which saves immense efforts in design and implementation. Second, new theoretical guarantees for ordinal ranking can be easily extended from known ones for binary classification, which saves tremendous efforts in derivation and analysis.
We introduced one such reduction framework in Subsection 2.3.2. The framework not only forms a cost-sensitive classification algorithm (CSOVO) by calling an under-lying binary classification algorithm, but also guarantees that a good cost-sensitive classifier can be obtained by combining a set of decent binary classifiers. Since the framework is designed for general cost-sensitive classification rather than for ordinal ranking, arguably it does not use all the properties of ordinal ranking. For instance, it is not clear whether the framework explicitly makes ordinal comparisons between the ranks (see Section 1.2). In this chapter, we study another reduction framework that fully takes the properties of ordinal ranking into account. The framework includes both the classification and the regression perspective of ordinal ranking. Under this framework, we will eventually show an interesting fact: Ordinal ranking (with its full properties) is equivalent to binary classification.
4.1 Reduction Framework
The reduction framework was first proposed in our earlier work, which considered a more restricted cost-sensitive setup (Li and Lin 2007b).1 The core of the framework is the following reduction method, which is composed of three stages: preprocessing, training, and prediction. Next, we introduce the stages of the reduction method and its consequent theoretical guarantees.
where
X(k)n = (xn, k), Yn(k)= 2 Jyn> kK − 1, Wn(k) = (K−1) ·
cn[k + 1]− cn[k]
.
2. Training: Use some binary classification algorithm Ab on ZE and get a binary classifier g where g X(k)
= g(x, k).
3. Prediction: For any x∈ X , estimate its rank with
ˆ
r(x) = rg(x)≡ 1 + XK−1 k=1
Jg(x, k) > 0K . (4.1)
We have encountered the extended examples X(k), Y(k)
when proving Theo-rem 3.2. Here the extended examples are weighted and are of a more generic form. In particular, X(k) now takes an abstract encoding of (x, k) rather than a concrete en-coding (x, 1k). The actual encoding of k can then depend on the binary classification algorithm Ab. We can think of the extended training vector X(k) as a representation of the question “is the rank of x greater than k?”, the binary label Y(k)as a represen-tation of the desired answer to the question, and the classifier g X(k)
as a function that predicts the answer to the question.
A threshold ensemble rHT,θ, for instance, is the same as rg using
g X(k)
= HT(x)− θk = HT(x)− XK−1
ℓ=1
θℓ· sℓ(x, 1k).
Then, ORBoost-LR (Algorithm 3.4) is simply a special case of the reduction method above using the classification cost vectors n
c(ℓ)c
oK
ℓ=1 in Z, the vector (x, 1k) as the actual encoding of X(k), and a variant of AdaBoost as Ab that works on the learning model G in (3.5).
While the reduction method is simple, it comes with a strong theoretical guar-antee: the cost bound theorem, which depends on the following probability
mea-sure dFb X(k), Y(k), W(k)
that generates weighted binary examples.
1. Draw a tuple (x, y, c) independently from dF(x, y, c) and draw k uniformly from the set {1, 2, . . . , K −1}.
2. Generate X(k), Y(k), W(k) by
X(k) = (x, k) ,
Y(k) = 2 Jy > kK − 1 , W(k) = (K−1) ·
c[k +1] − c[k]
.
(4.2)
As shown in the proof of Theorem 3.2, the extended training set ZE contains depen-dent training examples from dFb X(k), Y(k), W(k)
. For any given binary classifier g, we can then define its out-of-sample cost
πb(g)≡ π(g, Fb) = Z
X(k),Y(k),W(k)
W(k)·q
Y(k) 6= g X(k)y
dFb X(k), Y(k), W(k) .
Next, we introduce the cost bound theorem (Li and Lin 2007b).
Theorem 4.1 (Cost bound of the reduction framework). Consider a ranker rg
constructed from a binary classifier g using (4.1). Assume that c is V-shaped and c[y] = 0 for every example (x, y, c). If g(x, k) is rank-monotonic2 or if every cost vector c is convex (see Section 1.2), then π(rg)≤ πb(g).
Proof. When g(x, k) is rank-monotonic, by (4.1), Jg(x, k) ≤ 0K = Jrg(x)≤ kK. Thus,
c[rg(x)] = c[K] + XK−1 k=rg(x)
c[k]− c[k + 1]
= c[K] + XK−1 k=1
c[k]− c[k + 1]
Jg(x, k) ≤ 0K (4.3)
= c[y] + 1 K−1
XK−1 k=1
W(k)q
g X(k)
6= Y(k)y .
2A rank-monotonic g(x, k) means g(x, k−1) ≥ g(x, k), for k = 2, 3, . . . , K −1.
Then, we get
π(rg) = Z
x,y,c
c[rg(x)] dF (x, y, c)
= Z
x,y,c
1 K−1
XK−1 k=1
W(k)q
g X(k)
6= Y(k)y
dF (x, y, c)
= Z
X(k),Y(k),W(k)
W(k)q
g X(k)
6= Y(k)y
dFb X(k), Y(k), W(k)
= πb(g).
Now consider the case where every cost vector c is convex. Note that there are exactly rg(x)− 1 and K − rg(x) ones in those Jg(x, k) ≤ 0K. In addition, there are also exactly rg(x)− 1 zeros and K − rg(x) ones in those Jrg(x)≤ kK. Because the vector
Jrg(x) ≤ kKK
k=1 is monotonically increasing, and by the convexity condi-tion
c[k]− c[k + 1]K
k=1 is also monotonically increasing,
c[K] + XK−1 k=rg(x)
c[k]− c[k + 1]
= c[K] + XK−1 k=1
c[k]− c[k + 1]
Jrg(x) ≤ kK
≤ c[K] + XK−1 k=1
c[k]− c[k + 1]
Jg(x, k) ≤ 0K . (4.4)
The desired proof follows by replacing (4.3) with (4.4).
Theorem 4.1 indicates that if there exists a descent binary classifier g, we can obtain a good ranker rg. Nevertheless, it does not guarantee how good rg is in comparison with other rankers. In particular, if we consider the target function g∗
under dFb X(k), Y(k), W(k)
, and the target function r∗ under dF(x, y, c), does a small regret πb(g)−πb(g∗)
in binary classification translate to a small regret π(rg)−π(r∗) in ordinal ranking? Furthermore, is π(rg∗) close to π(r∗)? Next, we introduce the reverse-reduction technique, which helps to answer the questions above.
Reverse reduction goes through the preprocessing and the prediction stages of the reduction method in a different direction. In the preprocessing stage, instead of starting with ordinal examples (xn, yn, cn), reverse reduction deals with weighted binary examples
X(k)n , Yn(k), Wn(k)
. It first combines each set of binary examples sharing the same xn to an ordinal example by
yn= 1 +
K−1P
k=1
rYn(k) > 0z
; cn[k] =
XK−1 ℓ=1
wnℓ
K−1· Jyn≤ ℓ < k or k < ℓ ≤ ynK .
(4.5)
It is easy to verify that (4.5) is the exact inverse transform of (4.2) on the training examples. These ordinal examples are then given to an ordinal ranking algorithm to obtain a ranker r. In the prediction stage, reverse reduction works by decomposing the prediction r(x) to K− 1 binary predictions, each as if coming from a binary classifier
gr X(k)
= 2 Jr(x) > kK − 1. (4.6)
Then, a lemma on the out-of-sample cost of gr immediately follows.
Lemma 4.2. With the definitions of F(x, y, c) and Fb X(k), Y(k), W(k)
in Theo-rem 4.1, for every ordinal ranker r, π(r) = πb(gr).
Proof. Because gr is rank-monotonic, the same proof of Theorem 4.1 leads to the desired result.
The stages of reduction and reverse reduction are illustrated in Figure 4.1. Next, we show how the reverse-reduction technique complements the reduction method and allows us to draw a strong theoretical connection between ordinal ranking and binary classification. In addition, reverse reduction is useful in designing boosting algorithms for ordinal ranking, which will be demonstrated in Subsection 4.2.2.
ordinal example
(xn, yn, cn)
⇒
@AA
%
$ '
&
weighted binary examples
“
X(k)n , Yn(k), Wn(k)
”
k= 1, . . . , K−1
⇒ ⇒ ⇒
binarycoreclassification algorithm
⇒ ⇒ ⇒
%
$ '
&
related binary classifiers
g X(k) k= 1, . . . , K−1
AA
@
⇒
ordinal
ranker rg(x)
%
$ '
&
weighted binary examples
“
X(k)n , Yn(k), Wn(k)
”
k= 1, . . . , K−1
⇒ ⇒ ⇒
AA
@
ordinal example (xn, yn, cn)
⇒
core ordinal ranking algorithm
⇒
ordinal
ranker
r(x)
⇒ ⇒ ⇒
@ AA
%
$ '
&
related binary classifiers
gr X(k) k= 1, . . . , K−1
Figure 4.1: Reduction (top) and reverse reduction (bottom)
Recall that by the definition of r∗ and g∗, for any ranker r and any binary classi-fier g,
π(r)≥ π(r∗), πb(g)≥ πb(g∗) . (4.7)
With the definitions of r∗ and g∗, the reverse-reduction technique allows a simple proof of the following regret bound.
Theorem 4.3 (Regret bound of the reduction framework). If g(x, k) is rank-monotonic, or if every cost vector c is convex, then
π(rg)− π(r∗)≤ πb(g)− πb(g∗).
Proof.
π(rg)− π(r∗) ≤ πb(g)− π(r∗) (from Theorem 4.1)
= πb(g)− πb(gr∗) (from Lemma 4.2)
≤ πb(g)− πb(g∗) from (4.7)
.
The cost bound (Theorem 4.1) and the regret bound (Theorem 4.3) provide dif-ferent guarantees for the reduction method. The former describes how the ordinal ranking cost is upper bounded by the binary classification cost in an absolute sense, and the latter describes the upper bound in a relative sense. An immediate implica-tion of the regret bound is as follows. If there exists an optimal binary classifier g+
that is also rank-monotonic, both the right-hand side and the left-hand side of the equation are 0. That is, every optimal binary classifier under dFb X(k), Y(k), W(k) corresponds to an optimal ranker under dF(x, y, c). In other words, there is no gap between ordinal ranking and binary classification in terms of optimality. In the fol-lowing theorem, we show a general sufficient condition for the correspondence.
Theorem 4.4. Assume that the effective cost
cx[k] = Z
c,y
c[k] dF(c, y| x) − min
1≤ℓ≤K
Z
c,y
c[ℓ] dF(c, y| x)
is V-shaped with respect to yx= argmin
1≤ℓ≤K
cx[ℓ] on every feature vector x∈ X . Let
g+(x, k)≡ 2 Jyx> kK − 1.
Then g+ is rank-monotonic and is optimal under dFb X(k), Y(k), W(k) .
Proof. By construction g+ is rank-monotonic. Because the effective cost cx is V-shaped, for all cx[k + 1]− cx[k]6= 0,
g+(x, k) = sign(cx[k + 1]− cx[k]) .
That is,
g+(x, k)·
cx[k + 1]− cx[k]
=
cx[k + 1]− cx[k]
.
Then, for any binary classifier g,
πb(g) = Z
X(k),Y(k),W(k)
W(k)q
Y(k) 6= g X(k)y
dFb X(k), Y(k), W(k)
= Z
X(k),Y(k),W(k)
W(k) Y(k)− g X(k)2
dFb X(k), Y(k), W(k)
= ∆− 2 Z
X(k),Y(k),W(k)
W(k)· Y(k)· g X(k)
dFb X(k), Y(k), W(k)
= ∆− 2
K−1 XK−1
k=1
Z
x
g(x, k) Z
y,c
(c[k + 1]− c[k]) dF(y, c | x) dF(x)
= ∆− 2
K−1 XK−1
k=1
Z
x
g(x, k)·
cx[k + 1]− cx[k]
dF(x)
≥ ∆ − 2
K−1 XK−1
k=1
Z
x
cx[k + 1]− cx[k]
dF(x)
= ∆− 2
K−1 XK−1
k=1
Z
x
g+(x, k)·
cx[k + 1]− cx[k]
dF(x)
= πb(g+),
where ∆ is a constant that does not depend on g. Thus, the classifier g+ is optimal under dFb X(k), Y(k), W(k)
.
Note that if every cost vector c is convex, the effective cost cxwould also be convex and hence V-shaped. Thus, the convexity of c is also a (weaker) sufficient condition for the correspondence between optimal binary classifiers and optimal rankers.
As we can see from the definition of r∗ in (1.1), the effective cost cx conveys sufficient information for determining the optimal prediction at x. Because ordi-nal ranking predictions should take “closeness” into account (see Section 1.2), it is reasonable to assume that cx is V-shaped. Hence, in general (with such a minor assumption), optimal binary classifiers correspond to optimal rankers.
The results above demonstrate that ordinal ranking can be reduced to binary classification without any loss of optimality. That is, ordinal ranking is “no harder than” binary classification. Intuitively, binary classification is also “no harder than”
ordinal ranking, because the former is a special case of the latter with K = 2. Next,
we formalize the notion of hardness with the probably approximately correct (PAC) setup in computational learning theory (Kearns and Vazirani 1994) and prove that ordinal ranking and binary classification are indeed equivalent in hardness. We use the following definition of PAC in our coming theorems (Kearns and Vazirani 1994;
Valiant 1984).
Definition 4.5. In cost-sensitive classification, a learning modelG is efficiently PAC-learnable (using the same representation class) if there exists a learning algorithm A satisfying the following property: for every distribution dF(x, y, c) being considered, where
c[g∗(x)] = c[y] = cmin = 0,
with some g∗ ∈ G; for all 0 < ǫ and 0 < δ < 12, if A is given access to an oracle generating examples (x, y, c) from dF(x, y, c), as well as inputs ǫ and δ, then A out-puts ˆg ∈ G such that π(ˆg, F ) ≤ ǫ with probability at least 1 − δ as well as with time polynomial in 1ǫ and 1δ.
Briefly speaking, the definition assumes that the target function g∗ is within the learning model G and is of cost 0 (the minimum cost). In other words, it is the noiseless setup of learning. In the noisy case, we can use the following notion of agnostic PAC learning.
Definition 4.6. In cost-sensitive classification, a learning model G is efficiently ag-nostic PAC-learnable (using the same representation class) if there exists a learning algorithm A satisfying the following property: for every distribution dF(x, y, c) being considered, where c[y] = cmin; for all 0 < ǫ and 0 < δ < 12, if A is given access to an oracle generating examples (x, y, c) from dF(x, y, c), as well as inputs ǫ and δ, then A outputs ˆg ∈ G satisfying π(ˆg, F ) − π(g∗, F )≤ ǫ with probability at least 1 − δ as well as with time polynomial in 1ǫ and 1δ.
With Definitions 4.5 and 4.6, we now introduce the equivalence theorem.
Theorem 4.7 (Equivalence theorem of the reduction framework). Consider a learning modelR for ordinal ranking, its associated learning model G = {gr: r∈ R}
for binary classification, and distributions dF(x, y, c) such that all cost vectors c and effective cost vectors cx are V-shaped.
1. R is efficiently PAC-learnable if and only if G is efficiently PAC-learnable.
2. R is efficiently agnostic PAC-learnable if and only if G is efficiently agnostic PAC-learnable.
Proof. IfG is efficiently PAC-learnable using algorithm AG, we can convert AG to an efficient algorithm AR for ordinal ranking.
1. Transform the oracle generating (x, y, c) from dF(x, y, c) to an oracle generat-ing X(k), Y(k), W(k)
by (4.2).
2. Run AG with the transformed oracle until it outputs some g X(k) . 3. Return rg.
It is not hard to see thatAR is as efficient asAG, and the cost guarantee comes from Theorems 4.1 and 4.3.
Now we consider the other direction. If R is efficiently PAC-learnable using algo-rithm AR, we can convertAR to an efficient algorithm AG for binary classification.
1. Transform the oracle generating X(k), Y(k), W(k)
from dFb X(k), Y(k), W(k) to an oracle generating (x, y, c) by
x =
X(k)[1] , X(k)[2] , . . . , X(k)[D]
;
c =
W(k)
K−1 · 0, . . . , 0
| {z }
k
, 1, . . . , 1
for Y(k) =−1 ,
W(k)
K−1 · 1, . . . , 1
| {z }
k
, 0, . . . , 0
for Y(k) = +1 ; y = argmin
1≤ℓ≤K
c[ℓ] .
2. Run AR with the transformed oracle until it outputs some r(x).
3. Return gr.
We can easily see thatAG is as efficient asAR. Denote dFR(x, y, c) as the probability measure described in step one. It is simple to prove that FR with (4.2) would also introduce Fb. Then, by Lemma 4.2,
π(gr, Fb) = π(r, FR) for all r ∈ R.
Therefore, π(gr, Fb) < ǫ after running AG.
The proof above deals with the noiseless PAC-learnability result. Similar steps can be used to show the agnostic PAC-learnability result.
Theorem 4.7 demonstrates that ordinal ranking is as easy (hard) as the associated binary classification problem. If we look at g(x, k) for one particular k, we see that the binary classification problem can be simplified to classifying all (x, y) with y > k as Y(k) = +1, and all (x, y) with y ≤ k as Y(k) =−1. Recall that in Section 1.2, we discussed that ordinal ranking allows natural “ordinal comparison between different ranks.” When the comparison between “all x with rank less than k” and “all x with rank at least k” is natural, it is simple for Ab to locate a decent g(x, k). In other words, the underlying binary classification problem is easy. On the other hand, if the comparison is not natural, such as the fruit categorization example in Section 1.2, both the “ordinal ranking” problem and the underlying binary classification are hard.
SVM is a popular binary classification algorithm (Sch¨olkopf and Smola 2002; Vapnik 1995), which will be further introduced in Subsection 5.1.1. It maps the feature vector x to φ(x) in a possibly higher-dimensional space and implicitly computes the inner products with a kernel function
K(x, x′) = hφ(x), φ(x′)i .
If we encode (x, k) by (x,−γ1k), we can then compute the inner products of the extended examples by
KE (x, k), (x′, k′)
=h(φ(x), k) , (φ(x′), k′)i = K(x, x′) + γ2Jk = k′K .
With the reduction framework, we can plug in KE and O NK
extended training
examples into the standard SVM to obtain an ranker
r(x) = 1 + XK−1 k=1
Jhv, φ(x)i + b − θk > 0K ,
based on an optimal solution to
min
v,b,θk,ξ(k)n
1
2hv, vi + 1
2γ2 hθ, θi + κ XN n=1
XK−1 k=1
Wn(k)ξn(k), (4.8) subject to Yn(k)(hv, φ(x)i + b − θk)≥ 1 − ξn(k),
ξn(k)≥ 0, for n = 1, . . . , N, and k = 1, . . . , K −1.
If θ1 ≤ θ2 ≤ . . . ≤ θK−1, or if the cost vectors considered are convex, Theorems 4.1 and 4.3 can guarantee the expected out-of-sample cost of r(x) based on the expected out-of-sample cost of the binary classifier
g(x, k) = sign
hv, φ(x)i + b − θk .
The oSVM approach of Cardoso and da Costa (2007) is an instance of (4.8) with the absolute cost, in which all Wn(k) are equal. The SVOR-IMC approach of Chu and Keerthi (2007) can also be thought as a modified instance of the formulation with the absolute cost, except that the 2γ12 hθ, θi term is dropped. Their SVOR-EXC approach is another modified instance using the classification cost plus an additional constraint to guarantee that θ1 ≤ θ2 ≤ . . . ≤ θK−1.
RED-SVM unifies these algorithms under a generic formulation (4.8) with the reduction framework and allows us to deal with any convex cost vectors by chang-ing Wn(k), or with any cost vectors by changing Wn(k)as well as respecting the constraint θ1 ≤ θ2 ≤ . . . ≤ θK−1.3
Chu and Keerthi (2007) found that SVOR-EXP performed better in terms of the
3The additional constraint can be respected by a coordinate-descent procedure that switches between optimizing (v, b) (using the standard SVM solver) and optimizing θ under the constraints (a small quadratic programming problem with an analytic solution).
classification cost, and SVOR-IMC preceded in terms of the absolute cost. Their findings can be well explained through the reduction framework with the formula-tion above. Note that Chu and Keerthi (2007) paid much efforts in designing and implementing suitable optimizers for the modified formulation that does not contain the 2γ12 hθ, θi term. If we use the standard soft-margin SVM instead, we can directly and efficiently use the state-of-the-art SVM software to deal with the ordinal ranking problem. The formulation of Chu and Keerthi (2007) can be approximated by using a large γ. As we shall see in Subsection 4.3.1, even a simple assignment of γ = 1 performs similarly to the approaches of Chu and Keerthi (2007) in practice.
In addition to the algorithmic benefits described above, the reduction framework can also be used theoretically. For instance, we demonstrated how we can derive novel bounds of some common cost functions in Section 3.1. Next, we extend the bounds to SVM-based formulations and to a wider class of cost functions. While Shashua and Levin (2003) derived one such bound with a specific cost function, their bound is not data dependent and hence does not fully explain the out-of-sample performance of SVM-based rankers in reality (for more discussions on data-dependent bounds, see the work of, for example, Bartlett and Shawe-Taylor (1998)). Our bound, on the other hand, is not only more general, but also data dependent.
Theorem 4.8 (Large-margin bounds for SVM-based rankers). Consider a col-lection
F =n
fv,b,θ X(k)
=hv, φ(x)i + b − θk, where kvk2+kb − θk2 ≤ 1, kφ(x)k2+ 1 ≤ R2o .
Let Bmax = maxc∈C(c[1] + c[K]), Bmin = minc∈C(c[1] + c[K]), and β = Bmax/Bmin. If θ1 ≤ θ2 ≤ . . . ≤ θK−1, or if every c is convex, for any ∆ > 0, with probability at least 1− δ, and for every f in F, the associated ranker
r(x) = 1 + XK−1
k=1
qf X(k)
> 0y ,
satisfies
π(r)≤ β
N· (K −1) XN n=1
XK−1 k=1
Wn(k)q
Yn(k)f X(k)n
≤ ∆y
+ O log N
√N ,R
∆, r
log1 δ
! .
Proof. For every example (x, y, c), by the same derivation as Theorem 4.1,
(K−1) · c[r(x)]
≤ XK−1
k=1
W(k)q
Y(k)f X(k)
≤ 0y
≤ (K −1) · (c[1] + c[K]) · XK−1
k=1
W(k)
(K−1) · (c[1] + c[K])
qY(k)f X(k)
≤ 0y .
Note that
P(k) = W(k)
(K−1) · (c[1] + c[K])
sums to 1. Then, for each example (x, y, c) obtained from dF(x, y, c), we can ran-domly choose k according to P(k)and form an unweighted binary example X(k), Y(k)
. The procedure above defines a probability measure dFu X(k), Y(k)
. Integrating over all (x, y, c), we get
π(r) ≤ Bmax
Z
X(k),Y(k)
qY(k)f X(k) ≤ 0y
dFu X(k), Y(k) .
When each kn is chosen independently according to Pn(k), we can generate N independent examples
X(knn), Yn(kn)
from dFu X(k), Y(k)
from Z. Then, using a cost bound for SVM in binary classification (Bartlett and Shawe-Taylor 1998), with probability at least (1− δ2) over the choice ofn
X(knn), Yn(kn)
oN n=1, Z
X(k),Y(k)
qY(k)f X(k)
≤ 0y
dFu X(k), Y(k)
≤ 1
N XN n=1
qYn(kn)f X(knn)
≤ ∆y
+ O log N
√N ,R
∆, r
log1 δ
! .
Using the same technique as the proof of Theorem 3.2 with bn=r
Yn(kn)f X(knn)
≤ ∆z and a union bound, with probability > 1− δ,
π(r)
≤ Bmax
N XN n=1
qYn(kn)f X(knn)
≤ ∆y
+ O log N
√N ,R
∆, r
log1 δ
!
≤ Bmax
N XN n=1
1
(K−1) · (cn[1] + cn[K]) XK−1
k=1
Wn(k)·q
Yn(k)f X(k)n
≤ ∆y
+O log N
√N ,R
∆, r
log 1 δ
!
+ O 1
√N, r
log 1 δ
!
≤ β
N · (K −1) XN n=1
XK−1 k=1
Wn(k)·q
Yn(k)f X(k)n ≤ ∆y
+ O log N
√N , R
∆, r
log1 δ
! .
Thus, if f achieves large margins (≥ ∆) on most of the extended training exam-ples
X(k)n , Yn(k), Wn(k)
, π(r) is guaranteed to be small. A similar proof can be used to extend Theorem 3.2 and Corollary 3.4 to a more general class of cost functions.
stage, we first apply the reverse-reduction technique in (4.6) to cast each ranker rt
as a binary classifier gt = grt. The weighted votes from all the binary classifiers in the ensemble are gathered to form binary predictions. Then, the reduction method comes into play and constructs an ordinal prediction from the binary ones by (4.1).
Combining the steps above, we get the following prediction rule for an ordinal ranking ensemble U:
rU(x)≡ 1 +
K−1X
k=1
t T X
t=1
vtJk < rt(x)K ≥ 1 2
XT t=1
vt
|
. (4.9)
The steps of going back and forth between reduction and reverse reduction may seem complicated. Nevertheless, we can simplify many of them with careful deriva-tions, which are illustrated below. We start with the prediction steps and derive a simplified form of (4.9) as follows.
Theorem 4.9. For any ordinal ranking ensemble U ={(rt, vt)}Tt=1 such that vt ≥ 0 and PT
t=1vt= 1,
rU(x) = min (
k : XT
t=1
vtJk ≥ rt(x)K > 1 2
)
. (4.10)
Proof. Let k∗ = minn
k : PT
t=1vtJk ≥ rt(x)K > 12
o
. Then, PT
t=1vtJk ≥ rt(x)K > 12
if and only if k∗ ≤ k. That is, PT
t=1vtJk < rt(x)K ≥ 12 if and only if k < k∗. Therefore, rU(x) = 1 + k∗− 1 = k∗.
Thus, the seemly complicated prediction rule (4.9) can be equivalently performed in (4.10) by computing a simple and intuitive statistic: the weighted median. Note that the rule in (4.10) is not specific for our approach. It can be applied to ordi-nal ranking ensembles produced by any ensemble learning approaches, such as bag-ging (Breiman 1996).
We now look at the training steps. First, we list the steps of the original AdaBoost.
Algorithm 4.2 (AdaBoost, Freund and Schapire 1997).
1. For a given training set ˜Z ={(˜xm, ˜ym, ˜wm)}Mm=1, initialize ˜wm(1)= ˜wm for all m.
2. For t = 1, 2, . . . , T ,
(a) Obtain ˜gt from the base binary classification algorithm Ab. (b) Compute the weighted training error ˜ǫt.
˜ǫt = XM m=1
˜
wm(t)· J˜ym 6= ˜gt(˜xm)K
!. XM
m=1
˜ w(t)m
!
If ˜ǫt> 12, set T = t− 1 and abort loop.
(c) Let ˜vt = 12log 1−˜˜ǫtǫ
t.
(d) Let ˜Λt= exp(2˜vt)− 1, and
˜
w(t+1)m =
˜
w(t)m, y˜m = ˜gt(˜xm) ;
˜
w(t)m + ˜Λtw˜m(t), y˜m 6= ˜gt(˜xm) .
After plugging AdaBoost into reduction and a base ordinal ranking algorithm into reverse reduction, we can equivalently obtain the AdaBoost for ordinal ranking (AdaBoost.OR) algorithm below.