• 沒有找到結果。

AdaBoost.OR: AdaBoost for ordinal ranking

2. For t = 1, 2, . . . , T ,

(a) Obtain ˜gt from the base binary classification algorithm Ab. (b) Compute the weighted training error ˜ǫt.

˜ǫt = XM m=1

˜

wm(t)· J˜ym 6= ˜gt(˜xm)K

!. XM

m=1

˜ w(t)m

!

If ˜ǫt> 12, set T = t− 1 and abort loop.

(c) Let ˜vt = 12log 1−˜˜ǫtǫ

t.

(d) Let ˜Λt= exp(2˜vt)− 1, and

˜

w(t+1)m =





˜

w(t)m, y˜m = ˜gt(˜xm) ;

˜

w(t)m + ˜Λtm(t), y˜m 6= ˜gt(˜xm) .

After plugging AdaBoost into reduction and a base ordinal ranking algorithm into reverse reduction, we can equivalently obtain the AdaBoost for ordinal ranking (AdaBoost.OR) algorithm below.

(d) Let Λt= exp(2vt)− 1. If rt(xn)≥ yn, then

c(t+1)n [k] =













c(t)n [k] , k ≤ yn ;

c(t)n [k] + Λt· c(t)n [k] , yn< k≤ rt(xn) ; c(t)n [k] + Λt· c(t)n [rt(xn)] , k > rt(xn) .

Otherwise switch > to < and vice versa.

The connection between Algorithms 4.2 and 4.3 is based on maintaining the fol-lowing invariance in each iteration.

Lemma 4.10. Substitute the indices m in Algorithm 4.2 with (n, k). That is,

˜

xm = X(k)n , ˜ym = ˜ynk = Yn(k), and ˜wm = ˜wnk = Wn(k).

Take ˜gt(x, k) = grt(x, k) and assume that in Algorithms 4.2 and 4.3,

c(τ )n [k] = XK−1

ℓ=1

˜ w(τ )nℓ

K−1 · Jyn≤ ℓ < k or k < ℓ ≤ ynK (4.11)

is satisfied for τ = t with ˜wnℓ(τ ) ≥ 0. Then, equation (4.11) is satisfied for τ = t + 1 with ˜w(τ )nℓ ≥ 0.

Proof. Because (4.11) is satisfied for τ = t and ˜w(t)nℓ ≥ 0, the cost vector c(t)n is V-shaped with respect to yn and c(t)n [yn] = 0. Thus,

XN n=1

c(t)n [1] + c(t)n [K]

= XN n=1

XK−1 k=1

˜ w(t)nk.

In addition, since ˜gt(x, k) = grt(x, k), by a proof similar to Lemma 4.2, XN

n=1

c(t)n [rt(x)] = XN n=1

XK−1 k=1

˜

w(t)nk · Jynk 6= ˜gt(xn, k)K .

Therefore, ˜ǫt = ǫt, ˜vt= vt, and ˜Λt= Λt.

Because ˜gt(xn, k) 6= ynk if and only if rt(xn)≤ k < yn or yn< k ≤ rt(xn),

˜

w(t+1)nk =





˜

w(t)nk+ ˜Λtnk(t+1), yn< k ≤ rt(xn) or yn< k ≤ rt(xn) ;

˜

w(t)nk, otherwise .

(4.12)

It is easy to check that ˜w(t+1)nk are nonnegative. Furthermore, we see that the update rule in Algorithm 4.3 is equivalent to combining (4.12) and (4.11) with τ = t + 1.

Thus, equation (4.11) is satisfied for τ = t + 1.

Then, by mathematical induction from τ = 1 up to T with Lemma 4.10, plugging AdaBoost into reduction and a base ordinal ranking algorithm into reverse reduction is equivalent to running AdaBoost.OR with the base algorithm. AdaBoost.OR takes AdaBoost as a special case of K = 2. It can use any base ordinal ranking algorithm that produces individual rankers rt with errors ǫt12. In binary classification, the 12 error bound can be naturally achieved by a constant classifier or a fair coin flip. For ordinal ranking, is 12 still easy to achieve? The short answer is yes. In the following theorem, we demonstrate that there always exists a constant ranker that satisfies the error bound.

Theorem 4.11. Define constant rankers r(k) by r(k)(x) ≡ k for all x. For any set {cn}Nn=1, there exists a constant ranker with 1≤ k ≤ K such that

ǫ(k) = XN n=1

cn

r(k)(x)

!. XN

n=1

cn[1] + cn[K]

!

≤ 1 2.

Proof. Either r(1) or r(K) achieves error ≤ 12 because by definition ǫ(1)+ ǫ(K) = 1.

Therefore, even the simplest deterministic rankers can always achieve the desired error bound.4 If the base ordinal ranking algorithm produces better rankers, the following theorem bounds the normalized training cost of the final ensemble U.

4Similarly, the error bound can be achieved by a randomized ordinal ranker that returns either 1 or K with equal probability.

Theorem 4.12. Suppose the base ordinal ranking algorithm produces rankers with errors ǫ1, . . . , ǫT, where each ǫt12. Let γt = 12 − ǫt, the final ensemble rU satisfies the following error bound:

N PN

n=1cn[1] + cn[K]· ν(rU)≤ YT t=1

p1− 4γt2 ≤ exp

−2 XT

t=1

γt2 .

Proof. Similar to the proof for Lemma 4.10, the left-hand side of the bound equals XN

n=1

XK−1 k=1

wnkJynk 6= gU˜(xn, k)K

!. XN

n=1

XK−1 k=1

wnk

! ,

where ˜U is a binary classification ensemble{(˜gt, vt)}Tt=1with ˜gt= grt. Then, the bound is a simple consequence of the well-known AdaBoost bound (Freund and Schapire 1997).

Theorem 4.12 indicates that if the base algorithm always produces a ranker with ǫt12 − γ for γ > 0, the training cost of U would decrease exponentially with T . That is, AdaBoost.OR can rapidly boost the training performance of such a base algorithm.

Similar to Theorems 3.2 and 4.8, we can also use reduction to extend the out-of-sample cost bounds of AdaBoost to AdaBoost.OR, including the iteration-based bound (Freund and Schapire 1997) and the margin-based ones (Schapire et al. 1998).

Table 4.2: Test absolute cost of SVM-based ordinal ranking algorithms

data RED-SVM SVOR-IMC

set perceptron Gaussian pyrimdines 1.304±0.040 1.294±0.046

machine 0.842±0.022 0.990±0.026 boston 0.732±0.013 0.747±0.011 abalone 1.383±0.004 1.361±0.003

bank 1.404±0.002 1.393±0.002 computer 0.565±0.002 0.596±0.002 california 0.940±0.001 1.008±0.001 census 1.143±0.002 1.205±0.002

(those within one standard error of the lowest one are marked in bold) Table 4.3: Test classification cost of SVM-based ordinal ranking algorithms

data RED-SVM SVOR-EXC

set perceptron Gaussian pyrimdines 0.762±0.021 0.752±0.014

machine 0.572±0.013 0.661±0.012 boston 0.541±0.009 0.569±0.006 abalone 0.721±0.002 0.736±0.002 bank 0.751±0.001 0.744±0.001 computer 0.451±0.002 0.462±0.001 california 0.613±0.001 0.640±0.001 census 0.688±0.001 0.699±0.000

(those within one standard error of the lowest one are marked in bold)

4.3.1 SVM for Ordinal Ranking

We perform experiments on our proposed RED-SVM algorithms with the same eight benchmark data sets from Chu and Keerthi (2007) and the same setup as we did in Subsections 2.4.2 and 3.3.2. The γ parameter in (4.8) is fixed to 1. Simi-lar to the SVM-based approaches in Subsection 2.4.2, the κ parameter is chosen within {2−17, 2−15, . . . , 23} using a 5-fold CV procedure on the training set (Hsu, Chang and Lin 2003).

Table 4.2 compares our proposed RED-SVM algorithm with the perceptron kernel with the SVOR-IMC results listed by Chu and Keerthi (2007) using the mean absolute cost (and the standard error), and Table 4.3 compares our algorithm with their

SVOR-EXC results using the mean classification cost. We can see that our proposed RED-SVM can often perform significantly better than the SVOR algorithms in both tables.

Note, however, that Chu and Keerthi (2007) uses the Gaussian kernel rather than the perceptron kernel in their experiments. For a fair comparison, we implemented their SVOR-IMC algorithm with the perceptron kernel by modifying LIBSVM (Chang and Lin 2001) and conduct experiments with the same parameter selection procedure.5 With the same perceptron kernel, we compare RED-SVM with SVOR-IMC in Ta-ble 4.4. We see that our direct reduction to the standard SVM (RED-SVM) performs similarly to SVOR-IMC. In other words, the change from the Gaussian kernel to the perceptron kernel explains most of the performance differences between the columns of Tables 4.2 and 4.3. Our RED-SVM, nevertheless, is much easier to implement.

In addition, RED-SVM is significantly faster than SVOR-IMC in training, which is illustrated in Figure 4.2 using the four largest data sets.6 The main cause to the time difference is the speed-up heuristics. While, to the best of our knowledge, not much has been done to improve the original SVOR-IMC algorithm, plenty of heuris-tics, such as shrinking and advanced working selection in LIBSVM, can be seamlessly adopted by RED-SVM because of the reduction framework. The difference demon-strate an important advantage of the reduction framework: Any improvements to the binary classification approaches can be immediately inherited by reduction-based ordinal ranking algorithms.

Tables 4.5 and 4.6 compares RED-SVM to CSOVA and CSOVO using SVM with perceptron kernel as the underlying binary classification algorithm. We conduct ex-periments on both the eight data sets for ordinal ranking and the six data sets for classification. We can see a clear difference between the proposed cost-sensitive classi-fication algorithms and the proposed ordinal ranking algorithms. For ordinal ranking data sets, RED-SVM enjoys an advantage, even in the classification cost setup. Such a result justifies our final arguments in Section 4.1: When ordinal ranking allows

5We only focus on SVOR-IMC because it is more difficult to implement SVOR-EXC with LIBSVM and to compare it fairly to RED-SVM.

6We gathered the CPU time on a 1.7G Dual Intel Xeon machine with 1GB RAM.

We now demonstrate the validity of AdaBoost.OR. We will first illustrate its behavior on an artificial data set. Then, we test its training and test performance on the eight ordinal ranking data sets.

Table 4.5: Test absolute cost of all SVM-based algorithms

data set CSOVA CSOVO RED-SVM

pyrimdines 1.627±0.055 1.337±0.054 1.304±0.040 machineCPU 0.975±0.024 0.842±0.023 0.842±0.022 boston 0.946±0.017 0.789±0.015 0.732±0.013 abalone 1.674±0.007 1.422±0.006 1.383±0.004 bank 1.801±0.004 1.414±0.003 1.404±0.002 computer 0.644±0.003 0.575±0.002 0.565±0.002 california 1.121±0.002 0.951±0.002 0.940±0.001 census 1.329±0.003 1.135±0.001 1.143±0.002 vehicle 0.226±0.007 0.225±0.007 0.282±0.006 vowel 0.030±0.005 0.030±0.005 0.331±0.009 segment 0.043±0.003 0.045±0.003 0.082±0.003 dna 0.054±0.002 0.067±0.002 0.178±0.003 satimage 0.123±0.003 0.127±0.003 0.192±0.002 usps 0.077±0.002 0.089±0.002 0.294±0.003

(those within one standard error of the lowest one are marked in bold)

Table 4.6: Test classification cost of all SVM-based algorithms

data CSOVA CSOVO RED-SVM

set (OVA) (OVO)

pyrimdines 0.750±0.015 0.792±0.018 0.762±0.021 machine 0.608±0.012 0.612±0.012 0.572±0.013 boston 0.614±0.004 0.583±0.007 0.541±0.009 abalone 0.735±0.002 0.726±0.002 0.721±0.002 bank 0.767±0.001 0.750±0.001 0.751±0.001 computer 0.502±0.001 0.468±0.001 0.451±0.002 california 0.631±0.001 0.611±0.001 0.613±0.001

census 0.692±0.001 0.674±0.001 0.688±0.001 vehicle 0.191±0.005 0.185±0.005 0.265±0.006 vowel 0.015±0.002 0.011±0.002 0.225±0.006 segment 0.024±0.001 0.024±0.001 0.070±0.002 dna 0.040±0.002 0.043±0.002 0.168±0.002 satimage 0.071±0.002 0.072±0.002 0.161±0.001 usps 0.022±0.000 0.023±0.000 0.218±0.002

(those within one standard error of the lowest one are marked in bold)

1 2 3 4

(a) T = 1

1 2 3 4

(b) T = 10

1 2 3 4

(c) T = 100

1 2 3 4

(d) T = 1000

Figure 4.3: Decision boundaries produced by AdaBoost.OR on an artificial data set

We first generate 500 input vectors xn ∈ [0, 1] × [0, 1] uniformly and rank them with {1, 2, 3, 4} based on three quadratic boundaries. Then, we apply AdaBoost.OR on these examples with the absolute cost setup.

We use a simple base ordinal ranking algorithm called ORStump, which solves the following optimization problem efficiently with essentially the same dynamic pro-gramming technique used in RankBoost-OR (Subsection 3.2.1):

θmink,d,q

XN n=1

cn[r(xn, θ, d, q)] , subject to θ1 ≤ θ2 ≤ . . . ≤ θK−1,

where r(x, θ, d, q)≡ max {k : q · x[d] < θk} .

The ordinal ranking decision stump r(·, θ, d, q) is a natural extension of the binary

decision stump (Holte 1993). Note that the set of all possible ordinal ranking decision stumps includes constant rankers. Therefore, ORStump can always achieve ǫt12.

The decision boundaries generated by AdaBoost.OR with ORStump using T = 1, 10, 100, and 1000 are shown in Figure 4.3. The case of T = 1 is the same as applying ORStump directly on the artificial set, and we can see that its resulting decision boundary cannot capture the full characteristic of the data. As T gets larger, however, AdaBoost.OR is able to boost up ORStump to form more sophisticated boundaries that approximate the underlying quadratic curves better.

Next, we run AdaBoost.OR on the eight benchmark data sets. We couple Ada-Boost.OR with two base ordinal ranking algorithms: ORStump and PRank (Cram-mer and Singer 2005). For PRank, we adopt the SiPrank variant and make it cost-sensitive by presenting random examples (xn, yn) with probability proportional to max1≤k≤Kcn[k]. In addition, we apply the pocket technique with ratchet (Gallant 1990) for 2000 epochs to get a decent training cost minimizer.

We run AdaBoost.OR for T = 1000 iterations for ORStump, and 100 for PRank.

Such a setup is intended to compensate the computational complexity of each indi-vidual base ordinal ranking algorithm. Nevertheless, a more sophisticated choice of T should further improve the performance of AdaBoost.OR.

For each algorithm, the mean training absolute cost as well as its standard error is reported in Table 4.7; the mean test absolute cost and its standard error is reported in Table 4.8. For each pair of single and AdaBoost.OR entries, we mark the one with the lowest cost in bold.

From the tables, we see that AdaBoost.OR almost always boosts both the train-ing and test performance of the base ordinal ranktrain-ing algorithm significantly. Note, however, that it is harder for AdaBoost.OR to improve the performance of PRank, be-cause PRank sometimes cannot produce a good rtin terms of minimizing the training cost.

Table 4.7: Training absolute cost of base and AdaBoost.OR algorithms

data ORStump PRank

set single AdaBoost.OR single AdaBoost.OR

pyrimdines 1.757±0.017 0.024±0.007 0.457±0.029 0.268±0.048 machine 1.118±0.015 0.122±0.009 0.880±0.011 0.864±0.010 boston 1.049±0.010 0.000±0.000 0.845±0.009 0.831±0.008 abalone 1.528±0.008 1.048±0.008 1.439±0.010 1.437±0.010 bank 1.975±0.005 1.141±0.004 1.514±0.003 1.467±0.003 computer 1.178±0.003 0.499±0.002 0.659±0.003 0.658±0.002 california 1.615±0.004 0.883±0.004 1.205±0.004 1.205±0.004

census 1.826±0.002 1.113±0.004 1.582±0.008 1.562±0.006 (the lowest one among the two using the same base algorithm is marked in bold)

Table 4.8: Test absolute cost of base and AdaBoost.OR algorithms

data ORStump PRank

set single AdaBoost.OR single AdaBoost.OR

pyrimdines 1.913±0.087 1.244±0.051 1.569±0.070 1.417±0.066 machine 1.286±0.040 0.842±0.020 0.969±0.012 0.932±0.022 boston 1.172±0.013 0.887±0.014 0.906±0.012 0.892±0.011 abalone 1.592±0.003 1.475±0.004 1.477±0.009 1.475±0.008 bank 2.000±0.003 1.530±0.003 1.540±0.004 1.502±0.004 computer 1.200±0.003 0.627±0.002 0.661±0.003 0.660±0.003 california 1.636±0.001 0.995±0.003 1.206±0.002 1.206±0.003

census 1.851±0.001 1.253±0.002 1.598±0.005 1.578±0.003 (the lowest one among the two using the same base algorithm is marked in bold)

Chapter 5

Studies on Binary Classification

In Chapter 4, we proved that ordinal ranking is PAC-learnable if and only if binary classification is PAC-learnable. In other words, under the PAC setup, if we want to have a good learning algorithm (and learning model) for ordinal ranking, it is necessary and sufficient to design a good learning algorithm for binary classification.

In this chapter, we discuss two projects that aim at understanding and improving binary classification in the context of ensemble learning. The first one identifies some restrictions of AdaBoost (Algorithm 4.2) and resolves them with the help of SVM.

The second one, on the other hand, focuses on a particular advantage of AdaBoost, and uses the advantage to improve other learning algorithms. The findings in the projects reveal the relative strength and weakness of AdaBoost and SVM, two of the most important binary classification algorithms.

5.1 SVM for Infinite Ensemble Learning

Recall that we proposed the threshold ensemble model for ordinal ranking in Chap-ter 3. The model originates from the ensemble model for binary classification (Meir and R¨atsch 2003), which is accompanied by many successful algorithms such as bag-ging (Breiman 1996) and AdaBoost (Freund and Schapire 1997). The algorithms construct a classifier that averages over some base hypotheses in a set H. While the size of H can be infinite, most existing algorithms use only a finite subset of H, and the classifier is effectively a finite ensemble of hypotheses. Some theories show

that the finiteness places a restriction on the capacity of the ensemble (Freund and Schapire 1997), and some theories suggest that the performance of AdaBoost can be linked to its asymptotic behavior when the ensemble is allowed to be of an infinite size (R¨atsch, Onoda and M¨uller 2001). Thus, it is possible that an infinite ensem-ble is superior for learning. Nevertheless, the possibility has not been fully explored because constructing such an ensemble is a challenging task (Vapnik 1998).

Next, we discuss how we conquer the task of infinite ensemble learning and demon-strate that better performance can be achieved by going from finite ensembles to infinite ones. In particular, we formulate a framework for infinite ensemble learning using SVM (Lin 2005; Lin and Li 2008). The key of the framework is to embed an infinite number of hypotheses into an SVM kernel. Such a framework can be applied both to construct new kernels for SVM and to interpret some existing ones (Lin 2005;

Lin and Li 2008). Furthermore, the framework allows us to compare SVM and Ada-Boost in a fair manner using the same base hypothesis set. Experimental results show that SVM with these kernels is superior to AdaBoost with the same base hypothesis set and help understand both SVM and AdaBoost better.

Here κ > 0 is the regularization parameter, and φ is a feature mapping from X to a Hilbert spaceF (Sch¨olkopf and Smola 2002). Because F can be of an infinite number of dimensions, SVM solvers usually work on the dual problem:

minλn

1 2

XN m=1

XN n=1

λmλnymyn· K(xm, xn)− XN n=1

λn, (5.2)

subject to 0≤ λn≤ κ, for n = 1, 2, . . . , N, XN

n=1

ynλn= 0.

Here K is the kernel function defined by K(x, x)≡ hφ(x), φ(x)i. Then, the optimal classifier becomes

ˆ

g(x) = sign XN n=1

ynλnK(xn, x) + b

!

, (5.3)

where b can be computed through the primal-dual relationship (Sch¨olkopf and Smola 2002; Vapnik 1998).

It is known that SVM is connected to AdaBoost (Demiriz, Bennett and Shawe-Taylor 2002; Freund and Schapire 1999; R¨atsch et al. 2002; R¨atsch, Onoda and M¨uller 2001). Recall that AdaBoost (Algorithm 4.2) iteratively selects T hypotheses ht ∈ H and weights vt ≥ 0 to construct an ensemble classifier HT(x) = signPT

t=1vtht(x) . Under some assumptions (R¨atsch, Onoda and M¨uller 2001), it is shown that when T → ∞, AdaBoost asymptotically approximates an infinite ensemble classifier H(x) such that {(vt, ht)}t=1 is an optimal solution to

minvt,ht

X t=1

vt, (5.4)

subject to yn

X t=1

vtht(xn)

!

≥ 1, for n = 1, 2, . . . , N, vt≥ 0, for t = 1, 2, . . . ,∞.

Comparing (5.4) with (5.1) plus the feature mapping

φ(x) = h1(x), h2(x), . . .

, (5.5)

we see that the elements of φ(x) in SVM are similar to the hypotheses ht(x) in AdaBoost. They both work on linear combinations of these elements, though SVM deals with an additional intercept term b. SVM minimizes the ℓ2-norm of the weights while AdaBoost works on the ℓ1-norm. SVM introduces slack variables ξn and use the parameter κ for regularization, while AdaBoost relies on the choice of the pa-rameter T (Rosset, Zhu and Hastie 2004). Note that AdaBoost requires vt ≥ 0 for ensemble learning.

Let us take a deeper look at (5.4), which contains infinitely many variables. In order to approximate the optimal solution well with a fixed and finite T , AdaBoost resorts to two related properties of some of the optimal solutions for (5.4): finiteness and sparsity.

Finiteness: When two hypotheses share the same prediction patterns on the train-ing input vectors, they can be used interchangeably durtrain-ing the traintrain-ing time and are thus ambiguous. Since there are at most 2N prediction patterns on N training input vectors, we can partition H into at most 2N groups, each of which contains mutually ambiguous hypotheses. Some optimal solutions of (5.4) only assign one or a few nonzero weights within each group (Demiriz, Bennett and Shawe-Taylor 2002).

Thus, it is possible to work on a finite data-dependent subset of H instead of H itself without losing optimality.

Sparsity: Minimizing the ℓ1-norm kvk1 =P

t=1|vt| often leads to sparse solutions (Meir and R¨atsch 2003; Rosset et al. 2007). That is, for hypotheses in the finite (but possibly still large) subset ofH, only a small number of weights needs to be nonzero.

AdaBoost can be viewed as a greedy search algorithm that approximates such a finite and sparse ensemble (Rosset, Zhu and Hastie 2004).

Although there exist some good algorithms that can return an optimal solution

of (5.4) when H is infinitely large (R¨atsch, Demiriz and Bennett 2002; Rosset et al.

2007), the resulting ensemble relies on the sparsity property and is effectively of only finite size. Thus, it is possible that the learning performance could be further improved if either or both the finiteness and the sparsity restrictions are removed. The possibility motivates us to study the task of infinite ensemble learning, as discussed next.

use KH in (5.2), the classifier obtained is equivalent to

ˆ

g(x) = sign

Z

W

v(α)µ(α)hα(x)dα + b



. (5.7)

Nevertheless, ˆg is not an ensemble classifier yet, because we do not have the con-straints v(α)≥ 0, and we have an additional term b. Next, we would explain that such a classifier is equivalent to an ensemble classifier under some reasonable assump-tions.

We start from the constraints v(α) ≥ 0, which cannot be directly considered in (5.1). Vapnik (1998) showed that even if we add a countably infinite number of constraints to (5.1), infinitely many variables and constraints would be introduced to (5.2). Then, the latter problem would still be difficult to solve.

One remedy is to assume that H is negation complete, that is,1

h ∈ H ⇔ (−h) ∈ H.

Then, every linear combination over H can be easily transformed to an equivalent linear combination with only nonnegative weights. Negation completeness is usually a mild assumption for a reasonable H (R¨atsch et al. 2002). Following this assumption, the classifier (5.7) can be interpreted as an ensemble classifier overH with an intercept term b. Now b can be viewed as the weight on a constant hypothesis hc, which always predicts hc(x) = 1 for all x ∈ X . We shall further add a mild assumption that H contains both hc and (−hc). Then, the classifier (5.7) or (5.3) is indeed equivalent to an ensemble classifier, and we get the following framework for infinite ensemble learning.