• 沒有找到結果。

SVM-based framework for infinite ensemble learning

use KH in (5.2), the classifier obtained is equivalent to

ˆ

g(x) = sign

Z

W

v(α)µ(α)hα(x)dα + b



. (5.7)

Nevertheless, ˆg is not an ensemble classifier yet, because we do not have the con-straints v(α)≥ 0, and we have an additional term b. Next, we would explain that such a classifier is equivalent to an ensemble classifier under some reasonable assump-tions.

We start from the constraints v(α) ≥ 0, which cannot be directly considered in (5.1). Vapnik (1998) showed that even if we add a countably infinite number of constraints to (5.1), infinitely many variables and constraints would be introduced to (5.2). Then, the latter problem would still be difficult to solve.

One remedy is to assume that H is negation complete, that is,1

h ∈ H ⇔ (−h) ∈ H.

Then, every linear combination over H can be easily transformed to an equivalent linear combination with only nonnegative weights. Negation completeness is usually a mild assumption for a reasonable H (R¨atsch et al. 2002). Following this assumption, the classifier (5.7) can be interpreted as an ensemble classifier overH with an intercept term b. Now b can be viewed as the weight on a constant hypothesis hc, which always predicts hc(x) = 1 for all x ∈ X . We shall further add a mild assumption that H contains both hc and (−hc). Then, the classifier (5.7) or (5.3) is indeed equivalent to an ensemble classifier, and we get the following framework for infinite ensemble learning.

2. Construct a kernel KH according to Definition 5.1 with a proper embedding function µ.

3. Choose proper parameters, such as the soft-margin parameter κ.

4. Solve (5.2) withKHand obtain Lagrange multipliers λnand the intercept term b.

5. Output the classifier

ˆ

g(x) = sign XN n=1

ynλnKH(xn, x) + b

! ,

which is equivalent to some ensemble classifier over H.

The framework shall generally inherit the profound performance of SVM. Most of the steps in the framework can be done by existing SVM implementations, and the hard part is mostly in obtaining the kernel KH. We have derived several kernels for the framework (Lin 2005; Lin and Li 2008). Next, we introduce two important ones:

the stump kernel and the perceptron kernel.

Stump kernel: The stump kernel embodies infinitely many decision stumps of the form

sq,d,α(x) = q· sign x[d] − α .

The decision stump sq,d,α works on the d-th element of x and classifies x according to q ∈ {−1, +1} and the threshold α (Holte 1993). It is widely used for ensemble learning because of its simplicity (Freund and Schapire 1996).

To construct the stump kernel, we consider the following set of decision stumps

S =n

sq,d,αd: q ∈ {−1, +1} , d ∈ {1, . . . , D} , αd∈ [Ld, Rd]o .

We also assume X ⊆ (L1, R1)× (L2, R2)× · · ·× (LD, RD). Thus, the setS is negation complete and contains s+1,1,L1 as a constant hypothesis. The stump kernelKS defined

below can then be used in Algorithm 5.1 to obtain an infinite ensemble of decision stumps.

Definition 5.2 (Lin and Li 2008). The stump kernel is KS using Definition 5.1 and µ(q, d, αd) = µS = 12. In particular,

KS(x, x) = ∆S − kx − xk1 ,

where ∆S = 12PD

d=1(Rd− Ld) is a constant.

Definition 5.2 is a concrete instance that follows Definition 5.1. Because scaling µS is equivalent to scaling the parameter κ in SVM (Lin and Li 2008), we use µS = 12 to obtain a cosmetically cleaner kernel function.

Given the ranges (Ld, Rd), the stump kernel is very simple to compute. Further-more, the ranges are not even necessary in general, because dropping the constant ∆S

does not affect the classifier obtained from SVM (Lin and Li 2008). That is, in (5.2), the simplified stump kernel ˜KS(x, x) =−kx−xk1 can be used instead ofKSwithout changing the resulting classifier ˆg. The simplified stump kernel is simple to compute, yet useful in the sense of dichotomizing the training set—that is, fitting the training set perfectly.

Theorem 5.3 (Lin and Li 2008). Consider training input vectors{xn}Nn=1. If there is a dimension d such that xm[d]6= xn[d] for all m 6= n, then there exists some κ > 0 such that for all κ≥ κ, SVM with KS (or ˜KS) can always dichotomize the training set {(xn, yn)}Nn=1.

We shall make a remark here. Although Theorem 5.3 indicates how the stump kernel can be used with SVM to dichotomize the training set perfectly, the classifier obtained may suffer from overfitting (Keerthi and Lin 2003). Thus, SVM is usu-ally coupled with a reasonable parameter selection procedure to achieve good test performance (Hsu, Chang and Lin 2003; Keerthi and Lin 2003).

Perceptron kernel: The perceptron kernel embodies infinitely many perceptrons, which are linear threshold classifiers of the form

pu,α(x) = sign

hu, xi − α .

It is a basic theoretical model for a neuron and is important for building neural networks (Haykin 1999).

To construct the perceptron kernel, we consider the following set of perceptrons

P =

pu,α: u∈ RD,kuk2 = 1, α∈ [−R, R]

.

We assume that X is within the interior of a ball of radius R centered at the origin in RD. Then, the set P is negation complete and contains a constant hypothesis (u = 11 and α =−R). Thus, the perceptron kernel KP defined below can be used in Algorithm 5.1 to obtain an infinite ensemble of perceptrons.

Definition 5.4 (Lin and Li 2008). Let

ΘD = Z

kuk2=1

du, ΞD = Z

kuk2=1

cos angle(u, 11) du,

where the operator angle(·, ·) is the angle between two vectors, and the integrals are calculated with uniform measure on the surface kuk2 = 1. The perceptron kernel is KP with µ(u, α) = µP. In particular,

KP(x, x) = ∆P− kx − xk2,

where the constants µP = (2ΞD)12 and ∆P = ΘDΞ−1D R.

With the perceptron kernel, we can construct an infinite ensemble of perceptrons.

Such an ensemble is equivalent to a neural network with one hidden layer, infinitely many hidden neurons, and the hard-threshold activation functions. Williams (1998) built an infinite neural network with either the sigmoid or the Gaussian activation

function through computing the corresponding covariance function for Gaussian pro-cess models. Analogously, our approach returns an infinite neural network with hard-threshold activation functions (ensemble of perceptrons) through computing the per-ceptron kernel for SVM. Williams (1998) stated that “Paradoxically, it may be easier to carry out Bayesian prediction with infinite networks rather than finite ones.” Sim-ilar claims can be made with ensemble learning.

The perceptron kernel shares many similar properties to the stump kernel. First, the constant ∆P can also be dropped. That is, we can use the simplified perceptron kernel ˜KP(x, x) = −kx − xk2 instead of KP. Second, SVM with the perceptron kernel can also dichotomize the training set perfectly, as formalized below.

Theorem 5.5 (Lin and Li 2008). For the training set {(xn, yn)}Nn=1, if xm 6= xn

for all m 6= n, there exists some κ > 0 such that for all κ ≥ κ, SVM with KP (or ˜KP) can always dichotomize the training set.

For AdaBoost, we conduct a similar parameter selection procedure and search T within {10, 20, . . . , 1500}.

The three artificial data sets from Breiman (1999) (twonorm, threenorm, and ring-norm) are generated with training set size 300 and test set size 3000. We create three more data sets (twonorm-n, threenorm-n, ringnorm-n), which contain mislabeling noise on 10% of the training examples, to test the performance of the algorithms on noisy environments. We also use eight real-world data sets from the UCI repository (Het-tich, Blake and Merz 1998): australian, breast, german, heart, ionosphere, pima, sonar, and votes84. Their feature elements are scaled to [−1, 1]. We randomly pick 60%

of the examples for training, and the rest for testing. For the data sets above, we compute the means and the standard errors of the results over 100 different random runs. In addition, four larger real-world data sets are used to test the validity of the framework for large-scale learning. They are a1a (Hettich, Blake and Merz 1998;

Platt 1998), splice (Hettich, Blake and Merz 1998), svmguide1 (Hsu, Chang and Lin 2003), and w1a (Platt 1998).2 Each of them comes with a benchmark test set, on which we report the results. The information of the data sets used is summarized in Table 5.1.

Tables 5.2 and 5.3 show the test performance of ensemble learning algorithms on different base hypothesis sets. We can see that SVM-Stump and SVM-Perc are usually better than AdaBoost with the corresponding base hypothesis set. In noisy data sets, SVM-based framework for infinite ensemble learning always significantly outperforms AdaBoost. These results demonstrate that it is beneficial to go from a finite ensemble to an infinite one with suitable regularization.

Note that AdaBoost and our SVM-based framework differ in the concept of spar-sity. As illustrated in Subsection 5.1.1, AdaBoost prefers sparse ensemble classifiers, that is, ensembles that include a small number of hypotheses. Our framework works with an infinite number of hypotheses, but results in a sparse classifier in the support vector domain. Both concepts can be justified with various bounds on the expected

2These data sets are downloadable on tools page of LIBSVM (Chang and Lin 2001).

Table 5.1: Binary classification data sets

data set # training examples # test examples # features (D)

twonorm 300 3000 20

twonorm-n 300 3000 20

threenorm 300 3000 20

threenorm-n 300 3000 20

ringnnorm 300 3000 20

ringnorm-n 300 3000 20

australian 414 276 14

breast 409 274 10

german 600 400 24

heart 162 108 13

ionosphere 210 141 34

pima 460 308 8

sonar 124 84 60

votes84 261 174 16

a1a 1605 30956 123

splice 1000 2175 60

svmguide1 3089 4000 4

w1a 2477 47272 300

test cost (Freund and Schapire 1997; Graepel, Herbrich and Shawe-Taylor 2005).

Nevertheless, our experimental results indicate that sparse ensemble classifiers are sometimes not sophisticated enough in practice, especially when the base hypothesis set is simple. For instance, when using the decision stumps, a general data set may re-quire many of them to describe a suitable decision boundary. Thus, AdaBoost-Stump could be limited by the finiteness and sparsity restrictions (Lin and Li 2008). On the other hand, our framework (SVM-Stump), which suffers from neither restrictions, can perform better by averaging over an infinite number of hypotheses.

In our earlier work (Lin and Li 2008), we observed another advantage of the perceptron kernel (SVM-Perc). In particular, the perceptron kernel and the popular Gaussian kernel share almost indistinguishable performance in the experiments, but the former enjoys the benefit of faster parameter selection. For instance, determining a good parameter for the Gaussian kernel involves solving 550 optimization problems, but SVM-Perc deals with only 55. With the indistinguishable performance, SVM-Perc

Table 5.2: Test classification cost (%) of SVM-Stump and AdaBoost-Stump data set SVM-Stump AdaBoost-Stump

twonorm 2.858±0.038 5.022±0.062 twonorm-n 3.076±0.055 12.748±0.165 threenorm 17.745±0.100 22.096±0.117 threenorm-n 19.047±0.144 26.136±0.167 ringnorm 3.966±0.067 10.082±0.140 ringnorm-n 5.558±0.110 19.620±0.200 australian 14.446±0.205 14.232±0.179

breast 3.113±0.080 4.409±0.103 german 24.695±0.183 25.363±0.193 heart 16.352±0.274 19.222±0.349 ionosphere 8.128±0.173 11.340±0.252 pima 24.149±0.226 24.802±0.225 sonar 16.595±0.420 19.441±0.383 votes84 4.759±0.139 4.270±0.152

a1a 16.194 15.984

splice 6.207 5.747

svmguide1 2.925 3.350

w1a 2.090 2.177

(for the last 4 rows, the best results are marked in bold; for the other rows, those within one standard error of the lowest one are marked in bold)

should be a preferable choice in practice.

Both advantages of SVM-Perc above were inherited by the REDSVM (and SVOR-IMC) algorithm for ordinal ranking via the reduction framework (Algorithm 4.1).

First, we list the results of ORBoost-All with perceptron (Table 3.1) and REDSVM with the perceptron kernel (Table 4.2) in Table 5.4. Both algorithms can return a threshold ensemble of perceptrons. ORBoost-All roots from AdaBoost, while REDSVM roots from SVM. In the table, we see that REDSVM with the percep-tron kernel is usually better than ORBoost-All with perceppercep-tron, just as SVM-Perc is usually better than AdaBoost-Perc.

Second, when using the perceptron kernel, SVOR-IMC (and REDSVM) also en-joys the benefit of faster parameter selection. In addition, in Tables 4.2 and 4.4, we see that SVOR-IMC performs decently with both the perceptron and the Gaussian ker-nels (actually, better with the perceptron kernel). Such a result makes the perceptron

Table 5.3: Test classification cost (%) of SVM-Perc and AdaBoost-Perc data set SVM-Perc AdaBoost-Perc

twonorm 2.548±0.033 3.114±0.041 twonorm-n 2.755±0.052 4.529±0.101 threenorm 14.643±0.084 17.322±0.113 threenorm-n 16.299±0.103 20.018±0.182 ringnorm 2.464±0.038 36.278±0.141 ringnorm-n 3.505±0.086 37.812±0.196 australian 14.482±0.170 15.656±0.159 breast 3.230±0.080 3.493±0.101 german 24.593±0.196 25.027±0.184 heart 17.556±0.307 18.222±0.324 ionosphere 6.404±0.198 11.425±0.234 pima 23.545±0.212 24.825±0.197 sonar 15.619±0.401 19.774±0.427 votes84 4.425±0.138 4.374±0.164

a1a 15.690 19.986

splice 10.391 13.655

svmguide1 3.100 3.275

w1a 1.915 2.348

(for the last 4 rows, the best results are marked in bold; for the other rows, those within one standard error of the lowest one are marked in bold)

Table 5.4: Test absolute cost of algorithms for threshold perceptron ensembles

data perceptron

set RED-SVM ORBoost-All

pyrimdines 1.304±0.040 1.360±0.046 machine 0.842±0.022 0.889±0.019 boston 0.732±0.013 0.791±0.013 abalone 1.383±0.004 1.432±0.003 bank 1.404±0.002 1.490±0.002 computer 0.565±0.002 0.626±0.002 california 0.940±0.001 0.977±0.002 census 1.143±0.002 1.265±0.002

(those within one standard error of the lowest one are marked in bold)

kernel a preferable choice for ordinal ranking. These advantages clearly demonstrate how we can improve both binary classification and ordinal ranking simultaneously with the reduction framework.

5.2.1 Algorithm

Before we introduce the SeedBoost algorithm, we take a closer look at AdaBoost (Algorithm 4.2). Following the gradient descent view of Mason et al. (2000), in the 1-st iteration, AdaBoo1-st greedily chooses (h1, v1) to approximately minimize

XN n=1

wnexp −ynv1h1(xn)

. (5.8)

Then, in the t-th iteration, AdaBoost chooses (ht, vt) to approximately minimize XN

n=1

wnexp

−yn Ht(xn) + vt+1ht+1(xn)

= XN n=1

wn(t)exp −ynvt+1ht+1(xn)

. (5.9)

Comparing (5.8) and (5.9), we see that AdaBoost at the t-th iteration using the original training set {(xn, yn, wn)} is equivalent to AdaBoost at the 1-st iteration using a modified training set n

(xn, yn, wn(t))o

. Therefore, using (5.8) as a basic step, AdaBoost with (t + T ) iterations can be recursively defined as follows.

Algorithm 5.2 (A recursive view of AdaBoost with (t + T ) iterations).

1. Run AdaBoost on {(xn, yn, wn)}Nn=1 for t steps and get an ensemble classi-fier g(1)(x) = sign

Ht(1)(x) . 2. Run AdaBoost on n

(xn, yn, w(t)n )oN

n=1 for T steps and get an ensemble classi-fier g(2)(x) = sign

HT(2)(x) . 3. Return Ht+T(x) = sign

Ht(1)(x) + HT(2)(x) .

Our proposed SeedBoost algorithm simply generalizes the recursive steps above by replacing the first step with any learning algorithm, as listed below.