Stopping criteria - Different Implementations for RSVM

IV. Different Implementations for RSVM

4.5 Stopping criteria

In order to compare the performance of different methods, we should use compa-rable stopping criteria for them. However, so many differences among these methods prohibit us from finding precisely equivalent stopping criteria for all methods. Still we try to use reasonable criteria which will be described below in detail.

In the next Chapter we will use LIBSVM as the representative for solving the regular nonlinear SVM so we first explain its stopping criterion. LIBSVM is a de-composition method which solves the dual form of the standard SVM with a linear cost function:

minα f (α) = 1

2α^TQα − e^Tα (4.16)

subject to y^Tα = 0,

0 ≤ αⁱ ≤ C, i = 1, . . . , l.

It is shown in [6] that if C > 0, the KKT condition of (4.16) is equivalent to

m(α) = max( max

αi<C,yi=1−∇f(α)ⁱ, max

αi>0,yi=−1∇f(α)ⁱ)

≤ min( min

αi<C,yi=−1∇f(α)ⁱ, min

αi>0,yi=1−∇f(α)ⁱ) = M (α). (4.17) For practical implementation, a tolerance is set, and the stopping criterion can be

m(α) ≤ M(α) + , (4.18)

where we choose = 0.001.

To compare the performance of using linear and quadratic cost functions, we also modify LIBSVM to solve (2.7). It is easy to see that a similar stopping criterion can be used. Another decomposition software BSVM is used for solving RSVM as described in Section 4.4. It also has a similar stopping criterion.

The LSVM method, another implementation for RSVM, solves problems in the dual form (4.13). Hence we can use a similar stopping criterion: Now

∇f(ˆα) = ( ˜Q ˜Q^T + I

2C) ˆα − e.

The KKT condition of (4.13) shows that if

min ∇f(ˆα)i ≥ max_α_ˆ

i>0∇f(ˆα)i,

then ˆα is an optimal solution. Thus, an stopping criterion with a tolerance can be

maxαˆi≥0∇f(ˆα) ≤ min ∇f(ˆα) + . (4.19)

We also set = 0.001 here.

For the SSVM, we simply employ the original stopping criterion of TRON:

k∇f(eα^k)k² ≤ k∇f(eα¹)k², (4.20)

where f (eα) is defined in (4.3), eα¹ is the initial solution, eα^k is the current solution, and = 10⁻⁵.

Note that for LS-SVM implementation, we use direct methods to solve the linear system, and therefore no stopping criterion is needed.

Experiments on RSVM

In this chapter we conduct experiments on some commonly used problems. At first we extend RSVM to multi-class problems. Several methods have been proposed for SVM multi-class classification. A common way is to consider a set of binary SVM problems, while some authors also proposed methods that consider all classes at once. These methods in general can also be applied to RSVM.

There have been some comparisons of methods for multi-class SVM. In [15] we compare different decomposition implementations. Results indicate that diffenent strategies for multi-class SVM achieve similar testing accuracy, but the one-against-one method performs faster in practice. The comparison in [46] for LS-SVM also prefers the one-against-one method. So we use it for our implementation here. Sup-pose there are k classes of data, this method constructs k(k − 1)/2 classifiers where each one trains data from two classes. In classification we use a voting strategy: each binary classification is considered to be a voting where votes can be cast for all data points x - in the end point is designated to be in a class with maximum number of votes.

5.1 Problems and settings

We choose large multi-class datasets from the Statlog collection: dna, satimage, letter, and shuttle [30]. We also consider mnist [19], an important benchmark for handwritten digit recognition. The problem ijcnn1 is from the first problem of IJCNN challenge 2001 [38]. Note that we use the winner’s transformation of raw data [5].

The last one, protein, is a data set for protein secondary structure prediction [49].

Except problems dna, ijcnn1, and protein whose data values are already in a small range, we scale all training data to be in [−1, 1]. Then test data, available for all problems, are adjusted to [−1, 1] accordingly. Note that for problem mnist, it takes too much training time if the whole 60,000 training samples are used, so we consider the training and testing data together (i.e. 70,000 samples) and then cut the first 30%

for training and test the remaining 70%. Also note that for the problem satimage, there is one missing class. That is, in the original application there is one more class but in the data set no examples are with this class. We give problem statistics in Table 5.1. Some problems with a large number of attributes may be very sparse. For example, for each instance of protein, only 17 of the 357 attributes are nonzero.

Table 5.1: Problem statistics

Problem #training data #testing data #class #attribute

dna 2000 1300 3 180

We compare four implementations of RSVM discussed in Chapter IV with two implementations of regular SVM: linear and quadratic cost functions (i.e., (2.2) and

(2.5)). For regular SVM with the linear cost function, we use the software LIBSVM which implements a simple decomposition method. We can easily modify LIBSVM for using the quadratic cost function, which we will refer to as LIBSVM-q in the rest of this thesis. However, we will not use it to solve RSVM as LIBSVM implements an SMO type algorithm where the size of the working set is restricted to two. In Section 4.4 we have shown that larger working sets should be used when using decomposition methods for linear SVM.

The computational experiments for this section were done on a Pentium III-1000 with 1024MB RAM using the gcc compiler. For three of the four RSVM methods (SSVM, LS-SVM, and LSVM), the main computational work are some basic matrix operations so we use ATLAS to optimize the performance [50]. This is very crucial as otherwise a direct implementation of these matrix operations can at least double or triple the computational time. For decomposition methods where the kernel matrix cannot be fully stored we allocate 500MB memory as the cache for storing recently used kernel elements. Furthermore, LIBSVM, LIBSVM-q, and BSVM all use a shrink-ing technique so if most variables are finally at bounds, it solves smaller problems by considering only free variables. Details on the shrinking technique are in [6, Section 4].

Next we discuss the selection of parameters in different implementations. It is of course a tricky issue on selecting m, the size of the subset of RSVM. This depends on practical situations such as how large the problem is. Here in most cases we fix m to be 10% of the training data, which was also considered in [21]. For multi-class problems, we cannot use the same m for all binary problems as the data set may be highly unbalanced. Hence we choose m to be 10% of the size of each binary problem so a smaller binary problem use a smaller m. In addition, for problems shuttle, ijcnn1,

and protein, binary problems may be too large for training. Thus, we set m = 200 for these large binary problems. This is similar to how [21] deals with large problems.

Once the size is determined, for all four implementations, we select the same subset for each binary RSVM problem. Note that n m for most problems we consider, so the time of kernel evaluations is O(lmn) which is much less than that of each implementations for RSVM.

We need to decide some additional parameters prior to experiments. For SSVM method, the smoothing parameter β is set to 5 because the performance is almost the same for β ≥ 5. For LSVM method, we set β = ^1.9C which is the same as that in [28]. We do not conduct cross validation for selecting them as otherwise there are too many parameters. For BSVM which is used to solve linear SVM arising from RSVM, we use all default settings. In particular, the size of the working set is 10.

The most important criterion for evaluating the performance of these methods is their accuracy rate. However, it is unfair to use only one parameter set and then compare these methods. Practically for any method we have to find the best parameters by performing the model selection. This is conducted on the training data where the test data are assumed unknown. Then the best parameter set is used for constructing the model for future testing. To reduce the search space of parameter sets, here we consider only the RBF kernel K(xi, xj) ≡ e^−γkxⁱ^−x^j^k², so the parameters left for decision are kernel parameter γ and cost parameter C. In addition, for multi-class problems we consider that C and γ of all k(k − 1)/2 binary problems via the one-against-one approach are the same.

For each problem, we estimate the generalized accuracy using different γ = [2⁴, 2³, 2², . . . , 2⁻¹⁰] and C = [2¹², 2¹¹, 2¹⁰, . . . , 2⁻²]. Therefore, for each problem we try 15 × 15 = 225 combinations. For each pair of (C, γ), the validation

perfor-mance is measured by training 70% of the training set and testing the other 30% of the training set. Then we train the whole training set using the pair of (C, γ) that achieves the best validation rate and predict the test set. The resulting accuracy is presented in the “rate” columns of Table 5.2. Note that if several (C, γ) have the same accuracy in the validation stage, we apply all of them to the test data and re-port the highest rate. If there are several parameters resulting in the highest testing rate, we report the one with the minimal training time.

5.2 Results

Table 5.2 shows the result of comparing LIBSVM, LIBSVM-q, and the four RSVM implementations. We present the optimal parameters (C, γ) and the corresponding accuracy rates. It can be seen that optimal parameters (C, γ) are in various ranges for different implementations so it is essential to test so many parameter sets. We observe that LIBSVM and LIBSVM-q have very similar accuracy. This does not contradict the current understanding in SVM area as we have not seen any report which shows one has higher accuracy than the other. Except ijcnn1, the difference of the four RSVM implementations is also small. This is reasonable as essentially they solve (3.10) with minor modifications.

For all problems, LIBSVM and LIBSVM-q perform better than RSVM implemen-tations. We can expect this because for RSVM the support vectors are randomly chosen in advance, therefore we cannot ensure that the support vectors are impor-tant representatives of the training data. This seems imply that if problems are not too large, we would like to stick on the original SVM formulation. We think this situation is like the comparison between RBF networks and SVM [42] where as RBF networks select only several centers, it may not extract enough information.

In general the optimal C of RSVM is much larger than that of the regular SVM.

As RSVM is indeed a linear SVM with a lot more data than the number of attributes, it tends to need a larger C so that data can be correctly separated. How this property affects its model selection remains to be investigated.

We also observe that the accuracy of LS-SVM is a little lower than SSVM and LSVM. In particular, for problem ijcnn1 the difference is quite large. Note that ijcnn1 is an unbalanced problem where 90% of the data have the same label. Thus the 91.7%

accuracy by LS-SVM is quite poor. We suspect that the change of inequalities to equalities for LS-SVM may not be suitable for some problems.

Table 5.2: A comparison on RSVM: testing accuracy

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition

ProblemC,γ rate C,γ rate C,γ rate C,γ rate C,γ rate C,γ rate dna 2⁴,2⁻⁶ 95.4472²,2⁻⁶ 95.4472¹²,2⁻¹⁰92.8332⁴,2⁻⁶ 92.3272⁵,2⁻⁷ 93.0022⁹,2⁻⁶ 92.327

Table 5.3: A comparison on RSVM: number of support vectors

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition

Problem #SV #SV (all same)

In Table 5.3 we report the number of “unique” support vectors for each method.

Table 5.4: A comparison on RSVM: training time and testing time (in seconds)

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition Problemtraining testing training testing training training training training testing

dna 7.09 4.65 8.5 5.39 5.04 2.69 23.4 7.59 1.52

satimage16.21 9.04 19.04 10.21 23.77 11.59 141.17 43.75 11.4 letter 230 89.53 140.14 75.24 193.39 71.06 1846.12 446.04 149.77 shuttle 113 2.11 221.04 3.96 576.1 150.59 3080.56 562.62 74.82 mnist 1265.67 4475.541273.29 4470.951464.63 939.76 4346.28 1913.86 7836.99 ijcnn1 492.53 264.58 2791.5 572.58 57.87 19.42 436.46 16152.54 6.36 protein 1875.9 687.9 9862.25 808.68 84.21 64.6 129.47 833.35 35

We say “unique” support vectors because for the one-against-one approach, one training data may be a support vector of different binary classifiers. Hence, we report only the number of training data which corresponds to at least one support vector of a binary problem. Note that as we specified, all four RSVM implementations have the same number of support vectors.

For letter and mnist the RSVM approach has a large number of support vectors.

A reason is that subsets selected for all binary problems are quite different. This is unlike standard SVM where important vectors may appear in different binary prob-lems so the number of unique support vectors is not that large. An alternative way for RSVM may be to select a subset of all data first and then for each binary problem support vectors are elements in this subset which belong to the two corresponding classes. This will reduce the testing time as [15] has discussed that if the number of classes is not huge, it is proportional to the number of unique support vectors. On the contrary, training time will not be affected much as the size of binary problems is still similar.

Though the best parameters are different, roughly we can see LIBSVM-q requires more support vectors than LIBSVM. This is consistent with the common understand-ing that the quadratic cost function leads to more data with small nonzero ξi. For

protein, LIBSVM and LIBSVM-q both require many training data as support vectors.

Indeed most of them are “free” support vectors (i.e. corresponding dual variables αi

not at the upper bound). Such cases are very difficult for decomposition methods and their training time will be discussed later.

We report the training time and testing time in Table 5.4. For four RSVM implementations, we find their testing time is quite close so we post only that of the decomposition implementation. Note that here we report only the time for solving the optimal model. Except the LS-SVM implementation for RSVM which solves a linear equation for each parameter set, the training time of all other approaches depend on parameters. In other words, their number of iterations may vary for different parameter sets. Thus, results here cannot directly imply which method is faster as the total training time depends on the model selection strategy. However, generally we can see for problems with up to tens of thousands of data, the decomposition method for the traditional SVM is still competitive. Therefore, RSVM will be mainly useful for larger problems. In addition, RSVM possesses more advantage on solving large binary problems as for multi-class data sets we can afford to use decomposition methods to solve several smaller binary problems.

Table 5.4 also indicates that the training time of decomposition methods for SVM strongly depends on the number of support vectors. For the problem ijcnn1, compared to the standard LIBSVM the number of support vectors using LIBSVM-q is doubled and then the training time is a lot more. This has been known in the literature as starting from the zero vector as the initial solution, the smaller the number of support vectors (i.e. nonzero elements at an optimal solution) is, the fewer variables may need to be updated in the decomposition iterations. As discussed earlier, protein is a very challenging case for LIBSVM and LIBSVM-q due to the high percentage of

training data as support vectors. For such problem RSVM is a promising alternate as using very few data as support vectors, the computational time is largely reduced and the accuracy does not suffer much.

Among the four RSVM implementations, LS-SVM method is the fastest in the training stage. This is obvious as its cost is just that of one iteration of the SSVM method. Therefore, from the training time of both implementations we can roughly see the number of iterations of the SSVM method. As expected, the Newton’s method converges quickly, usually in less than 10 iterations. On the other hand, LSVM which is cheaper for each iteration, needs hundreds of iterations. For quite a few problems it is the slowest.

We observe that for ijcnn1 and protein the testing time of RSVM is much less than that of traditional SVM. On the other hand, for mnist the behavior reverses. We have mensioned how the testing time grows with the number of support vectors during the comparison of support vectors. As a result, for ijcnn1 and protein we reduce the number of support vectors to a few hundreds, so the testing time is extremely little;

for mnist our selection strategy results in a large number of support vectors in toto, so the testing time is huge.

It is interesting to see that the decomposition method, not originally designed for linear SVM, performs well on some problems. However, it is not very stable as for some difficult cases, the number of training iterations is prohibitive. Actually, we observe that for problem ijcnn1, the number of nonzero dual variables ˆα is extremely large (more than 20000). For such situation, it is very hard for the decomposition implementation to solve the linear SVM dual problem.

5.3 Some modifications on RSVM and their performances

In this section, two types of modifications are applied to the original RSVM for-mulation. The first is about the regularization term. Remember we mentioned in Chapter III that following the generalized SVM the authors of [21] replace ¹₂α¯^TQRRα¯ in (3.9) by ¹₂α¯^Tα and solve (3.10). So far we only see that for the LSVM implemen-¯ tation, if without doing so we may have troubles to obtain and use the dual problem.

For SSVM and LS-SVM, ¹₂α¯^TQRRα can be kept and the same methods can still be¯ applied. As the change leads to the loss of the property on maximizing the margin, we are interested in whether the performance (testing accuracy) is worsen or not.

Without changing the ¹₂α¯^TQRRα term in (4.8), LS-SVM formulation is¯

Thus we will solve a different linear system:

( eQ^TQ +e 1

2CQRR)eα = eQ^Te. (5.2) (4.9) is a positive definite linear system because I/2C is positive definite. However, (5.2) does not share this property, since in some cases, QRR/2C is only positive semi-definite. An example when using the RBF kernel is that some training data are at the same point (e.g. dna in our experiments). Therefore, Cholesky factorization which we used to solve (4.9) can fail in solving (5.2). Thus, LU factorization is used. This change affects little to the training time according to the time complexity analysis in Section 3.2. A comparison on the testing accuracy between solving (4.9) and (5.2) is in Table 5.5. We can see that their accuracy is very similar. Therefore, we conclude that the use of a simpler quadratic term in RSVM is basically fine.

The second modification is changing the random selection of support vectors.

Table 5.5: A comparison on modified versions of RSVM: testing accuracy

LS-SVM Decomposition

(4.9) (5.2) P

K:,i+(4.9) ICF+(4.9) (4.15) loosen+(4.15) ProblemC, γ rate C, γ rate C, γ rate C, γ rate C, γ rate C, γ rate

The random selection is efficient as it costs O(lmn) time where n is the number of attributes in the training vector. However, we suspect that a more careful selection might improve the testing accuracy. We try several heuristics to select a subset of training data which we consider more important as support vectors, and then use the LS-SVM implememtation to solve the reduced problem. They are listed as follows:

1. For each column of K, we calculate the sum of all entries in that column. Then we take the m vectors corresponding to the columns with larger sums. Since RBF kernel is used, all entries are positive, and so are the sums. We think columns with larger sums might be more important. The main work for this strategy is to obtain all entries of K, which costs O(l²n) time. This selection strategy is denoted as “P

K:,i” in Table 5.5.

2. Conducting incomplete Cholesky factorization with symmetric pivoting on K

在文檔中 Reduction techniques for training support vector machines (頁 27-0)