Results - Experiments on RSVM - Reduction techniques for training support vector machines

V. Experiments on RSVM

5.2 Results

Table 5.2 shows the result of comparing LIBSVM, LIBSVM-q, and the four RSVM implementations. We present the optimal parameters (C, γ) and the corresponding accuracy rates. It can be seen that optimal parameters (C, γ) are in various ranges for different implementations so it is essential to test so many parameter sets. We observe that LIBSVM and LIBSVM-q have very similar accuracy. This does not contradict the current understanding in SVM area as we have not seen any report which shows one has higher accuracy than the other. Except ijcnn1, the difference of the four RSVM implementations is also small. This is reasonable as essentially they solve (3.10) with minor modifications.

For all problems, LIBSVM and LIBSVM-q perform better than RSVM implemen-tations. We can expect this because for RSVM the support vectors are randomly chosen in advance, therefore we cannot ensure that the support vectors are impor-tant representatives of the training data. This seems imply that if problems are not too large, we would like to stick on the original SVM formulation. We think this situation is like the comparison between RBF networks and SVM [42] where as RBF networks select only several centers, it may not extract enough information.

In general the optimal C of RSVM is much larger than that of the regular SVM.

As RSVM is indeed a linear SVM with a lot more data than the number of attributes, it tends to need a larger C so that data can be correctly separated. How this property affects its model selection remains to be investigated.

We also observe that the accuracy of LS-SVM is a little lower than SSVM and LSVM. In particular, for problem ijcnn1 the difference is quite large. Note that ijcnn1 is an unbalanced problem where 90% of the data have the same label. Thus the 91.7%

accuracy by LS-SVM is quite poor. We suspect that the change of inequalities to equalities for LS-SVM may not be suitable for some problems.

Table 5.2: A comparison on RSVM: testing accuracy

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition

ProblemC,γ rate C,γ rate C,γ rate C,γ rate C,γ rate C,γ rate dna 2⁴,2⁻⁶ 95.4472²,2⁻⁶ 95.4472¹²,2⁻¹⁰92.8332⁴,2⁻⁶ 92.3272⁵,2⁻⁷ 93.0022⁹,2⁻⁶ 92.327

Table 5.3: A comparison on RSVM: number of support vectors

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition

Problem #SV #SV (all same)

In Table 5.3 we report the number of “unique” support vectors for each method.

Table 5.4: A comparison on RSVM: training time and testing time (in seconds)

SVM RSVM

LIBSVM LIBSVM-q SSVM LS-SVM LSVM Decomposition Problemtraining testing training testing training training training training testing

dna 7.09 4.65 8.5 5.39 5.04 2.69 23.4 7.59 1.52

satimage16.21 9.04 19.04 10.21 23.77 11.59 141.17 43.75 11.4 letter 230 89.53 140.14 75.24 193.39 71.06 1846.12 446.04 149.77 shuttle 113 2.11 221.04 3.96 576.1 150.59 3080.56 562.62 74.82 mnist 1265.67 4475.541273.29 4470.951464.63 939.76 4346.28 1913.86 7836.99 ijcnn1 492.53 264.58 2791.5 572.58 57.87 19.42 436.46 16152.54 6.36 protein 1875.9 687.9 9862.25 808.68 84.21 64.6 129.47 833.35 35

We say “unique” support vectors because for the one-against-one approach, one training data may be a support vector of different binary classifiers. Hence, we report only the number of training data which corresponds to at least one support vector of a binary problem. Note that as we specified, all four RSVM implementations have the same number of support vectors.

For letter and mnist the RSVM approach has a large number of support vectors.

A reason is that subsets selected for all binary problems are quite different. This is unlike standard SVM where important vectors may appear in different binary prob-lems so the number of unique support vectors is not that large. An alternative way for RSVM may be to select a subset of all data first and then for each binary problem support vectors are elements in this subset which belong to the two corresponding classes. This will reduce the testing time as [15] has discussed that if the number of classes is not huge, it is proportional to the number of unique support vectors. On the contrary, training time will not be affected much as the size of binary problems is still similar.

Though the best parameters are different, roughly we can see LIBSVM-q requires more support vectors than LIBSVM. This is consistent with the common understand-ing that the quadratic cost function leads to more data with small nonzero ξi. For

protein, LIBSVM and LIBSVM-q both require many training data as support vectors.

Indeed most of them are “free” support vectors (i.e. corresponding dual variables αi

not at the upper bound). Such cases are very difficult for decomposition methods and their training time will be discussed later.

We report the training time and testing time in Table 5.4. For four RSVM implementations, we find their testing time is quite close so we post only that of the decomposition implementation. Note that here we report only the time for solving the optimal model. Except the LS-SVM implementation for RSVM which solves a linear equation for each parameter set, the training time of all other approaches depend on parameters. In other words, their number of iterations may vary for different parameter sets. Thus, results here cannot directly imply which method is faster as the total training time depends on the model selection strategy. However, generally we can see for problems with up to tens of thousands of data, the decomposition method for the traditional SVM is still competitive. Therefore, RSVM will be mainly useful for larger problems. In addition, RSVM possesses more advantage on solving large binary problems as for multi-class data sets we can afford to use decomposition methods to solve several smaller binary problems.

Table 5.4 also indicates that the training time of decomposition methods for SVM strongly depends on the number of support vectors. For the problem ijcnn1, compared to the standard LIBSVM the number of support vectors using LIBSVM-q is doubled and then the training time is a lot more. This has been known in the literature as starting from the zero vector as the initial solution, the smaller the number of support vectors (i.e. nonzero elements at an optimal solution) is, the fewer variables may need to be updated in the decomposition iterations. As discussed earlier, protein is a very challenging case for LIBSVM and LIBSVM-q due to the high percentage of

training data as support vectors. For such problem RSVM is a promising alternate as using very few data as support vectors, the computational time is largely reduced and the accuracy does not suffer much.

Among the four RSVM implementations, LS-SVM method is the fastest in the training stage. This is obvious as its cost is just that of one iteration of the SSVM method. Therefore, from the training time of both implementations we can roughly see the number of iterations of the SSVM method. As expected, the Newton’s method converges quickly, usually in less than 10 iterations. On the other hand, LSVM which is cheaper for each iteration, needs hundreds of iterations. For quite a few problems it is the slowest.

We observe that for ijcnn1 and protein the testing time of RSVM is much less than that of traditional SVM. On the other hand, for mnist the behavior reverses. We have mensioned how the testing time grows with the number of support vectors during the comparison of support vectors. As a result, for ijcnn1 and protein we reduce the number of support vectors to a few hundreds, so the testing time is extremely little;

for mnist our selection strategy results in a large number of support vectors in toto, so the testing time is huge.

It is interesting to see that the decomposition method, not originally designed for linear SVM, performs well on some problems. However, it is not very stable as for some difficult cases, the number of training iterations is prohibitive. Actually, we observe that for problem ijcnn1, the number of nonzero dual variables ˆα is extremely large (more than 20000). For such situation, it is very hard for the decomposition implementation to solve the linear SVM dual problem.

在文檔中 Reduction techniques for training support vector machines (頁 34-39)