Experiments and results - ICF Approximated Kernel Representation for SVM

VI. ICF Approximated Kernel Representation for SVM

6.5 Experiments and results

We use the same target problems as RSVM experiments, with the same settings mentioned in Chapter V. In particular, here m controls the number of entries of the ICF matrix V , and it is selected in the same way as for RSVM: in most cases m is set to be 10% of the training data. However, since the number of dual variables is still l, the number of support vectors are not the same as RSVM; instead, it depends on the sparsity of the solution.

We compare five methods:

1. LIBSVM, the standard decomposition method for nonlinear SVM

2. Decomposition implementation for RSVM

3. ICFSVM using the first ICF algorithm (denoted as [24])

4. ICFSVM using the second ICF algorithm (denoted as (VI.1))

5. Re-training on the support vectors after 4.

To reduce the search space of parameter sets, here we consider only the RBF kernel K(xi, xj) ≡ e^−γkxⁱ^−x^j^k², so the parameters left for decision are kernel parameter γ and cost parameter C. For all methods, we conduct model selection on the training data where the test data are assumed unknown. For each problem, we estimate the

generalized accuracy using different γ = [2⁴, 2³, 2², . . . , 2⁻¹⁰] and C = [2¹², 2¹¹, 2¹⁰, . . . , 2⁻²]. Therefore, for each problem we try 15 × 15 = 225 combinations. For each pair of (C, γ), the validation performance is measured by training 70% of the training set and testing the other 30% of the training set. The best parameter set is used for constructing the model for future testing. In addition, for multi-class problems we consider that C and γ of all k(k − 1)/2 binary problems via the one-against-one approach are the same.

Table 6.1 shows the testing accuracy of all methods. ICFSVM using the second ICF algorithm achieves the highest rate. However, the best rate is similar to RSVM, so for moderate-sized training set ICFSVM is also not recommended. We also note that the optimal C parameter for ICFSVM is smaller than that of RSVM. This nice property helps the decomposition implementation since it is more stable for normal parameters, especially for moderately small C’s. In addition, it is easier to apply different model selection techniques on ICFSVM than RSVM: many of them can work only on the region where C and γ are not too small or too large.

In Table 6.2 we report the number of “unique” support vectors at the optimal model of each method. We observe that the result of ICFSVM using the second algorithm is similar to that of the original SVM. Hence, we can say that such a method also gains the SVM property that support vectors are often sparse. The number of support vectors for RSVM is decided in advance, and is not similar to other techniques.

We report the training time and testing time for solving the optimal model in Table 6.3. Since ICFSVM has to perform ICF, an extra work before solving (6.2), the training time may be more than RSVM whose main work is solving (4.15), a QP similar to (6.2). Moreover, from the table we see that for some problems solving

Table 6.1: A comparison on ICFSVM: testing accuracy

SVM RSVM ICFSVM(decomposition implementation)

N/A: training time too large to apply the model selection

Table 6.2: A comparison on ICFSVM: number of support vectors

SVM RSVM ICFSVM

LIBSVM Decomposition [24] (VI.1) (VI.1)+retrain

Problem #SV

dna 973 372 1688 1389 1588

satimage 1611 1826 4022 1187 1507

letter 8931 13928 12844 5390 8953

shuttle 285 4982 43026 308 3714

mnist 8333 12874 N/A 5295 5938

ijcnn1 4555 200 49485 4507 8731

protein 14770 596 N/A 15049 15512

N/A: training time too large to apply the model selection

Table 6.3: A comparison on ICFSVM: training time and ICF time (in seconds)

SVM RSVM ICFSVM

LIBSVM Decomposition [24] (VI.1) (VI.1)+retrain

Problem training training training ICF training ICF training ICF

dna 7.09 7.59 440.41 427.18 9.62 5.45 33.77 5.52

satimage 16.21 43.75 558.23 467.48 48.49 28.37 61.59 28.32 letter 230 446.04 3565.31 2857.95 484.59 222.4 635.41 221.93 shuttle 113 562.62 70207.76 13948.141251.17 1184.631811.6 1265.51 mnist 1265.67 1913.86 N/A N/A 2585.13 2021.642565.08 1866.9 ijcnn1 492.53 16152.54 21059.3 4680.63 5463.8 103.97 1579.73 102.52 protein 1875.9 833.35 N/A N/A 217.53 92.52 3462.57 110.54

N/A: training time too large to apply the model selection

ICF occupies most of the training time. However, for the best model the training time of ICFSVM may not be more than RSVM. This is due to that ICFSVM has the best model at smaller C. Now they both use dcomposition methods which can be very slow if C is large (see discussions in [15]). Thus, if proper model selection techniques are used, the overall training time of ICFSVM can be competitive with that of RSVM. In addition, if in the model selection process, a line search procedure is need by fixing other parameters and varing C, we only have to do ICF once. This may be another advantage of ICFSVM. However, we have mentioned that RSVM can be solved through both primal and dual forms. So in this aspect, it is more flexible and efficient if only a single parameter set is considered.

The performance for re-training the support vectors obtained by ICFSVM is generally worse than the original ICFSVM. Moreover, the optimal parameters for all problems are quite different. This indicates that ICFSVM actually finds another set of support vectors, and re-training on this set will lead to quite different model.

The result for the first ICF method is even strange. Although for some problems it is comparable with other methods, for other problems we find the performance extremely bad, which shows that the trained model is quite far from the original SVM model. Moreover, we cannot provide the result for mnist and protein due to the huge training time for each model. The reason is that for these problems, the resulting ICF is not close to the kernel matrix, and we have the details in the discussion of this algorithm. As a consequence of the discussion, we can explain for the huge training time: the frequent restarting of the ICF process wastes much time. Moreover, as the identity matrix instead of the kernel matrix dominates the approximated problem, we find that many training models with different parameters are alike. That is, the optimal model are less sensitive for parameters C and γ.

CHAPTER VII

Discussions and Conclusions

In this thesis we first discuss four multi-class implementations for RSVM and compared them with two decomposition methods based on LIBSVM. Experiments indicate that in general the test accuracy of RSVM is a little worse than that of the standard SVM. Though RSVM keeps similar constraints as the primal form of SVM, restricting support vectors from a randomly selected subset still downgrades the performance. Moreover, we show that it is hard to improve the performance by changing the selection strategy. For the training time which is the main motivation of RSVM we show that based on the current implementation techniques, RSVM will be faster than regular SVM on large problems or some difficult cases with many support vectors. Therefore, for median-sized problems, standard SVM should be used but for large problems, as RSVM can effectively restrict the number of support vectors, it can be an appealing alternate. Regarding the implementation of RSVM, least-square SVM (LS-SVM) is the fastest among the four methods compared here though its accuracy is a little lower. Thus, for very large problems it is appropriate to try this implementation first.

We then discuss the implementations for applying ICF approximated kernel to SVM problem. Experiments show that the testing accuracy is also a little lower

than that of traditional SVM. Compared with RSVM, we avoid the random selec-tion of support vectors by the ICF strategy, but the approximaselec-tion does not seem good enough for generating good models as SVM. Moreover, though RSVM and ICFSVM both can be implemented by linear SVM solvers, ICFSVM is restricted to those solvers focusing on the dual form. For those solvers focusing on the primal form, we explain why their solutions cannot be used for generating ICFSVM models.

However, the optimal penalty parameter of ICFSVM is smaller than that of RSVM, which is usually a merit for model selection strategies. Also, we collect the set of support vectors obtained by ICFSVM, and then use SVM solvers to train on the set. Experimental results shows this extra work will downgrade the testing accuracy more.

BIBLIOGRAPHY

[1] K. P. Bennett, D. Hui, and L. Auslender. On support vector decision trees for database marketing. Department of Mathematical Sciences Math Report No.

98-100, Rensselaer Polytechnic Institute, Troy, NY 12180, Mar. 1998.

[2] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal mar-gin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992.

[3] M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares, Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression data using support vector machines. Proceedings of the National Academy of Science, 97(1):262–267, 2000.

[4] C. J. C. Burges. A tutorial on support vector machines for pattern recognition.

Data Mining and Knowledge Discovery, 2(2):121–167, 1998.

[5] C.-C. Chang and C.-J. Lin. IJCNN 2001 challenge: Generalization ability and text decoding. In Proceedings of IJCNN, 2001. Winner of IJCNN Challenge.

[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[7] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–

297, 1995.

[8] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma-chines. Cambridge University Press, Cambridge, UK, 2000.

[9] D. DeCoste and B. Sch¨olkopf. Training invariant support vector machines. Ma-chine Learning, 46:161–190, 2002.

[10] A. Edelman. Large dense numerical linear algebra in 1993: The parallel com-puting influence. International J. Supercomputer Appl., 7:113–128, 1993.

[11] S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel repre-sentations. Journal of Machines Learning Research, 2:243–264, 2001.

[12] T.-T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast and simple learning procedure for support vector machines. In Proceedings of 15th Intl. Conf. Machine Learning. Morgan Kaufman Publishers, 1998.

[13] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, second edition, 1989.

[14] M. Heiler. Optimization criteria and learning algorithms for large margin classi-fiers. Master’s thesis, University of Mannheim, Germany, Department of Math-ematics and Computer Science, Computer Vision, Graphics, and Pattern Recog-nition Group, D-68131 Mannheim, Germany, 2001.

[15] C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

[16] C.-W. Hsu and C.-J. Lin. A simple decomposition method for support vector machines. Machine Learning, 46:291–314, 2002.

[17] T. Joachims. Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

[18] T. Joachims. Transductive inference for text classification using support vector machines. In Proceedings of International Conference on Machine Learning, 1999.

[19] Y. LeCun. MNIST database of handwritten digits. Available at http://www.research.att.com/~yann/exdb/mnist.

[20] Y. LeCun, L. Jackel, L. Bottou, A. Brunot, C. Cortes, J. Denker, H. Drucker, I. Guyon, U. Muller, E. Sackinger, P. Simard, and V. Vapnik. Comparison of learning algorithms for handwritten digit recognition. In F.Fogelman and P.Gallinari, editors, International Conference on Artificial Neural Networks, pages 53–60, Paris, 1995. EC2 & Cie., 1995.

[21] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining, 2001.

[22] C.-J. Lin. Formulations of support vector machines: a note from an optimization point of view. Neural Computation, 13(2):307–317, 2001.

[23] C.-J. Lin and J. J. Mor´e. Newton’s method for large-scale bound constrained problems. SIAM J. Optim., 9:1100–1127, 1999.

[24] C.-J. Lin and R. Saigal. An incomplete Cholesky factorization for dense matri-ces. BIT, 40:536–558, 2000.

[25] K.-M. Lin and C.-J. Lin. A study on reduced support vector machines. Tech-nical report, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2002.

[26] O. L. Mangasarian. Generalized support vector machines. In B. S. A. J. Smola, P. Bartlett and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 135–146. MIT Press, 2000.

[27] O. L. Mangasarian and D. R. Musicant. Successive overrelaxation for sup-port vector machines. IEEE Transactions on Neural Networks, 10(5):1032–1037, 1999.

[28] O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines.

Journal of Machine Learning Research, 1:161–177, 2001.

[29] N. Matic, I. Guyon, J. Denker, and V. Vapnik. Writer adaptation for on-line handwritten character recognition. In I. C. S. Press, editor, In Second Interna-tional Conference on Pattern Recognition and Document Analysis, pages 187–

191, Tsukuba, Japan, 1993.

[30] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available at anonymous ftp: ftp.ncc.up.pt/pub/statlog/.

[31] K.-R. M¨uller, A. Smola, G. R¨atsch, B. Sch¨olkopf, J. Kohlmorgen, and V. Vapnik.

Predicting time series with support vector machines. In B. Sch¨olkopf, C. J. C.

Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 243–254, Cambridge, MA, 1999. MIT Press.

[32] M. Orr. Introduction to radial basis function networks. Technical report, Institute for Adaptive and Neural Computation, Edinburgh University, 1996.

http://www.anc.ed.ac.uk/#mjo/rbf.html.

[33] J. M. Ortega. Introduction to Parallel and Vector Solution of Linear Systems.

Plenum Press, New York, 1988.

[34] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR’97, New York, NY, 1997.

IEEE.

[35] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In Proceedings of International Conference on Pattern Recognition, 1998.

[36] C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In International Conference on Computer Vision ICCV’98, 1998.

[37] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors, Ad-vances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998.

MIT Press.

[38] D. Prokhorov. IJCNN 2001 neural network competition. Slide presentation in IJCNN’01, Ford Research Laboratory, 2001.

http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf .

[39] M. Rychetsky, S. Ortmann, and M. Glesner. Construction of a support vector machine with local experts. In Workshop on Support Vector Machines at the International Joint Conference on Artificial Intelligence (IJCAI 99), 1999.

[40] B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, editors. Advances in Kernel Methods - Support Vector Learning. MIT Press, Cambridge, MA, 1998.

[41] B. Sch¨olkopf and A. J. Smola. Learning with kernels. MIT Press, 2002.

[42] B. Sch¨olkopf, K.-K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Transactions on Signal Processing, 45(11):2758–

2765, 1997.

[43] A. Smola and B. Sch¨olkopf. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, 2000.

[44] M. O. Stitson, A. Gammerman, V. Vapnik, V. Vovk, C. Watkins, and J. Weston.

Support vector regression with ANOVA decomposition kernels. In Sch¨olkopf et al. [40], pages 285–292.

[45] J. Suykens and J. Vandewalle. Least square support vector machine classifiers.

Neural Processing Letters, 9(3):293–300, 1999.

[46] J. Suykens and J. Vandewalle. Multiclass LS-SVMs: Moderated outputs and coding-decoding schemes. In Proceedings of IJCNN, Washington D.C., 1999.

[47] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY, 1995.

[48] V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.

[49] J.-Y. Wang. Application of support vector machines in bioinformatics. Master’s thesis, Department of Computer Science and Information Engineering, National Taiwan University, 2002.

[50] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software.

Technical Report UT-CS-97-366, 1997.

APPENDICES

APPENDIX A

The Relation between (3.7) and (3.1)

The derivation from (3.5) and (3.6) to (3.7) shows only that if (w^∗, b^∗, ξ^∗) and α^∗ are primal and dual optimal solutions, respectively, then (α^∗, b^∗, ξ^∗) is feasible for (3.7). Hence, before using (3.7) in practice we want to make sure that any optimal (α, ξ, b) of (3.7) can construct an optimal w for the primal problem (3.1).

It is easy to see this. By using w =P

Thus, (w, b, ξ) is optimal for the primal.

APPENDIX B

Derivations of (4.4) and (4.5)

In order to simplify the complex Pβ(e − eQeα) term, we define the vector

v ≡ exp(−β(e − eQeα)), (B.1)

and then

Pβ(e − eQeα) = (e − eQeα) + β⁻¹log(1 + v). (B.2) The differentiations of v and Pβ are:

Finally we obtain the gradient and Hessian of f :

= ∂ where δij is defined as

If we define diag(v) ≡

v₁ 0 0 ... v_d

for vector v with dimension d, we obtain a more clear matrix form:

在文檔中 Reduction techniques for training support vector machines (頁 49-0)