An SMO-type Method for General non-PSD Kernels

Results in Section 6.1 depend on properties of the sigmoid kernel. Here we propose an SMO-type method which is able to handle all kernel matrices no matter they are PSD or not. To have such a method, the key is on solving the sub-problem when Kii−2K^ij+Kjj <

0. In this case, (24a) is a concave quadratic function like that in Figure 4. The two sub-figures clearly show that the global optimal solution of (24) can be obtained by checking the objective values at two bounds L and H.

A disadvantage is that this procedure of checking two points is different from the solution procedure of Kii− 2K^ij+ Kjj ≥ 0. Thus, we propose to consider only the lower bound L which, as L < 0, always ensures the strict decrease of the objective function.

Therefore, the algorithm is as follows:

If Kii− 2K^ij + Kjj > 0, then ˆdj is the maximum of (25) and L,

Else ˆdj = L. (27)

Practically the change of the code may be only from (25) to

−−yⁱ∇F (α^k)i+ yj∇F (α^k)j

max(Kii− 2K^ij+ Kjj, 0). (28)

When Kii+ Kjj− 2K^ij < 0, (28) is −∞. Then the same as the situation of Kⁱⁱ+ Kjj− 2Kij = 0, ˆdj = L is taken.

An advantage of this strategy is that we do not have to exactly solve (24). (28) also shows that a very simple modification from the PSD-kernel version is possible. Moreover, it is easier to prove the asymptotic convergence. The reason will be discussed after

Lemma 3. In the following we prove that any limit point of the decomposition procedure discussed above is a stationary point of (2). In earlier convergence results, Q is PSD so a stationary point is already a global minimum.

If the working set selection is via (22), existing convergence proofs for PSD kernels (Lin 2001; Lin 2002) require the following important lemma which is also needed here:

Lemma 3 There exists σ > 0 such that for any k, F (α^k+1) ≤ F (α^k) −σ

2kα^k+1− α^kk². (29)

Proof.

If Kii+ Kjj− 2K^ij ≥ 0 in the current iteration, (Lin 2002) shows that by selecting σ as the following number

min{2 C, min

t,r {Ktt+ Krr− 2K^tr

2 | K^tt+ Krr− 2K^tr > 0}}, (30) (29) holds.

If Kii+Kjj−2K^ij < 0, ˆdj = L < 0 is the step chosen so (−yⁱ∇F (α^k)i+yj∇F (α^k)j) ˆdj <

0. As kα^k+1− α^kk² = 2 ˆd²_j from ˆdi = − ˆdj, (24a) implies that F (α^k+1) − F (α^k) < 1

2(Kii+ Kjj− 2K^ij) ˆd²_j (31)

= 1

4(Kii+ Kjj− 2K^ij)kα^k+1− α^kk²

≤ −σ⁰

2kα^k+1− α^kk², where

σ⁰ ≡ − max_t,r {Ktt+ Krr− 2K^tr

2 | K^tt+ Krr− 2K^tr < 0}. (32) Therefore, by defining σ as the minimum of (30) and (32), the proof is complete. 2

Next we give the main convergence result:

Theorem 11 For the decomposition method using (22) for the working set selection and (27) for solving the sub-problem, any limit point of {α^k} is a stationary point of (2).

Proof.

If we carefully check the proof in (Lin 2001; Lin 2002), it can be extended to non-PSD Q if (1) (29) holds and (2) a local minimum of the sub-problem is obtained in each iteration. Now we have (29) from Lemma 3. In addition, ˆdj = L is essentially one of the two local minima of problem (24) as clearly seen from Figure 4. Thus, the same proof follows. 2

There is an interesting remark about Lemma 3. If we exactly solve (24), so far we have not been able to establish Lemma 3. The reason is that if ˆdj = H is taken, (−yⁱ∇F (α^k)i+ yj∇F (α^k)j) ˆdj > 0 so (31) may not be true. Therefore, the convergence is not clear. In the whole convergence proof, Lemma 3 is used to obtain kα^k+1−α^kk → 0 as k → ∞. A different way to have this property is by slightly modifying the sub-problem (20) as shown in (Palagi and Sciandrone 2002). Then the convergence holds when we exactly solve the new sub-problem.

Although Theorem 11 shows only that the improved SMO algorithm converges to a stationary point rather than a global minimum, the algorithm nevertheless shows a way to design a robust SVM software with separability concern. Theorem 1 indicates that a stationary point is feasible for the separability problem (5). Thus, if the number of support vectors of this stationary point is not too large, the training error would not be too large, either. Furthermore, with additional constraints y^Tα = 0 and 0 ≤ αⁱ ≤ C, i = 1, . . . , l, a stationary point may already be a global one. If this happens at parameters with better accuracy, we do not worry about multiple stationary points at others. An example is the sigmoid kernel, where discussion in Section 3 indicates that parameters with better accuracy tends to be with CPD kernel matrices.

It is well known that Neural Networks have similar problems about local minima (Sarle 1997), and a popular way to prevent trapping in a bad one is multiple random initializations. Here we adapt this method and present an empirical study in Figure 5.

We use the heart data set, with the same setting as in Figure 2. Figure 5(a) is the contour which uses the zero vector as the initial α⁰. Figure 5(b) is the contour by choosing the solution with the smallest of five objective values via different random initial α⁰.

The performance of Figures 5(a) and 5(b) is similar, especially in regions with good rates. For example, when r < −0.5, the two contours are almost the same, a property

Figure 5: Comparison of cross validation rates between approaches without (left) and with (right) five random initializations

which may explain the CPD-ness in that region. In the regions where multiple stationary points may occur (e.g. C > 2⁶ and r > +0.5), two contours are different but the rates are still similar. We observe similar results on other datasets, too. Therefore, the stationary point obtained by zero initialization seems good enough in practice.

7 Modification to Convex Formulations

While (5) is non-convex, it is possible to slightly modify the formulation to be convex.

If the objective function is replaced by 1

then (5) becomes convex. Note that non-PSD kernel K still appears in constraints. The main drawback of this approach is that αi are in general non-zero, so unlike standard SVM, the sparsity is lost.

There are other formulations which use a non-PSD kernel matrix but remain convex.

For example, we can consider the kernel logistic regression (KLR) (e.g., (Wahba 1998)) and use a convex regularization term:

minα,b

where

ξr ≡ −y^r(

j=1

αjK(xr, xj) + b).

By defining an (l + 1) × l matrix ˜K with K˜ij ≡

(Kij if 0 ≤ i, j ≤ l, 1 if i = l + 1, the Hessian matrix of (33) is

I + C ˜˜ Kdiag(˜p)diag(1 − ˜p) ˜K^T,

where ˜I is an l + 1 by l + 1 identity matrix with the last diagonal element replaced by zero. ˜p ≡ [1/(1 + e^ξ¹), . . . , 1/(1 + e^ξ^l)]^T and diag(˜p) is a diagonal matrix generated by ˜p.

Clearly, the Hessian matrix is always positive semidefinite, so (33) is convex.

In the following we compare SVM (RBF and sigmoid kernels) and KLR (sigmoid kernel). Four data sets are tested: heart, german, diabete, and a1a. They are from (Michie, Spiegelhalter, and Taylor 1994) and (Blake and Merz 1998). The first three data sets are linearly scaled, so values of each attribute are in [-1, 1]. For a1a, its values are binary (0 or 1), so we do not scale it. We train SVM (RBF and sigmoid kernels) by LIBSVM (Chang and Lin 2001), which, an SMO-type decomposition implementation, uses techniques in Section 6 for solving non-convex optimization problems. For KLR, two optimization procedures are compared. The first one, KLR-NT, is a Newton’s method implemented by modifying the software TRON (Lin and Mor´e 1999). The second one, KLR-CG, is conjugate gradient method (see, for example, (Nash and Sofer 1996)). The stopping criteria for the two procedures are set the same to ensure that the solutions are comparable.

For the comparison, we conduct a two-level cross validation. At the first level, data are separated to five folds. Each fold is predicted by training the remaining four folds.

For each training set, we perform another five-fold cross validation and choose the best parameter by CV accuracy. We try all (log₂C, log₂a, r) in the region [−3, 0, . . . , 12] × [−12, −9, . . . , 3] × [−2.4, −1.8, . . . , 2.4]. Then the average testing accuracy is reported in Table 1. Note that for the parameter selection, the RBF kernel e^−akxⁱ^−x^j^k² does not involve with r.

Resulting accuracy is similar for all the three approaches. The sigmoid kernel seems to work well in practice, but it is not better than RBF. As RBF has properties of being PD and having fewer parameters, somehow there is no strong reason to use the sigmoid.

KLR with the sigmoid kernel is competitive with SVM, and a nice property is that it solves a convex problem. However, without sparsity, the training and testing time for KLR is much longer. Moreover, CG is worse than NT for KLR. These are clearly shown in Table 2. The experiments are put on Pentium IV 2.8G machines with 1024 MB RAM.

Optimized linear algebra subroutines (Whaley, Petitet, and Dongarra 2000) are linked to reduce the computational time for KLR solvers. The time is measured in CPU seconds.

Number of support vectors (#SV) and training/testing time are averaged from the results of the first level of five-fold CV. This means that the maximum possible #SV here is 4/5 of the original data size, and we can see that KLR models are dense to this extent.

Table 1: Comparison of test accuracy

e^−akxⁱ^−x^j^k² tanh(ax^T_i xj + r) data set #data #attributes SVM SVM KLR-NT KLR-CG

heart 270 13 83.0% 83.0% 83.7% 83.7%

german 1000 24 76.6% 76.1% 75.6% 75.6%

diabete 768 8 77.6% 77.3% 77.1% 76.7%

a1a 1605 123 83.6% 83.1% 83.7% 83.8%

Table 2: Comparison of time usage tanh(ax^T_i xj+ r)

#SV training/testing time

data set SVM KLR-NT KLR-CG SVM KLR-NT KLR-CG

heart 115.2 216 216 0.02/0.01 0.12/0.02 0.45/0.02 german 430.2 800 800 0.51/0.07 5.76/0.10 73.3/0.11 diabete 338.4 614.4 614.4 0.09/0.03 2.25/0.04 31.7/0.05 a1a 492 1284 1284 0.39/0.08 46.7/0.25 80.3/0.19

8 Discussions

From the results in Sections 3 and 5, we clearly see the importance of the CPD-ness which is directly related to the linear constraint y^Tα = 0. We suspect that for many non-PSD kernels used so far, their viability is based on it as well as inequality constraints 0 ≤ αⁱ ≤ C, i = 1, . . . , l of the dual problem. It is known that some non-PSD kernels are not CPD. For example, the tangent distance kernel matrix in (Haasdonk and Keysers 2002) may contain more than one negative eigenvalue, a property that indicates the matrix is not CPD. Further investigation on such non-PSD kernels and the effect of inequality constraints 0 ≤ αⁱ ≤ C will be interesting research directions.

Even though the CPD-ness of the sigmoid kernel for certain parameters gives an explanation to the practical viability, the quality of the local minimum solution in other parameters may not be guaranteed. This makes it hard to select suitable parameters for the sigmoid kernel. Thus, in general we do not recommend the use of the sigmoid kernel.

In addition, our analysis indicates that for certain parameters the sigmoid kernel behaves like the RBF kernel. Experiments also show that their performance are similar.

Therefore, with the result in (Keerthi and Lin 2003) showing that the linear kernel is essentially a special case of the RBF kernel, among existing kernels, RBF should be the first choice for general users.

Acknowledgments

This work was supported in part by the National Science Council of Taiwan via the grant NSC 90-2213-E-002-111. We thank users of LIBSVM (in particular, Carl Staelin), who somewhat forced us to study this issue. We also thank Bernhard Sch¨olkopf and Bernard Haasdonk for some helpful discussions.

References

Berg, C., J. P. R. Christensen, and P. Ressel (1984). Harmonic Analysis on Semigroups.

New York: Springer-Verlag.

Blake, C. L. and C. J. Merz (1998). UCI repository of machine learn-ing databases. Technical report, University of California, Depart-ment of Information and Computer Science, Irvine, CA. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.

Boser, B., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory.

Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition.

Data Mining and Knowledge Discovery 2 (2), 121–167.

Burges, C. J. C. (1999). Geometry and invariance in kernel based methods. In B. Sch¨olkopf, C. Burges, and A. Smola (Eds.), Advances in Kernel Methods: Sup-port Vector Learning, pp. 89–116. MIT Press.

Chang, C.-C. and C.-J. Lin (2001). LIBSVM: a library for support vector machines.

Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Cortes, C. and V. Vapnik (1995). Support-vector network. Machine Learning 20, 273–

297.

DeCoste, D. and B. Sch¨olkopf (2002). Training invariant support vector machines.

Machine Learning 46, 161–190.

Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys 23 (1), 5–48.

Haasdonk, B. and D. Keysers (2002). Tangent distance kernels for support vector machines. In Proceedings of the 16th ICPR, pp. 864–868.

Hsu, C.-W. and C.-J. Lin (2002). A simple decomposition method for support vector machines. Machine Learning 46, 291–314.

Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.

Keerthi, S. S. and C.-J. Lin (2003). Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15 (7), 1667–1689.

Keerthi, S. S., S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy (2001). Improve-ments to Platt’s SMO algorithm for SVM classifier design. Neural Computation 13, 637–649.

Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Transactions on Neural Networks 12 (6), 1288–1298.

Lin, C.-J. (2002). Asymptotic convergence of an SMO algorithm without any assump-tions. IEEE Transactions on Neural Networks 13 (1), 248–250.

Lin, C.-J. and J. J. Mor´e (1999). Newton’s method for large-scale bound constrained problems. SIAM Journal on Optimization 9, 1100–1127.

Lin, K.-M. and C.-J. Lin (2003). A study on reduced support vector machines. IEEE Transactions on Neural Networks 14 (6), 1449–1559.

Micchelli, C. A. (1986). Interpolation of scattered data: distance matrices and condi-tionally positive definite functions. Constructive Approximation 2, 11–22.

Michie, D., D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall. Data available at http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.

Nash, S. G. and A. Sofer (1996). Linear and Nonlinear Programming. McGraw-Hill.

Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97, New York, NY, pp. 130–

136. IEEE.

Osuna, E. and F. Girosi (1998). Reducing the run-time complexity of support vector machines. In Proceedings of International Conference on Pattern Recognition.

Palagi, L. and M. Sciandrone (2002). On the convergence of a modified version of SVM^light algorithm. Technical Report IASI-CNR 567.

Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances

in Kernel Methods - Support Vector Learning, Cambridge, MA. MIT Press.

Sarle, W. S. (1997). Neural Network FAQ. Periodic posting to the Usenet newsgroup comp.ai.neural-nets.

Sch¨olkopf, B. (1997). Support Vector Learning. Ph. D. thesis.

Sch¨olkopf, B. (2000). The kernel trick for distances. In NIPS, pp. 301–307.

Sch¨olkopf, B. and A. J. Smola (2002). Learning with kernels. MIT Press.

Sellathurai, M. and S. Haykin (1999). The separability theory of hyperbolic tangent kernels and support vector machines for pattern classification. In Proceedings of ICASSP99.

Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.

Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods: Support Vector Learning, pp. 69–88. MIT Press.

Whaley, R. C., A. Petitet, and J. J. Dongarra (2000). Automatically tuned linear alge-bra software and the ATLAS project. Technical report, Department of Computer Sciences, University of Tennessee.

A Proof of Theorem 8

The proof of Theorem 8 contains three parts: the convergence of the optimal solution, the convergence of the decision value without the bias term, and the convergence of the bias term. Before entering the proof, we first need to know that (17) has a PD kernel under our assumption xi 6= x^j for all i 6= j. Therefore, the optimal solution ˆα^∗ of (17) is unique. From now on we denote ˆα^r as a local optimal solution of (2), and b^r as the associated optimal b value. For (17), b^∗ denotes its optimal b.

1. The convergence of optimal solution:

r→−∞lim θrαˆ^r = ˆα^∗, where θr ≡ 1 + tanh(r). (34)

Proof.

By the equivalence between (2) and (16), θrαˆ^r is the optimal solution of (16). The convergence to ˆα^∗ comes from (Keerthi and Lin 2003, Lemma 2) since ¯Q is PD and the kernel of (16) approaches ¯Q by Lemma 1. 2

2. The convergence of the decision value without the bias term: For any x,

r→−∞lim

(36) comes from the equality constraint in (2) and (37) comes from (34) and Lemma 1. 2

3. The convergence of the bias term:

r→−∞lim b^r = b^∗. (38)

Proof.

By the KKT condition that b^r must satisfy,

i∈Imaxup(ˆα^r,C)−yⁱ∇F (ˆα^r)i ≤ b^r ≤ min

i∈Ilow(ˆα^r,C)−yⁱ∇F (ˆα^r)i,

where Iup and Ilow are defined in (21). In addition, because b^∗ is unique, max

i∈Iup(ˆα^∗, ˜C)−yⁱ∇F^T( ˆα^∗)i = b^∗ = min

i∈Ilow(ˆα^∗, ˜C)−yⁱ∇F^T( ˆα^∗)i.

Note that the equivalence between (2) and (16) implies ∇F (ˆα^r)i = ∇F^r(θrαˆ^r)i. Thus,

max

i∈Iup(θrαˆ^r, ˜C)−yⁱ∇F^r(θrαˆ^r)i ≤ b^r ≤ min

i∈Ilow(θrαˆ^r, ˜C)−yⁱ∇F^r(θrαˆ^r)i.

By the convergence of θrαˆ^r when r → −∞, after r is small enough, all index i’s satisfying ˆα_i^∗ < ˜C would have θrαˆ_i^r < ˜C. That is, Iup( ˆα^∗, ˜C) ⊆ I^up(θrαˆ^r, ˜C).

Therefore, when r is small enough, max

i∈Iup(ˆα^∗, ˜C)−yⁱ∇F^r(θrαˆ^r)i ≤ max

i∈Iup(θrαˆ^r, ˜C)−yⁱ∇F^r(θrαˆ^r)i. Similarly,

min

i∈Ilow(ˆα^∗, ˜C)−yⁱ∇F^r(θrαˆ^r)i ≥ min

i∈Ilow(θrαˆ^r, ˜C)−yⁱ∇F^r(θrαˆ^r)i. Thus, for r < 0 small enough,

max

i∈Iup(ˆα^∗, ˜C)−yⁱ∇F^r(θrαˆ^r)i ≤ b^r ≤ min

i∈Ilow(ˆα^∗, ˜C)−yⁱ∇F^r(θrαˆ^r)i. Taking limr→−∞ on both sides, using Lemma 1 and (34),

r→−∞lim b^r = max

i∈Iup(ˆα^∗, ˜C)−yⁱ∇F^T( ˆα^∗)i = min

i∈Ilow(ˆα^∗, ˜C)−yⁱ∇F^T( ˆα^∗)i = b^∗. (39) 2

Therefore, with (37) and (39), our proof of Theorem 8 is complete.

在文檔中 a study on sigmoid kernels for svm and the training of non-psd (頁 21-32)