Some modifications on RSVM and their performances

V. Experiments on RSVM

5.3 Some modifications on RSVM and their performances

In this section, two types of modifications are applied to the original RSVM for-mulation. The first is about the regularization term. Remember we mentioned in Chapter III that following the generalized SVM the authors of [21] replace ¹₂α¯^TQRRα¯ in (3.9) by ¹₂α¯^Tα and solve (3.10). So far we only see that for the LSVM implemen-¯ tation, if without doing so we may have troubles to obtain and use the dual problem.

For SSVM and LS-SVM, ¹₂α¯^TQRRα can be kept and the same methods can still be¯ applied. As the change leads to the loss of the property on maximizing the margin, we are interested in whether the performance (testing accuracy) is worsen or not.

Without changing the ¹₂α¯^TQRRα term in (4.8), LS-SVM formulation is¯

Thus we will solve a different linear system:

( eQ^TQ +e 1

2CQRR)eα = eQ^Te. (5.2) (4.9) is a positive definite linear system because I/2C is positive definite. However, (5.2) does not share this property, since in some cases, QRR/2C is only positive semi-definite. An example when using the RBF kernel is that some training data are at the same point (e.g. dna in our experiments). Therefore, Cholesky factorization which we used to solve (4.9) can fail in solving (5.2). Thus, LU factorization is used. This change affects little to the training time according to the time complexity analysis in Section 3.2. A comparison on the testing accuracy between solving (4.9) and (5.2) is in Table 5.5. We can see that their accuracy is very similar. Therefore, we conclude that the use of a simpler quadratic term in RSVM is basically fine.

The second modification is changing the random selection of support vectors.

Table 5.5: A comparison on modified versions of RSVM: testing accuracy

LS-SVM Decomposition

(4.9) (5.2) P

K:,i+(4.9) ICF+(4.9) (4.15) loosen+(4.15) ProblemC, γ rate C, γ rate C, γ rate C, γ rate C, γ rate C, γ rate

The random selection is efficient as it costs O(lmn) time where n is the number of attributes in the training vector. However, we suspect that a more careful selection might improve the testing accuracy. We try several heuristics to select a subset of training data which we consider more important as support vectors, and then use the LS-SVM implememtation to solve the reduced problem. They are listed as follows:

1. For each column of K, we calculate the sum of all entries in that column. Then we take the m vectors corresponding to the columns with larger sums. Since RBF kernel is used, all entries are positive, and so are the sums. We think columns with larger sums might be more important. The main work for this strategy is to obtain all entries of K, which costs O(l²n) time. This selection strategy is denoted as “P

K:,i” in Table 5.5.

2. Conducting incomplete Cholesky factorization with symmetric pivoting on K (This will be detailed in Chapter VI), and the first m pivoted indices are used for selecting support vectors. This is a by-product of a factorization process for K. We use an algorithm to perform imcomplete Cholesky factorization in O(lm²) time, and denote it as “ICF” strategy in Table 5.5.

3. Train the original SVM using conventional decomposition methods with a very

loose stopping tolerance (here BSVM with = 0.1 is used). BSVM solves a problem of the form (4.15), and those dual variables equal to C are called bounded dual variables. Since they are nonzero, the corresponding training data are all support vectors, but a notable property states that all misclassified training data are included in these support vectors. For RSVM case where only a few support vectors are selected, if we select all these bounded support vectors, we find that RSVM is severely affected by these points and constructs a bad model. Thus, we mainly select the free support vectors (corresponding to dual variables not equal to C) of BSVM. If there are less than m free support vectors, we randomly select other training data, including bounded support vectors. We find that loosing the stopping tolerance reduces not much of the training time for BSVM so overall this strategy is slower than BSVM with default tolerance. However, from this strategy we can see whether an approximated solution of decomposition methods can help to select support vectors of RSVM. We denote it as “loosen” strategy in Table 5.5.

The result is also shown in Table 5.5. For some problems, “pivot” and “loosen”

improve the training accuracy very little. “ P

K:,i ” in general does not improve the accuracy, and for shuttle the result is poor. Compared with Table 5.2, all im-provements do not make RSVM competable with the original SVM, so we suggest the original simple and fast random selection.

ICF Approximated Kernel Representation for SVM

As we have mentioned before, the difficulty for solving the quadratic programming problem (2.7) is the dense matrix Q, which is so large that we can neither store it in memory nor efficiently compute its inverse. RSVM has circumvented this problem by modifying the problem so only some columns of Q are used during the optimization.

This is an example of approximating the large dense Q by its sub-matrix, and there exists other approximations. For example, a greedy approximation [43] approximates Q by an l ×m basis matrix which is linear combinations of some columns of Q. Their algorithms, like RSVM, involve random selections as well. Here, we study another technique which approximates Q by its incomplete Cholesky factorization (ICF).

This idea was proposed in [11, Section 4], and in this chapter we will detail the concept and implementation for this kind of reduction, and compare it with RSVM.

6.1 Incomplete Cholesky factorization (ICF)

Linear systems can be solved in two popular schemes: direct methods such as the Gaussian elimination, and iterative methods, which approximate the actual solution by repeatedly solving simpler problems. The conjugate gradient method (CG) is

one major approach of iterative methods, but a problem of CG is that it sometimes converges very slowly. A modern approach suggested in the survey [10] is the pre-conditioned CG method. When solving the linear system Qx = b for x, the first step of this approach is obtaining a preconditioner V from Q. Then we use CG to solve the system (V⁻¹QV^−T)y = V⁻¹b for y, and finally solve V^Tx = y for x. If the preconditioner V is chosen well, CG converges fast and the linear system is solved efficiently. If Q is symmetric positive definite, a perfect choice is the Cholesky fac-torization satisfying Q = V V^T, which transforms Q to the identity matrix. However, obtaining the Cholesky factorization is usually as hard as solving the linear system.

We then search for V which can be obtained more easily and V V^T approximates Q. Such a matrix V is called the incomplete Cholesky factorization (ICF). ICF has been a general way for obtaining preconditioners, especially for solving sparse linear systems. However, in this Chapter we will focus on the use of ICF for handling problems involving dense matrices.

Currently most ICF methods are modifications on traditional Cholesky factoriza-tion algorithms. The modificafactoriza-tions may be, for example, discarding too small entires of Q or use fast but inaccurate operations in ICF algorithms. Based on different orders of loops and indexes, there are several versions of Cholesky factorization [33, Appendix]. However, since we do not store the dense matrix but calculate each ele-ment during the factorization, only some special forms are usable. For example, we consider the popular outer product form [13, Section 10.3.2], which is based on the following formula:

In the beginning, the left-hand side is the whole matrix Q. The vector

√α

√vα

is the leftmost column of V , and the same procedure is recursively applied to the sub-matrix B − ^vvα^T.

In each iteration of the Cholesky factorization procedure, since we are not able to store B (it can be as large as Q), B −^vvα^T is also not stored. This means that we must delay the operation of subtracting ^vv_α^T until the respective column is considered. In other words, if v is the jth column, elements of the jth column of Q are not computed until the jth step. The jth column of V is computed by using the 1st to (j − 1)st columns of V .

For the ICF, each column of V can also be computed by this strategy. In the next section we will consider two of such ICF methods.

6.2 ICF algorithms

The first method discussed here is proposed in [24], which examines all entries of Q, but in each step only the largest m elements of v are stored as a column of V . When we compute the later v’s, we use the previous stored columns of V for updating them. [24] showed that in practice, the time complexity of this algorithm is O(lm²).

The memory requirement is O(lm) so it can be pre-allocated conveniently. This situation is similar to that of RSVM. Since v is not wholly stored as a column of V , the updating is not exact. Therefore, after a number of steps, negative diagonals may occur. That is, in (6.1) the entry α is negative. The process cannot continue since we must take the square root of α. When this algorithm encounters such a problem, it adds βI to Q and restarts itself. If unfortunately the same problem occurs again, β is iteratively increased. We observe that for some problems, β becomes so large that the resulting Q is dominated by βI. Then the ICF may not be a good approximation

Algorithm VI.1: Column-wise ICF with symmetric pivoting

The second algorithm is proposed in [11, Section 4], which is an early stopping version of the Cholesky factorization. It stops generating the columns of the Cholesky factorization when the trace of the remaining submatrix B is smaller than a stopping tolerance. In each step, the updating procedure is exactly the same as the Cholesky factorization, so it inherits the good property that negative diagonals never occur.

Moreover, while [24] does not perform pivoting, [11] implements symmetric pivoting to improve numerical stability. This is done by swapping the column and row with

the largest diagonal to current column and row. Here instead of checking the trace of the remaining submatrix, we stop the algorithm after m columns are generated.

This also results in looking O(lm) entries of Q, which is also the same as RSVM. We put the modified procedure in Algorithm VI.1. Each iteration consists of symmetric pivoting and one column update, and both cost O(lm) time. Hence, the total time complexity is O(lm²).

6.3 Using ICF in SVM

In [11] the authors proposed to use interior point method (IPM) to solve the SVM dual problem (4.16) with a low rank kernel matrix Q. The need for a low rank kernel matrix results from the most time-consuming operation on every step of IPM, that is, calculating (D + Q)⁻¹u where D is a diagonal matrix and u is some vector. If we already have the low rank representation Q = V V^T where V is an l × m matrix with m l, the calculation is efficient. Actually we have introduced one approach, the SMW identity (4.3), which can invert (D + V V^T) in O(lm²) time. Another approach is detailed in [11, Section 3.2] with the same order of time.

We observe that if linear kernel is used, for example Q = diag(y)XX^Tdiag(y) in (4.16), then V = diag(y)X is a low-rank matrix, as long as X has more rows than columns. That is, in SVM input, the number of training data vectors l is more than the number of attributes n. However, for other situations, such as using the RBF kernel, Q has full rank, so the low rank representation of Q does not exist. Such problem is overcome in [11, Section 4] by introducing the ICF approximation. A low rank ICF V V^T is obtained, then ˜Q = V V^T substitutes Q in the IPM procedure.

The resulting optimal dual solution ˆα is considered as an approximate solution of the original dual and is used to construct w =

Xl i=1

yiαˆiφ(xi). Also, the decision function

is the same as the original SVM (compared with (3.4) and (2.8)). Note that every xi

corresponding to nonzero ˆαi is saved as the support vectors. This is different from RSVM as the latter fixes the support vectors in advance. Consequently, for the ICF approximation method, it is interesting to see whether the support vectors will be sparse or not.

Here we show that except IPM, other methods which focus on solving linear SVM are possible to solve the SVM problem with an ICF kernel representation. We change the kernel matrix Q in the dual problem (4.15) to its ICF Q = V V^T:

minαˆ

2αˆ^T(V V^T + yy^T) ˆα − e^Tαˆ

subject to 0 ≤ ˆαi ≤ C, i = 1, . . . , l. (6.2)

We observe that (6.2) is already in the dual form of a linear SVM with more data points than attributes. Therefore, all methods suitable for the dual problem of linear SVM can be applied without modification.

6.4 Implementations

We have described several techniques for solving linear SVM problems in Chapter IV. The solvers focusing on the dual form (LSVM and decomposition) can be im-mediately applied for (6.2). Here we construct the two ICF described in Section 6.2, then respectively solve the linear SVM problem (6.2) by the decomposition method.

We denote them by ICFSVM in our experiments. The total time complexity for ICFSVM is O(lm²) + #iterations × O(lm).

We want to see whether we can use the implementations for linear SVM in the

primal form. Clearly the primal form of (6.2) can be considered as a linear SVM:

Although solving (6.3) is as easy as solving a linear SVM problem, we have difficulties on explaining the solution. (6.3) is a linear SVM problem with “virtual”

data points which are the rows of V . The solution eα is the normal vector of the hyperplane which separates the “virtual” data points. That is, for optimal value of α we have eˆ α = P

iαˆiVi,:, so ˆαi actually represents the weight of Vi,:, which is very different from φ(xi). Thus in the prediction stage, since we do not know how to transform the test data into the “virtual” space, the separating hyperplane is not usable. To conclude, we find it impossible to make use of the solution of the primal problem (6.3). Therefore, we do not apply those implementations (SSVM, LS-SVM) which solve the primal form of SVM.

An inference from this observation is about one concluding remark made in [11]

which believes that the set of nonzero dual variables (support vectors) obtained by ICFSVM is (almost) the same as the set obtained by solving standard SVM dual form. From the earlier discussion, we doubt this assertion since the resulting nonzero dual variables in (6.2) are only meaningful in constructing the separating hyperplane of the “virtual” data points in (6.3) , and we can hardly believe that the actual w will be better constructed by the linear combinations of φ(xi), where each xi is the training vector corresponding to a nonzero ˆαi. We also conduct some experiments on this issue in the following section.

An additional implementation is combining the ICFSVM technique with RSVM.

Some information from the ICF process can be used to select the support vectors for RSVM procedure. Here we use the data corresponding to the m chosen columns in Algorithm VI.1 as the support vectors, and then follow the regular RSVM routines.

Experimental results using LS-SVM implementation have been shown in Table 5.5.

It has similar accuracy with other RSVM modifications discussed in Section 5.3.

6.5 Experiments and results

We use the same target problems as RSVM experiments, with the same settings mentioned in Chapter V. In particular, here m controls the number of entries of the ICF matrix V , and it is selected in the same way as for RSVM: in most cases m is set to be 10% of the training data. However, since the number of dual variables is still l, the number of support vectors are not the same as RSVM; instead, it depends on the sparsity of the solution.

We compare five methods:

1. LIBSVM, the standard decomposition method for nonlinear SVM

2. Decomposition implementation for RSVM

3. ICFSVM using the first ICF algorithm (denoted as [24])

4. ICFSVM using the second ICF algorithm (denoted as (VI.1))

5. Re-training on the support vectors after 4.

To reduce the search space of parameter sets, here we consider only the RBF kernel K(xi, xj) ≡ e^−γkxⁱ^−x^j^k², so the parameters left for decision are kernel parameter γ and cost parameter C. For all methods, we conduct model selection on the training data where the test data are assumed unknown. For each problem, we estimate the

generalized accuracy using different γ = [2⁴, 2³, 2², . . . , 2⁻¹⁰] and C = [2¹², 2¹¹, 2¹⁰, . . . , 2⁻²]. Therefore, for each problem we try 15 × 15 = 225 combinations. For each pair of (C, γ), the validation performance is measured by training 70% of the training set and testing the other 30% of the training set. The best parameter set is used for constructing the model for future testing. In addition, for multi-class problems we consider that C and γ of all k(k − 1)/2 binary problems via the one-against-one approach are the same.

Table 6.1 shows the testing accuracy of all methods. ICFSVM using the second ICF algorithm achieves the highest rate. However, the best rate is similar to RSVM, so for moderate-sized training set ICFSVM is also not recommended. We also note that the optimal C parameter for ICFSVM is smaller than that of RSVM. This nice property helps the decomposition implementation since it is more stable for normal parameters, especially for moderately small C’s. In addition, it is easier to apply different model selection techniques on ICFSVM than RSVM: many of them can work only on the region where C and γ are not too small or too large.

In Table 6.2 we report the number of “unique” support vectors at the optimal model of each method. We observe that the result of ICFSVM using the second algorithm is similar to that of the original SVM. Hence, we can say that such a method also gains the SVM property that support vectors are often sparse. The number of support vectors for RSVM is decided in advance, and is not similar to other techniques.

We report the training time and testing time for solving the optimal model in Table 6.3. Since ICFSVM has to perform ICF, an extra work before solving (6.2), the training time may be more than RSVM whose main work is solving (4.15), a QP similar to (6.2). Moreover, from the table we see that for some problems solving

Table 6.1: A comparison on ICFSVM: testing accuracy

SVM RSVM ICFSVM(decomposition implementation)

N/A: training time too large to apply the model selection

Table 6.2: A comparison on ICFSVM: number of support vectors

SVM RSVM ICFSVM

LIBSVM Decomposition [24] (VI.1) (VI.1)+retrain

Problem #SV

dna 973 372 1688 1389 1588

satimage 1611 1826 4022 1187 1507

letter 8931 13928 12844 5390 8953

shuttle 285 4982 43026 308 3714

mnist 8333 12874 N/A 5295 5938

ijcnn1 4555 200 49485 4507 8731

protein 14770 596 N/A 15049 15512

N/A: training time too large to apply the model selection

Table 6.3: A comparison on ICFSVM: training time and ICF time (in seconds)

SVM RSVM ICFSVM

LIBSVM Decomposition [24] (VI.1) (VI.1)+retrain

Problem training training training ICF training ICF training ICF

dna 7.09 7.59 440.41 427.18 9.62 5.45 33.77 5.52

satimage 16.21 43.75 558.23 467.48 48.49 28.37 61.59 28.32 letter 230 446.04 3565.31 2857.95 484.59 222.4 635.41 221.93 shuttle 113 562.62 70207.76 13948.141251.17 1184.631811.6 1265.51 mnist 1265.67 1913.86 N/A N/A 2585.13 2021.642565.08 1866.9 ijcnn1 492.53 16152.54 21059.3 4680.63 5463.8 103.97 1579.73 102.52 protein 1875.9 833.35 N/A N/A 217.53 92.52 3462.57 110.54

N/A: training time too large to apply the model selection

ICF occupies most of the training time. However, for the best model the training time of ICFSVM may not be more than RSVM. This is due to that ICFSVM has the best model at smaller C. Now they both use dcomposition methods which can be very slow if C is large (see discussions in [15]). Thus, if proper model selection techniques are used, the overall training time of ICFSVM can be competitive with that of RSVM. In addition, if in the model selection process, a line search procedure is need by fixing other parameters and varing C, we only have to do ICF once. This may be another advantage of ICFSVM. However, we have mentioned that RSVM can be solved through both primal and dual forms. So in this aspect, it is more flexible and efficient if only a single parameter set is considered.

The performance for re-training the support vectors obtained by ICFSVM is generally worse than the original ICFSVM. Moreover, the optimal parameters for all problems are quite different. This indicates that ICFSVM actually finds another set of support vectors, and re-training on this set will lead to quite different model.

The result for the first ICF method is even strange. Although for some problems it is comparable with other methods, for other problems we find the performance extremely bad, which shows that the trained model is quite far from the original

在文檔中 Reduction techniques for training support vector machines (頁 39-0)