• 沒有找到結果。

V. Experiments on RSVM

5.1 Problems and settings

We choose large multi-class datasets from the Statlog collection: dna, satimage, letter, and shuttle [30]. We also consider mnist [19], an important benchmark for handwritten digit recognition. The problem ijcnn1 is from the first problem of IJCNN challenge 2001 [38]. Note that we use the winner’s transformation of raw data [5].

The last one, protein, is a data set for protein secondary structure prediction [49].

Except problems dna, ijcnn1, and protein whose data values are already in a small range, we scale all training data to be in [−1, 1]. Then test data, available for all problems, are adjusted to [−1, 1] accordingly. Note that for problem mnist, it takes too much training time if the whole 60,000 training samples are used, so we consider the training and testing data together (i.e. 70,000 samples) and then cut the first 30%

for training and test the remaining 70%. Also note that for the problem satimage, there is one missing class. That is, in the original application there is one more class but in the data set no examples are with this class. We give problem statistics in Table 5.1. Some problems with a large number of attributes may be very sparse. For example, for each instance of protein, only 17 of the 357 attributes are nonzero.

Table 5.1: Problem statistics

Problem #training data #testing data #class #attribute

dna 2000 1300 3 180

We compare four implementations of RSVM discussed in Chapter IV with two implementations of regular SVM: linear and quadratic cost functions (i.e., (2.2) and

(2.5)). For regular SVM with the linear cost function, we use the software LIBSVM which implements a simple decomposition method. We can easily modify LIBSVM for using the quadratic cost function, which we will refer to as LIBSVM-q in the rest of this thesis. However, we will not use it to solve RSVM as LIBSVM implements an SMO type algorithm where the size of the working set is restricted to two. In Section 4.4 we have shown that larger working sets should be used when using decomposition methods for linear SVM.

The computational experiments for this section were done on a Pentium III-1000 with 1024MB RAM using the gcc compiler. For three of the four RSVM methods (SSVM, LS-SVM, and LSVM), the main computational work are some basic matrix operations so we use ATLAS to optimize the performance [50]. This is very crucial as otherwise a direct implementation of these matrix operations can at least double or triple the computational time. For decomposition methods where the kernel matrix cannot be fully stored we allocate 500MB memory as the cache for storing recently used kernel elements. Furthermore, LIBSVM, LIBSVM-q, and BSVM all use a shrink-ing technique so if most variables are finally at bounds, it solves smaller problems by considering only free variables. Details on the shrinking technique are in [6, Section 4].

Next we discuss the selection of parameters in different implementations. It is of course a tricky issue on selecting m, the size of the subset of RSVM. This depends on practical situations such as how large the problem is. Here in most cases we fix m to be 10% of the training data, which was also considered in [21]. For multi-class problems, we cannot use the same m for all binary problems as the data set may be highly unbalanced. Hence we choose m to be 10% of the size of each binary problem so a smaller binary problem use a smaller m. In addition, for problems shuttle, ijcnn1,

26

and protein, binary problems may be too large for training. Thus, we set m = 200 for these large binary problems. This is similar to how [21] deals with large problems.

Once the size is determined, for all four implementations, we select the same subset for each binary RSVM problem. Note that n  m for most problems we consider, so the time of kernel evaluations is O(lmn) which is much less than that of each implementations for RSVM.

We need to decide some additional parameters prior to experiments. For SSVM method, the smoothing parameter β is set to 5 because the performance is almost the same for β ≥ 5. For LSVM method, we set β = 1.9C which is the same as that in [28]. We do not conduct cross validation for selecting them as otherwise there are too many parameters. For BSVM which is used to solve linear SVM arising from RSVM, we use all default settings. In particular, the size of the working set is 10.

The most important criterion for evaluating the performance of these methods is their accuracy rate. However, it is unfair to use only one parameter set and then compare these methods. Practically for any method we have to find the best parameters by performing the model selection. This is conducted on the training data where the test data are assumed unknown. Then the best parameter set is used for constructing the model for future testing. To reduce the search space of parameter sets, here we consider only the RBF kernel K(xi, xj) ≡ e−γkxi−xjk2, so the parameters left for decision are kernel parameter γ and cost parameter C. In addition, for multi-class problems we consider that C and γ of all k(k − 1)/2 binary problems via the one-against-one approach are the same.

For each problem, we estimate the generalized accuracy using different γ = [24, 23, 22, . . . , 2−10] and C = [212, 211, 210, . . . , 2−2]. Therefore, for each problem we try 15 × 15 = 225 combinations. For each pair of (C, γ), the validation

perfor-mance is measured by training 70% of the training set and testing the other 30% of the training set. Then we train the whole training set using the pair of (C, γ) that achieves the best validation rate and predict the test set. The resulting accuracy is presented in the “rate” columns of Table 5.2. Note that if several (C, γ) have the same accuracy in the validation stage, we apply all of them to the test data and re-port the highest rate. If there are several parameters resulting in the highest testing rate, we report the one with the minimal training time.

相關文件