Implementation Results - PROPOSED APPROACHES

PROPOSED APPROACHES

4.2 Implementation Results

) min(

)

max(x x

x x_new x

−

= − (4.1)

4.2 Implementation Results

Five approaches including linear kernel, two popular kernels (polynomial and RBF), and two proposed kernels (polynomial plus RBF, ; polynomial multiplies RBF, ) were implemented for the classification tasks. We use the one-against-one procedure to calculate the accuracy of classification in the multi-class SVM model, else the general procedure is employed to acquire that. Furthermore, a popular classifier, K nearest neighbor (KNN) was employed as the benchmark in our experiment. In order to simplify the process of classification, the parameter was set at 0, was set at 1 in the polynomial kernel. We only changed the degree . As for the RBF kernel, it remained in its original form, i.e. kernel width could be changed. In our experiment, parameter was set between 2 to 10. Parameter

kP₊ G

kP_⋅

b d

d γ

was set at 10⁻³, 10⁻², 10⁻¹, 10⁰, 10¹, 10², respectively.

A total of twelve data sets were separated into large (more than 5000 samples) and small ones (less than 5000 samples) as mentioned before. The imbalanced data sets is shuttle. Table 4.2 compares the accuracy of the classification with the larger and smaller data sets respectively. In the larger data sets, in general the accuracy of the classification of the SVM based approaches is better than that of the KNN approach. Among them, the proposed kernel (polynomial multiplies RBF kernel) has the best performance, and the next one is another proposed kernel . As for smaller data sets, the results are as similar as the larger data sets. In general, the performance of the SVM based approaches is better than that of the KNN approach.

Among them, the combined kernel also has the best performance. In addition we found that the performance of the proposed kernels is not so good for the

kP_⋅

kP₊

kP_⋅

kP₊

kP_⋅

The average accuracies of the classification for the seven larger data sets and five smaller data sets are shown in Tables 4.3 and 4.4. Their standard deviations are listed in the brackets. The two tables indicated that the combined kernel has better performance than the other approaches. After feature selection (from 75% to 25%), the kernel also showed a better performance both in larger and smaller data. In the larger data, the combined kernel showed a better performance than the polynomial and RBF kernel. The result in the smaller data was the same as that in the larger one. Furthermore, the kernel almost had the lowest standard deviation among the four approaches in the larger data. In the smaller data set, the kernel performed well.

The implementation result of feature selection is showed in Tables 4.3 and 4.4. In this procedure, the kernels were applied to L-J method for feature selection. The optimal parameter settings were employed in SVM model for L-J feature selection. As same as above, we use the original SVM technique if the data were of two classes.

Else the SVM model is worked by one-against-one process if the data are more than two classes.

classification of imbalanced data. Based on the ML, our proposed combined kernels have a better performance when the ML is small.

Table 4.1 Data sets used in this study.

No Data set # of samples # of features # of classes Data Style Data complexity (ML)

Ratio of positive to negative

1 Hyperlipidemia 6000 33 2 c 1.00 1:2.08

2 Liver disease 6000 33 2 c 1.46 1:2.81

3 Renal disease 6000 33 2 c 1.06 1:3.60

4 Census income 32561 14 2 c, d 1.41 1:3.15

5 Shuttle* 14500 9 7 c 3.63 1:14.20

6 Mushroom 8124 22 2 d 1.01 1:1.07

7 Letter 15000 16 26 c 1.00 1:1.13

8 Sonar 208 60 2 c 1.01 1:1.14

9 Ionosphere 351 34 2 c 1.08 1:1.79

10 Vehicle silhouettes 846 18 4 c 1.09 1:1.40

11 Spambase 4601 57 2 c 1.04 1:1.54

12 Vowel 990 13 11 c, d 1.05 1:1.00

c: continuous; d: discrete.

Table 4.2

Comparison of classification accuracy with the larger and smaller data sets.

Data classification algorithms No. Data sets SVM

linear

SVM Polynomial

SVM RBF

SVM Poly + RBF

SVM Poly × RBF

KNN

=1 k

KNN

=3 k 1 Hyperlipidemia 68.06 51.92 68.06 69 69 59 68

2 Liver disease

73.75 73.5 73.5 77 77 60 68

3 Renal disease 78.5 78.58 78.5 82 82 67 81

4 Census income 70.5 75 71.5 74 75.8 74 73

5 Shuttle* 95.5 98.2 95.5 96.8 95.5 100 100

6 Mushroom 97.33 100 99.73 99.73 100 91.33 94 7 Letter 82.25 86.75 90.75 87.5 90.75 81 78.65

AVERAGE-large 80.84(11.65) 80.56(16.49) 82.51(12.65) 83.72(11.55) 84.29(11.39) 76.05(15.63) 80.38(12.48) 8 Sonar 85.33 88.1 88.1 92.86 95.23 83.33 85.71 9 Ionosphere 81.43 84.29 84.29 91.43 91.43 81.43 78.57 10 Vehicle silhouettes 75 79.3 82.84 79.3 85.8 75 83

11 Spambase 94.5 94.5 94.5 94.5 95 88 87.5

12 Vowel 98.89 90.91 98.89 99.49 99.49 97.22 99.72 AVERAGE-smaller 87.03(9.69) 87.42(5.88) 89.72(6.83) 91.52(7.48) 93.39(5.11) 85(8.27) 86.9(7.92)

*: imbalanced data set ( ): standard deviation

No1-7: larger data sets.

No8-12: smaller data sets.

Table 4.3 The accuracy of feature selection for the SVM using the L-J method (larger data sets) Full (100%) Reduced (75%) Kernel

Dataset k_P k_G k_P₊_G k_P_⋅_G k_P k_G k_P₊_G k_P_⋅_G HyperLipidemia 51.92 68.06 69 69 50.83 67.83 70 69.5 Liver disease

73.5 73.5 77 77 71.13 71.13 75.5 76.5

Renal disease 78.58 78.5 82 82 76.25 72.13 81.75 80.38 Census_income 75 71.5 74 75.8 69.6 71.6 71.6 76 Shuttle* 98.2 95.5 96.8 95.5 98.4 98.2 99.8 98.8 Mushroom 100 99.73 99.73 100 96.8 98.2 98 98.8

Letter 86.75 90.75 87.5 90.75 81.25 81 83 83

AVERAGE Reduced (50%) Reduced (25%)

Kernel

Dataset k P k _G k_P₊_G k_P_⋅_G k _P k _G k_P₊_G k_P_⋅_G HyperLipidemia 50.25 67 68.5 68.86 48.5 62.5 61.83 62.38 Liver disease

72 72 73 75 62.5 71.25 60.25 71.75

Renal disease 75 70 78 78.5 69.4 63.94 74.75 72.88 Census_income 68.5 69.5 78 72.5 72 73.6 72 74 Shuttle* 91.5 91.41 91.5 95.5 81.4 82.6 81.4 82.8 Mushroom 98.82 99.73 99.73 99.73 100 99.89 100 99.92

Letter 67.5 73 73.5 79 46.5 52.5 52 57.5

AVERAGE

*: imbalanced data set ( ): standard deviation

Table 4.4 The accuracy of feature selection for the SVM using the L-J method (smaller data sets) Vehicle 79.3 82.84 79.3 85.8 78.85 77.51 78.25 83.25

Spambase 94.5 94.5 94.5 95 92.5 94 91.5 94

Vowel 90.91 98.98 99.49 99.49 88.43 94.47 92.13 95 AVERAGE Reduced (50%) Reduced (25%)

Kernel

Dataset k P k _G k_P₊_G k_P_⋅_G k _P k _G k_P₊_G k_P_⋅_G Sonar 76.19 78.57 85.71 88.1 78.57 76.2 85.71 85.71 Ionosphere 71.42 78.57 85.71 90 72.86 77.14 82.56 88.57 Vehicle 78.1 72.19 78.1 80.47 69.29 69.41 72.92 79.51 Spambase 90.8 93.4 92 94.4 87.5 86.8 88 89.75 Vowel 84.48 93.93 89.39 94.94 44.44 39.71 44.95 45.51 AVERAGE ( ): standard deviation

4.3 Discussions

In the experiment, we found that parameters and heavily influenced the classification accuracy. These two parameters have a different impact on larger and smaller data sets. In larger data sets, degree should be higher, and should be lower. On the other hand, degree should be lower and should be higher in the smaller data sets. Figures 4.1 and 4.2 show the relationship between parameters and accuracy for a large data set (Renal disease) and a smaller data set (Vowel), respectively.

d γ

Classification accuracy (using polynomial kernel)

78.44 78.48 78.52 78.56 78.6

2 3 4 5 6 7 8 9 10

degree d

accuracy %

Classification accuracy (using RBF kernel)

77.8 78 78.2 78.4 78.6

10^-3 10^-2 10^-1 10^0 10^1 10^2

Kernel width

accuracy %

10⁻3 10⁻² 10⁻¹ 10⁰ 10¹ 10²

Figure 4.1 The relationship between parameters and accuracy for the larger data set.

Classification accuracy (using polynomial kernel)

98 98.5 99 99.5 100

2 3 4 5 6 7 8 9 10

degree d

accuracy %

Classification accuracy (using RBF kernel)

98 98.5 99 99.5 100

10^-3 10^-2 10^-1 10^0 10^1 10^2

kernel width

accuracy %

10⁻3 10⁻² 10⁻¹ 10⁰ 10¹ 10²

Figure 4.2 The relationship between parameters and accuracy for the smaller data set.

Some research indicated that the SVM with the kernel method provided a better performance for classification than the linear methods (Tefas et al., 2001). In the present study, our experiment showed similar results (see Table 4.2). Although the linear kernel is not the best of the kernel based approaches for large data sets, it is acceptable compared with the KNN approach. The other two popular kernels, polynomial and RBF, also provided an acceptable performance in both larger and

smaller data sets. We found that the performance of the RBF is better than that of the polynomial, both in the larger and smaller data sets.

In the setting of the parameters for the polynomial and RBF kernels, Pardo and Sberveglieri (2005) consider that larger values for the polynomial kernel parameter mean more complex classification functions (higher order polynomials). These functions are useful for solving classification problems. At the same time however, a smaller value for the RBF kernel parameter is also good at solving classification problems. In this study, the results of our experiment are similar to those of Pardo and Sberveglieri (2005) (see Figures 6 and 7). It is evident that larger is good at complex data because it can obtain a greater probability of classification. Hence, we feel that the input space, with a lower dimension transformation to the feature space with a higher dimension, seems to make it easier to classify a separable bound.

In the following we discuss the effect of parameters and on classification accuracy. First, for the polynomial kernel, we set

d γ

a= b , if is adjusted from 3 to 5, and the terms of the polynomial will be expanded from 4 to 6. As a result of the terms being expanded, the number of boundaries is also increased. Although the larger the value, the poorer the performance of the classification, we can slightly adjust the value based on the complexity of the data.

d d

Next, suppose the width of the RBF kernel is adjusted from to , then the increment of the RBF kernel is positive. On the contrary, the increment of the RBF kernel is negative when the kernel width is decreased. Thus the user can change the kernel width until the kernel is satisfied with his need. From the mathematical viewpoint, when the smaller data sets are in the lower space, a larger width is useful to easily and quickly achieve the optimal solution. However, when the larger data sets are in the higher space and when there are many local optimal solutions, then it is easy to fall into the trap of larger kernel width. Thus, the small

γ 10⁰ 10¹

width is best for larger data sets. Our experiment only shows the classification accuracy difference for larger and smaller data sets using different kernel widths;

however, we could not find a significant difference in the classification accuracy for the data sets with a different data complexity.

Based on the above discussion, some useful strategies for determining parameters and are summarized in Table 4.5. In the polynomial kernel, a larger parameter is suitable for larger data sets; and a smaller is suitable for smaller data sets. In the RBF kernel, a smaller parameter is suitable for larger data sets; and a larger is suitable for smaller data sets.

d γ

d d

γ γ

Table 4.5 The strategies of parameter setting of polynomial and RBF kernels.

Kernel type Data set size

Polynomial (d)

RBF (γ) Larger data set larger smaller

Small data set smaller larger

In our experiment, it seems that the multiplication kernel ( ) is superior to the summation one ( ). The reason for this may be that the multiplication kernel has some functions by changing degree and adjusting width at the same time, which seems to increase the classification performance. However, the influences of these functions are not significant in the summation kernels.

kP_⋅ G

kP₊

In addition, ML was used to evaluate the data complexity. As expected, the combined kernel, provides a better performance for classification when the ML approaches to 1. However, it seems that the combined kernels are not superior to the other approaches when the ML is greater than approximately 1.5. A possible explanation may be that, for simple problems using the SVM with the original kernel is good enough for classification. The combined kernels are not recommended for

kP_⋅

addressing simple problems because they will complicate the data space.

Next, we show the results with 100%, 75%, 50%, and 25% features after feature selection by twelve data sets. Obviously, the performance of classification decrease follows the number of features reduced. It is interesting to note that the more the number of classes there was, the larger the decreasing percentage of classification was noted.

As for feature selection process, many investigators consider that the most straightforward idea is to use a leave-one-out procedure or a cross-validation set to assess the generalization error with regard to the number of features and choose the number of attributes which minimizes the test error. It was deemed to be unfavorable for the computation. Compared with this process, L-J method just selects variables by index influence (α_j) and avoids this predicament. However, kernel selection in L-J method plays an important role and greatly affects the performance of classification.

CHAPTER 5

在文檔中核心函數為基礎的支持向量分類器：理論與應用 (頁 38-49)