8 Multi-Class ν-SV Classifiers - a tutorial on ν -support vector machines

Though SVM was originally designed for two-class problems, several approaches have been developed to extend SVM for multi-class data sets. In this section, we discuss the extension of the “one-against-one” approach for multi-class ν-SVM.

Most approaches for multi-class SVM decompose the data set to several binary problems. For example, the “one-against-one” approach trains a binary SVM for any

two classes of data and obtains a decision function. Thus, for a k-class problem, there are k(k−1)/2 decision functions. In the prediction stage, a voting strategy is used where the testing point is designated to be in a class with the maximum number of votes. In [18], it was experimentally shown that for general problems, using C-SV classifier, various multi-class approaches give similar accuracy. However, the “one-against-one”

method is more efficient for training. Here, we will focus on extending it for ν-SVM.

Multi-class methods must be considered together with parameter-selection strate-gies. That is, we search for appropriate C and kernel parameters for constructing a better model. In the following, we restrict the discussion on only the Gaussian (radius basis function) kernel k(x_i, xj) = e^−γkxⁱ^−x^j^k², so the kernel parameter is γ. With the parameter selection considered, there are two ways to implement the “one-against-one” method: First, for any two classes of data, the parameter selection is conducted to have the best (C, γ). Thus, for the best model selected, each decision function has its own (C, γ). For experiments here, the parameter selection of each binary SVM is by a five-fold cross-validation. The second way is that for each (C, γ), an evaluation criterion (e.g. cross-validation) combining with the “one-against-one” method is used for estimating the performance of the model. A sequence of pre-selected (C, γ) is tried to select the best model. Therefore, for each model, k(k − 1)/2 decision functions share the same C and γ.

It is not very clear which one of the two implementations is better. On one hand, a single parameter set may not be uniformly good for all k(k − 1)/2 decision functions.

On the other hand, as the overall accuracy is the final consideration, one parameter set for one decision function may lead to over-fitting. [14] is the first to compare the two approaches using C-SVM, where the preliminary results show that both give similar accuracy.

For ν-SVM, each binary SVM using data from the ith and the jth classes has an admissible interval [ν_min^ij , ν_max^ij ], where ν_max^ij = 2 min(mi, mj)/(mi+ mj) according to proposition 3. Here m_i and m_j are the number of data points in the ith and jth classes, respectively. Thus, if all k(k − 1)/2 decision functions share the same ν, the admissible interval is

[maxi6=j ν_min^ij , min

i6=j ν_max^ij ]. (91)

This set is non-empty if the kernel matrix is positive definite. The reason is that proposi-tion 3 implies ν_min^ij = 0, ∀i 6= j, so min_i6=jν_max^ij = 0. Therefore, unlike C of C-SVM, which has a large valid range [0, ∞), for ν-SVM, we worry that the admissible interval may be too small. For example, if the data set is highly unbalanced, mini6=jν_min^ij is very small.

We redo the same comparison as that in [14] for ν-SVM. Results are in Table 2. We consider multi-class problems tested in [18], where most of them are from the statlog collection [25]. Except data sets dna, shuttle, letter, satimage, and usps, where test sets are available, we separate each problem to 80% training and 20% testing. Then, cross validation are conducted only on the training data. All other settings such as data scaling are the same as those in [18]. Experiments are conducted using LIBSVM [10], which solves both C-SVM and ν-SVM.

Results in Table 2 show no significant difference among the four implementations.

Note that some problems (e.g. shuttle) are highly unbalanced so the admissible interval

(91) is very small. Surprisingly, from such intervals, we can still find a suitable ν which leads to a good model. This preliminary experiment indicates that in general the use of

“one-against-one” approach for multi-class ν-SVM is viable.

Table 2. Test accuracy (in percentage) of multi-class data sets by C-SVM and ν-SVM. The columns “Common C”, “Different C”, “Common ν”, “Different ν” are testing accuracy of using the same and different (C,γ), (or (ν,γ)) for all k(k − 1)/2 decision functions. The validation is conducted on the following points of (C,γ): [2⁻⁵, 2⁻³, . . . , 2¹⁵] × [2⁻¹⁵, 2⁻¹³, . . . , 2³]. For ν-SVM, the range of γ is the same but we validate a 10-point discretization of ν in the interval (91) or [ν^ij_min, ν_max^ij ], depending on whether k(k − 1)/2 decision functions share the same parameters or not. For small problems (number of training data ≤ 1000), we do cross validation five times, and then average the testing accuracy.

Data set Class No. # training # testing Common C Different C Common ν Different ν

vehicle 4 677 169 86.5 87.1 85.9 87.8

glass 6 171 43 72.2 70.7 73.0 69.3

iris 3 120 30 96.0 93.3 94.0 94.6

dna 3 2000 1186 95.6 95.1 95.0 94.8

segment 7 1848 462 98.3 97.2 96.7 97.6

shuttle 7 43500 14500 99.9 99.9 99.7 99.8

letter 26 15000 5000 97.9 97.7 97.9 96.8

vowel 11 423 105 98.1 97.7 98.3 96.0

satimage 6 4435 2000 91.9 92.2 92.1 91.9

wine 3 143 35 97.1 97.1 97.1 96.6

usps 10 7291 2007 95.3 95.2 95.3 94.8

We also present the contours of C-SVM and ν-SVM in Figure 8 using the approach that all decision functions share the same (C, γ). In the contour of C-SVM, the x-axis and y-x-axis are log₂C and log₂γ, respectively. For ν-SVM, the x-axis is ν in the interval (91). Clearly, the good region of using ν-SVM is smaller. This confirms our concern earlier, which motivated us to conduct experiments in this section. Fortunately, points in this smaller good region still lead to models that are competitive with those by C-SVM.

There are some ways to enlarge the admissible interval of ν. A work to extend algorithm to the case of very small values of ν by allowing negative margins is [27].

For the upper bound, according to the above proposition 3, if the classes are balanced, then the upper bound is 1. This leads to the idea to modify the algorithm by adjusting the cost function such that the classes are balanced in terms of the cost, even if they are not in terms of the mere numbers of training examples. An earlier discussion on such formulations is at [12]. For example, we can consider the following formulation:

minimize w∈H,ξ∈R^m,ρ,b∈R

τ (w, ξ, ρ) = 1

2kwk²− νρ + 1 2m+

i:yi=1

ξi+ 1 2m−

i:yi=−1

ξi

subject to yi(hxi, wi + b) ≥ ρ − ξi, and ξi≥ 0, ρ ≥ 0.

satimage 91.5

Fig. 6. 5-fold cross-validation accuracy of the data set satimage. Left: C-SVM, Right: ν-SVM

The dual is

Clearly, when all αiequals its corresponding upper bound, α is a feasible solution with P_m

The dual is

maximize

α∈R^m W (α) = −1 2

i,j=1

αiαjyiyjk(xi, xj)

subject to 0 ≤ α_i≤ 1

2 min(m+, m−), Xm

i=1

αiyi= 0, Xm

i=1

αi≥ ν.

Then, the largest admissible ν is 1.

A slight modification of the implementation in Section 7 for the above formulations is in [13].

在文檔中 a tutorial on ν -support vector machines (頁 22-26)