• 沒有找到結果。

Support Vector Machine

Chapter 3 Methodology

3.1 Support Vector Machine

Support Vector Machine (SVM) is a new generation learning system based on recent advances in statistical learning theory [8]. It has been widely and successfully applied to many real world classification problems such as text categorization, image detection, biochemical technology [9-11], etc. In the most cases, the generalization performance of SVM is outstanding. In this subsection, we will give a brief introduction.

Figure 3.1 Optimal Separating Hyperplane

15

The basic idea of an SVM classifier is that the set of binary labeled training data vectors can be separated by a hyperplane. In the simplest case of a linear hyperplane there may exist many possible separating hyperplanes. Consider the example in Figure 3.1. Here there are many possible linear classifiers that can separate the data, but among them, the SVM classifier seeks the separating hyperplane that produces the largest separation margin (maximizes the distance between it and the nearest data point of each class, shown in Figure 3.2). This linear classifier is termed to the optimal separating hyperplane. The hyperplane with maximal margin is the ultimate learning goal in statistical learning theory, and will probably perform well in classifying the new data.

Figure 3.2 Margin: Distance Between Hyperplane and The Nearest Data Point of Each Class

Such a scheme is known to be associated with structural risk minimization to find a learning machine that yields a good trade-off between low empirical risk and small capacity.

In the more general case in which the data points are not linearly separable in the input space (shown in Figure 3.3(a)), a non-linear transformation is used to map the data vector x into a high-dimensional feature space prior to applying the linear maximum-margin classifier (shown in Figure 3.3(b), Figure 3.4).

16

) (X X

φ

Figure 3.3 Mapping Not Linearly Separable Data Vector into High-Dimensional Feature Space

Φ Φ

Figure 3.4 Mapping Function ψ: Maps Samples into Higher Dimension Feature Space

Figure 3.5 Example of One Dimension Non Linearly Separable Data Vector

17

As shown in Figure 3.5(a), it is a nonlinearly-separable one-dimension data vector.

But after mapping the data vector by a mapping function into two dimension space, it become can separated by a linearly hyperplane (shown in Figure 3.5 (b)).

To avoid over-fitting in this higher dimensional space, an SVM uses kernel functions (polynomial and Gaussian radial basis kernels are the most common) in which the nonlinear mapping is implicitly embedded. With the use of a kernel, the decision function in a SVM classifier has the following form:

x α ,

where K(·,·) is the kernel function, xi are the so-called support vectors determined from training data, LS is the number of support vectors, yi is the class indicator associated with each xi , and αi , the Lagrange multipliers. In addition, for a given kernel it is necessary to specify the cost factor c, a positive regularization parameter that controls the trade-off between complexity of the machine and the allowed classification error. More detail about SVM, please refer to the literatures [12-13].

To design an effective SVM model, values of parameters in SVM have to be chosen carefully in advance [14]. These parameters include the following.

1. A kernel function used in SVM, which constructs a non-linear decision hyperplane in an input space.

2. Regularization parameter C, which determines the tradeoff cost between minimizing the training error and minimizing the complexity of the model.

3. Parameter of the kernel function which define the nonlinear mapping from the input space to some high-dimensional feature space.

18

RBF Kernel

Because the mapping function ψ may be a very complicated expression, there is a computational problem working with very large vectors and a generalization theory problem curse of dimensionality. So we need to use kernels to solve the computational problem of working with many dimensions. They can make it possible to use infinite dimensions efficiently in time and space [15].

Though there are four common kernels, we used the SVM with its simplest case of a linear hyperplane and with radial basis function (RBF) kernel. Because the RBF kernel non-linearly maps samples into a higher dimensional space, it can handle the case when the relation between class labels and attributes in nonlinear. The second reason is the number of hyperparameters which influences the complexity of model selection. Finally, the RBF kernel has less numerical difficulties.

Model Selection

There are two parameters while using RBF kernels: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done.

The goal is to identify good (C, γ) so that the classifier can accurately predict unknown data, i.e., testing data. Note that it may not be useful to achieve high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known). Therefore, a common way is to separate training data to two parts of which one is considered unknown in training the classifier (shown in Figure 3.6). Then the prediction accuracy on this set can more precisely reflect the performance on classifying unknown data. An improved version of this procedure is k-fold cross-validation.

19

Figure 3.6 Support Vector Machine Operation Diagram

Cross-validation

In k-fold cross validation, the training data is randomly split into k mutually exclusive subsets of approximately equal size. Sequentially one subset is tested using the classifier trained on the remaining k-1 subsets. This procedure is repeated k times and in this fashion each subset is used for testing once. The cross-validation accuracy is the percentage of data which correctly classified [16]. The Figure 3.7 is a 5-fold cross-validation example.

20

Figure 3.7 The 5-fold Cross-Validation Example

Except estimate methods, the mechanisms of searching for parameter sets that make SVMs resulting model perform well is important, too. The most common and reliable approach for model selection is exhaustive grid search method.

Grid-search

When searching for a good combination of parameters for C and γ, it is usual to form a two dimension uniform grid (say p × p) of points in a pre-specified search range (for example, C = 2-5, 2-3,…, 215, γ= 215, 2-13,…,23) and find a combination (point) that gives the least value for some estimate of generalization error. It is expensive since it requires the trying of p × p pairs of (C, γ). Figure 3.8 is a grid search example where p

= 5.

Figure 3.8 A Exhaustive Grid Search Example

21

Scaling

Scaling data vector before applying SVM is very important [17]. The main advantage is to avoid attributes in greater numeric ranges dominate those in smaller numeric ranges. Another advantage is to avoid numerical difficulties during the calculation.

Because kernel values usually depend on the inner products of feature vectors, large attribute values might cause numerical problems.

相關文件