A Guide to Support Vector Machines

(1)

A Guide to Support Vector Machines

Chih-Jen Lin

July 13, 2006

(2)

(3)

Preface

This is the course notes that I have been using in the past few years. The main purpose is to let students not in machine learning background learn how to effectively use support vector machines as a tool.

The first part (basic topics) is suitable for people who would like to use SVM. The second part (advanced topics) is for people who would like to know implementation details of SVM software.

5

(6)

(7)

Part I

Basic Topics

7

(8)

(9)

Chapter 1 Introduction to Classification Problems

1.1 Data Classification

The basic idea of data classification problem can be simply described as follows: Given training data with known labels (classes), we would like to learn a model, so it can be used for predicting data with unknown labels.

For example, suppose the height and weight of the following eight persons are available and medical experts have identified that some of them are over-weighted or under-weighted:

Table 1.1.1: Six training points

ID 1 2 3 4 5 6

Weight (kg) 50 60 70 70 80 90 Height (m) 1.6 1.7 1.9 1.5 1.7 1.6 Over-weighted No No No Yes Yes Yes

They are considered as “two classes” of data and the label is whether the person is over- or under-weighted. The above data can also be illustrated by Figure 1.1.

Consulting with experts may be expensive, so we would like to construct a model from available information. Then for any person, this model could easily predict whether he/she is over- or under-weighted.

9

(10)

40 50 60 70 80 90 1.4

1.5 1.6 1.7 1.8 1.9 2

Weight

Height

Figure 1.1: Six training points

1.2 Training and Testing Error

A model can be a rule like

If weight ≥ 60, then over-weighted.

Clearly, this rule does not make sense, as some tall people may be thin even thouth their weights are more than 60. A better model may be the following rule:

If weight/(height)² ≥ 23, then over-weighted.

The area of classification is to idenfity a good model so future prediction is accu- rate.

Here “weight” and “height” are called features or attributes. In statistics, they are called variables. Each person is considered as a data instance (or a data observation).

Mathematically, we have x = [weight, height] as a data instance and y = 1 or −1 as the label of each instance. Here, there are six training instances x1, . . . , x6 with corresponding class labels y = [−1, −1, −1, 1, 1, 1]^T (if −1 and 1 mean under-weighted and over-weighted, respectively).

1.3 Nearest Neighbor Methods

Here, we introduce a simple classifier: nearest neighbor. For any new person, we check that his/her (height, weight) is closest to which one in the training set. For example, if a person has weight=70 and height=1.8, then the closest one in the training set is

(11)

1.4. LINEAR CLASSIFIERS 11 the third person, who is under-weighted. Thus, the new one is predicted as under- weighted as well. Note that weight and height are now in two very different ranges, so we may have to scale them before calculating the distance. This issue will be discussed in Chapter 3.2.

If the Eucledian distance is considered, for any given x, the nearest neighbor method essentially predict it to be in the

class of arg min

i kx − xⁱk².

We may wrorry that the closest instance is a wrongly recorded data, so sometimes the k closest points are considered and the prediction is by a majority vote. This method is called k-nearest neighbor. How to select k is an issue and will be discussed in Chapter 1.5.

1.4 Linear Classifiers

A model after the training procedure can be, for example, a rule set, or the whole training set like that by the nearest neighbor. Here we show that a straight line can be a model as well.

40 50 60 70 80 90

1.4 1.5 1.6 1.7 1.8 1.9 2

Weight

Height

Figure 1.2: Example of a linear classifier In Figure 1.4, a line

0.2 × weight − 10 × height + 3 = 0

separates all the training data. In general we represent such a line as w^Tx + b = 0,

(12)

where x = [weight, height]^T, w = [0.2, −10]^T, and b = 3. Then for any new data x, we check whether it is on the right- or left-hand side of the line. That is,

if w^Tx + b > 0 predict x as “over-weighted”,

< 0 predict x as “under-weighted”.

How to find such a straight line will be discussed in Chapter 2.

1.5 Overfitting and Underfitting

(a) Training data and an overfitting classifier

(b) Applying an overfitting classifier on testing data

(c) Training data and a better classifier (d) Applying a better classifier on testing data

Figure 1.3: An overfitting classifier and a better classifier (● and ▲: training data;

and △: testing data).

Note that it may not be useful to achieve high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known). A clear illustration is in Figure 1.3. It is a problem with two classes of data: triangles and circles.

(13)

1.6. CROSS VALIDATION 13 Filled circles and triangles are the training data while hollow circles and triangles are the testing data. The testing accuracy the classifier in Figures 1.3(a) and 1.3(b) is not good since it overfits the training data. On the other hand, the classifier in Figures 1.3(c) and 1.3(d) does not overfit the training data and hence gives better testing accuracy.

Some training data may be wrongly recorded, so sometimes we should allow training errors. That is, under the obtained model, we predict some training data to be in their opposite class. Note that if there are no duplicated data instances, we can always fit training data so that training accuracy is 100%. An example is in Figure 1.4.

Figure 1.4: We can always achieve 100% training accuracy

Perfect training accuracy is not good, so we should avoid overfitting training data.

On the other hand, a good model should also avoid underfitting, which means the model does not extract enough information from the training data. An example is in Figure 1.5. Clearly, the linear classifier does not use the information that most circles are at the upper-right corner and most triangles are at the lower-left. Therefore, from the discussion in this section, we conclude that a good classifier should

avoid overfitting and avoid underfitting.

1.6 Cross Validation

The above discussion also hints another important fact about classification problem:

Training accuracy is not important; only test accuracy counts.

(14)

Figure 1.5: A underfitting example

This statement is quite obvious as for training data we already know their class labels. However, as the true class labels of test data are not known, how do we find the performance on predicting them? A common way is to separate training data to two parts of which one is considered unknown in training the classifier. Then the prediction accuracy on this set can more precisely reflect the performance on classifying unknown data. An improved version of this procedure is cross-validation.

In v-fold cross-validation, we first divide the training set into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v − 1 subsets. Thus, each instance of the whole training set is predicted once so the cross-validation accuracy is the percentage of data which are correctly classified.

Usually v ≥ 5 is used.

In Chapter 1.3, we mention the k-nearest neighbor method and the selection of k can be via cross validation. For example, sequentially we try k = 1, 3, 5, 7, . . . and calculate the corresponding cross validation accuract. The k with the highest accuracy is the best and used for future prediction. Note that we consider only odd k, so the majority vote in prediction could produce a single winner.

1.7 Exercises

1. Write a k-nearest neighbor code to train

http://www.csie.ntu.edu.tw/^∼cjlin/libsvmtools/binary/ijcnn1.bz2 and test

http://www.csie.ntu.edu.tw/^∼cjlin/libsvmtools/binary/ijcnn1.t.bz2

(15)

1.7. EXERCISES 15 You need to conduct cross-validation on the training data in order to select a good k.

(16)

(17)

Chapter 2 Support Vector Classification

2.1 Linear Separating Hyperplane with Maximal Margin

The original idea of SVM classification is to use a linear separating hyperplane to create a classifier. Given training vectors xi, i = 1, . . . , l of length n, and a vector y defined as follows

yi =

1 if xi in class 1,

−1 if xi in class 2,

the support vector technique tries to find the separating hyperplane with the largest margin between two classes, measured along a line perpendicular to the hyperplane.

For example, in Figure 2.1, two classes could be fully separated by a dotted line w^Tx + b = 0. We would like to decide the line with the largest margin. In other words, intuitively we think that the distance between two classes of training data should be as large as possible. That means we find a line with parameters w and b such that the distance between w^Tx + b = ±1 is maximized.

The distance between w^Tx + b = 1 and −1 can be calculated by the following way. Consider a point ¯x on w^Tx + b = −1:

@@

@

@@

@

x + tw¯

¯ x tw

w^Tx + b = −1

w^Tx + b = 1

17

(18)

w^Tx + b =



 +1

0

−1



 Figure 2.1: Separating hyperplane

As w is the “normal vector” of the line w^Tx + b = −1, w and the line are perpendicular to each other. Starting from ¯x and moving along the direction w, we assume ¯x + tw touches the line w^Tx + b = 1. Thus,

w^T(¯x + tw) + b = 1 and w^Tx + b = −1.¯

Then, tw^Tw = 2, so the distance (i.e., the length of tw) is ktwk = 2kwk/(w^Tw) = 2/kwk. Note that kwk = pw²₁+ · · · + w²n. As maximizing 2/kwk is equivalent to minimizing w^Tw/2, we have the following problem:

minw,b

1 2w^Tw

subject to yi(w^Txi+ b) ≥ 1, (2.1) i = 1, . . . , l.

The constraint yi(w^Txi+ b) ≥ 1 means

(w^Txi) + b ≥ 1 if yi = 1, (w^Txi) + b ≤ −1 if yi = −1.

That is, data in the class 1 must be on the right-hand side of w^Tx+b = 0 while data in the other class must be on the left-hand side. Note that the reason of maximizing the distance between w^Tx + b = ±1 is based on Vapnik’s Structural Risk Minimization (Vapnik, 1998).

The following example gives a simple illustration of maximal-margin separating hyperplanes:

Example 2.1.1 Given two training data in R¹ as in the following figure:

△0

1

(19)

2.2. MAPPING DATA TO HIGHER DIMENSIONAL SPACES 19 What is the separating hyperplane ?

Now two data are x1 = 1, x2 = 0 with y = [+1, −1]^T. Furthermore, w ∈ R¹, so (2.1) becomes

minw,b

1 2w²

subject to w · 1 + b ≥ 1, (2.2)

−1(w · 0 + b) ≥ 1. (2.3)

From (2.3), −b ≥ 1. Putting this into (2.2), w ≥ 2. In other words, for any (w, b) which satisfies (2.2) and (2.3), w ≥ 2. As we are minimizing ¹₂w², the smallest possibility is w = 2. Thus, (w, b) = (2, −1) is the optimal solution. The separating hyperplane is 2x − 1 = 0, in the middle of the two training data:

△0

• 1 x = 1/2

2.2 Mapping Data to Higher Dimensional Spaces

Figure 2.2: An example which is not linear separable

However, practically problems may not be linearly separable where an example is in Figure 2.2. That is, there is no (w, b) which satisfies constraints of (2.1). In this situation, we say (2.1) is “infeasible.” In (Cortes and Vapnik, 1995) the authors introduced slack variables ξi, i = 1, . . . , l in the constraints:

minw,b,ξ

1

2w^Tw + C

l

X

i=1

ξi

subject to yi(w^Txi + b) ≥ 1 − ξⁱ, (2.4) ξi ≥ 0, i = 1, . . . , l.

(20)

That is, constraints (2.4) allow that training data may not be on the correct side of the separating hyperplane w^Tx + b = 0. This situation happens when ξi > 1 and an example is in the following figure

w^Txi+ b = 1 − ξⁱ < −1

We have ξ ≥ 0 as if ξ < 0, yⁱ(w^Txi+ b) ≥ 1 − ξⁱ ≥ 1 and the training data is already on the correct side. The new problem is always feasible since for any (w, b),

ξi ≡ max(0, 1 − yⁱ(w^Tx + b)), i = 1, . . . , l, lead to that (w, b, ξ) is a feasible solution.

Using this setting, we may worry that for linearly separable data, some ξi > 1 and hence corresponding data are wrongly classified. For the case that most data except some noisy ones are separable by a linear function, we would like w^Tx + b = 0 correctly classifies the majority of points. Thus, in the objective function we add a penalty term CPl

i=1ξi, where C > 0 is the penalty parameter. To have the objective value as small as possible, most ξi should be zero, so the constraint goes back to its original form. Theoretically we can prove that if data are linear separable and C is larger than a certain number, problem (2.4) goes back to (2.1) and all ξi are zero (Lin, 2001a).

Unfortunately, such a setting is not enough for practical use. If data are distributed in a highly nonlinear way, employing only a linear function causes many training instances to be on the wrong side of the hyperplane. So underfitting occurs and the decision function does not perform well.

To fit the training data better, we may think of using a nonlinear curve like that in Figure 2.2. The problem is that it is very difficult to model nonlinear curves. All we are familiar with are eliptic, hyperbolic, or parabolic curves, which are far from enough in practice. Instead of using more sophisticated curves, another approach is to map data into a higher dimensional space. For example, in the example in Chap- ter 1, each data instance has two features (attributes): height and weight. We may consider two other attributes

(21)

2.3. THE DUAL PROBLEM 21 height-weight, weight/(height²).

Such features may provide more information for separating underweighted/overweighted people. Each new data instance is now in a four-dimensional space, so if the two new features are good, it should be easier to have a seperating hyperplane so that most ξi

are zero.

Thus SVM non-linearly transforms the original input space into a higher dimensional feature space. More precisely, the training data x is mapped into a (possibly infinite) vector in a higher dimensional space:

φ(x) = [φ₁(x), φ₂(x), . . .].

In this higher dimensional space, it is more possible that data can be linearly separated. An example by mapping x from R³ to R¹⁰ is as follows:

φ(x) = (1,√ 2x1,√

2x2,√

2x3, x²₁, x²₂, x²₃,√

2x1x2,√

2x1x3,√

2x2x3).

An extreme example is to map a data instance x ∈ R¹ to an infinite dimensional space:

φ(x) =

1, x

1!,x² 2!,x³

3!, . . .

T

.

We then try to find a linear separating plane in a higher dimensional space so (2.4) becomes

minw,b,ξ

1

2w^Tw + C

l

X

i=1

ξi

subject to yi(w^Tφ(xi) + b) ≥ 1 − ξⁱ, (2.5) ξi ≥ 0, i = 1, . . . , l.

2.3 The Dual Problem

The remaining problem is how to effectively solve (2.5). Especially after data are mapped into a higher dimensional space, the number of variables (w, b) becomes very large or even infinite. We handle this difficulty by solving the dual problem of (2.5):

minα

1 2

l

X

i=1 l

X

j=1

αiαjyiyjφ(xi)^Tφ(xj) −

l

X

i=1

αi

subject to 0 ≤ αⁱ ≤ C, i = 1, . . . , l, (2.6)

l

X

i=1

yiαi = 0.

(22)

This new problem of course has some relation with the original problem (2.5), and we hope that it can be solved more easily. Sometimes we write (2.6) in a matrix form for convenience:

minα

1

2α^TQα − e^Tα

subject to 0 ≤ αⁱ ≤ C, i = 1, . . . , l, (2.7) y^Tα= 0.

In (2.7), e is the vector of all ones, C is the upper bound, Q is an l by l positive semidefinite matrix, Qij ≡ yⁱyjK(xi, xj), and K(xi, xj) ≡ φ(xⁱ)^Tφ(xj) is the kernel, which will be addressed in Chapter 2.4.

If (2.7) is called the “dual” problem of (2.5), we refer (2.5) to be the “primal” problem. Suppose ( ¯w, ¯b, ¯ξ) and ¯α are optimal solutions of the primal and dual problems, respectively, the following two properties hold:

¯ w =

l

X

i=1

¯

αiyiφ(xi) (2.8)

1

2w¯^Tw + C¯

l

X

i=1

ξ¯i = e^Tα¯ − 1

2α¯^TQ ¯α. (2.9)

In other words, if the dual problem is solved with a solution ¯α, the optimal primal solution ¯w is easily obtained from (2.8). Suppose an optimal ¯b is also easily found, the decision function is hence determined.

Thus, the crucial point is whether the dual is easier to be solved than the primal.

The number of variables in the dual, which is the size of the training set: l, is a fixed number. In contrast, the number of variables in the primal problem varies depending on how data are mapped to a higher dimensional space. Therefore, moving from the primal to the dual means that we solve a finite-dimensional optimization problem instead of a possibly infinite-dimensional problem.

We illustrate this primal-dual relationship using data in Example 2.1.1 without mapping them to a higher dimensional space. As the problem is linearly separable, it is fine to consider the formulation (2.1) without slack variables ξi, i = 1, . . . , l. Then

(23)

2.4. KERNEL AND DECISION FUNCTIONS 23 the dual is

minα

1 2

l

X

i=1 l

X

j=1

αiαjyiyjx^T_i xj −

l

X

i=1

αi

subject to 0 ≤ αⁱ, i = 1, . . . , l,

l

X

i=1

yiαi = 0.

Using data in Example 2.1.1, the objective function is 1

2α²₁− (α¹+ α2)

= 1

2α₁ α2

1 0 0 0

α1

α2

−1 1α1

α2

. Constraints are

α1− α² = 0, 0 ≤ α¹, 0 ≤ α². Substituting α2 = α1 into the objective function,

1

2α²₁− 2α¹

has the smallest value at α₁ = 2. As [2, 2]^T satisfies constraints 0 ≤ α1 and 0 ≤ α2, it is the optimal solution. Using the primal-dual relation (2.8),

w = y1α1x1+ y2α2x2

= 1 · 2 · 1 + (−1) · 2 · 0

= 2,

the same as what obtained by directly solving the primal problem.

The calculation of b is easy, but is left in Chapter 6 for implementation details. The remaining issue of using the dual problem is about the inner product φ(xi)^Tφ(xj).

If φ(x) is an infinite-long vector, there is no way to fully write it down and then calculate the inner produce. Thus, even though the dual possesses the advantage of having a finite number of variables, we even could not write the problem down before solving it. This is resolved by using special mapping functions φ so that φ(xi)^Tφ(xj) is efficiently calculated. Details are in the next section.

2.4 Kernel and Decision Functions

Consider a special φ(x) mentioned earlier (assume x ∈ R³):

φ(x) = (1,√ 2x1,√

2x2,√

2x3, x²₁, x²₂, x²₃,√

2x1x2,√

2x1x3,√

2x2x3).

(24)

In this case it is easy to see that φ(xi)^Tφ(xj) = (1 + x^T_i xj)², which is easier to be calculated then doing a direct inner product. To be more precise, a direct calculation of φ(xi)^Tφ(xj) takes 10 multiplications and 9 additions, but using (1 + x^T_ixj)², only four multiplications and three additions are needed. Therefore, if a special φ(x) is considered, even though it is a long vector, φ(xi)^Tφ(xj) may still be easily available.

We call such inner products the “kernel function.” Some popular kernels are, for example,

1. e^−γ||xⁱ^−x^j^||² (Gaussian kernel or Radial bassis function (RBF) kernel), 2. (x^T_i xj/γ + δ)^d (polynomial kernel),

where γ, d, and δ are kernel parameters. The following calculation shows that the Gaussian (RBF) kernel indeed is an inner product of two vectors in an infinite dimensional space. Assume x ∈ R¹ and γ > 0.

e^−γ||xⁱ^−x^j^||² = e^−γ(xⁱ^−x^j⁾²

= e^−γx²ⁱ^+2γxⁱ^x^j^−γx²^j

= e^−γx²ⁱ^−γx²^j 1 + 2γxixj

1! + (2γxixj)²

2! +(2γxixj)³

3! + · · ·

= e^−γx²ⁱ^−γx²^j 1 · 1 +r 2γ

1!xi·r 2γ 1!xj+

r(2γ)² 2! x²_i ·

r(2γ)² 2! x²_j +

r(2γ)³ 3! x³_i ·

r(2γ)³

3! x³_j + · · ·

= φ(xi)^Tφ(xj), where

φ(x) = e^−γx²

"

1,r 2γ 1!x,

r(2γ)² 2! x²,

r(2γ)³ 3! x³, · · ·

#T

.

Note that γ > 0 is used for the existance of terms such as q

2γ 1!,

q(2γ)³ 3! , etc.

After (2.7) is solved with a solution α, the vector for which αi > 0 are called support vectors. Then, from (5.10), a decision function is written as

f (x) = sign(w^Tφ(x) + b) = sign

l

X

i=1

yiαiφ(xi)^Tφ(x) + b

!

. (2.10)

In other words, for a test vector x, if Pl

i=1yiαiφ(xi)^Tφ(x) + b > 0, we classify it to be in the class 1. Otherwise, we think it is in the second class. We can see that only support vectors will affect results in the prediction stage. In general, the number of

(25)

2.5. MULTI-CLASS SVM 25 support vectors is not large. Therefore we can say SVM is used to find important data (support vectors) from training data.

We use Figure 2.2 as an illustration. Two classes of training data are not linearly separable. Using the RBF kernel, we obtain a hyperplane w^Tφ(x) + b = 0. In the original space, it is indeed a nonlinear curve

l

X

i=1

yiαiφ(xi)^Tφ(x) + b = 0. (2.11)

In the figure, all points in red color are support vectors and they are selected from both classes of training data. Clearly support vectors are close to the nonlinear curve (2.11) are more important points.

−1.5 −1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Figure 2.3: Support vectors (marked as +) are important data from training data

2.5 Multi-class SVM

The discussion so far assumes that data are in only two classes. Many practical applications involve with more classes. For example, hand-written digit recognition considers data in 10 classes: digits 0 to 9. There are many ways to extend SVM for such cases. Here, we discuss two simple methods.

(26)

2.5.1 One-against-all Multi-class SVM

This commonly mis-named method should be called “one-against-the-rest.” It constructs binary SVM models so that each one is trained with one class as positive and the rest as negative. We illustrate this method by a simple situation of four classes.

The four two-class SVMs are

yi = 1 yi = −1 Decision functions class 1 classes 2,3,4 f¹(x) = (w¹)^Tx + b¹ class 2 classes 1,3,4 f²(x) = (w²)^Tx + b² class 3 classes 1,2,4 f³(x) = (w³)^Tx + b³ class 4 classes 1,2,3 f⁴(x) = (w⁴)^Tx + b⁴ For any test data x, if it is in the ith class, we would expect that

fⁱ(x) ≥ 1 and f^j(x) ≤ −1, if j 6= i.

This “expection” directly follows from our setting of training the four two-class problems and from the assumption that data are correctly separated. Therefore, fⁱ(x) has the largest values among f¹(x), . . . , f⁴(x) and hence the decision rule is

Predicted class = arg max

i=1,...,4fⁱ(x).

2.5.2 One-against-one Multi-class SVM

This method also constructs several two-class SVMs but each one is by training data from only two different classes. Thus, this method is sometimes called a “pairwise”

approach. For the same example of four classes, six two-class problems are con- structed:

yi = 1 yi = −1 Decision functions class 1 class 2 f¹²(x) = (w¹²)^Tx + b¹² class 1 class 3 f¹³(x) = (w¹³)^Tx + b¹³ class 1 class 4 f¹⁴(x) = (w¹⁴)^Tx + b¹⁴ class 2 class 3 f²³(x) = (w²³)^Tx + b²³ class 2 class 4 f²⁴(x) = (w²⁴)^Tx + b²⁴ class 3 class 4 f³⁴(x) = (w³⁴)^Tx + b³⁴

For any test data x, we put it into the six functions. If the problem of classes i and j indicates the data x should be in i, the class i gets one vote. For example, assume

(27)

2.6. NOTES 27 Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Then, we have

class 1 2 3 4

# votes 3 1 1 1 Thus, x is predicted to be in the first class.

For a data set with k different classes, this method constructs k(k − 1)/2 two- class SVMs. We may worry that sometimes more than one class obtains the highest number of votes. Practically this situation does not happen so often and there are some further strategies to handle it.

2.6 Notes

The formulas of SVM were developed in (Boser et al., 1992; Cortes and Vapnik, 1995), where the mapping function and the dual problem are introduced. Other general SVM references are, for example, (Cristianini and Shawe-Taylor, 2000; Sch¨olkopf and Smola, 2002).

Many work have shown that data in a higher dimensional space has a larger op- portunity to be separated (e.g., Cover (1965)). Such results explain why our mapping here should be useful.

The one-against-one method were introduced in (Knerr et al., 1990; Friedman, 1996), and the first use on SVM was in (Kreßel, 1999). A comparison of one-against- all, one-against-one, and other approaches for multi-class SVM is (Hsu and Lin, 2002).

2.7 Exercises

1. Given three training data in R² as in the following figure:

- 6

d t

t

(0, 0) (0, 1)

(1, 0)

(28)

What is the separating hyperplane ?

2. Given four training data in R² as in the following figure:

- 6

d t

t d

(0, 0) (0, 1)

(1, 0) (1, 1)

What is the separating hyperplane ?

3. Solve problem 2 via its dual optimization formula.

4. Assume x ∈ R² and γ > 0. Show that e^−γ||xⁱ^−x^j^||² is in a form of φ(xi)^Tφ(xj).

(29)

Chapter 3 Training and Testing a Data set

3.1 Categorical Features

SVM requires that each data instance is represented as a vector of real numbers.

Hence, if there are categorical attributes, we first have to convert them into numeric data:

1. Use one integer number to represent an m-category attribute. For example, a three-category attribute such as {red, green, blue} can be represented as 1,2,3.

2. Use m binary values to represent an m-category attribute. Only one of the m numbers is one, and others are zero. Thus, {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).

Our experience indicates that if the number of values in an attribute is not too many, the second coding might be more stable than using a single number to represent a categorical attribute.

3.2 Data Scaling

Assume there are three training instances as follows height gender y

x1 150 F -1

x₂ 180 M 1

x3 185 M 1

The attribute “height” is in centimeter. Clearly, the second “gender” is consistent with target class y and hence should be more important. However, if F and M are

29

(30)

transformed to be 0 and 1, respectively, training data and the separating hyperplane are

x₁

x2x3

The separating hyperplane is nearly a vertical line, so the decision strongly depends on the first attribute. This result is not good as the second attribute should play a more important role.

If we linearly scale the first to the range [0, 1] by:

1st attribute − 150 185 − 150 , then new points and the separating hyperplane are

x1

x2x3

This, transformed back to the original space, is x1

x2x3

Therefore, the second attribute plays a role in the decision function.

This example explains that when features are in different numerical ranges, those in larger ranges may dominate the others. Thus, a proper scaling of features before training SVM can be very important.

Another reason for doing data scaling is to avoid numerical difficulties during the calculation. For example, if the polynomial kernel

(x^T_i xj + 1)⁸ (3.1)

is used, the first attribute which ranges from 150 to 185 will cause the value of (3.1) to be larger than (10²)⁸ = 10¹⁶. Computer overflow easily happens when dealing with

(31)

3.2. DATA SCALING 31 such numbers. Moreover, if the following RBF kernel is used:

e^−kxⁱ^−x^j^k², (3.2)

we have values smaller than e⁻¹⁰⁰⁰⁰. It is so small (< 10⁻³⁰⁰) and the decision function has

l

X

i=1

αiyiK(xi, x) + b ≈ b,

if x 6= xⁱ, i = 1, . . . , l. Apparently, this decision function is not good.

A simple linear scaling is formally stated as the following. Assume Mi and mi are respectively the largest and smallest values of the ith attribute and we would like to scale the ith attribute to the range of [−1, +1]:

mi Mi

−1 1

If x is the original value of the ith attribute in one data instance, the new value should be

x^′ = x − ^Mⁱ^+m2 ⁱ Mi−mi

2

= 2 x − mⁱ Mi− mⁱ − 1.

There are many other possible ways of data scaling, but will not be discussed here.

Of course we must use the same method to scale testing data before testing. For example, suppose that we scaled the first attribute of training data from [-10, +10]

to [-1, +1]. If the first attribute of testing data is lying in the range [-11, +8], we must scale the testing data to [-1.1, +0.8].

Data scaling is important for many other classification methods. Sarle (1997) explains why we scale data while using Neural Networks, and most considerations also apply to SVM.

Some ask about the difference between scaling each feature to [−1, +1] and [0, +1].

In Homework 1, we show that for the linear and RBF kernels, if different parameters have been considered, they are fully equivalent. However, for polynomial kernels, they are different. [0, 1] causes that all kernel elements are nonnegative. It is not clear yet whether this is a good property or not.

(32)

3.3 Model Selection

Though there are only few common kernels mentioned in Chapter 2, we must decide which one to try first. Then we also need to choose the penalty parameter C and kernel parameters.

3.3.1 RBF Kernel

We suggest that in general RBF is a reasonable first choice. The RBF kernel non- linearly maps samples into a higher dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear.

Furthermore, the linear kernel is a special case of RBF as (Keerthi and Lin, 2003) shows that the linear kernel with a penalty parameter ˜C has the same performance as the RBF kernel with some parameters (C, γ).

The second reason is the number of hyperparameters which influences the com- plexity of model selection. The polynomial kernel has more hyperparameters than the RBF kernel.

Finally, the RBF kernel has less numerical difficulties. One key point is 0 <

K(xi, xj) ≤ 1 in contrast to polynomial kernels of which kernel values may go to infinity (x^T_i xj/γ + δ > 1) or zero (x^T_i xj/γ + δ < 1) while the degree is large.

3.3.2 Cross-validation and Grid-search

There are two parameters while using the RBF kernel: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C, γ) so that the classifier can accurately predict unknown data (i.e., testing data).

Note that Chapter 1.5 has explained that it may not be useful to achieve high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known). Similar to selecting k of k-nearest neighbor in Chapter 1.6, cross validation estimates the performance of the model.

We recommend a “grid-search” on C and γ using cross-validation. Basically pairs of (C, γ) are tried and the one with the best cross-validation accuracy is picked. We found that trying exponentially growing sequences of C and γ is a practical method to identify good parameters (for example, C = 2⁻⁵, 2⁻³, . . . , 2¹⁵, γ = 2⁻¹⁵, 2⁻¹³, . . . , 2³).

The grid-search is straightforward but seems stupid. In fact, there are several advanced methods which can save computational cost by, for example, approximating

(33)

3.3. MODEL SELECTION 33 the cross-validation rate. However, there are two motivations why we prefer the simple grid-search approach.

One is that psychologically we may not feel safe to use methods which avoid doing an exhaustive parameter search by approximations or heuristics. The other reason is that the computational time to find good parameters by grid-search is not much more than that by advanced methods since there are only two parameters. Furthermore, the grid-search can be easily parallelized because each (C, γ) is independent. Many of advanced methods are iterative processes, e.g. walking along a path, which might be difficult for parallelization.

german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale

german.numer_scale 77.5

77 76.5 76 75.5 75

-5 0 5 10 15

lg(C)

-14 -12 -10 -8 -6 -4 -2 0 2

lg(gamma)

Figure 3.1: Loose grid search on C = 2⁻⁵, 2⁻³, . . . , 2¹⁵ and γ = 2⁻¹⁵, 2⁻¹³, . . . , 2³. Since doing a complete grid-search may still be time-consuming, we recommend using a coarse grid first. After identifying a “better” region on the grid, a finer grid search on that region can be conducted. To illustrate this, we do an experiment on the problem german from the Statlog collection (Michie et al., 1994). After scaling this set, we first use a coarse grid (Figure 3.1) and find that the best (C, γ) is (2³, 2⁻⁵) with the cross-validation rate 77.5%. Next we conduct a finer grid search on the neighborhood of (2³, 2⁻⁵) (Figure 3.2) and obtain a better cross-validation rate 77.6%

at (2^3.25, 2^−5.25). After the best (C, γ) is found, the whole training set is trained again to generate the final classifier. Note that there is no need to conduct a very fine grid

(34)

german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale german.numer_scale

german.numer_scale 77.5

77 76.5 76 75.5 75

1 1.5 2 2.5 3 3.5 4 4.5 5

lg(C)

-7 -6.5 -6 -5.5 -5 -4.5 -4 -3.5 -3

lg(gamma)

Figure 3.2: Fine grid-search on C = 2¹, 2^1.25, . . . , 2⁵ and γ = 2⁻⁷, 2^−6.75, . . . , 2⁻³.

search. Figure 3.1 clearly shows that good parameters are in a quite wide region.

The above approach works well for problems with thousands or more data points.

For very large data sets, a feasible approach is to randomly choose a subset of the data set, conduct grid-search on them, and then do a better-region-only grid-search on the complete data set.

3.4 A General Procedure

To use SVM, we propose trying the following procedure first:

• Conduct simple scaling on the data.

• Consider the RBF kernel K(x, y) = e^−γkx−yk²

• Use cross-validation to find the best parameter C and γ.

• Use the best parameter C and γ to train the whole training set.

• Test.

(35)

3.5. USING LIBSVM 35

3.5 Using LIBSVM

We use LIBSVM, a library for support vector machines, to demonstrate the training and testing procedure (Chang and Lin, 2001b). It is available at

http://www.csie.ntu.edu.tw/^∼cjlin/libsvm

Instructions for installation on different platforms are in the README file of the package.

The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...

<label> is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). <index>

is an integer starting from 1 and <value> is a real number. The indices must be in an ascending order. An example is in the following:

1.0 1:2.617300e+01 2:5.886700e+01 3:-1.894697e-01 4:1.251225e+02 1.0 1:5.707397e+01 2:2.214040e+02 3:8.607959e-02 4:1.229114e+02 1.0 1:1.725900e+01 2:1.734360e+02 3:-1.298053e-01 4:1.250318e+02 1.0 1:2.177940e+01 2:1.249531e+02 3:1.538853e-01 4:1.527150e+02 0.0 1:2.391101e+01 2:3.890001e+01 3:4.704049e-01 4:1.257871e+02 0.0 1:2.230670e+01 2:2.262220e+01 3:2.117224e-01 4:1.012818e+02 0.0 1:1.640820e+01 2:3.920219e+01 3:-9.912787e-02 4:3.248707e+01 Clearly this data set contains four features. Next we consider a data set in

http://www.csie.ntu.edu.tw/^∼cjlin/papers/guide/data/train.1 It includes 3,089 training instances.

A simple training shows

$./svm-train train.1 ...*

optimization finished, #iter = 6131 nu = 0.606144

obj = -1061.528899, rho = -0.495258 nSV = 3053, nBSV = 724

Total nSV = 3053

(36)

From the output, obj is the optimal objective value of the dual SVM problem (2.7).

The value rho is −b in the decision function (2.10). nSV and nBSV are number of support vectors (i.e., 0 < αi ≤ C) and bounded support vectors (i.e., αⁱ = C). Other information such as #iter will be explained in Chapter 6.

The training procedure generates a model file train1.model which includes information such as support vectors and dual optimal solutions. The test file has the same format as the training. If labels of the testing data file are available, we can calculate the test accuracy. If they are unknown, you still have to fill this column with any number. The testing procedure is as follows:

$./svm-predict test.1 train.1.model test.1.predict Accuracy = 66.925% (2677/4000)

The test accuracy is not satisfactory. If we predict the training data using the same model:

$./svm-predict train.1 train.1.model o Accuracy = 99.7734% (3082/3089)

We find that training and testing accuracy are rather different. In fact, overfitting has happened. From the discussion earlier in this Chapter, we understand that scaling and parameter selection may be needed.

Thus, we use the program svm-scale provided in LIBSVM to conduct data scaling:

$./svm-scale -l -1 -u 1 train.1 > train.1.scale

This means that each attribute is linearly scaled to the range [−1, 1]. A common mistake is then to scale the test data by the same way:

$./svm-scale -l -1 -u 1 test.1 > test.1.scale

Remember we should use the same scaling factor for training and testing sets. A correct way should be

$./svm-scale -s range1 train.1 > train.1.scale

$./svm-scale -r range1 test.1 > test.1.scale

That is, we store the scaling factor used in training and apply them for the testing set. By training and predicting scaled sets we obtain:

(37)

3.6. OTHER EXAMPLES 37

$./svm-train train.1.scale

$./svm-predict test.1.scale train.1.scale.model test.1.predict

→ Accuracy = 96.15%

The test accuracy is now much better. Now parameters used are default ones: C = 1 and the RBF kernel with γ = 1/n = 0.25, where n is the number of features. Note that different parameters could really lead to different performance. For example, if we use C = 20, γ = 400

$./svm-train -c 20 -g 400 train.1.scale

$./svm-predict train.1.scale train.1.scale.model o Accuracy = 100% (3089/3089) (classification)

we obtain 100% training accuracy but very bad accuracy

$./svm-predict test.1.scale train.1.scale.model o Accuracy = 82.7% (3308/4000) (classification) Thus parameter selection is quite important.

In LIBSVM there is a simple tool grid.py for parameter selection:

$./grid.py train.1.scale

[local] -1 -7 85.1408 (best c=0.5, g=0.0078125, rate=85.1408) [local] 5 -7 95.4354 (best c=32.0, g=0.0078125, rate=95.4354) .

. .

Best c=2.0, g=2.0

A contour like Figures 3.1 and 3.2 showing cross validation accuracy is generated.

3.6 Other Examples

Table 3.6.1 presents some real-world examples. These data sets are reported from our users who could not obtain reasonable accuracy in the beginning. Using the procedure illustrated in Section 3.4, we help them to achieve better performance.

∗Courtesy of Jan Conrad from Uppsala University, Sweden.

†Courtesy of Cory Spencer from Simon Fraser University, Canada (Gardy et al., 2003).

‡Courtesy of a user from Germany.

§As there are no testing data, cross-validation instead of testing accuracy is presented here.

(38)

Table 3.6.1: Problem characteristics and performance comparisons.

Applications #training #testing #features #classes Accuracy Accuracy

data data by users by our

procedure

Astroparticle^∗ 3,089 4,000 4 2 75.2% 96.9%

Bioinformatics^† 391 0^§ 20 3 36% 85.2%

Vehicle^‡ 1,243 41 21 2 4.88% 87.8%

These sets are at http://www.csie.ntu.edu.tw/^∼cjlin/papers/guide/data/.

The first set has been discussed in Section 3.5. Here we present details of the other two. We also demonstrate how to use a scripts easy.py which exactly does the general procedure in Section 3.4.

• Bioinformatics

– Original sets with default parameters

$./svm-train -v 5 train.2

→ Cross Validation Accuracy = 56.5217%

– Scaled sets with default parameters

$./svm-scale -l -1 -u 1 train.2 > train.2.scale

$./svm-train -v 5 train.2.scale

– Scaled sets with parameter selection

$python grid.py train.2.scale

· · ·

2.0 0.5 85.1662

(Best C=2.0, γ=0.5 with five fold cross-validation rate=85.1662%) – Using an automatic script

$python easy.py train.2 Scaling training data...

Cross validation...

(39)

3.6. OTHER EXAMPLES 39 Best c=2.0, g=0.5

Training...

• Vehicle

– Original sets with default parameters

$./svm-train train.3

$./svm-predict test.3 train.3.model test.3.predict

→ Accuracy = 2.43902%

– Scaled sets with default parameters

$./svm-scale -l -1 -u 1 -s range3 train.3 > train.3.scale

$./svm-scale -r range3 test.3 > test.3.scale

$./svm-train train.3.scale

→ Accuracy = 12.1951%

– Scaled sets with parameter selection

$python grid.py train.3.scale

· · ·

128.0 0.125 84.8753

(Best C=128.0, γ=0.125 with five-fold cross-validation rate=84.8753%)

$./svm-train -c 128 -g 0.125 train.3.scale

→ Accuracy = 87.8049%

– Using an automatic script

$python easy.py train.3 test.3 Scaling training data...

Cross validation...

Best c=128.0, g=0.125 Training...

Scaling testing data...

Testing...

Accuracy = 87.8049% (36/41) (classification)

(40)

d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k d2-3k

d2-3k 97

96.8 96.6 96.4 96.2 96 95.8 95.6 95.4 95.2 95

4 5 6 7 8 9 10

lg(C)

-3 -2 -1 0 1 2

lg(gamma)

Figure 3.3: Cross validation using 3000 points

3.7 A Large Practical Example

In this section we discuss how to deal with a large and practical data set. It comes from the first problem First problem of IJCNN Challenge 2001, organized by Ford Scientific Research Labs (Prokhorov, 2001). We summarize the approach of the winning entry (Chang and Lin, 2001a). The training set consists of 50,000 instances like the following

0.000000 -0.999991 0.169769 0.000000 1.000000 0.000000 -0.659538 0.169769 0.000292 1.000000 0.000000 -0.660738 0.169128 -0.020372 1.000000 1.000000 -0.660307 0.169128 0.007305 1.000000 0.000000 -0.660159 0.169525 0.002519 1.000000 0.000000 -0.659091 0.169525 0.018198 1.000000 0.000000 -0.660532 0.169525 -0.024526 1.000000 0.000000 -0.659798 0.169525 0.012458 1.000000

(41)

3.7. A LARGE PRACTICAL EXAMPLE 41

d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k d2-10k

d2-10k 97.6

97.4 97.2 97 96.8 96.6 96.4 96.2 96

2 3 4 5 6 7 8

lg(C)

-2 -1 0 1 2 3

lg(gamma)

Figure 3.4: Cross validation using 10,000 points

They are results at 50,000 time points and hence are a time series. There are 100,000 testing points. The kth instance contains

x₁(k), x₂(k), x₃(k), x₄(k), x₅(k), y(k),

where y(k) = ±1 is the class label. For a time-series data, past and future information may affect the current class label. Moreover, it is known that the fifth attribute x5(k) is independent of y(k), but only if x5(k) = 1, then the test instance is considered for evaluation. Therefore, among the 100,000 test instances, only around 90,000 are evaluated. Another known information is that x4(k) is more important than other features.

To begin, we analyze features in more detail. The first feature x1(k) has a periodicity so that sequentially we have nine 0s, one 1, nine 0s, one 1, and so on. Other attributes, x2(k), . . . , x4(k), are real numbers in the range of ±1.5. An interesting

(42)

d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2 d2

d2 98.8

98.6 98.4 98.2 98 97.8 97.6 97.4 97.2 97

1 2 3 4 5 6 7

lg(C)

-2 -1 0 1 2 3

lg(gamma)

Figure 3.5: Cross validation using 50,000 points

observation is that for the 50,000 training data, 90% of y(k) are −1. Thus it might be possible that if we just guess all test data to be −1, there is already 90% accuracy.

The difficulty is how to use learning techniques to achieve higher accuracy.

To use SVM for constructing a model, first we have to decide the attributes (i.e. features) of each data. There are possible variables which may affect y(k). In addition, for each attribute we need an encoding scheme. For example, to represent the periodicity of x1(k), we can include x1(k −5), . . . , x¹(k +4) as 10 binary attributes of the kth data. On the other hand, we can use only one integer between 1 to 10 which indicates the the position of 1 in x1(k − 5), . . . , x¹(k + 4). Based on our experience we choose the former way as it might be better for support vector machines.

We directly use x2(k) and x3(k) as they are. As x4(k) is more important, we consider some past and future elements. After conducting some cross validation tests, we decide to use x4(k − 5), . . . , x⁴(k + 4). Therefore, each training data consists

(43)

3.7. A LARGE PRACTICAL EXAMPLE 43 of 22 attributes.

For learning techniques like Neural Networks or Support Vector Machines, it is recommended to scale each attribute of data into an appropriate range such as [−1, 1]

or [0, 1]. Since all raw data under our encoding scheme are already in a small region [−1.5, 1.5], we do not conduct any scaling.

After preparing the training data, we do the model selection by 5-fold cross validation. We consider only the RBF kernel K(xi, xj) = e^−γkxⁱ^−x^j^k². Thus two parameters are the kernel parameter γ and the penalty parameter C in (2.5).

First we work on a small subset of the training data: 3,000 randomly selected points. The contour of cross validation accuracy is in Figure 3.3 where two axes are log₂C and log₂γ. It can be seen that the best cross validation rates happen at around C = 2⁷ and γ = 2⁰. We then work on a larger subset with 10,000 data points. Results are in Figure 3.4. Parameters with the best cross validation rate are at around C = 2⁴ to 2⁵ and γ = 2⁰ to 2¹, a different range than that in Figure 3.3. Finally we do the model selection on all 50,000 training data where results are shown in Figure 3.5.

Again we note that the best parameters slightly move to another region.

Therefore, the experiment seems to show that the best parameters depend on the size of the training data. Remember that the objective function of SVM is

1

2w^Tw + C

l

X

i=1

ξi.

Under the same w, a larger number of instances causes larger Pl

i=1ξi. Thus, some have proposed using

1

2w^Tw + C l

l

X

i=1

ξi. This matches our experimental results as roughly

2⁷· 3000 ≈ 2⁵· 10000 ≈ 2³· 50000.

Though using a subset for parameter selection saves time, a procedure using all available training points may still be the most reliable.

Finally we select C = 2⁴ and γ = 2² to train the 50,000 data and obtain a model for testing. There are 1,293 test errors. This is the winning entry of IJCNN 2001 competition and the second place has more than 2,000 errors. For the 50,000 training data, the number of support vectors is around 3,000. Note that the number of SVs depends on different data sets.

(44)

3.8 A Failed Example 3.9 Homework

1. (a) Show that for the linear kernel, a data scaled to [−1, 1] and trained by SVM with C is equivalent to the same data scaled to [0, +1] and solved by SVM with 4C.

(b) Show that for the RBF kernel, a data scaled to [−1, 1] and trained by SVM with γ is equivalent to the same data scaled to [0, +1] and solved by SVM with 4γ.