Combining SVMs with Various Feature Selection Strategies

78  Download (0)

Full text


£ G  .  ~ X

Æ ÿ ¡ Z


Combining SVMs with Various Feature Selection Strategies

 ~ ß  Wsº

¼0>0 ‘$>0


Selection Strategies

by Yi-Wei Chen

A dissertation submitted in partial fulfillment of the requirements for the degree of

Master of Science

(Computer Science and Information Engineering) in National Taiwan University



Yi-Wei Chen 2007 All Rights Reserved


Feature selection is an important issue in many research areas. There are some reasons for selecting important features such as reducing the learning time, improving the accuracy, etc. This thesis investigates the performance of combining support vector machines (SVM) and various feature selection strategies. The first part of the thesis mainly describes the existing feature selection methods and our experience on using those methods to attend a competition. The second part studies more feature selection strategies using the SVM.



3 3

3œ9r½…òPóC (feature selection) Îל¥ŠÝ¯†òPóC bœ9? »A"ƕ>—è{?ŽÝã—‡‡Í¡Z"D¿àY'

^ (Support Vector Machine) 3!ÝòPóCˆ¯ì5vÝ[Œ¡ZÝG–I


™¡–I JE?9Ý℄°®áÝ~



ABSTRACT . . . ii


LIST OF TABLES . . . vii

CHAPTER I. Introduction . . . 1

II. Basic Concepts of SVM . . . 4

2.1 Linear Separating Hyperplane with Maximal Margin . . . 4

2.2 Mapping Data to Higher Dimensional Spaces . . . 6

2.3 The Dual Problem . . . 9

2.4 Kernel and Decision Functions . . . 10

2.5 Multi-class SVM . . . 13

2.5.1 One-against-all Multi-class SVM . . . 13

2.5.2 One-against-one Multi-class SVM . . . 14

2.6 Parameter Selection . . . 15

III. Existing Feature Selection Methods . . . 17

3.1 Feature Ranking . . . 17

3.1.1 Statistical Score . . . 17

3.1.2 Random Shuffle on Features . . . 20

3.1.3 Separating Hyperplane in SVM . . . 21

3.2 Feature Selection . . . 22

3.2.1 Forward/Backward Selection . . . 22

3.2.2 Feature Ranking and Feature Number Estimation . 23 3.3 Feature Scaling . . . 25

3.3.1 Radius-Margin Bound SVM . . . 25

3.3.2 Bayesian SVM . . . 27

IV. Experience on NIPS Competition . . . 29



4.2.1 Balanced Error Rate (BER) . . . 30

4.2.2 Area Under Curve (AUC) . . . 31

4.2.3 Fraction of Features . . . 31

4.2.4 Fraction of Probes . . . 31

4.3 Data Sets Information . . . 32

4.3.1 Source of Data Sets . . . 32

4.4 Strategies in Competition . . . 33

4.4.1 No Selection: Direct Use of SVM . . . 33

4.4.2 F-score for Feature Selection: F-score + SVM . . . 33

4.4.3 F-score and Random Forest for Feature Selection: F-score + RF + SVM . . . 35

4.4.4 Random Forest and RM-bound SVM for Feature Se- lection . . . 36

4.5 Experimental Results . . . 36

4.6 Competition Results . . . 38

4.7 Discussion and Conclusions from the Competition . . . 38

V. Other Feature Ranking Methods by SVM . . . 42

5.1 Normal Vector of the Decision Boundary in Nonlinear SVM . 42 5.2 Change of Decision Value in Nonlinear SVM . . . 43

5.2.1 Instances from Underlying Distribution . . . 44

5.2.2 Instances from Decision Boundary . . . 48

5.3 Random Shuffle on Features using Probability SVM . . . 49

5.3.1 SVM with Probability Output . . . 49

5.3.2 Random Shuffle on Validation Data Features . . . . 50

VI. Experiments . . . 52

6.1 Experiment Procedures . . . 52

6.1.1 Feature Selection by Ranking . . . 52

6.1.2 Feature Scaling using RM-bound SVM . . . 54

6.2 Data Sets . . . 54

6.3 Experimental Results . . . 56

6.4 Analysis . . . 57

VII. Discussion and Conclusions . . . 65





2.1 Separating hyperplane . . . 6 2.2 An example which is not linearly separable . . . 6 2.3 Support vectors (marked as +) are important data from training data 13 3.1 An example of one-dimensional data that cannot be separated by

only one boundary point. . . 18 3.2 A two-dimensional data that is linearly separable. While the sepa-

rating plane is nearly vertical, the feature of x-axis is more important. 22 4.1 Curves of F-scores against features; features with F-scores below the

horizontal line are dropped . . . 41 5.1 A two-dimensional illustration about “pushing” an instance x to the

decision boundary . . . 45 6.1 Comparisons of CV accuracy and testing accuracy against log num-

ber of features between feature ranking methods. . . 60 6.2 Comparisons of CV accuracy and testing accuracy against the log

number of features between feature ranking methods. . . 61 6.3 Comparisons of CV accuracy and testing accuracy against log num-

ber of features between feature ranking methods. . . 62 6.4 Comparisons of CV accuracy and testing accuracy against log num-

ber of features between feature ranking methods. . . 63 6.5 Comparisons of CV accuracy and testing accuracy against log num-

ber of features between feature ranking methods. . . 64




4.1 Statistics of competition data sets . . . 32 4.2 Comparison of different methods during the development period:

BERs of validation sets (in percentage); bold-faced entries corre- spond to approaches used to generate our final submission . . . 37 4.3 CV BER on the training set (in percentage) . . . 38 4.4 F-score threshold and the number of features selected in the ap-

proach F+SVM . . . 38 4.5 NIPS 2003 challenge results of the development stage . . . 39 4.6 NIPS 2003 challenge results of the final stage . . . 40 6.1 Statistics of all data sets. The column “accuracy” is the testing ac-

curacy of original problems. (log2C, log2γ) is the optimal parameter obtained by parameter selection. . . 55 6.2 The testing accuracy obtained by each feature selection/scaling method.

The numbers of features selected are also reported in parentheses for those five feature selection methods. Some results are not available due to the limitation of the programs. . . 58




Support vector machines (SVM) [3, 12, 36] have been an effective technique for data classification. Not only it has a solid theoretical foundation, practical compar- isons have also shown that SVM is competitive with existing methods such as neural networks and decision trees.

SVM uses a separating hyperplane to maximize the distance between two classes of data. For problems that can not be linearly separated in the original feature space, SVMs employ two techniques. First, SVM with a soft margin hyperplane and a penalty function of training errors is introduced. Second, using the kernel technique, the original feature space is non-linearly transformed into a higher dimensional kernel space. In this kernel space it is more possible to find a linear separating hyperplane.

More details about basic concepts of SVM are described in Chapter II.

Feature selection is an important issue in many research areas, such as bioinfor- matics, chemistry [34, 15], and text categorization [27, 18]. There are some reasons for selecting important features. First, reducing the number of features decreases the learning time and storage requirements. Second, removing irrelevant features and keeping informative ones usually improve the accuracy and performance. Third, the subset of selected features may help to discover more knowledge about the data.



According to the relationship between the classifier and the selection strategy, a feature selection method can be categorized into one of the two types: “filter”

methods and “wrapper” methods [19]. Filter methods are defined as a preprocessing step that removes irrelevant or unimportant features before classifiers begin to work.

On the other hand, “wrapper” methods have close relations to the classifiers and will rely on them to select features.

There are some early studies on feature selection using SVMs. For example, [16]

considers the normal vector of the decision boundary as feature weights. [39, 7]

minimize generalization bounds with respect to feature weights. In addition, [9]

treats the decision values as random variables and maximize the posteriori using a Bayesian framework. There are some other variants of SVMs which are designed to do feature selection [41, 30]. However, in general there is no common or the best way to conduct feature selections with SVMs.

Since more and more people work on feature selection, a competition which was especially designed for this topic was held in the 16th annual conference on Neural Information Processing Systems (NIPS). We took part in this contest and combined different selection strategies with SVMs. The paper [8] is a preliminary study on how to use SVMs to conduct feature selection. Now in this thesis, comparisons between more methods with SVMs are investigated, and more extensive experiments are conducted.

This thesis is organized as follows. Chapter II introduces basic concepts of SVM.

Some existing general feature selection methods are discussed in Chapter III. In Chapter IV, we describe the result of attending the NIPS 2003 feature selection competition. Chapter V discusses more strategies with SVMs. Experiments and comparisons are listed in Chapter VI. Finally, discussion and conclusions are in


Chapter VII.


Basic Concepts of SVM

We introduce the basic concepts of the SVM in this chapter. Part of this chapter is modified from [24].

2.1 Linear Separating Hyperplane with Maximal Margin

The original idea of SVM classification is to use a linear separating hyperplane to create a classifier. Given training vectors xi, i = 1, . . . , l of length n, and a vector y defined as follows

yi =



1 if xi in class 1,

−1 if xi in class 2,

the support vector technique tries to find the separating hyperplane with the largest margin between two classes, measured along a line perpendicular to the hyperplane.

For example, in Figure 2.1, two classes could be fully separated by a dotted line wTx + b = 0. We would like to decide the line with the largest margin. In other words, intuitively we think that the distance between two classes of training data should be as large as possible. That means we find a line with parameters w and b such that the distance between wTx + b = ±1 is maximized.

The distance between wTx + b = 1 and −1 can be calculated by the following way. Consider a point ¯x on wTx + b = −1:



¯ x + tw

¯ x tw

wTx + b = −1

wTx + b = 1

As w is the “normal vector” of the line wTx + b = −1, w and the line are perpendicular to each other. Starting from ¯x and moving along the direction w, we assume ¯x + tw touches the line wTx + b = 1. Thus,

wT(¯x + tw) + b = 1 and wTx + b = −1.¯

Then, twTw = 2, so the distance (i.e., the length of tw) is ktwk = 2kwk/(wTw) = 2/kwk. Note that kwk = pw21+ · · · + wn2. As maximizing 2/kwk is equivalent to minimizing wTw/2, we have the following problem:


1 2wTw

subject to yi(wTxi+ b) ≥ 1, (2.1.1) i = 1, . . . , l.

The constraint yi(wTxi+ b) ≥ 1 means

(wTxi) + b ≥ 1 if yi = 1, (wTxi) + b ≤ −1 if yi = −1.

That is, data in class 1 must be on the right-hand side of wTx + b = 0 while data in the other class must be on the left-hand side. Note that the reason of maximizing the distance between wTx + b = ±1 is related to Vapnik’s Structural Risk Minimization [36].


wTx + b =

 +1



 Figure 2.1: Separating hyperplane

Figure 2.2: An example which is not linearly separable

2.2 Mapping Data to Higher Dimensional Spaces

Practically problems may not be linearly separable and an example is in Figure 2.2. Thus, there is no (w, b) which satisfies constraints of (2.1.1). In this situation, we say (2.1.1) is “infeasible.” In [12] the authors introduced slack variables ξi, i = 1, . . . , l in the constraints:



2wTw + C





subject to yi(wTxi+ b) ≥ 1 − ξi, (2.2.1) ξi ≥ 0, i = 1, . . . , l.

That is, constraints (2.2.1) allow that training data may not be on the correct side of the separating hyperplane wTx + b = 0. This situation happens when ξi > 1 and an example is in the following figure


wTxi+ b = 1 − ξi < −1

We have ξ ≥ 0 as if ξ < 0, then yi(wTxi+ b) ≥ 1 − ξi ≥ 1 and the training data is already on the correct side. The new problem is always feasible since for any (w, b),

ξi ≡ max(0, 1 − yi(wTx + b)), i = 1, . . . , l,

lead to that (w, b, ξ) is a feasible solution.

Using this setting, we may worry that for linearly separable data, some ξi > 1 and hence corresponding data are wrongly classified. For the case that most data except some noisy ones are separable by a linear function, we would like wTx + b = 0 correctly classifies the majority of points. Thus, in the objective function we add a penalty term CPl

i=1ξi, where C > 0 is the penalty parameter. To have the objective value as small as possible, most ξi should be zero, so the constraint goes back to its original form. Theoretically we can prove that if data are linear separable and C is larger than a certain number, problem (2.2.1) goes back to (2.1.1) and all ξi are zero [25].

Unfortunately, such a setting is not enough for practical use. If data are dis- tributed in a highly nonlinear way, employing only a linear function causes many training instances to be on the wrong side of the hyperplane. So underfitting occurs and the decision function does not perform well.

To fit the training data better, we may think of using a nonlinear curve like that in Figure 2.2. The problem is that it is very difficult to model nonlinear curves. All


we are familiar with are eliptic, hyperbolic, or parabolic curves, which are far from enough in practice. Instead of using more sophisticated curves, another approach is to map data into a higher dimensional space. For example, suppose the height and weight of some people are available and medical experts have identified that some of them are over-weighted or underweighted. We may consider two other attributes

height-weight, weight/(height2).

Such features may provide more information for separating underweighted/overweighted people. Each new data instance is now in a four-dimensional space, so if the two new features are good, it should be easier to have a seperating hyperplane so that most ξi are zero.

Thus SVM non-linearly transforms the original input space into a higher dimen- sional feature space. More precisely, the training data x is mapped into a (possibly infinite) vector in a higher dimensional space:

φ(x) = [φ1(x), φ2(x), . . .].

In this higher dimensional space, it is more possible that data can be linearly sepa- rated. An example by mapping x from R3 to R10 is as follows:

φ(x) = (1,√ 2x1,√


2x3, x21, x22, x23,√




An extreme example is to map a data instance x ∈ R1 to an infinite dimensional space:

φ(x) =

 1, x

1!,x2 2!,x3

3!, . . .



We then try to find a linear separating plane in a higher dimensional space so


(2.2.1) becomes



2wTw + C





subject to yi(wTφ(xi) + b) ≥ 1 − ξi, (2.2.2) ξi ≥ 0, i = 1, . . . , l.

2.3 The Dual Problem

The remaining problem is how to effectively solve (2.2.2). Especially after data are mapped into a higher dimensional space, the number of variables (w, b) becomes very large or even infinite. We handle this difficulty by solving the dual problem of (2.2.2):


1 2



i=1 l



αiαjyiyjφ(xi)Tφ(xj) −





subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.3.1)




yiαi = 0.

This new problem of course is related to the original problem (2.2.2), and we hope that it can be more easily solved. Sometimes we write (2.3.1) in a matrix form for convenience:



TQα − eTα

subject to 0 ≤ αi ≤ C, i = 1, . . . , l, (2.3.2) yTα= 0.

In (2.3.2), e is the vector of all ones, C is the upper bound, Q is an l by l positive semidefinite matrix, Qij ≡ yiyjK(xi, xj), and K(xi, xj) ≡ φ(xi)Tφ(xj) is the kernel, which will be addressed in Chapter 2.4.


If (2.3.2) is called the “dual” problem of (2.2.2), we refer to (2.2.2) as the “primal”

problem. Suppose ( ¯w, ¯b, ¯ξ) and ¯α are optimal solutions of the primal and dual problems, respectively, the following two properties hold:

¯ w =





αiyiφ(xi), (2.3.3)


2w¯Tw + C¯




ξ¯i = eTα¯ − 1

2α¯TQ ¯α. (2.3.4) In other words, if the dual problem is solved with a solution ¯α, the optimal primal solution ¯w is easily obtained from (2.3.3). If an optimal ¯b can also be easily found, the decision function is hence determined.

Thus, the crucial point is whether the dual is easier to be solved than the primal.

The number of variables in the dual, which is the size of the training set: l, is a fixed number. In contrast, the number of variables in the primal problem varies depending on how data are mapped to a higher dimensional space. Therefore, moving from the primal to the dual means that we solve a finite-dimensional optimization problem instead of a possibly infinite-dimensional problem.

The remaining issue of using the dual problem is about the inner product Qij = φ(xi)Tφ(xj). If φ(x) is an infinite-long vector, there is no way to fully write it down and then calculate the inner product. Thus, even though the dual possesses the advantage of having a finite number of variables, we even could not write the problem down before solving it. This is resolved by using special mapping functions φ so that φ(xi)Tφ(xj) is in a closed form. Details are in the next section.

2.4 Kernel and Decision Functions

Consider a special φ(x) mentioned earlier (assume x ∈ R3):

φ(x) = (1,√ 2x1,√


2x3, x21, x22, x23,√





In this case it is easy to see that φ(xi)Tφ(xj) = (1 + xTi xj)2, which is easier to be calculated then doing a direct inner product. To be more precise, a direct calculation of φ(xi)Tφ(xj) takes 10 multiplications and 9 additions, but using (1 + xTi xj)2, only four multiplications and three additions are needed. Therefore, if a special φ(x) is considered, even though it is a long vector, φ(xi)Tφ(xj) may still be easily available.

We call such inner products the “kernel function.” Some popular kernels are, for example,

1. e−γ||xi−xj||2 (Gaussian kernel or Radial bassis function (RBF) kernel),

2. (xTi xj/γ + δ)d (polynomial kernel),

where γ, d, and δ are kernel parameters. The following calculation shows that the Gaussian (RBF) kernel indeed is an inner product of two vectors in an infinite dimensional space. Assume x ∈ R1 and γ > 0.

e−γ||xi−xj||2 = e−γ(xi−xj)2

= e−γx2i+2γxixj−γx2j

= e−γx2i−γx2j 1 + 2γxixj

1! +(2γxixj)2

2! + (2γxixj)3

3! + · · ·

= e−γx2i−γx2j 1 · 1 +r 2γ

1!xi·r 2γ 1!xj +

r(2γ)2 2! x2i ·

r(2γ)2 2! x2j +

r(2γ)3 3! x3i ·


3! x3j + · · ·

= φ(xi)Tφ(xj),


φ(x) = e−γx2


1,r 2γ 1!x,

r(2γ)2 2! x2,

r(2γ)3 3! x3, · · ·



Note that γ > 0 is used for the existance of terms such as q


q(2γ)3 3! , etc.


After (2.3.2) is solved with a solution α, the vector for which αi > 0 are called support vectors. Then, a decision function is written as

f (x) = sign(wTφ(x) + b) = sign




yiαiφ(xi)Tφ(x) + b


. (2.4.1)

In other words, for a test vector x, if Pl

i=1yiαiφ(xi)Tφ(x) + b > 0, we classify it to be in the class 1. Otherwise, we think it is in the second class. We can see that only support vectors will affect results in the prediction stage. In general, the number of support vectors is not large. Therefore we can say SVM is used to find important data (support vectors) from training data.

We use Figure 2.3 as an illustration. Two classes of training data are not linearly separable. Using the RBF kernel, we obtain a hyperplane wTφ(x) + b = 0. In the original space, it is indeed a nonlinear curve




yiαiφ(xi)Tφ(x) + b = 0. (2.4.2)

In the figure, all points in red color are support vectors and they are selected from both classes of training data. Clearly support vectors are close to the nonlinear curve (2.4.2) are more important points.


−1.5 −1 −0.5 0 0.5 1

−0.2 0 0.2 0.4 0.6 0.8 1 1.2

Figure 2.3: Support vectors (marked as +) are important data from training data

2.5 Multi-class SVM

The discussion so far assumes that data are in only two classes. Many practical applications involve more classes of data. For example, hand-written digit recognition considers data in 10 classes: digits 0 to 9. There are many ways to extend SVM for such cases. Here, we discuss two simple methods.

2.5.1 One-against-all Multi-class SVM

This commonly mis-named method should be called “one-against-the-rest.” It constructs binary SVM models so that each one is trained with one class as positive and the rest as negative. We illustrate this method by a simple situation of four classes. The four two-class SVMs are


yi = 1 yi = −1 Decision function class 1 classes 2,3,4 f1(x) = (w1)Tx + b1 class 2 classes 1,3,4 f2(x) = (w2)Tx + b2 class 3 classes 1,2,4 f3(x) = (w3)Tx + b3 class 4 classes 1,2,3 f4(x) = (w4)Tx + b4 For any test data x, if it is in the ith class, we would expect that

fi(x) ≥ 1 and fj(x) ≤ −1, if j 6= i.

This “expectation” directly follows from our setting of training the four two-class problems and from the assumption that data are correctly separated. Therefore, fi(x) has the largest values among f1(x), . . . , f4(x) and hence the decision rule is

Predicted class = arg max


2.5.2 One-against-one Multi-class SVM

This method also constructs several two-class SVMs but each one is by training data from only two different classes. Thus, this method is sometimes called a “pair- wise” approach. For the same example of four classes, six two-class problems are constructed:

yi = 1 yi = −1 Decision function class 1 class 2 f12(x) = (w12)Tx + b12 class 1 class 3 f13(x) = (w13)Tx + b13 class 1 class 4 f14(x) = (w14)Tx + b14 class 2 class 3 f23(x) = (w23)Tx + b23 class 2 class 4 f24(x) = (w24)Tx + b24 class 3 class 4 f34(x) = (w34)Tx + b34


For any test data x, we put it into the six functions. If the problem of classes i and j indicates the data x should be in i, the class i gets one vote. For example, assume

Classes winner

1 2 1

1 3 1

1 4 1

2 3 2

2 4 4

3 4 3

Then, we have

class 1 2 3 4

# votes 3 1 1 1 Thus, x is predicted to be in the first class.

For a data set with m different classes, this method constructs m(m − 1)/2 two- class SVMs. We may worry that sometimes more than one class obtains the highest number of votes. Practically this situation does not happen so often and there are some further strategies to handle it.

2.6 Parameter Selection

The performance of the SVM is sensitive to its kernel and parameters [17], so they must be chosen carefully when constructing a predicting model.

There are some commonly used kernels:

1. linear: K(xi, xj) = xTi xj .


2. polynomial: K(xi, xj) = γxTi xj+ rd

, γ > 0 .

3. radial basis function (RBF): K(xi, xj) = exp (−γkxi− xjk2) , γ > 0 .

[17] recommends using the RBF kernel in the SVM. First, it nonlinearly maps in- stances into a higher dimensional space, and it can handle data which is not linearly separable. Second, [20] shows that the linear kernel is a special case of the RBF kernel. In addition, the sigmoid kernel behaves like the RBF kernel for certain pa- rameters [26]. Third, the RBF kernel suffers less from numerical difficulties.

With the RBF kernel, there are two parameters to be determined in an SVM model now: the penalty parameter C in (2.2.1), and γ in the RBF kernel. To get good generalization ability, a cross validation (CV) process is conducted to decide parameters. The procedure to construct a model for predicting is therefore like the following:

Algorithm 1 SVM Parameter Selection

1. Consider a list of (C, γ), C, γ > 0. For each hyperparameter pair (C, γ) in the search space, conduct five-fold cross validation on the training set.

2. Choose the parameter (C, γ) that leads to the lowest cross validation error rate.

3. Use the best parameter to create a model as the predictor.


Existing Feature Selection Methods

In this chapter we describe several common strategies for feature selection. Some of them give only a ranking of features; some are able to directly select proper features for classification. There are methods that can only be applied to SVMs, but most general feature selection techniques are independent of the classifier used.

3.1 Feature Ranking

Among feature selection techniques, some could give an importance order of all the features. In the following we describe some methods of this type.

3.1.1 Statistical Score

Several statistical criteria such as correlation coefficient, Fisher’s criterion [13, 28]

could give feature rankings. They are independent of the choice of the classifier and are more like preprocessing steps. The advantage of such criteria is that they are simple and effective. However, effects of classifiers behind them are often ignored, and this is the main disadvantage of this type of approaches [21]. For example, consider a one dimensional two-class data set in Figure 3.1. Fisher’s criterion or correlation coefficient may think that this feature is not discriminative. If the classifier used is



△△△△ △△△△

Figure 3.1: An example of one-dimensional data that cannot be separated by only one boundary point.

a linear one, this is the case indeed. However, a decision tree will perform well on this data set.

Here we concentrate on using Fisher’s criterion in feature ranking. The Fisher’s criterion of a single feature is defined as the following. Given a data set X with two classes, denote instances in class 1 as X1, and those in class 2 as X2. Assume ¯xkj is the average of the jth feature in Xk. The Fisher score (F-score for short) of the jth feature is:

F (j) ≡ x¯1j − ¯x2j


(s1j)2+ (s2j)2, (3.1.1) where

(skj)2 = X


xj − ¯xkj



The numerator indicates the discrimination between two classes, and the denom- inator indicates the scatter within each class. The larger the F-score is, the more likely this feature is more discriminative. Therefore, this score can be a criterion for feature selection.

A disadvantage of using F-score to obtain feature ranking is that it does not reveal mutual information among features. Consider one simple example in the following figure:




Both features of this data have low F-scores as in (3.1.1) the denominator (the sum of variances of the positive and negative sets) is much larger than the numerator.

Despite this disadvantage, F-score is simple and generally quite effective.

For data with more than two classes, F-score can be extended as the following.

Given a data set X with m classes, denote the set of instances in class k as Xk, and

|Xk| = lk, k = 1, . . . , m. Assume ¯xkj and ¯xj are the average of the jth feature in Xk and X, respectively. The Fisher score of the jth feature of this data set is defined as:

F (j) ≡ˆ SB(j)

SW(j), (3.1.2)


SB(j) =




lkkj − ¯xj2

, (3.1.3)

SW(j) =






xj− ¯xkj


. (3.1.4)

Note that (3.1.1) is equivalent to (3.1.2) when k = 2: Since

SW(j) =






xj − ¯xkj


= (s1j)2+ (s2j)2,



SB(j) =




lkkj − ¯xj2

= l1


x1j − l11j + l22j l1+ l2

2 + l2


x2j −l11j + l22j l1+ l2


= l1


l1+ l2


¯ x1j − ¯x2j


+ l2


l1+ l2


¯ x2j − ¯x1j


= l1l2

l1+ l21j − ¯x2j


, for a two-class data set

F (j) =ˆ l1l2

l1+ l2F (j), (3.1.5)

and they both give the same feature ranking.

3.1.2 Random Shuffle on Features

Feature ranking obtained from statistical scores do not take classifiers’ charac- teristics into account. Here we introduce a method which uses the ability of the underlying classifiers. It can be generally applied to all kinds of classifiers as well.

To introduce this idea, take random forest as an example. Random forest (RF) is a classification method, but it also provides feature importance [4]. A forest con- tains many decision trees. Each decision tree is constructed by instances ramdomly sampled with replacement. Therefore about one-third of training instances are left out. These are called out-of-bag (oob) data. This oob data can be used to estimate classification error.

The basic idea about giving feature importance is simple. It is achieved by using oob data. For every decision tree in the forest, put down its oob data and count the number of correct predictions. Then randomly permute the values of the jth feature among oob instances, and put these instances down the tree again. Substract the number of correct predictions in permuted oob data from that in unpermuted oob


data. The average of this number over all the trees in the forest is the raw importance score for the jth feature.

From the description above, it is easy to see that this idea can be extended to any other learning machines generally. First, determine the sort of hold-out technique to estimate a model’s performance, such as oob data in random forest, or cross- validation for general machines. Second, build a model by the learning algorithm and estimate its performance by the above technique. Finally, random shuffle the specific feature in hold-out data, and then estimate the performance again using the same model in the previous step. The difference of two estimations can be seen as an influence of that feature. If a feature plays an important role in classifying, shuffling this feature will make it irrelevant to the labels and the performance will decrease.

Note that random shuffling or even removing features in the whole data is also a possible way to know the influence of features. However, most learning algorithms spend very large amount of time in training or even in hyper-parameter selection, while the predicting is much faster. Therefore it is not practical to shuffle features of the whole data if the size of data set is not small.

3.1.3 Separating Hyperplane in SVM

For a linear SVM, the normal direction of the separating hyperplane may give the importance of features [15, 16]. A simple illustration by a two-dimensional binary- class problem is in Figure 3.2. It is obvious that the feature of the horizontal axis has good separating ability, while the feature of the vertival axis is more like a noise. Furthermore, the projected length of w, the normal vector of the separating hyperplane, on the horizontal axis is longer than that on the vertical axis. Therefore, one may conjecture that if the absolute value of a component in w is larger, its


wTx + b = 0 w

Figure 3.2: A two-dimensional data that is linearly separable. While the separating plane is nearly vertical, the feature of x-axis is more important.

corresponding feature is more important.

However, for non-linear cases the decision boundary is not a hyperplane anymore.

That is, the influence of a feature on the decision boundary varies and is dependent on where it is in the feature space. Some solutions to the nonlinear case will be described Chapter V.

3.2 Feature Selection

In this section, we describe several techniques which select important features for learning. In some situations, the rank of features must be available first.

3.2.1 Forward/Backward Selection

When the number of features is small, a greedy strategy can be considered. Sup- pose there are n features in a data set. First we consider feature subsets that contain


only one feature. For i = 1, . . . , n, we select the ith feature only and conduct the cross validation on the selected data. Therefore, the CV accuracy for n different feature subsets are obtained. The feature with the highest CV accuracy is thought as the most important one. Now, if feature subsets with two features are considered, instead of checking all n(n − 1)/2 possible subsets, we only look at the subsets with the most important feature which is selected previously. Then, only n − 1 feature subsets are of interest. For each subset, we check the CV accuracy of the selected data again and choose the best subset. This selection continues until the perfor- mance is not increased, or all the features are selected. Such a procedure is called forward (or incremental) selection, since one more feature is included every time. On the contrary, the procedure that starts from all the features and greedily remove one feature at a time is called backward elimination [14].

However, this procedure is time consuming if the size of features is large. A faster procedure can be applied if the rank of features is given. Instead of checking all the remaining features during each step of forward selection, the top-ranked feature is chosen. Therefore, the execution time of the whole procedure is proportional to the number of features.

Note that greedily selecting the best feature in each step does not guarantee to find the globally optimal set of features.

3.2.2 Feature Ranking and Feature Number Estimation

If a feature ranking is available, one does not need to conduct a forward or back- ward selection, whose time complexity, as indicated earlier, is linear to the number of features. We can decide the number of features by recursively removing half of the remaining features (or doubling the number of features). The feature set with the


highest CV accuracy is then selected. The number of CV steps is only the logarithm of the number of features. An example of using this procedure is in [34].

We summarize this procedure in the following.

Algorithm 2 Obtain Feature Size

1. Partition the whole training data for k-fold cross-validation.

2. For each CV training set:

(a) Obtain the feature ranking of this CV training set.

(b) Let the working features be the whole features.

(c) Under the working features, construct a model by the CV training set to predict the CV testing set.

(d) Update working features by removing half features which are less impor- tant. Go to step 2(c) or stop if the number of working features is small.

3. Obtain the CV accuracy for different size of working features. Assign p to be the size with the maximal CV accuracy.

4. Calculate the feature ranking on the whole training data. Select the most im- portant p features. These features are then used for the final model construction and prediction.

Though computationally more efficient, this procedure checks fewer feature sub- sets. For example, it considers only at most ten sets in a data set with 1000 features.

Note that the rank of features for each CV training set is obtained in the begin- ning at step 2(a) and is not updated throughout iterations. For the same data under


different feature subsets, some feature ranking methods may obtain different rank- ings. Therefore, one may update the ranking after step 2(d), where half features are removed. An example of using this implementation is Recursive Feature Elimination (RFE) in [15]. However, [34] points out that such an aggressive update of feature rankings may more easily lead to overfitting.

3.3 Feature Scaling

Instead of selecting a subset of features, one may give weights on features. Hence important features play a more vital role in training and prediction. Such techniques are often associated with particular classification methods. Here we describe two techniques using SVM.

3.3.1 Radius-Margin Bound SVM

[7] considers the RBF kernel with feature-wise scaling factors:

K(x, x) = exp −




γj(xj − xj)2


. (3.3.1)

By minimizing an estimation of generalization errors which is a function of γ1, . . . , γn, the feature importance is obtained.

Leave-one-out (loo) error is a common estimation of generalization errors but is not a smooth function of γ1, . . . , γn. Hence it is hard to directly minimize the loo error. For a hard-margin SVM, its loo error is bounded by a smoother function [36, 37]:

loo ≤ 4kwk2R2, (3.3.2)

where kwk is the inverse of the margin of this SVM, and R is the radius of the smallest sphere containing all instances in the kernel space. Therefore we refer to this upper bound as the “radius margin (RM) bound.” Then instead of minimizing


the loo error, we can minimize the radius margin bound to obtain certain γ1, . . . , γn. [36] states that R2 is the objective value of the following optimization problem:

minβ 1 − βT

subject to 0 ≤ βi, i = 1, . . . , l, (3.3.3) eTβ= 1,

where Kij = K(xi, xj).

However, hard-margin SVMs are not suitable for practical use as in general data are not linearly separable. Among soft-margin SVMs, an L2-SVM can be transformed to an equivalent hard-margin form. The formula of L2-SVMs is similar to (2.2.1) except the penalty term:



2wTw + C





subject to yi(wTxi+ b) ≥ 1 − ξi, (3.3.4) ξi ≥ 0, i = 1, . . . , l.

(3.3.4) can be reduced to a hard-margin form by using

w ≡˜

 w


and the ith training data as

φ(xi) yiei/√ C

, (3.3.5) where ei is a zero vector of length l except the ith component is one. The kernel function becomes

K(x˜ i, xj) ≡ K(xi, xj) +δij

C, (3.3.6)

where δij = 1 if i = j and 0 otherwise. Therefore the RM bound for L2-SVMs is

4k ˜wk22. (3.3.7)

k ˜wk and ˜R are obtained by optimization problems. According to [37, 10], these two functions are differentiable to C and γ1, . . . , γn. Thus, gradient-based methods


can be applied to minimize this bound. Using these parameters, an SVM model can be built for future prediction. We call this machine an RM-bound SVM.

3.3.2 Bayesian SVM

[9] proposes a Bayesian technique for support vector classification. For a binary classfication problem with instances xi, i = 1, . . . , l and labels yi ∈ {−1, 1}, they assume that the decision function f is the realization of random variables indexed by the training vectors xi in a stationary zero-mean Gaussian process. In this case, the covariance between f (xi) and f (xj) can be defined as:

cov(f (xi), f (xj)) = κ0exp


2κkxi− xjk2

+ κb. (3.3.8)

Thus, the prior probability of random variables {f(xi)} can be written as:

P(f|θ) = 1 Zf




, (3.3.9)

where f = (f (x1), . . . , f (xl))T, Zf = (2π)n/2|Σ|1/2, Σ is the covariance matrix defined in (3.3.8), and θ is the hyperparameter κ0, κ, and κb. Furthermore, they define the

“trigonometric likelihood” function:

P (yi|f(xi)) =









0 yif (xi) ∈ (−∞, −1],

cos2 π4(1 − yif (xi))

yif (xi) ∈ (−1, +1),

1 yif (xi) ∈ [+1, +∞),


where yi is the label of the instance xi. Therefore, by the Bayesian rule,

P(f|D, θ) = 1

ZS exp (−S(f)) , (3.3.11)


S(f) = 1

2fTΣ−1f +




δ (yif (xi)) . (3.3.12)


Here δ is the “trigonometric loss function” derived from the trigonometric likelihood function (3.3.10):

δ (yif (xi)) = − log (P (yi|f(xi)))










+∞ yif (xi) ∈ (−∞, −1],

2 log sec π4(1 − yif (xi))

yif (xi) ∈ (−1, +1),

0 yif (xi) ∈ [+1, +∞).


Therefore, the posteriori is maximized and the following optimization problem has to be solved:

minf S(f) = 1

2fTΣ−1f +




δ (yif (xi)) . (3.3.14) For feature selection, they also propose a feature-wise scaling kernel simliar to (3.3.1):

cov(f (x), f (x)) = κ0exp −1 2




κj(xj− xj)2


+ κb. (3.3.15)

Thus, to treat κj as variables in (3.3.14), the scaling factor of each feature can also be calculated.

For RM-bound SVM and Bayesian SVM, both of them try to maximize estima- tions of generalization ability. We will see in the next chapter that they have similar performance in some experiments.


Experience on NIPS Competition

This chapter discusses our experience of attending a feature selection competition.

It shows how to apply some feature selection techniques described in the previous chapter on competition data sets.

4.1 Introduction of the Competition

NIPS 2003 Feature Selection Challengewas a competition held in the 16th annual conference on Neural Information Processing Systems (NIPS). There were five two- class data sets designed for the competition. and each data set was splitted into a training, a validation, and a testing set. The aim was to have the best performance on testing sets.

The competition was composed of two stages: only labels of training sets were available in the “development stage.” Competitors submitted their predicted labels of the validation and the testing sets. An on-line judge reported the performance of the validation sets, such as the balanced error rate. However, the performance on the testing sets were not available, so competitors cannot know which strategy was better for the testing sets directly from submissions. The development stage ended



on December 1st, and then the “final stage” began. Labels of validation sets were given additionally in the final stage, so competitors could use this information to predict labels of the testing sets. The performance of the testing sets were still kept in secret. Only five submissions were allowed in the final stage and the deadline was on December 8th.

In the contest, we used combinations of SVMs with various feature selection strategies. Some of them were “filters”: general feature selection methods indepen- dent of SVM. That is, these methods selected important features first and then an SVM was applied for classification. On the other hand, some were wrapper-type methods: modifications of SVM which chose important features as well as conducted training/testing. Overall we ranked third as a group and were the winner of one data set.

4.2 Performance Measures

The following introduces performance measures which are considered in this chal- lenge.

4.2.1 Balanced Error Rate (BER)

In NIPS 2003 Feature Selection Challenge, the main judging criterion was the balanced error rate (BER).

BER ≡ 1

2(# positive instances predicted wrong

# positive instances +

# negative instances predicted wrong

# negative instances ) .


For example, assume that a testing data set contains 90 positive and 10 negative instances. If all instances are predicted as positive, then the BER is 50% since the first term of (4.2.1) is 0/90 but the second is 10/10. In short, if there are fewer


negative examples, the errors on negative examples will count more.

4.2.2 Area Under Curve (AUC)

Besides predicted labels, competitors could to submit confidence to on the pre- diction. With confidence of each prediction, an ROC curve can be drawn. The area under curve (AUC) is defined as the area under this ROC curve.

4.2.3 Fraction of Features

The fraction of features is the ratio of the number of features used by the classifier to the total number of features in the data set.

4.2.4 Fraction of Probes

A certain number of features meaningless by design were introduced in generating the competition data. These features are called random probes. The ratio of random probes selected is also a criterion to judge the performance; the smaller it is, the more capable the selection approach is on filtering out irrelevant features.

The relative strength of classifiers was judged only on the BER. For methods having performance differences which were not statistically significant, the method using the smallest number of features would win. Fraction of probes were used to assess the relative strength of methods that were not significantly different both in the error rate and the number of features. In that case, the method with the smallest number of random probes in the feature set won.

Although there were several performance measures, throughout the competition we focused on how to achieve the smallest BER.


4.3 Data Sets Information

The organizers prepare competition data by transforming some publicly available data sets. Table 4.1 contains some statistics of competition data sets. However, the idendity of the original data was not revealed until the submission deadline, so competitors cannot use domain knowledge about the data. Moreover, irrelevant features were added as “random probes.”

Table 4.1: Statistics of competition data sets


Feature Numbers 10000 20000 100000 5000 500

Training Size 100 300 800 6000 2000

Validation Size 100 300 350 1000 600

Testing Size 700 2000 800 6500 1800

4.3.1 Source of Data Sets

The source of five competition data sets are described in the following.

ARCENE came from the cancer data set of National Cancer Institute (NCI) and Eastern Virginia Medical School (EVMS) together with 3,000 random probes. The labels indicated whether patterns were cancer or not.

DEXTER was a subset of Reuters text categorization benchmark. The labels indicated whether articles were about “corporate acquisitions.” 9,947 features rep- resented word frequencies while 10,053 random probes were added.

DOROTHEAwas the Thrombin data set, which was from KDD (Knowledge Dis- covery in Data Mining) Cup 2001. The last 50,000 features were randomly permuted to be random probes.

GISETTE was transformed from MNIST [22], a handwritten digits recognition problem. The task was to separate digits “4” and “9.” Features were created by a


random selection of subsets of products of pairs of pixel values plus 2,500 random probes.

MADELONwas a artificial data with only five useful features. Clusters of different classes were placed on the summits of a five dimensional hypercube. Besides those five useful features, there were also five redundant features, ten repeated features, and 480 random probes.

4.4 Strategies in Competition

In this section, we discuss feature selection strategies tried during the competition.

Each method is named as “A + B,” where A is a filter to select features and B is a classifier or a wrapper. If a method is “A + B + C,” it means that there are two filters A and B.

4.4.1 No Selection: Direct Use of SVM

The first strategy was to directly use SVMs without feature selection. Thus, the procedure in Section 2.6 was considered.

4.4.2 F-score for Feature Selection: F-score + SVM

Introduced in Section 3.1.1, F-score is a simple technique which measures the discrimination of a feature. In the competition, a variant of F-score different from (3.1.2) was used:

F (j) ≡˜ S˜B(j)

W(j), (4.4.1)


B(j) =





xkj − ¯xj2

, (4.4.2)

W(j) =




1 lk− 1



xj− ¯xkj




with the same notation definition in (3.1.2)

We selected features with high F-scores and then applied SVM for training/prediction.

The procedure is summarized below:

Algorithm 3 F-score + SVM

1. Calculate F-score of every feature.

2. Pick some possible thresholds by human eye to cut low and high F-scores.

3. For each threshold, do the following

(a) Drop features with F-score below this threshold.

(b) Randomly split the training data into Xtrain and Xvalid.

(c) Let Xtrain be the new training data. Use the SVM procedure in Chapter II to obtain a predictor; use the predictor to predict Xvalid.

(d) Repeat steps (a)-(c) five times, and then calculate the average validation error.

4. Choose the threshold with the lowest average validation error.

5. Drop features with F-score below the selected threshold. Then apply the SVM procedure in Chapter II.

In the above procedure, possible thresholds were identified by “human eye.” For data sets in this competition, there was a quite clear gap between high and lower F-scores (see Figure 4.1, which will be described in Section 4.5).


4.4.3 F-score and Random Forest for Feature Selection: F-score + RF + SVM

Random Forest was involved in our third method. We followed a similar procedure in [35] but used SVMs as classifiers instead.

In practice, the RF code we used could not handle too many features. Thus, before using RF to select features, we obtained a subset of features using F-score se- lection first. This approach is thus called “F-score + RF + SVM” and is summarized below:

Algorithm 4 F-score + RF + SVM

1. F-score

(a) Consider the subset of features obtained in Section 4.4.2.

2. RF

(a) Initialize the RF working data set to include all training instances with the subset of features selected from Step 1. Use RF to obtain the rank of features.

(b) Use RF as a predictor and conduct five-fold CV on the working set.

(c) Update the working set by removing half features which are less important and go to Step 2b.

Stop if the number of features is small.

(d) Among various feature subsets chosen above, select the one with the lowest CV error.

3. SVM


(a) Apply the SVM procedure in Chapter II on the training data with the selected features.

4.4.4 Random Forest and RM-bound SVM for Feature Selection

Our final strategy was to apply RM-bound SVMs described in Section 3.3.1 on competition data sets. When the number of features is large, minimizing the RM- bound is time consuming. Thus, we applied this technique only on the problem MADELON, which contained 500 features. To further reduce the computational burden, we used RF to pre-select important features. Thus, this method is referred to as “RF + RM-SVM.”

4.5 Experimental Results

In the experiment, we used LIBSVM [6] for SVM classification. For feature se- lection methods, we used the randomForest [23] package in the software R for RF and modified the implementation in [10] for the RM-bound SVM. Data sets were scaled before doing experiments. With training, validation, and testing data to- gether, each feature is linearly scaled to [0, 1]. Except scaling, there was no other data preprocessing.

In the development stage, only labels of training sets were known. An on-line judge returned BER of what competitors predicted about validation sets, but labels of validation sets and even information of testing sets were kept unknown.

We mainly focused on three feature selection strategies discussed in Sections 4.4.1-4.4.3: SVM, F-score + SVM, and F-score + RF + SVM. For RF + RM-SVM, due to the large number of features, we only applied it on MADELON. The RF


procedure in Section 4.4.3 selected 16 features and then an RM-SVM scaled them.

In all experiments we focused on getting the smallest BER.

For the strategy F-score + RF + SVM, after the initial selection by F-score, we found that RF retained all features. That is, by comparing cross-validation BER using different subsets of features, the one with all features was the best. Hence, F+RF+SVM was in fact the same as F+SVM for all the five data sets. Since our validation accuracy of DOROTHEA was not good when compared to some partic- ipants, we considered a heuristic by submitting results via the top 100, 200, and 300 features from RF. The BERs of the validation set were 0.1431, 0.1251, and 0.1498, respectively. Therefore, we considered “F-score + RF top 200 + SVM” for DOROTHEA.

Table 4.2 presents the BER on validation data sets by different feature selection strategies. It shows that no method is the best on all data sets.

Table 4.2: Comparison of different methods during the development period: BERs of validation sets (in percentage); bold-faced entries correspond to ap- proaches used to generate our final submission


SVM 13.31 11.67 33.98 2.10 40.17

F+SVM 21.43 8.00 21.38 1.80 13.00

F+RF+SVM 21.43 8.00 12.51 1.80 13.00

RF+RM-SVM§ – – – – 7.50

In Table 4.3 the CV BER on the training set is listed. Results of the first three problems are quite different from those in Table 4.2. Due to the small training sets or other reasons, CV did not accurately indicate the future performance.

In Table 4.4, the first row indicates the threshold of F-score. The second row is the number of selected features which is compared to the total number of features in

§Our implementation of RF+RM-SVM is applicable to only MADELON, which has a smaller number of features.


Table 4.3: CV BER on the training set (in percentage)


SVM 11.04 8.33 39.38 2.08 39.85

F+SVM 9.25 4.00 14.21 1.37 11.60

Table 4.4: F-score threshold and the number of features selected in the approach F+SVM


F-score threshold 0.1 0.015 0.05 0.01 0.005

#features selected 661 209 445 913 13

#total features 10000 20000 100000 5000 500

the third row. Figure 4.1 presents the curve of F-scores against features.

4.6 Competition Results

For each data set, we submitted the final result using the method that led to the best validation accuracy in Table 4.2. A comparison of competition results (ours and winning entries) is in Tables 4.5 and 4.6. The column “Score” means the overall performance among all the methods. The rest of the columns are performance measures mentioned in Section 4.2.

For the development stage submissions, we ranked 1ston GISETTE, 3rdon MADE- LON, and 5th on ARCENE. Overall we ranked 3rd as a group and our best entry was the 6th, using the criterion of the organizers. For the final stage submissions, we ranked 2nd as a group and our best entry was the 4th.

4.7 Discussion and Conclusions from the Competition

Usually SVMs suffer from a large number of features, but we found that a direct use of SVM works well on GISETTE and ARCENE. After the competition, we realized that GISETTE came from an OCR problem MNIST [22] with 784 features of gray-level




Related subjects :
Outline : Analysis