A new method for feature subset selection for handling classification problems

(1)

A New Method for Feature Subset Selection for Handling Classification Problems

Shyi-Ming Chen and Jen-Da Shie

Department of Computer Science and Information Engineering National Taiwan University of Science and Technology

Taipei, Taiwan, R. O. C.

Abstract- In this paper, we present a new method for dealing with feature subset selection for handling classification problems.

We discriminate numeric features to construct the membership function of each fuzzy subset of each feature. Then, we select the feature subset based on the proposed fuzzy entropy measure with boundary samples. The proposed feature subset selection method can select relevant features from sample data to get higher average classification accuracy rates than the ones selected by the existing methods.

I. I NTRODUCTION

In [19], Tsang et al. pointed out that feature subset selection aims to reduce the number of features used in classification or recognition. A data set might have irrelevant and relevant features. If we can properly select relevant features to deal with classification problems, we can increase the classification accuracy rates. In recent years, many feature subset selection methods have been proposed [1], [2], [4], [5], [7], [19]. Most of feature subset selection methods focused on designing high performance algorithms based on features’

quality-measures or the feature subset searching. There are many methods for dealing with features’ quality-measures, such as similarity measures [19], gain-entropies [4], the relevance of features [1], decision tables [5], the overall feature evaluation index (OFEI) [7], the feature quality index (FQI) [7], the mutual information-based feature selector (MIFS) [2], Ξ, etc. Several searching algorithms for reducing feature spaces have been proposed, such as heuristic algorithms [19], genetic algorithms [5], the greedy method [4], Ξ, etc. The approaches for feature subset selection can be divided into the filter model and the wrapper model [9]. In the filter model, features are filtered independent of classification algorithms. The wrapper model includes a target classifier for performance evaluation. In this paper, we adopt the filter mode to select the feature subset due to the fact that we usually do not know which classifier is suitable in an unknown domain, and new classifiers with higher performance will be developed in a well-known domain.

In this paper, we focus on boundary samples instead of the full set of samples to propose a new method to calculate the fuzzy entropy of a feature subset, where boundary samples usually are the critical point to improve the classification accuracy rates of classification algorithms, such as Support Vector Machines (SVMs) [3] and Boosting [8]. The proposed feature subset selection method is based on the proposed fuzzy entropy measure to search the feature subset. In this paper, we use three different kinds of classifiers (i.e., LMT [12], Naive Bayes [10], and SMO [16]) to compare the average classification accuracy rates of the selected feature subset of the proposed method with the ones selected by the exiting methods, i.e., OFFSS [19], OFEI [7], FQI [7] and MIFS [2]. The proposed feature subset selection method can select relevant features to get higher average classification accuracy rates than the ones selected by the exiting methods.

II. F UZZY E NTROPY M EASURES

The entropy measure is commonly used in the information theory, where Shannon’s entropy [17] is widely used. It can be used to characterize the impurity of a collection of samples.

Let X be a discrete random variable with a finite set containing n elements, where X = {x

1,

x

2,

…, x

n

}. If an element x

i

occurs with a probability p(x

i

), then the amount of information I(x

i

) associated with the known occurrence x

i

is defined as follows:

) 2 ( log )

( x i p x i

I . (1)

The entropy H(X) of X is defined as follows:

) 2 ( log 0

) ( )

( N p x i

i p x i X

H ¦ . (2)

In [20], Zadeh defined a fuzzy entropy on a fuzzy subset A ~

for a finite set X = {x

1

, x

2,

Ξ, x

n

} with respect to the probability distribution P = {p

1

, p

2

, Ξ, p

n

}, shown as follows:

¦ n

i x i p i p i H A

1 P ~ ( ) log , (3)

where

A ~

P denotes the membership function of A ~

, (x

A ~ P

i

) denotes the grade of membership of x

i

in the fuzzy set A ~

, p

i

denotes the probability of x

i

, and 1 d i d n .

In [14], Luca et al. defined a fuzzy entropy measure based on Shannon's entropy [17]. In [11], Kosko defined a fuzzy entropy measure based on the geometry of hypercube. In [13], Lee et al. presented a fuzzy entropy measure of an interval, based on Shannon’s entropy and Luca’s axioms.

In this paper, we presented a new fuzzy entropy measure of a fuzzy set, shown as follows.

Definition 2.1: Assume that a set X of samples is divided into a set C of classes. The class degree ~ )

c (A

CD of the sample data labeled as class c, where cC, in the fuzzy set A ~

is defined by:

¦

X

x x

A X c

x x

A c A

CD

) (

~ ) (

~

~ ) (

P P

, (4)

where X

c

denotes the samples belonging to class c, cC, and ~ x ( )

A

P denotes the membership grade of x in the fuzzy set

A ~

.

Definition 2.2: The fuzzy entropy of the samples labeled as class c, where cC, in the fuzzy set A ~

is defined as follows:

~ ) 2 ( log

~ ) (

~ )

( A CD c A CD c A

FE c . (5)

Definition 2.3: The fuzzy entropy of a fuzzy set ^A ^~ is defined by:

¦ C c

c A FE A

FE ~ )

(

~ )

( . (6)

(2)

III. F UZZY E NTROPY M EASURE OF A F EATURE

In the following, we define the fuzzy entropy measure of a feature and present an algorithm to construct the membership function of each fuzzy set of each feature. Fuzzy entropy of a feature is defined as follows.

Definition 3.1: Fuzzy entropy FFE(f) of a feature f is defined by:

¦ V

v FE v

s s v f

FFE ( ) ( ) , (7)

where V denotes the set of fuzzy subsets of feature f, FE(v) denotes the fuzzy entropy of the fuzzy set v, S denotes the total membership grade of the elements in all fuzzy subsets of the feature f, S

v

denotes the total membership grade of the elements in the fuzzy subset v , and

s

s v denotes the “weight”

of FE(v).

There are two categories of features, where the one is nominal and the other one is numeric. Both of them have the corresponding membership functions of fuzzy sets. The individual value of a nominal feature could be treated as a fuzzy subset, where the membership function is defined as follows:

¯ ®

otherwise u x x if

u

0 ,

, ) 1

P ( , (8)

where uU, U denotes a set of values of a nominal feature and µ

u

denotes the membership function of the fuzzy subset u.

For example, the set of values of the feature “Sex” is {male, female}. When the value of the feature “Sex” is “male”, the membership grades are: µ

male

(male) = 1 and µ

female

(male) = 0.

Fig. 1. A numeric feature A with fuzzy subsets v

1

, v

2

, and v

3

, and the clusters centers of v

1

, v

2

, and v

3

are

m

1

, m

2

and m

3

, respectively.

A numeric feature can be discriminated into finite fuzzy subsets. The number of fuzzy sets will affect the result of classification. Therefore, the discrimination of a numeric feature is an important process. Using unsupervised learning techniques to discriminate a numeric feature is a good method and the K-means clustering algorithm [15] is widely used, where it uses the Euclidean distance measure to generate cluster centers. In this paper, we apply the K-means clustering algorithm to generate k cluster centers and then construct the corresponding membership functions, where the cluster centers are used as the centers of fuzzy sets. For

example, assume that m

1,

m

2

and m

3

are the cluster centers of three clusters, respectively. Then, we can construct their corresponding membership functions as shown in Fig. 1, where µ

v1

(0) = 0.5, µ

v1

(m

1

) = 1, µ

v1

(m

2

) = 0, µ

v1

(m

1

) = 0, µ

v2

(m

2

) = 1, µ

v2

(m

3

) = 0 , µ

v3

(m

2

) = 1, µ

v3

(m

3

) = 0, and µ

v3

(U

max

)

= 0.5.

The fuzzy entropy of a feature reduces when the number of cl

to 2.

generate k values to the k clusters centers. */

usters increases. However, too many clusters could cause the overfitting problem. It means that the models over fit the training data set and reduce their classification accuracy rates when they classify new instances. In this paper, we define a threshold value T

c

to avoid the overfitting problem, where T

c

[0, 1]. When the decreasing rate of the fuzzy entropy of a feature is less than the given threshold value T

c

, we stop increasing the number of clusters, where the decreasing rate of a fuzzy entropy is calculated by subtracting the fuzzy entropy of a feature divided into k-1 clusters from the fuzzy entropy of a feature divided into k clusters. In the following, we present an algorithm to construct the membership functions of a feature, shown as follows:

Step 1: Initially, set the number k of clusters Step 2: Use the K-means clustering algorithm to cluster centers.

/* Assign initial for i = 1 to k do

let

i k i x

m ;

repeat

/* Assign each sample to the cluster which has the {

minimum Euclidean distance, where “

X x

x f

( ) min

arg ”

returns one of such x that minimizes the ) and “

function f(x

x ” denotes the Euclidean norm. */

r all x

fo X

let {

2 ;

min arg x m k

K k

i

let ^Cluster _i ^Cluster _i ^ `; ^x };

alculate a new cluster center for each cluster, where fo

let

/* C n _i

denotes the number of items in the ith cluster. */

r i = 1 to k do

n i i Cluster x

x m i

¦ ;

} until each cluster set is not changed.

tions corresponding to Step 3: Construct the membership func

k cluster centers.

v

1

v

2

v

3

0 m

1

m

2

m

3

U

max

Membership Grade

Feature A 1

0.5

(3)

/* Assign neighbor cluster centers for the ith cluster center

“m

i

”, where “m

L

” denotes the left “cluster center” of m

i

, m

R

denotes the right “cluster center” of m

i

, “U

min

” denotes the minimum value of a feature, and “U

max

” denotes the maximum value of a feature. */

let °¯

° ®

otherwise m i

i if i U

m m

_L

U

1 ,

1 ),

(

_min

min

let °¯

° ®

otherwise m i

K i i if m U m

_R

U

1 , ), (

_max

max

/ Construct the membership function v*

i

based on the ith cluster center m

i

, where “Max” denotes maximum operator. */

let

° °

°

¯

° °

°

®

!

°¿

° ¾

½

°¯

° ®

d

°¿

° ¾

½

°¯

° ®

m i x if m m

m i x Max

m i x if m m

x Max m

i x v

i R

L i

i

, 0 , 1

) P (

**Step 4: /* Calculate the fuzzy entropy of feature** by formulas (4)-(7). */

f

for i = 1 to k do let ( ) ¦ ( ) ;

C c FE c v i v i

FE

let ¦

V v

v s FE s v f

FFE ( ) ( ) .

Step 5: If the decreasing rate of the fuzzy entropy is larger than the threshold value T

c

, let k = k + 1 and go to Step 2.

Otherwise, then let k = k - 1 and Stop.

IV. T HE P ROPOSED A LGORITHM FOR F EATURE S UBSET

S ELECTION

In this section, we present an algorithm for feature subset selection. The proposed algorithm uses boundary samples instead of full set of samples to select the feature subset. First, we introduce the concept of boundary samples. Then, we define the fuzzy entropy of the feature subset. Finally, we propose an algorithm for feature subset selection.

In a dimension reduction problem [6], each feature might have incorrectly classified samples. Thus, an optimal feature subset is a set of correlated features [9]. It means that the samples incorrectly classified by a feature could be correctly classified by another feature. "Boundary samples" are incorrectly classified samples of features, and we should focus on them for feature subset selection. For example, Table I shows an example data set. The incorrectly classified samples of feature A are {1, 2, 5} because the labels of these samples with the same feature value are ambiguous. Thus, the value of feature A with incorrectly classified samples is

"black". In the same way, the incorrectly classified samples of feature B are {2, 5, 6}. Thus, we can only use Samples 1 and 5 to calculate the entropy of the feature subset {A, B}.

Because Sample 1 can be correctly classified as feature A, it can be also correctly classified as feature subset {A, B}. Thus, Sample 1 could be omitted and samples {3, 4, 6} could be omitted, too. Therefore, we can reduce the number of samples from 6 to 2.

TABLE I A

N

E

XAMPLE OF

D

ATA

S

ET

No. Feature A Feature B Feature C Label

1 Black ocean summer positive

2 Black lake winter positive

3 White ocean fall positive

4 Red river winter negative

5 Black lake fall negative

6 Red lake fall negative

A feature subset can be regarded as a collection of multiple features. For example, in Table I, the values of the feature subset {A, B} are {(black, ocean}, (black, lake), (black, river), (white, ocean), (white, lake), (white, river), (red, ocean), (red, lake), (red, river)). In Table I, Sample 2 and Sample 5 are called boundary samples due to the fact that when the values of feature A and feature B of Sample 2 and Sample 5 are

“Black” and “lake”, respectively, they will get different labels

“positive” and “negative”, respectively. Thus, we can calculate the fuzzy entropy of the feature subset {A, B}only using the boundary samples, i.e., Sample 2 and Sample 5.

However, we can not use the boundary samples to evaluate the fuzzy entropy directly. We can use an indirect method to simplify the process of feature subset selection. Assume that the samples distribution is shown in Fig. 2, where the symbols "O" and "X" denote the positive samples and the negative samples, respectively. The corresponding membership functions of the feature A are shown in Fig. 3, where the fuzzy entropy of the fuzzy set v

1

is 0. If we omit the samples whose values of the feature A value are less than U

A3

, then it will affect the fuzzy entropy of the fuzzy set v

2

.

Fig. 2. The samples distribution with two features and two classes.

Fig. 3. The corresponding membership functions of the feature A.

Membership Grade

v

1

v

2

v

3

0 u

A1

u

A2

u

A3

Feature A 1

0.5 u

A4

u

A5

u

A6

u

B5

Feature B

0 u

A1

u

A2

u

A3

Feature A u

A4

u

A5

u

A6

u

B1

u

B2

u

B3

u

B4

X O

X X

X

X X X

X X

O

X O X O

O

X X O X

O O

O

X

O

(4)

Therefore, we must use a direct method to get the benefit of the fuzzy entropy measure focusing on boundary samples.

While we evaluate the fuzzy entropy of a feature subset, we omit unambiguous fuzzy subsets instead of unambiguous samples. In the previous example, we could omit the fuzzy set A ~

to evaluate the fuzzy entropy of the feature subset {A, B}.

We define a threshold value T

r

, where T

r

[0, 1], to find the feature subset of a sample data. If the “maximum class degree” of a set of samples in a fuzzy set is larger than or equal to the given threshold value T

r

, then the fuzzy subset will be omitted to reduce the number of the values of the feature, where the “maximum class degree” of a set of samples in a fuzzy set is the maximum value of the class degrees of a set of samples in the fuzzy set, and the definition of “class degrees” is defined by Definition 4.3. Then, we can construct the combined extension matrix of the membership grades of two features. Before we construct the combined extension matrix, we have to construct the extension matrices of the features. The extension matrix of membership grades of a feature is defined as follows.

Definition 4.1: The extension matrix of membership grades EM

f

of the values of a feature f belonging to fuzzy sets is defined as follows:

m f n r n v m f

r n v

r f v m r f

v EM f

u

»»

»

»»

»

¼ º

««

«

««

«

¬ ª

) ( )

1 (

1 ) ( 1 )

1 (

P P

, (9)

where m denotes the number of fuzzy sets for the feature f, n denotes the number of samples, ( )

j f i r

P v denotes the membership grade of the value of the feature f of the jth sample belonging to the fuzzy set v

i

, and 1 d i d m .

Let i denotes the number of fuzzy sets for the feature f

1

whose “maximum class degree” is smaller than the given threshold value T

r

, and j denotes the number of fuzzy sets for the feature f

2

whose “maximum class degree” is smaller than the given threshold value T

r

. We define the combined extension matrix of membership grades of two features f

1

and f

2

as follows.

Definition 4.2: The combined extension matrix CEM(f

1,

f

2,

T

r

) of membership grades of the two features f

1,

and f

2

with the maximum class degree threshold value T

r

is defined as follows:

ij nf n v r rnf v rnf v rnf v

rf v rf v rf v rf v Tr f f CEM

j i

u

»

» »

¼ º

«

« «

¬ ª

) ( ) ( ) ( ) (

1 ) ( 1 ) ( 1 )

( 1 ) ( ) 2, 1, (

2 2 1 1 2 21 1 11

P P P

P

P P P

P

, (10)

where n denotes the number of samples and the notation " "

denotes the minimum operator.

The class degree CD

c

(v) of a set of samples can be calculated from the extension matrix of membership grades, shown as follows.

Definition 4.3: The extension matrix of membership grades of a feature f is used to calculate the class degree CD

c

(v) of a set of samples labeled as class c in the fuzzy subset v, shown as follows:

¦

R r

v f r EM R c r

v f r EM c v

CD ( , )

) , ( )

( . (11)

The fuzzy entropy of a feature f can be calculated by Definitions 2.2, 2.3, 3.1 and 4.3, shown as follows:

)].

) 2 ( log ) ( ( [

)

( ¦

¦

V u

v c C CD c v CD c v

s s v f

FFE

Definition 4.4: The fuzzy entropy measure BSFFE(f

1

, f

2

) of a feature subset {f

1

, f

2

} focusing on boundary samples is defined as follows:

° °

¯

°° ®

u

otherwise f

FFE FS s FFE s

s s s if s f FFE FS s FFE s BSFFE

UB B

B B UB

B

f f

), ( ) (

), ( ) ( )

(

2 1 , 2

1

2 2

1 1

(12)

where FS denotes the feature subset {f

1

, f

2

}, s

1

denotes the total membership grades of the elements in all fuzzy sets of feature f

1

, s

1B

denotes the total membership grades of the elements in the fuzzy sets of feature f

1

whose maximum class degree are larger then the threshold value T

r

of the maximum class degree, and FFE

UB

(f

1

) denotes the fuzzy entropy of the fuzzy subsets of feature f

1

whose maximum class degree are smaller than or equal to the threshold value T

r

, where T

r

[0, 1]. In the same way, s

2

, s

2B

and FFE

UB

(f

2

) for feature f

2

can be defined.

Let F be a set of candidate features and let FS be the selected feature subset. The proposed algorithm for feature subset selection is presented as follows:

**Step 1: /* Generate the extension matrices of membership** grades of all features and calculate the fuzzy entropy of them, respectively. */

For each f F do {

let ;

m f n r n vm f

r n v

r f f vm

v r EM f

u

»»

¼ º

««

¬ ª

) ( )

1 (

1 ) ( 1 )

1 (

P P

let E ( f ) FFE ( f )

}.

**Step 2: /* Put the feature with the minimum fuzzy entropy** into the initial feature subset and remove it from the original feature set, where the symbol “

X

x ” returns one of such x that minimizes the function f(x). /*

x f

( ) min arg

let ˆ f arg min E ( f ) ;

fF

let E _FS E ( f ˆ ) ; let ^FS ^FS ^ ` ^f ^ˆ ^;

let ^F ^F ^ ` ^f ^ˆ ^.

**Step 3: /* Repeatedly put the feature which can reduce the** fuzzy entropy of the feature subset into FS until no such a feature exists. */

repeat {

For each fF

(5)

{

let EM

_temp

CEM ( FS , f , T r ) ; let E ( f ) BSFFE ( FS , f )

};

let f ˆ arg min E ( f ) ;

fF

let ^FS ^FS ^ ` ^f ^ˆ ^;

let ^F ^F ^ ` ^f ^ˆ ^;

let D E

_FS

E ( f ˆ ) ; let E

_FS

E ( f ˆ )

} until ( FFE ( FS ) 0 or D d 0 or F I ) ; let FS be the selected feature subset.

V. E XPERIMENTAL R ESULTS

We have implemented the proposed feature subset selection method by using IBM Lotus Notes version 4.6 [23]

on a Pentium III PC. In the following, we use four different kinds of UCI data sets [21] for comparing the average classification accuracy rate of the features selected by the proposed feature subset section method with the ones selected by the OFFSS (Optimal Fuzzy-Valued Feature Subset) method [19], the OFEI (Overall Feature Evaluation Index) method [7], the FQI (Feature Quality Index) method [7] and the MIFS (Mutual Information-Based Feature Selector) method [2]. These data sets are the Iris data set [21], the Breast cancer data set [21], the Pima diabetes data set [21], and the MPG data set [21]. We summarize these data sets as follows:

(i) Iris data set: It is a well-known benchmark data set. This data set has 150 samples which are classified into three classes, i.e., Setosa, Versicolor, and Virginical. Each sample is described by four numeric features which are (1)Sepal Length, (2) Sepal Width, (3) Petal Length and (4) Petal Width.

(ii) Breast Cancer Diagnostic data set: It was obtained form the University of Wisconsin Hospitals, Madison from Dr. W.

H. Wolberg. This data set contains 699 samples in two classes; 458 samples belong to the “benign” class, the other 241 samples are belonging to the “malignant” class. Since 16 samples of the data set have missing values, we use 683 samples. Each sample is described by nine features, which are (1) Clump Thickness, (2) Uniformity of Cell Size, (3) Uniformity of Cell Shape, (4) Marginal Adhesion, (5) Single Epithelial Cell Size, (6) Bare Nuclei, (7) Bland Chromatin, (8) Normal Nucleoli, and (9) Mitoses.

(iii) Pima diabetes data set: The Pima Indian Diabetes data set contains 768 samples. There are 500 examples from patients who do not have diabetes and 268 examples from patients who are known to have diabetes. Each sample representing a patient who may show signs of diabetes is described by eight features, which are (1) Number of Times Pregnant, (2) Plasma Glucose Concentration, (3) Diastolic Blood Pressure, (4) Triceps Skin Fold Thickness, (5) Two-hour Serum Insulin, (6) Body Mass Index, (7) Diabetes Pedigree Function, and (8) Age.

(iv) Mile per gallon (MPG) data set: It is a benchmark data set which comes from a nonlinear regression model, where several features (input variables) used to predict another feature (output variable). After removing the samples with missing values, the data set is reduced to 392 samples. The MPG problem has six input variables, which are (1) Number

of Cylinders (discrete), (2) Displacement (continuous), (3) Horsepower (continuous), (4) Weight (continuous), (5) Acceleration (continuous), and (6) Year-model (discrete).

The output variable is the fuel consumption in MPG. We categorize Output values into three classes which are Class 1: {MPG d 18 }, Class 2: {18 MPG d 30 }, and Class 3:

}.

MPG {30

We use three different kinds of classifiers (i.e., Logistic Model Trees (LMT) [12], Naive Bayes [10] and Sequential Minimal Optimization (SMO) [16]) to compare the average classification accuracy rates of the features selected by the proposed method with the ones selected by the existing methods (i.e., OFFSS [19], OFEI [7], FQI [7], and MIFS [2]).

We apply the proposed method to select feature subsets for these four data sets (i.e., Iris data set [21], the Breast cancer data set [21], the Pima diabetes data set [21], and the MPG data set [21]), respectively. The proposed method consists of two major steps, where the first major step is to define the corresponding membership function of each fuzzy set of each feature. The second major step is to select feature subset based on the proposed fuzzy entropy measure with boundary samples. Table II shows a comparison of the results of feature subset selection of the proposed method with the existing methods.

TABLE II

A C

OMPARISON OF

F

EATURE

S

UBSET

S

ELECTION FOR

D

IFFERENT

M

ETHODS

Data Set Iris Breast Cancer Pima Diabetes

MPG OFFSS [19] {4, 3} {6, 3, 1, 2} {2, 6, 7} {6, 2, 5, 4}

OFEI [7] {4, 3} {6,1,3,2} {2,3,6} {4,5,6,2}

FQI [7] {4, 3} {6,1,8,3} {8,2,1} {4,6,3,2}

MIFS [2] {4, 3} {6,3,2,7} {2,6,8} {4,6,2,1}

The Proposed Method

{4, 3} {2, 8, 6, 1} {2, 6, 8, 7} {4, 6, 3}

We use three kinds of classifiers (i.e., LMT [12], Naive Bayes [10], and SMO [16]) to evaluate the average classification accuracy rates of the selected feature subsets by different methods and make the experiments in the environment of the free software “Weka” [22], where we use Weka to select different classifiers and different data sets with respect to the feature subsets selected by different methods. We apply the 10-fold cross-validation method [18]

on the four data sets to get the average classification accuracy rates, and the results are shown in Table III. In the 10-fold cross-validation method, we divide each data set into 10 subsets of approximately equal size. Each time the system selects one of the 10 subsets as the testing data set and trains a classifier by the remaining 9 subsets to get the classification accuracy rate with respect to the selected feature subset. After executing 10 times, we can get the average classification accuracy rate with respect to each selected feature subset.

From Table III, we can see that the feature subsets selected

by the proposed method gets higher average classification

accuracy rates than the ones selected by the methods OFFSS

[19], OFEI [7], FQI [7] and MIFS [2].

(6)

TABLE III

A C

OMPARISON OF THE

A

VERAGE

C

LASSIFICATION

A

CCURACY

R

ATES FOR

D

IFFERENT

M

ETHODS

Average Classification Accuracy Rate of the Features Selected by Different Methods Data

Sets

Classifiers OFFSS [19]

OFEI [7] FQI [7] MIFS [2] The Proposed Method LMT [12] 94.6667 % 94.6667 % 94.6667 % 94.6667 % 94.6667 %

Naive Bayes [10]

96 % 96 % 96 % 96 % 96 %

Iris Data Set

SMO [16] 96 % 96 % 96 % 96 % 96 %

LMT [12] 95.3148 % 95.3184% 96.1933% 95.4612% 96.4861 %

Naive Bayes [10]

96.9253 % 96.9253% 97.2182% 96.7789% 97.2182 % Breast

Cancer Data Set

SMO [16] 96.0469 % 96.0469% 96.4861% 95.6076% 96.0469 %

LMT [12] 76.8229 % 76.0417% 73.5667% 75.5208% 77.2135 %

Naive Bayes [10]

76.5625 % 76.8229% 74.0885% 76.4323% 77.474 % Pima

Diabetes Data Set

SMO [16] 75.9115 % 75.9115% 75.3906% 75.9115% 77.0833 %

LMT [12] 81.1224 % 81.1224% 82.398% 84.1837% 81.8878 %

Naive Bayes [10]

78.3163 % 78.3163% 79.5918% 76.4323% 80.6122 % MPG

Data Set

SMO [16] 80.6122 % 80.6122% 81.6327% 76.2755% 81.8878 %

VI. C ONCLUSIONS

In this paper, we have presented a new method for feature subset selection based on a new fuzzy entropy measure. We discriminate numeric features to construct the membership function of each fuzzy subset of each feature. Then, we select the feature subset by the proposed fuzzy entropy measure with boundary samples. The proposed method can deal with both numeric and nominal features. It can select relevant features from sample data to get higher average classification accuracy rates. The feature subsets selected by the proposed method gets higher average classification accuracy rates than the ones selected by the methods OFFSS [19], OFEI [7], FQI [7] and MIFS [2].

A CKNOWLEDGMENT

This work was supported in part by the National Science Council, Republic of China, under Grant NSC 93-2213-E- 011-018.

R EFERENCES

[1] P. W. Baim, “A method for attribute selection in inductive learning systems,” IEEE Trans. Pattern Anal.

Machine Intell., vol. 10, no. 6, pp. 88-896, 1988.

[2] R. Battiti, “Using mutual information for selecting features in supervisedneural net learning,” IEEE Trans.

Neural Networks, vol. 5, no. 4, pp. 537-550, 1994.

[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, 1992, pp.144-152.

[4] R. Caruana and D. Freitag, “Greedy attribute selection,”

Proceedings of International Conference on Machine Learning, New Brunswick, 1994, pp. 28-36.

[5] N. Chaikla and Y. Qi, “Genetic algorithms in feature selection,” Proceedings of IEEE Systems, Man, and Cybernetics, Tokyo, 1999, vol. 5, pp. 538-540.

[6] J. X. Chen, S. Wang, “Data visualization: parallel coordinates and dimension reduction,” IEEE Computational Science and Engineering, vol. 3, no. 5, pp.

110-113, 2001.

[7] R. K. De, N. R. Pal, and S. K. Pal, “Feature analysis:

neural network and fuzzy set theoretic approaches,”

Pattern Recognition, vol. 30, no. 10, pp.1579–1590, 1997.

[8] Y. Freund and R. E. Schapire, “A short introduction to boosting,” Journal of Japanese Society for Artificial Intelligence, vol. 14, no. 5, pp. 771-780, 1999.

[9] G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem,” Proceedings of the Eleventh International Conference on Machine Learning, San Francisco, California, 1994, pp. 121-129.

[10] G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, San Mateo, 1995, pp.

338-345.

[11] B. Kosko, “Fuzzy entropy and conditioning,”

Information Sciences, vol. 40, pp. 165-174, 1986.

[12] N. Landwehr, M. Hall, and E. Frank, “Logistic model trees,” Proceedings of the 14th European Conference on Machine Learning, Cavtat, 2003, pp. 241-252.

[13] H. M. Lee, C. M. Chen, J. M. Chen, and Y. L. Jou,” An efficient fuzzy classifier with feature selection based on fuzzy entropy”, IEEE Trans. Systems, Man and Cybernetics-Part B: Cybern., vol. 31, no. 3, pp. 426-432, 2001.

[14] A. D. Luca, S. Termini, “A definition of a on probabilistic entropy in the setting of fuzzy set theory,”

Information and Control, vol. 20, pp. 301-312, 1972.

[15] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” Proceedings of the Fifth Berkeley Symposium on Mathematical statistics and probability, Berkeley, California, 1967, pp. 281-297.

[16] J. C. Platt, “Using analytic QP and sparseness to speed training of support vector machines,” Proceedings of Neural Information Processing Systems, Denver, 1999, pp. 557-563.

A new method for feature subset selection for handling classification problems