A New Method for Feature Subset Selection for Handling Classification Problems
Shyi-Ming Chen and Jen-Da Shie
Department of Computer Science and Information Engineering National Taiwan University of Science and Technology
Taipei, Taiwan, R. O. C.
Abstract- In this paper, we present a new method for dealing with feature subset selection for handling classification problems.
We discriminate numeric features to construct the membership function of each fuzzy subset of each feature. Then, we select the feature subset based on the proposed fuzzy entropy measure with boundary samples. The proposed feature subset selection method can select relevant features from sample data to get higher average classification accuracy rates than the ones selected by the existing methods.
I. I NTRODUCTION
In [19], Tsang et al. pointed out that feature subset selection aims to reduce the number of features used in classification or recognition. A data set might have irrelevant and relevant features. If we can properly select relevant features to deal with classification problems, we can increase the classification accuracy rates. In recent years, many feature subset selection methods have been proposed [1], [2], [4], [5], [7], [19]. Most of feature subset selection methods focused on designing high performance algorithms based on features’
quality-measures or the feature subset searching. There are many methods for dealing with features’ quality-measures, such as similarity measures [19], gain-entropies [4], the relevance of features [1], decision tables [5], the overall feature evaluation index (OFEI) [7], the feature quality index (FQI) [7], the mutual information-based feature selector (MIFS) [2], Ξ, etc. Several searching algorithms for reducing feature spaces have been proposed, such as heuristic algorithms [19], genetic algorithms [5], the greedy method [4], Ξ, etc. The approaches for feature subset selection can be divided into the filter model and the wrapper model [9]. In the filter model, features are filtered independent of classification algorithms. The wrapper model includes a target classifier for performance evaluation. In this paper, we adopt the filter mode to select the feature subset due to the fact that we usually do not know which classifier is suitable in an unknown domain, and new classifiers with higher performance will be developed in a well-known domain.
In this paper, we focus on boundary samples instead of the full set of samples to propose a new method to calculate the fuzzy entropy of a feature subset, where boundary samples usually are the critical point to improve the classification accuracy rates of classification algorithms, such as Support Vector Machines (SVMs) [3] and Boosting [8]. The proposed feature subset selection method is based on the proposed fuzzy entropy measure to search the feature subset. In this paper, we use three different kinds of classifiers (i.e., LMT [12], Naive Bayes [10], and SMO [16]) to compare the average classification accuracy rates of the selected feature subset of the proposed method with the ones selected by the exiting methods, i.e., OFFSS [19], OFEI [7], FQI [7] and MIFS [2]. The proposed feature subset selection method can select relevant features to get higher average classification accuracy rates than the ones selected by the exiting methods.
II. F UZZY E NTROPY M EASURES
The entropy measure is commonly used in the information theory, where Shannon’s entropy [17] is widely used. It can be used to characterize the impurity of a collection of samples.
Let X be a discrete random variable with a finite set containing n elements, where X = {x
1,x
2,…, x
n}. If an element x
ioccurs with a probability p(x
i), then the amount of information I(x
i) associated with the known occurrence x
iis defined as follows:
) 2 ( log )
( x i p x i
I . (1)
The entropy H(X) of X is defined as follows:
) 2 ( log 0
) ( )
( N p x i
i p x i X
H ¦ . (2)
In [20], Zadeh defined a fuzzy entropy on a fuzzy subset A ~
for a finite set X = {x
1, x
2,Ξ, x
n} with respect to the probability distribution P = {p
1, p
2, Ξ, p
n}, shown as follows:
¦ n
i x i p i p i H A
1 P ~ ( ) log , (3)
where
A ~
P denotes the membership function of A ~
, (x
A ~ P
i) denotes the grade of membership of x
iin the fuzzy set A ~
, p
idenotes the probability of x
i, and 1 d i d n .
In [14], Luca et al. defined a fuzzy entropy measure based on Shannon's entropy [17]. In [11], Kosko defined a fuzzy entropy measure based on the geometry of hypercube. In [13], Lee et al. presented a fuzzy entropy measure of an interval, based on Shannon’s entropy and Luca’s axioms.
In this paper, we presented a new fuzzy entropy measure of a fuzzy set, shown as follows.
Definition 2.1: Assume that a set X of samples is divided into a set C of classes. The class degree ~ )
c (A
CD of the sample data labeled as class c, where cC, in the fuzzy set A ~
is defined by:
¦
¦
X
x x
A X c
x x
A c A
CD
) (
~ ) (
~
~ ) (
P P
, (4)
where X
cdenotes the samples belonging to class c, cC, and ~ x ( )
A
P denotes the membership grade of x in the fuzzy set
A ~
.
Definition 2.2: The fuzzy entropy of the samples labeled as class c, where cC, in the fuzzy set A ~
is defined as follows:
~ ) 2 ( log
~ ) (
~ )
( A CD c A CD c A
FE c . (5)
Definition 2.3: The fuzzy entropy of a fuzzy set A ~ is defined by:
¦ C c
c A FE A
FE ~ )
(
~ )
( . (6)
III. F UZZY E NTROPY M EASURE OF A F EATURE
In the following, we define the fuzzy entropy measure of a feature and present an algorithm to construct the membership function of each fuzzy set of each feature. Fuzzy entropy of a feature is defined as follows.
Definition 3.1: Fuzzy entropy FFE(f) of a feature f is defined by:
¦ V
v FE v
s s v f
FFE ( ) ( ) , (7)
where V denotes the set of fuzzy subsets of feature f, FE(v) denotes the fuzzy entropy of the fuzzy set v, S denotes the total membership grade of the elements in all fuzzy subsets of the feature f, S
vdenotes the total membership grade of the elements in the fuzzy subset v , and
s
s v denotes the “weight”
of FE(v).
There are two categories of features, where the one is nominal and the other one is numeric. Both of them have the corresponding membership functions of fuzzy sets. The individual value of a nominal feature could be treated as a fuzzy subset, where the membership function is defined as follows:
¯ ®
otherwise u x x if
u
0 ,
, ) 1
P ( , (8)
where uU, U denotes a set of values of a nominal feature and µ
udenotes the membership function of the fuzzy subset u.
For example, the set of values of the feature “Sex” is {male, female}. When the value of the feature “Sex” is “male”, the membership grades are: µ
male(male) = 1 and µ
female(male) = 0.
Fig. 1. A numeric feature A with fuzzy subsets v
1, v
2, and v
3, and the clusters centers of v
1, v
2, and v
3are
m
1, m
2and m
3, respectively.
A numeric feature can be discriminated into finite fuzzy subsets. The number of fuzzy sets will affect the result of classification. Therefore, the discrimination of a numeric feature is an important process. Using unsupervised learning techniques to discriminate a numeric feature is a good method and the K-means clustering algorithm [15] is widely used, where it uses the Euclidean distance measure to generate cluster centers. In this paper, we apply the K-means clustering algorithm to generate k cluster centers and then construct the corresponding membership functions, where the cluster centers are used as the centers of fuzzy sets. For
example, assume that m
1,m
2and m
3are the cluster centers of three clusters, respectively. Then, we can construct their corresponding membership functions as shown in Fig. 1, where µ
v1(0) = 0.5, µ
v1(m
1) = 1, µ
v1(m
2) = 0, µ
v1(m
1) = 0, µ
v2(m
2) = 1, µ
v2(m
3) = 0 , µ
v3(m
2) = 1, µ
v3(m
3) = 0, and µ
v3(U
max)
= 0.5.
The fuzzy entropy of a feature reduces when the number of cl
to 2.
generate k values to the k clusters centers. */
usters increases. However, too many clusters could cause the overfitting problem. It means that the models over fit the training data set and reduce their classification accuracy rates when they classify new instances. In this paper, we define a threshold value T
cto avoid the overfitting problem, where T
c [0, 1]. When the decreasing rate of the fuzzy entropy of a feature is less than the given threshold value T
c, we stop increasing the number of clusters, where the decreasing rate of a fuzzy entropy is calculated by subtracting the fuzzy entropy of a feature divided into k-1 clusters from the fuzzy entropy of a feature divided into k clusters. In the following, we present an algorithm to construct the membership functions of a feature, shown as follows:
Step 1: Initially, set the number k of clusters Step 2: Use the K-means clustering algorithm to cluster centers.
/* Assign initial for i = 1 to k do
let
i k i x
m ;
repeat
/* Assign each sample to the cluster which has the {
minimum Euclidean distance, where “
X x
x f
( ) min
arg ”
returns one of such x that minimizes the ) and “
function f(x
x ” denotes the Euclidean norm. */
r all x
fo X
let {
2 ;
min arg x m k
K k
i
let Cluster i Cluster i ^ `; x };
alculate a new cluster center for each cluster, where fo
let
/* C n i
denotes the number of items in the ith cluster. */
r i = 1 to k do
n i i Cluster x
x m i
¦ ;
} until each cluster set is not changed.
tions corresponding to Step 3: Construct the membership func
k cluster centers.
v
1v
2v
30 m
1m
2m
3U
maxMembership Grade
Feature A 1
0.5
/* Assign neighbor cluster centers for the ith cluster center
“m
i”, where “m
L” denotes the left “cluster center” of m
i, m
Rdenotes the right “cluster center” of m
i, “U
min” denotes the minimum value of a feature, and “U
max” denotes the maximum value of a feature. */
let °¯
° ®
otherwise m i
i if i U
m m
LU
1 ,
1 ),
(
minmin
let °¯
° ®
otherwise m i
K i i if m U m
RU
1 , ), (
maxmax
/* Construct the membership function v
ibased on the ith cluster center m
i, where “Max” denotes maximum operator. */
let
° °
°
¯
° °
°
®
!
°¿
° ¾
½
°¯
° ®
d
°¿
° ¾
½
°¯
° ®
m i x if m m
m i x Max
m i x if m m
x Max m
i x v
i R
L i
i
, 0 , 1
, 0 , 1
) P (
Step 4: /* Calculate the fuzzy entropy of feature by formulas (4)-(7). */
f
for i = 1 to k do let ( ) ¦ ( ) ;
C c FE c v i v i
FE
let ¦
V v
v s FE s v f
FFE ( ) ( ) .
Step 5: If the decreasing rate of the fuzzy entropy is larger than the threshold value T
c, let k = k + 1 and go to Step 2.
Otherwise, then let k = k - 1 and Stop.
IV. T HE P ROPOSED A LGORITHM FOR F EATURE S UBSET
S ELECTION
In this section, we present an algorithm for feature subset selection. The proposed algorithm uses boundary samples instead of full set of samples to select the feature subset. First, we introduce the concept of boundary samples. Then, we define the fuzzy entropy of the feature subset. Finally, we propose an algorithm for feature subset selection.
In a dimension reduction problem [6], each feature might have incorrectly classified samples. Thus, an optimal feature subset is a set of correlated features [9]. It means that the samples incorrectly classified by a feature could be correctly classified by another feature. "Boundary samples" are incorrectly classified samples of features, and we should focus on them for feature subset selection. For example, Table I shows an example data set. The incorrectly classified samples of feature A are {1, 2, 5} because the labels of these samples with the same feature value are ambiguous. Thus, the value of feature A with incorrectly classified samples is
"black". In the same way, the incorrectly classified samples of feature B are {2, 5, 6}. Thus, we can only use Samples 1 and 5 to calculate the entropy of the feature subset {A, B}.
Because Sample 1 can be correctly classified as feature A, it can be also correctly classified as feature subset {A, B}. Thus, Sample 1 could be omitted and samples {3, 4, 6} could be omitted, too. Therefore, we can reduce the number of samples from 6 to 2.
TABLE I A
NE
XAMPLE OFD
ATAS
ETNo. Feature A Feature B Feature C Label
1 Black ocean summer positive
2 Black lake winter positive
3 White ocean fall positive
4 Red river winter negative
5 Black lake fall negative
6 Red lake fall negative
A feature subset can be regarded as a collection of multiple features. For example, in Table I, the values of the feature subset {A, B} are {(black, ocean}, (black, lake), (black, river), (white, ocean), (white, lake), (white, river), (red, ocean), (red, lake), (red, river)). In Table I, Sample 2 and Sample 5 are called boundary samples due to the fact that when the values of feature A and feature B of Sample 2 and Sample 5 are
“Black” and “lake”, respectively, they will get different labels
“positive” and “negative”, respectively. Thus, we can calculate the fuzzy entropy of the feature subset {A, B}only using the boundary samples, i.e., Sample 2 and Sample 5.
However, we can not use the boundary samples to evaluate the fuzzy entropy directly. We can use an indirect method to simplify the process of feature subset selection. Assume that the samples distribution is shown in Fig. 2, where the symbols "O" and "X" denote the positive samples and the negative samples, respectively. The corresponding membership functions of the feature A are shown in Fig. 3, where the fuzzy entropy of the fuzzy set v
1is 0. If we omit the samples whose values of the feature A value are less than U
A3, then it will affect the fuzzy entropy of the fuzzy set v
2.
Fig. 2. The samples distribution with two features and two classes.
Fig. 3. The corresponding membership functions of the feature A.
Membership Grade
v
1v
2v
30 u
A1u
A2u
A3Feature A 1
0.5
u
A4u
A5u
A6u
B5Feature B
0 u
A1u
A2u
A3Feature A u
A4u
A5u
A6u
B1u
B2u
B3u
B4X O
X X
X
X X X
X X
X X
O
X O X O
O
X X O X
O O
O
X
O
Therefore, we must use a direct method to get the benefit of the fuzzy entropy measure focusing on boundary samples.
While we evaluate the fuzzy entropy of a feature subset, we omit unambiguous fuzzy subsets instead of unambiguous samples. In the previous example, we could omit the fuzzy set A ~
to evaluate the fuzzy entropy of the feature subset {A, B}.
We define a threshold value T
r, where T
r [0, 1], to find the feature subset of a sample data. If the “maximum class degree” of a set of samples in a fuzzy set is larger than or equal to the given threshold value T
r, then the fuzzy subset will be omitted to reduce the number of the values of the feature, where the “maximum class degree” of a set of samples in a fuzzy set is the maximum value of the class degrees of a set of samples in the fuzzy set, and the definition of “class degrees” is defined by Definition 4.3. Then, we can construct the combined extension matrix of the membership grades of two features. Before we construct the combined extension matrix, we have to construct the extension matrices of the features. The extension matrix of membership grades of a feature is defined as follows.
Definition 4.1: The extension matrix of membership grades EM
fof the values of a feature f belonging to fuzzy sets is defined as follows:
m f n r n v m f
r n v
r f v m r f
v EM f
u
»»
»
»»
»
¼ º
««
«
««
«
¬ ª
) ( )
1 (
1 ) ( 1 )
1 (
P P
P P
, (9)
where m denotes the number of fuzzy sets for the feature f, n denotes the number of samples, ( )
j f i r
P v denotes the membership grade of the value of the feature f of the jth sample belonging to the fuzzy set v
i, and 1 d i d m .
Let i denotes the number of fuzzy sets for the feature f
1whose “maximum class degree” is smaller than the given threshold value T
r, and j denotes the number of fuzzy sets for the feature f
2whose “maximum class degree” is smaller than the given threshold value T
r. We define the combined extension matrix of membership grades of two features f
1and f
2as follows.
Definition 4.2: The combined extension matrix CEM(f
1,f
2,T
r) of membership grades of the two features f
1,and f
2with the maximum class degree threshold value T
ris defined as follows:
ij nf n v r rnf v rnf v rnf v
rf v rf v rf v rf v Tr f f CEM
j i
j i
u
»
» »
¼ º
«
« «
¬ ª
) ( ) ( ) ( ) (
1 ) ( 1 ) ( 1 )
( 1 ) ( ) 2, 1, (
2 2 1 1 2 21 1 11
2 2 1 1 2 21 1 11
P P P
P
P P P
P
, (10)
where n denotes the number of samples and the notation " "
denotes the minimum operator.
The class degree CD
c(v) of a set of samples can be calculated from the extension matrix of membership grades, shown as follows.
Definition 4.3: The extension matrix of membership grades of a feature f is used to calculate the class degree CD
c(v) of a set of samples labeled as class c in the fuzzy subset v, shown as follows:
¦
¦
R r
v f r EM R c r
v f r EM c v
CD ( , )
) , ( )
( . (11)
The fuzzy entropy of a feature f can be calculated by Definitions 2.2, 2.3, 3.1 and 4.3, shown as follows:
)].
) 2 ( log ) ( ( [
)
( ¦
¦
V u
v c C CD c v CD c v
s s v f
FFE
Definition 4.4: The fuzzy entropy measure BSFFE(f
1, f
2) of a feature subset {f
1, f
2} focusing on boundary samples is defined as follows:
° °
¯
°° ®
u
u
otherwise f
FFE FS s FFE s
s s s if s f FFE FS s FFE s BSFFE
UB B
B B UB
B
f f
), ( ) (
), ( ) ( )
(
2 1 , 2
1
2 2
2 2
1 1
1 1
(12)
where FS denotes the feature subset {f
1, f
2}, s
1denotes the total membership grades of the elements in all fuzzy sets of feature f
1, s
1Bdenotes the total membership grades of the elements in the fuzzy sets of feature f
1whose maximum class degree are larger then the threshold value T
rof the maximum class degree, and FFE
UB(f
1) denotes the fuzzy entropy of the fuzzy subsets of feature f
1whose maximum class degree are smaller than or equal to the threshold value T
r, where T
r [0, 1]. In the same way, s
2, s
2Band FFE
UB(f
2) for feature f
2can be defined.
Let F be a set of candidate features and let FS be the selected feature subset. The proposed algorithm for feature subset selection is presented as follows:
Step 1: /* Generate the extension matrices of membership grades of all features and calculate the fuzzy entropy of them, respectively. */
For each f F do {
let ;
m f n r n vm f
r n v
r f f vm
v r EM f
u
»»
»»
»»
¼ º
««
««
««
¬ ª
) ( )
1 (
1 ) ( 1 )
1 (
P P
P P
let E ( f ) FFE ( f )
}.
Step 2: /* Put the feature with the minimum fuzzy entropy into the initial feature subset and remove it from the original feature set, where the symbol “
X
x ” returns one of such x that minimizes the function f(x). */
x f
( ) min arg
let ˆ f arg min E ( f ) ;
fF
let E FS E ( f ˆ ) ; let FS FS ^ ` f ˆ ;
let F F ^ ` f ˆ .
Step 3: /* Repeatedly put the feature which can reduce the fuzzy entropy of the feature subset into FS until no such a feature exists. */
repeat {
For each fF
{
let EM
tempCEM ( FS , f , T r ) ; let E ( f ) BSFFE ( FS , f )
};
let f ˆ arg min E ( f ) ;
fF