Improved binary particle swarm optimization using catfish effect for feature selection

(1)

Improved binary particle swarm optimization using catﬁsh effect for

feature selection

Li-Yeh Chuang

a

_{, Sheng-Wei Tsai}

b

_{, Cheng-Hong Yang}

b,c,⇑

a

Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung 80041, Taiwan

b

Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80778, Taiwan

c

Department of Network Systems, Toko University, Chiayi 61363, Taiwan

a r t i c l e

i n f o

Keywords: Feature selection

Catﬁsh binary particle swarm optimization K-nearest neighbor

Leave-one-out cross-validation

a b s t r a c t

The feature selection process constitutes a commonly encountered problem of global combinatorial opti-mization. This process reduces the number of features by removing irrelevant, noisy, and redundant data, thus resulting in acceptable classification accuracy. Feature selection is a preprocessing technique with great importance in the fields of data analysis and information retrieval processing, pattern classification, and data mining applications. This paper presents a novel optimization algorithm called catfish binary particle swarm optimization (CatfishBPSO), in which the so-called catfish effect is applied to improve the performance of binary particle swarm optimization (BPSO). This effect is the result of the introduction of new particles into the search space (‘‘catfish particles’’), which replace particles with the worst fitness by the initialized at extreme points of the search space when the fitness of the global best particle has not improved for a number of consecutive iterations. In this study, the K-nearest neighbor (K-NN) method with leave-one-out cross-validation (LOOCV) was used to evaluate the quality of the solutions. CatfishB-PSO was applied and compared to 10 classification problems taken from the literature. Experimental results show that CatfishBPSO simplifies the feature selection process effectively, and either obtains higher classification accuracy or uses fewer features than other feature selection methods.

1. Introduction

Feature selection is a preprocessing technique for effective data analysis. The purpose of feature selection is the selection of optimal subsets, which are necessary and sufficient for solving the problem. Feature selection improves the predictive accuracy of algorithms by reducing the dimensionality, removing irrelevant features, and reducing the amount of data needed for the learning process. This can be done because not all available features are relevant for the classification process. Recently, feature selection has been success-fully employed to solve classification problem in various areas, such as pattern recognition (Gunal & Edizkan, 2008), data mining (Martin-Bautista & Vila, 1999; Piramuthu, 1998), multimedia infor-mation retrieval (Lew, 2001; Liu & Dellaert, 1998), structure-activity correlation (Agrafiotis & Cedeño, 2002), and other areas where feature selection can be applied to.

The filter methods and wrapper methods are two general approaches of feature selection. Filter methods define the relevant features without prior classification of the data (Liu & Setiono,

1996). Wrapper methods on the other hand incorporate classifica-tion algorithms to search for and select relevant features (Kohavi & John, 1997). The wrapper methods generally outperform filter methods in terms of classification accuracy. The CatfishBPSO intro-duced in this paper belongs to the wrapper methods. A large num-ber of selected features do not necessarily translate into high classification accuracy for many pattern classification problems. In some cases, the performance of algorithms devoted to speed and predictive accuracy of the data characterization can even decrease because features may be irrelevant or misleading, or due to spurious correlations. These factors can have a negative impact on the classification process during the learning stage. Ide-ally, feature selection method reduces the cost of feature measure-ment, and increase classifier efficiency and classification accuracy. Several methods have previously been used to perform feature selection of training and testing data, such as genetic algorithms (Raymer, Punch, Goodman, Kuhn, & Jain, 2000), branch and bound algorithms (Murphy & Aha, 1994; Yu & Yuan, 1993), sequential search algorithms (Pudil, Novovicova, & Kittler, 1994), mutual information (Battiti, 1994), neural networks (Brill, Brown, & Martin, 1992), tabu search (Zhang & Sun, 2002), genetic algo-rithms, hybrid genetic algorithms (Oh, Lee, & Moon, 2004) and bin-ary particle swarm optimization (Bello, Gomez, Garcia, & Nowe, 2007; Tanaka, Kurita, & Kawabe, 2007; Wang, Yang, Teng, Xia, & Jensen, 2007). GAs have demonstrated the ability to reach

⇑ Corresponding author. Address: Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80778, Taiwan. Tel.: +886 7 381 4526x5639; fax: +886 7 383 6844.

E-mail addresses: chuang@isu.edu.tw (L.-Y. Chuang), 1096305108@cc.kuas. edu.tw(S.-W. Tsai),chyang@cc.kuas.edu.tw(C.-H. Yang).

Expert Systems with Applications 38 (2011) 12699–12707

Contents lists available atScienceDirect

Expert Systems with Applications

(2)

near-optimal solutions for large problems; however, it may require a long processing time to reach a near-optimal solution (Elbeltagi, Hegazy, & Grierson, 2005). Similarly to GAs, BPSO is also a popula-tion-based optimizer. BPSO has a memory, so knowledge of good solutions is retained by all the particles and optimal solutions are found by the swarms following the best particle. Unlike GAs, BPSO does not contain any crossover and mutation processes. It has much more profound intelligent background and can be performed easily (Shi, Liang, Lee, Lu, & Wang, 2005).

In general, classification problems fall into either one of two groups: binary classification (two class labels) and multiclass clas-sification (more than two class labels). Examples methods applied to binary classification problems are discriminant analysis (Hastie, Tibshirani, & Friedman, 2001; Li, Zhu, & Ogihara, 2003), decision trees (Safavian & Landgrebe, 1991), the K-nearest neighbor method (Cover & Hart, 1967; Dasarathy, 1991), backpropagation neural net-works (Mitchell, 1997), and support vector machines (Vapnik, 1998). Methods successfully applied to multiclass classification include support vector machines: (1) versus-rest and one-versus-one (Kreßel, 1999), (2) DAGSVM (Platt, Cristianini, & Shawe-Taylor, 2000), (3) the method by Weston and Watkins (Hsu & Lin, 2002; Weston & Watkins, 1999), and (4) the method by Crammer and Singer (Crammer & Singer, 2000; Hsu & Lin, 2002). Generally speaking, solving multiclass classification prob-lems is not as trivial as solving binary ones.

The present study proposes a novel optimization algorithm for feature selection called catfish binary particle swarm optimization (CatfishBPSO). In CatfishBPSO, the so-called ‘‘catfish’’ effect intro-duces a competition function into a group of individuals. Catfish particles are introduced into the search space if the fitness of gbest is not improved (i.e. unchanged) a number of consecutive iteration. These catfish particles are introduced at extreme positions of the search space and will initialize a new search from these extreme positions. The catfish particles open up new opportunities for find-ing better solutions, and guide the entire swarm to promisfind-ing new regions of the search space. The K-nearest neighbor method (K-NN) with leave-one-out cross-validation (LOOCV) based on Euclidean distance calculations was used to evaluate the quality of solutions for 10 classification problems taken from the literature. Experi-mental results show that CatfishBPSO simplifies the feature selec-tion process effectively, and either obtains higher classificaselec-tion accuracy or uses fewer features than other feature selection methods.

2. Methods

2.1. Binary particle swarm optimization (BPSO)

In PSO, each particle is analogous to an individual ‘‘fish’’ in a school of fish. A swarm consists of N particles moving around a D-dimensional search space. The process of PSO is initialized with a population of random particles and the algorithm then searches for optimal solutions by continuously updating generations. Each particle makes use of its own memory and knowledge gained by the swarm as a whole to find the best solution. The position of the ith particle can be represented by xi= (xi1, xi2, . . ., xiD). The velocity for the ith particle can be written as

v

i= (

v

i1,

v

i2, . . .,

v

iD). The positions and velocities of the particles are conﬁned within [Xmin, Xmax]D and [Vmin, Vmax]D, respectively. The best previously visited position of the ith particle is denoted its individual best po-sition pi= (pi1, pi2, . . ., piD), a value called pbesti. The best value of the all individual pbestivalues is denoted the global best position g = (g1, g2, . . ., gD) and called gbest. At each generation, the position and velocity of the ith particle are updated by pbestiand gbest in the swarm. However, many optimization problems occur in a space

featuring discrete, qualitative distinctions between variables and between levels of variables. For this reason, Kennedy and Eberhart introduced binary PSO (BPSO), which can be applied to discrete binary variables. In a binary space, a particle may move to near cor-ners of a hypercube by ﬂipping various numbers of bits; thus, the overall particle velocity may be described by the number of bits changed per iteration (Fix & Hodges, 1951). In BPSO, each particle is updated based on the following equations:

v

new

id ¼ w

v

oldid þc1r1ðpbestidxoldidÞþc2r2ðgbestdxoldidÞ ð1Þ if

v

new

id RðVmin;VmaxÞ then

v

newid

¼ maxðminðVmax;

v

newid Þ; VminÞ ð2Þ Sð

v

new id Þ ¼ 1 1 þ evnew id ð3Þ if ðr3<Sð

v

newid ÞÞ then x new id ¼ 1 else x new id ¼ 0 ð4Þ

In these equations, w is the inertia weight that controls the impact of the previous velocity of a particle on its current one, r1, r2, and r3are random numbers between (0, 1), and c1and c2are accel-eration constants, which control how far a particle will move in a single generation. Velocities

v

new

id and

v

oldid denote the velocities of the new and old particle, respectively. xold

id is the current particle po-sition, and xnew

id is the new, updated particle position. In Eq.(2), par-ticle velocities of each dimension are tried to a maximum velocity Vmax. If the sum of accelerations causes the velocity of that dimen-sion to exceed Vmax, then the velocity of that dimension is limited to Vmax. Vmaxand Vminare user-speciﬁed parameters (in our case Vmax= 6, Vmin= 6). The position of particles after updating is calcu-lated by the function Sð

v

new

id Þ (Eq.(3)). If Sð

v

newid Þ is larger than r3, then its position value is represented by {1} (meaning this position is selected for the next update). If Sð

v

new

id Þ is smaller than r3, then its position value is represented by {0} (meaning this position is not selected for the next update) (Kennedy & Eberhart, 1997).

2.2. Catfish binary particle swarm optimization (CatfishBPSO) The catfish effect derives its name from an effect that Norwe-gian fishermen observed when they introduced catfish into a hold-ing tank for caught sardines. The introduction of catfish, which is different from sardines, into the tank resulted in the stimulation of sardine movement, thus keeping the sardines alive and therefore fresh for a longer time. Similarly, CatfishBPSO introduces catfish particles to stimulate a renewed search by the ‘‘sardine’’ particles. In other words, these catfish particles guide particles which are trapped in a local optimum on towards a new region of the search space, and thus to potentially better particle solutions. In CatfishB-PSO, a particle swarm is randomly initialized in the first step, and the particles are distributed over the D-dimensional search space. The position of each particle is represented in binary string form; the bit value {0} and {1} represent a non-selected and selected fea-ture, respectively. Each particle is updated by following two values. The first one, pbest, is the best solution (fitness) it has achieved so far. The other value tracked by the PSO is the global best value gbest obtained so far by all particles in the population. The position and velocity of each particle are updated by Eqs.(1)–(4). If the dis-tance between gbest and the surrounding particles is small, each particle is considered a part of the cluster around gbest and will only move a very small distance in the next generation. To avoid this premature convergence, catfish particles are introduced and replace the 10% of original particles with the worst fitness value of the swarm. These catfish particles are essential for the success of a given optimization task. The introduction of catfish particles in CatfishBPSO is simple and can be done without increasing the

(3)

computational complexity of the process. Catﬁsh particles overcome the inherent defects (premature convergence) of BPSO by initializing a new search over the entire search space from its extreme points. A more detailed description of the entire CatﬁshB-PSO process can be found inChuang, Tsai, and Yang (2008). 2.3. K-nearest neighbor

The K-nearest neighbor (K-NN) method is one of the most pop-ular nonparametric methods (Cover & Hart, 1967; Fix & Hodges, 1951) used for classification of new objects based on attributes and training samples. K-NN consists of a supervised learning algo-rithm which instantly classifies the result of a query instance based on the majority of the K-nearest neighbor category. Classifiers do not use any model for K-nearest neighbors and are determined solely based on the minimum distance from the query instance to the training samples. Any tied results are solved by a random procedure.

The advantage of the K-NN method is that it is simple and easy to implement. In this study, the feature subset was measured by the leave-one-out cross-validation (LOOCV) of one nearest neigh-bor (1-NN). Neighneigh-bors are calculated using their Euclidean dis-tance. The 1-NN classifier does not require any user-specified parameters, and the classification results are implementation inde-pendent. In the LOOCV method, a single observation from the ori-ginal sample is selected as the validation data, and the remaining observations constitute the training data. This process is repeated so that each observation in the sample is used once as the valida-tion data. Essentially, the procedure is the same as K-fold cross-validation where K is equal to the number of observations in the original sample. The pseudo-codes for CatfishBPSO and 1-NN are shown below.

CatﬁshBPSO pseudo-code 01: Begin

02: Randomly initialize particles swarm

03: while (number of iterations, or the stopping criterion is not met)

04: Evaluate ﬁtness of particle swarm by 1-NN() 05: for n = 1 to number of particles

06: Find pbestiand gbest

07: for d = 1 to number of dimension of particle 08: update the position of particles by Eqs.(1)–(4)

09: next d

10: next n

11: if ﬁtness of gbest is the same three times then 12: Sort the particle swarm via ﬁtness from best to

worst

13: for n = number of Nine-tenth particles to number of particles

14: if rand number >0.5 then

15: for d = 1 to number of dimension of particle

16: The position of catﬁsh = 1 (Max of the

search space)

17: next d

18: else

19: for d = 1 to number of dimension of particle

20: The position of catﬁsh = 0 (Min of the

search space)

21: next d

22: next n

23: end if

24: next generation until stopping criterion 25: end

1-NN pseudo-code 01: begin

02: for i = 1 to sample number of classification problem 03: for j = 1 to sample number of classification problem 04: for k = 1 to dimension number of classification

problem

05: disti= disti+ (dataik datajk)2

06: next k

07: if disti< nearest then 08: classi= classj

09: nearest = disti

10: end if

11: next j

12: next i

13: for i = 1 to sample number of classiﬁcation problem 14: if classi= real class of testing data then

correct = correct + 1

15: end if

16: next i

17: Fitness value = correct / number of testing data 18: end

2.4. Parameter setting

In our experiment, the same conditions were used to compare the performance of SGA (Oh et al., 2004), HGAs (Oh et al., 2004), BPSO and CatﬁshBPSO algorithms, i.e. population size = 20 (Oh et al., 2004), and the K = 1 (Oh et al., 2004) in part of K-Nearest Neighbor. The other parameters of CatﬁshBPSO are set as follows: generations = 70 (maximum number of iterations), w = 1.0 (Kennedy, Eberhart, & Shi, 2001), and c1= c2= 2 (Kennedy et al.,

2001). The number of generations for the SGA, HGAs, BPSO and CatﬁshBPSO is listed inTable 1.

3. Experimental results and discussion

In order to investigate the effectiveness and performance of the CatfishBPSO algorithm for classification problems, we chose 10 classification problems from the literature. These data sets were obtained from the UCI Repository (Murphy & Aha, 1994). The data format is shown inTable 2. Three types of classification problems, small-, medium-, and large-sized groups, were tested. If the num-ber of features is between 10 and 19, the sample group can be con-sidered small; the Glass, Vowel, Wine, Letter, Vehicle, and Segmentation problems constitute such small sample groups. If the number of features is between 20 and 49, the sample group test problems are of medium size. Groups in this category included the WDBC, Ionosphere, and Satellite problems. If the number of features is greater than 50, the test problems are large sample group problems; this group included the Sonar problem. Further-more, in order to compare our results to the results published in HGAs (Oh et al., 2004) under identical condition, the 1-NN method with leave-one-out cross-validation (LOOCV) was used to evaluate all data sets. Employing the 1-NN method with LOOCV as a fitness function for CatfishBPSO has two distinct advantages: the calcula-tion time can be decreased, and higher classificacalcula-tion accuracy can be obtained.

3.1. Experimental results

InTable 3, the classiﬁcation accuracy and selected number of features is shown for the data sets tested with BPSO, CatﬁshBPSO

(4)

E-214-004-CC3, NSC95-2221-E-151-004-MY3, NSC95-2221-E-214-087, and NSC95-2622-E-214-004.

References

Agraﬁotis, D. K., & Cedeño, W. (2002). Feature selection for structure-activity correlation using binary particle swarms. Journal of Medicinal Chemistry, 45(5), 1098–1107.

Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4), 537–550. Bello, R., Gomez, Y., Garcia, M. M., & Nowe, A. (2007). Two-step particle swarm

optimization to solve the feature selection problem. Intelligent Systems Design and Application, 691, 696.

Brill, F. Z., Brown, D. E., & Martin, W. N. (1992). Fast genetic selection of features for neural network classiﬁers. IEEE Transactions on Neural Networks, 3(2), 324–328. Chuang, L. Y., Tsai, S. W., & Yang, C. H. (2008). Catﬁsh particle swarm optimization. In IEEE swarm intelligence symposium 2008 (SIS 2008), St. Louis, Missouri (p. 20). Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: Wiley &

Sons Inc..

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classiﬁcation. Proceedings of the IEEE Transaction on Information Theory, 13(1), 21–27.

Crammer, K., & Singer, Y. (2000). On the learn ability and design of output codes for multiclass problems. In Proceedings of the 13th annual conference on computational learning theory (COLT 2000), Stanford University, Palo Alto, CA, June 28–July 1.

Dasarathy, B. V. (1991). NN concepts and techniques, nearest neighbor (NN) norms: NN pattern classiﬁcation techniques. IEEE Computer Society Press (pp. 1–30). Elbeltagi, E., Hegazy, T., & Grierson, D. (2005). Comparison among ﬁve

evolutionary-based optimization algorithms. Advanced Engineering Informatics, 19, 43–53. Fix, E., & Hodges, J. L. (1951). Discriminatory analysis-nonparametric

discrimination: Consistency properties. Project 21-49-004, Report 4, US Air Force School of Aviation Medicine, Randolph Field (pp. 261–279).

Gunal, S., & Edizkan, R. (2008). Subspace based feature selection for pattern recognition. Information Sciences, 178(9), 3716–3726.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer.

Hsu, C. W., & Lin, C.-J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transaction on Neural Network, 12, 415–425.

Kennedy, J., & Eberhart, R. C. (1997). A discrete binary version of the particle swarm algorithm. In IEEE international conference on systems, man, and cybernetics, computational cybernetics and simulation (Vol. 5, pp. 4104–4108, 12–15). Kennedy, J., Eberhart, R. C., & Shi, Y. (2001). Swarm intelligence. San Francisco:

Morgan Kaufmann Publishers.

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artiﬁcial Intelligence, 97(1–2), 273–324.

Kreßel, U. (1999). Pairwise classiﬁcation and support vector machines. In Advances in kernel methods: Support vector learning (pp. 255–268). Cambridge, MA: MIT Press.

Kudo, M., & Sklansky, J. (2000). Comparison of algorithms that select features for pattern classiﬁers. Pattern Recognition, 33(1), 25–41.

Lew, M. S. (2001). Principles of visual information retrieval. London: Springer-Verlag.

Li, T., Zhu, S., & Ogihara, M. (2003). Efﬁcient multi-way text categorization via generalized discriminant analysis. In Proceedings of 12th international conference on information and knowledge management (CIKM 2003) (pp. 317–324). NY: ACM Press.

Liu, Y., & Dellaert, F. (1998). A classiﬁcation based similarity metric for 3D image retrieval. In Proceedings of the IEEE international conference on computer vision and pattern recognition (pp. 800–805).

Liu, H., & Setiono, R. (1996). A probabilistic approach to feature selection – A ﬁlter solution. In Proceedings of the 13th international conference on machine learning (pp. 319–327).

Martin-Bautista, M. J., & Vila, M.-A. (1999). A survey of genetic feature selection in mining issues. In Proceedings of congress on evolutionary computation (Vol. 2, pp. 1314–1321).

Mitchell, T. M. (1997). Machine learning. New York, NY, USA: McGraw-Hill. Murphy, P. M., & Aha, D. W. (1994). UCI repository of machine learning databases.

Technical report, Department of Information and Computer Science, University of California, Irvine, Calif. Available at: <http://www.ics.uci.edu/~mlearn/ MLRepository.html>.

Oh, I.-S., Lee, J.-S., & Moon, B.-R. (2004). Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11), 1424–1437.

Piramuthu, S. (1998). Evaluating feature selection methods for learning in data mining applications. In Proceedings of the IEEE 31st annual Hawaii international conference on system science (Vol. 5, pp. 294–301).

Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGS for multiclass classiﬁcation. Advances in neural information processing systems (Vol. 12, pp. 547–553). MIT Press.

Pudil, P., Novovicova, J., & Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15(11), 1119–1125.

Raymer, M. L., Punch, W. F., Goodman, E. D., Kuhn, L. A., & Jain, A. K. (2000). Dimensionality reduction using genetic algorithms. IEEE Transaction on Evolutionary Computation, 4(2), 164–171.

Safavian, S. R., & Landgrebe, D. (1991). A survey of decision tree classiﬁer methodology. IEEE Transactions of Systems, Man, and Cybernetics, 21(3), 660–674.

Shi, X. H., Liang, Y. C., Lee, H. P., Lu, C., & Wang, L. M. (2005). An improved GA and a novel PSO-GA-based hybrid algorithm. Information Processing Letter, 93, 255–261.

Tanaka, K., Kurita, T., & Kawabe, T. (2007). Selection of import vectors via binary particle swarm optimization and cross-validation for kernel logistic regression. In Proceedings of international joint conference on networks, Orlando, Florida, USA, August 12–17.

Vapnik, V. (1998). Statistical learning theory. New York, NY, USA: Wiley-Interscience. Wang, X., Yang, J., Teng, X., Xia, W., & Jensen, R. (2007). Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4), 459–471.

Weston, J., & Watkins, C. (1999). Support vector machines for multi-class pattern recognition. In Proceedings of the seventh European symposium on artiﬁcial neural networks (ESANN 99), Bruges (pp. 21–23).

Yu, B., & Yuan, B. (1993). A more efﬁcient branch and bound algorithm for feature selection. Pattern Recognition, 26(6), 883–889.

Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. Pattern Recognition, 35(3), 701–711.