Gene selection and classification using Taguchi chaotic binary particle swarm optimization

(1)

Gene selection and classiﬁcation using Taguchi chaotic binary particle

swarm optimization

Li-Yeh Chuang

a

_{, Cheng-San Yang}

b,⇑

_{, Kuo-Chuan Wu}

c

_{, Cheng-Hong Yang}

d,e,⇑ a

Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung 80041, Taiwan

b

Department of Plastic Surgery, Chia-Yi Christian Hospital, Chiayi 60002, Taiwan

c

Department of Computer Science and Information Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80708, Taiwan

d

Department of Network Systems, Toko University, Chiayi 61363, Taiwan

e_{Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80708, Taiwan}

a r t i c l e

i n f o

Keywords: Microarray data

Correlation-based feature selection Taguchi-binary particle swarm optimization K-nearest neighbor

a b s t r a c t

The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification. Microarray data with reference to gene expres-sion profiles have provided some valuable results related to a variety of problems and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and a small sam-ple size. This makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to ana-lyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate.

In this paper, correlation-based feature selection (CFS) and the Taguchi chaotic binary particle swarm optimization (TCBPSO) were combined into a hybrid method. The K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles. Experi-mental results show that this hybrid method effectively simplifies features selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached. The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.

1. Introduction

Microarray data characteristically have a high dimension and a small sample size, which makes it difficult for a general classi-fication method to obtain correct data for classiclassi-fication. The pur-pose of classification is to build an efficient model for predicting the class membership of data. This model should produce a cor-rect label on the training data and predict the label of any un-known data accurately. Determining an optimal feature (gene) subset is a very complex task, yet one which proves decisive

for the outcome of the classification error rate. Classifying microarray data involves feature selection and classifier design. Feature selection is the process of choosing a subset of features from an original feature set and thus can be viewed as a princi-pal pre-processing tool prior to solving the classification prob-lems (Wang, Yang, Teng, Xia, & Jensen, 2007). The goal of feature selection is to reduce the dimensionality of the problem and to retain the characteristics necessary for recognition, classi-fication and/or the data mining process. A reliable selection method that obtains the relevant genes from the sample data is needed in order to decrease the classification error rates and to avoid incomprehensibility.

Performing an exhaustive search over the entire solution space is not practical since this would require a long computing time associated with high cost. To overcome these feature selection problems, irrelevant and redundant features have to be eliminated and only features that are relevant for the classiﬁcation process should be considered. Deleting irrelevant features signiﬁcantly

⇑ Corresponding authors. Addresses: Department of Plastic Surgery, Chia-Yi Christian Hospital, Chiayi 60002, Taiwan. Tel.:+886 5 276 5041; fax: +886 5 277 4511 (C.-S. Yang). Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 80708, Taiwan. Tel.: +886 7 381 4526x5639; fax: +886 7 383 6844 (C.-H. Yang).

E-mail addresses: chuang@isu.edu.tw (L.-Y. Chuang), p8896117@mail.ncku. edu.tw (C.-S. Yang), 1097308101@cc.kuas.edu.tw (K.-C. Wu), chyang@cc.kuas. edu.tw(C.-H. Yang).

Contents lists available atScienceDirect

Expert Systems with Applications

(2)

improves the computational efficiency and lowers the classifica-tion error rate. As many pattern recogniclassifica-tion techniques were orig-inally not designed to deal with the large amount of irrelevant or redundant features, combining feature selection techniques has become a necessity (Guyon & Elisseeff, 2003; Kohavi & John, 1997; Liu & Motoda, 1998). Existing methods for data reduction, or more specifically for feature selection in the context of micro-array data analysis (Kim & Cho, 2008; Saeys, Inza, & Larranaga, 2007; Tang, Zhang, Huang, Hu, & Zhao, 2008; Wong & Hsu, 2008), can be classified into two major groups: filter and wrapper approaches. The filter approach separates data before the actual classification process and then calculates feature weight values. Thus, features that represent the original data set better can be identified. However, the filter approach does not account for inter-actions amongst the features. Methods in this category include the correlation-based feature selection (CFS) (Hall, 1999), t-test, infor-mation gain (Quinlan, 1986), mutual information (Battiti, 1994) and entropy-based methods (Liu, Krishnan, & Mondry, 2005). Wrapper models generally focus on improving classification accu-racy of pattern classification problems and typically perform better (i.e., reach high classification accuracy) than filter models. How-ever, wrapper approaches are computationally more expensive than filter methods (Kohavi & John, 1997; Liu et al., 2005). Several methods have previously been used to perform feature selection of training and testing data, such as genetic algorithms (Raymer, Punch, Goodman, Kuhn, & Jain, 2000), branch and bound algo-rithms (Narendra & Fukunaga, 1977), sequential search algorithms (Pudil, Novovicov, & Kittler, 1994), tabu search (Zhang & Sun, 2002), binary particle swarm optimization (BPSO) (Chuang, Chang, Tu, & Yang, 2008), and hybrid genetic algorithms (Oh, Lee, & Moon, 2004). Some novel evaluation criteria for wrapper and filter models are currently under development (Zhu, Ong, & Dash, 2007).

Many optimization algorithms produce locally optimal solu-tions and are thus combined with a local search process to improve the outcome. An example of such a local search process incorpo-rated in a genetic algorithm can be found inOh et al. (2004). In this study, we used the Taguchi method as a local search method in chaotic binary particle swarm optimization (CBPSO). The Taguchi method uses ideas from statistical experimental design to improve or optimize products, processes and equipment. It uses two major tools: the signal-to-noise ratio (SNR), which measures quality, and orthogonal arrays (OAs), which are used to study the many design parameters simultaneously. This method has been successfully ap-plied in machine learning and data mining, e.g., combined data mining and electrical discharge machining (Chang, Tsai, & Ke, 2006). Sohn and Shin used the Taguchi experimental design for the Monte Carlo simulation of classiﬁer combination methods (Sohn & Shin, 2007). Kwak and Choi used the Taguchi method for feature selection for classiﬁcation problems (Kwak & Choi, 2002). Chen et al. optimized neural network parameters using the Taguchi method (Chen, Tai, Wang, Deng, & Chen, 2008).

This paper presents a hybrid feature selection approach consist-ing of two stages, namely correlation-based feature selection and Taguchi chaotic binary particle swarm optimization (CFS–TCBPSO). In the former, the correlation-based feature selection value of each feature is calculated in order to select those features that can differentiate a variety of classes. This process is considered a filter approach. The Taguchi chaotic binary particle swarm optimization, a wrapper approach, is then applied to the features that have been selected during the first-stage, and evaluates if the selected features affect the classification error rate using the K-nearest neighbor (KNN) classifier (Cover & Hart, 1967; Fix & Hodges, 1951; Tan, 2006) with leave-one-out cross-validation (LOOCV) (Cawley & Talbot, 2003; Stone, 1974). Ten classification microarray data sets taken from the literature (Diaz-Uriarte & Alvarez de Andres, 2006) were used for the feature selection and

classiﬁcation. The experimental results show that the proposed method achieved the lowest classiﬁcation error rates and outper-formed the other methods from the literature it was tested against. 2. Material and methods

2.1. Correlation-based feature selection

Correlation-based feature selection (CFS) is a ﬁlter feature selec-tion method using a heuristic for evaluating the merit of a subset of feature. The heuristic takes the individual features useful for label-ing the class and their inter-correlation into account. The hypoth-esis of CFS is based on the statement Good feature subsets contain features highly correlated with (i.e., predictive of) the class, yet uncor-related with (i.e. not predictive of) each other (Hall, 1999).

This hypothesis is incorporated into the correlation-based heu-ristic evaluation equation as:

MeritS¼

k

c

cf

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k þ kðk 1Þ

c

ff

q ð1Þ

where MeritSis the heuristic merit of a feature subset S containing k features,

c

_cf is the average feature and class correlation, and

c

_ff is the average feature-feature intercorrelation (f 2 S). Eq. (1) is Pearson’s correlation, where all variables have been standardized. General ﬁlter methods estimate the signiﬁcance of a feature indi-vidually. CFS is then used to determine the best combination of attribute subsets via score values from the original data sets. The attributes are combined since they would be poor predictors of the class individually. Redundant attributes are discriminated against because they would be highly correlated with one or more of the other attributes (Hall, 1999).

Various heuristic search strategies, such as the best first method (Rich & Knight, 1991), are often applied to search the feature subset space in a reasonable time frame. We applied the best-first-method to calculate a matrix of class and feature-feature correlation merits for CFS from the training data. The best-first-search starts with an empty set of features and generates all possible single feature expansions. Given enough time, a best-first-search will explore the entire feature subset space, so CFS uses a stopping criterion when subsets are found (Hall, 1999). In order to calculate the merit of a feature set, the correlation between features is computed using symmetrical uncertainty (SU):

SU ¼ 2:0 HðYÞ þ HðXÞ HðX; YÞ

HðYÞ þ HðXÞ

ð2Þ where H(Y) and H(X, Y) are deﬁned as:

HðYÞ ¼ X

y2Y

pðyÞlog2ðpðyÞÞ ð3Þ

where a probabilistic model of a feature Y can be formed by esti-mating the individual probabilities of the values y 2 Y from the training data. If feature Y in the training data is partitioned accord-ing to another feature X, then the relationship between features Y and X is given by:

HðYjXÞ ¼ X

x2X

pðxÞX

y2Y

pðyjxÞlog2ðpðyjxÞÞ ð4Þ

SU compensates for the information gain’s bias toward some attri-butes; the SU value is in the range [0, 1].

2.2. Taguchi method

The Taguchi method (Taguchi, Chowdhury, & Taguchi, 2000; Tsai, Liu, & Chou, 2004; Wu & Wu, 2000) is a robust design approach that uses ideas from statistical experimental design for

(3)

estimating and performing improvements in products, processes, and parameter settings. In a robust experimental design, processes or products can be analyzed and improved by altering relevant de-sign factors. Robust experimental dede-sign is used to dede-sign experi-ments and analyze the solution efﬁciently and reliably. The Taguchi method provides two tools, an orthogonal array (OA) and the signal-to-noise ratio (SNR), for analysis and improvement (Taguchi et al., 2000; Tsai et al., 2004; Wu & Wu, 2000). These tools are used to determine the levels of factors and to minimize the sen-sitivity of noise. A general two-level OA can be deﬁned as Lh(2d), where d is the number of columns (i.e., the number of design parameters) in the orthogonal matrix, and h = 2k_{(h > d, k > log}

2(d), k is an integer) denotes the number of experimental trials; base 2 denotes the number of levels for each design parameter. Hence if a particular target (i.e., process or product) has d different design fac-tors, 2dpossible experimental trials will have to be considered in a full factorial experimental design. OA is principally utilized to de-crease experimental efforts associated with the d design parame-ters. An OA can be considered a fractional factorial experimental design matrix that provides a comprehensive analysis of interac-tions among all design factors, and fair, balanced and systematic comparisons of the different levels (or options) of each design fac-tor. Table 1 is an example of a L8(27) orthogonal array.

The SNR is then utilized to analyze and optimize design param-eters for a particular target. In Taguchi method classiﬁes robust parameter design problems into different categories depending on the target of the problem. Typically, the smaller-the-better and larger-the-better SNR types are utilized (Wu & Wu, 2000). Consider a set of t observations {y1, y2, . . ., yt}.

For the smaller-the-better characteristic, the SNR is determined as SNR ¼ 10 log 1 n Xn t¼1 y2 t ! ð5Þ For the larger-the-better characteristic, the SNR is determined as

SNR ¼ 10 log 1 n Xn t¼1 1 y2 t ! ð6Þ The SNR in the Taguchi method is used to determine the robust-ness of the levels of each design parameter. A ‘‘high quality’’ result for a particular target can be achieved by specifying design param-eters at a speciﬁc level with a high SNR.

2.3. Binary particle swarm optimization

In the original particle swarm optimization (PSO) (Kennedy & Eberhart, 1995), each particle is analogous to an individual ‘‘ﬁsh’’ in a school of ﬁsh. It is a population-based optimization technique,

where a population is called a swarm. A swarm consists of N par-ticles moving around in a D-dimensional search space. The position of the ith particle can be represented by xi= (xi1, xi2, . . ., xiD). The velocity for the ith particle can be written as

v

i= (

v

i1,

v

i2, . . .,

v

iD). The positions and velocities of the particles are confined within [Xmin, Xmax]Dand [Vmin, Vmax]D, respectively. Each particle coexists and evolves simultaneously based on knowledge shared with neighboring particles; it makes use of its own memory and knowl-edge gained by the swarm as a whole to find the best solution. The best previously encountered position of the ith particle is denoted its individual best position pi= (pi1, pi2, . . ., piD), a value called pbesti. The best value of the all individual pbestivalues is denoted the global best position g = (g1, g2, . . ., gD) and called gbest. The PSO process is initialized with a population of random particles, and the algorithm then executes a search for optimal solutions by continuously updating generations. However, many optimiza-tion problems occur in a space featuring discrete, qualitative distinctions between variables and between levels of variables. For this reason, Kennedy and Eberhart (1997)introduced binary PSO (BPSO), which can be applied to discrete binary variables. In a binary space, a particle may move to near corners of a hypercube by flipping various numbers of bits; thus, the overall particle velocity may be described by the number of bits changed per iteration (Kennedy & Eberhart, 1997). In BPSO, each particle is updated based on the following equations:

v

new

id ¼ w

v

oldid þ c1 r1 ðpbestid xoldidÞ þ c2 r2 ðgbestd xoldidÞ

ð7Þ If

v

new

id 2 ðVmni;VmaxÞ then

v

idnew¼ maxðminðVmax;

v

newid Þ; VminÞ

ð8Þ Sð

v

new id Þ ¼ 1 1 þ evnewid ð9Þ if r3<Sð

v

newid Þ then x new id ¼ 1 else x new id ¼ 0 ð10Þ

In these equations, w is the inertia weight that controls the im-pact of the previous velocity of a particle on its current one, r1,r2 and r3are random numbers between [0, 1], and c1and c2are accel-eration constants that control how far a particle moves in a single generation. Velocities

v

new

id and

v

oldid denote the velocities of the new and old particle, respectively. xold

id is the current particle position, and xnew

id is the new, updated particle position. In Eq.(8), particle velocities of each dimension are tried to a maximum velocity Vmax. If the sum of accelerations causes the velocity of that dimension to exceed Vmax, then the velocity of that dimension is limited to Vmax. Vmaxand Vminare user-speciﬁed parameters (in our case Vmax= 6, Vmin= 6). The position of particles after updating is calculated by the function Sð

v

new

id Þ (Eq.(9)). If Sð

v

newid Þ is larger than r3, then its position value is represented by {1} (meaning this position is selected for the next update). If Sð

v

new

id Þ is smaller than r3, then its position value is represented by {0} (meaning this position is not selected for the next update).

2.4. Chaotic binary particle swarm optimization

Generating random sequences with a long period and good uni-formity is very important for a heuristic algorithm. Since chaos is non-repetitive a heuristic algorithm can be embedded. Chaos can be described as the complex behavior of a nonlinear deterministic system that has ergodic and stochastic properties (Schuster & Just, 2005). It is very sensitive to the initial conditions and the parame-ters used. In other word, cause and effect of chaos are not propor-tional to the small differences in the initial values. In what is called the ‘‘butterﬂy effect’’, small variations of an initial variable can re-sult in huge differences in the solutions after a certain number of iterations. Optimization algorithms based on the chaos theory

Table 1

L8(27) orthogonal array.

Experimental trial Design factors (features)

A B C D E F G Column number 1 2 3 4 5 6 7 1 1 1 1 1 1 1 1 2 1 1 1 2 2 2 2 3 1 2 2 1 1 2 2 4 1 2 2 2 2 1 1 5 2 1 2 1 2 1 2 6 2 1 2 2 1 2 1 7 2 2 1 1 2 2 1 8 2 2 1 2 1 1 2

(4)

set of genes that can provide meaningful diagnostic information for disease prediction without diminishing accuracy. Feature selection uses relatively few features since only selective features need to be used. This does not affect the predictive error rate in a negative way. A general overall feature selection approach can be found in

Saeys et al. (2007). Feature selection methods using a wrapper ap-proach are very much dependent on the classifier or the pattern recognition approach used to assign the feature (gene) subset. On the other hand, filter approaches take only intrinsic features of the data into account. Finally, an embedded approach similar to a wrapper approach has the advantage that it includes interactions with the classification model, while at the same time it is far less computationally intensive than a wrapper method (Saeys et al., 2007).Wang et al. (2005)indicate that filter approaches can select more relevant feature subsets faster than wrapper approaches. On the other hand, wrapper approaches generally tend to obtain bet-ter classification accuracies. Inza, Larranaga, Blanco, and Cerrolaza (2004)) and Xiong, Fang, and Zhao (2001) used a wrapper approach to implement feature selection, and selected better feature subsets to boost classification accuracy. Nevertheless, optimal solutions are difficult to find due to the large size of the search space if only a wrapper approach is used. In this paper, we combined a filter and wrapper approach. CFS is a filter method that searches the en-tire feature space efficiently, and TCBPSO is a wrapper method that uses an induction algorithm to evaluate the feature subsets directly. As stated above, wrapper methods generally outperform filter methods in terms of prediction error rate. Since the individual advantages of a wrapper and filter method complement each other well (Zhu et al., 2007), we used a hybrid two-stage strategy to increase the classification accuracy. The Taguchi method imple-mented under the CBPSO procedure is responsible for the local search. The Taguchi principle is used to improve the quality of a product by minimizing the effect of the causes of variation without eliminating these causes (Tsai et al., 2004). The two-level orthogo-nal array and the SNR of the Taguchi method are used for exploita-tion. The optimum particles can easily be found by using both experimental runs and SNRs instead of executing combinations all of factor levels. Consequently, a superior candidate feature subset with high classification performance for the classification task at hand can be obtained in a subsequent iteration.

Many classifiers (e.g., KNN, linear and quadratic discriminant analysis, support vector machines, etc.) show good performance on microarray data. Each approach has its advantages and disad-vantages, so no single approach can be considered ideal. As a clas-sifier, KNN performs well for cancer classification, compared to the more sophisticated classifiers. It is an easily implemented method that has a single parameter (the number of nearest neighbors) to be pre-defined, given that the distance metric is Euclidean (Okun & Priisalu, 2009). Given a fixed dimension, a semi-definite positive norm, and n points in this space, the KNN of every point can be found in O(kn log n) time (Vaidya, 1989). The KNN method is easy

to implement by computing the distances from the test sample to all stored vectors, and its time complexity is superior to other methods. On the other hand, the KNN parameter K, the best choice of the number of neighbors, depends upon the data. Generally, lar-ger values of K reduce the effect of noise on the classification, but make boundaries between classes less distinct.Ghosh (2006) indi-cated that ‘‘the optimum value K depends on the specific data set and is to be estimated using the available training sample observa-tions‘‘. Since the time complexity of KNN is O(kn log n), the parameter K directly influences the performance efficiency. In clas-sification problems, overfitting appears when computationally intensive search algorithms are used. Estimates may be overfitted and yield biased predictions under these circumstances (Reunanen, Guyon, & Elisseeff, 2003). If the training data lies too closely to-gether, the classifier predictions are of poor quality. This occurs when there is insufficient data to train the classifier and the data does not fully cover the concept being learned. This problem is common in many real world samples where the available data may be rather noisy (Loughrey & Cunningham, 2005). In order to avoid overfitting, some additional techniques have been discussed, such as cross-validation, regularization, and early termination or resampling (Schaffer, 1993; Wolpert, 1993). However the best way to avoid overfitting is to use an abundant amount of training data. In this paper, the microarray data characteristically have a high dimension and small sample size, which is subsequently re-duced by a filter feature selection method. After feature reduction, the LOOCV technique enhances the training data for classification in a wrapper-based feature selection method.

The results show that the Taguchi method plays an important role for the local search in GA. PSO has many advantages over GA (Yang, Huang, Wu, & Chang, 2008). In order to enhance the efﬁ-ciency of BPSO, we designed the local search in such a way that it occurs when gBest is unchanged k times.Fig. 3shows how the pro-posed Taguchi method helps CBPSO escape the local optimum. In

Yang, Chuang, Li, and Yang (2008), we used a ﬁtness design called relative difference ﬁtness (RDF) to determine if a local optimum oc-curred. In this case, new particles were generated to escape the entrapment in a local optimum. The IBPSO introduced inChuang et al. (2008)avoids getting trapped in a local optimum by resetting gBest. The concepts behindYang et al. (2008) and Chuang et al. (2008)are similar with regard to avoiding a local optimum. How-ever, the various search strategies may lead to different results. 6. Conclusions

In this paper, we have described a two stage feature selection approach: a filter (CFS) and wrapper (TCBPSO) feature selection method were combined in a hybrid method, and KNN with the LOOCV method served as a classifier for pattern identification in gene expression data. We compared our approaches against Random Forest, shrunken centroids and nearest neighbor methods with variable selection that have been used for classification and feature selection of large-dimensional microarray data sets. Exper-imental results show that this hybrid method effectively simplifies features selection by reducing the number of features needed. The classification error rate obtained by the proposed method was the lowest for all of the ten gene expression data set problems tested. Six gene expression profile data sets even reached a classification error rate of zero. The proposed method can conceivably be used in other research projects that implement feature selection. References

Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., et al. (2000). Distinct types of diffuse large B-cell lymphoma identiﬁed by gene expression proﬁling. Nature, 403, 503–511.

(5)

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750. Battiti, R. (1994). Using mutual information for selecting features in supervised

neural net learning. IEEE Transactions on Neural Networks, 5, 537–550. Cawley, G. C., & Talbot, N. L. C. (2003). Efﬁcient leave-one-out cross-validation of

kernel ﬁsher discriminant classiﬁers. Pattern Recognition, 36, 2585–2592. Chang, T. C., Tsai, F. C., & Ke, J. H. (2006). Data mining and Taguchi method

combination applied to the selection of discharge factors and the best interactive factor combination under multiple quality properties. The International Journal of Advanced Manufacturing Technology, 31, 164–174. Chen, W.-C., Tai, P.-H., Wang, M.-W., Deng, W.-J., & Chen, C.-T. (2008). A neural

network-based approach for dynamic quality prediction in a plastic injection molding process. Expert Systems with Applications, 35, 843–849.

Chuang, L.-Y., Chang, H.-W., Tu, C.-J., & Yang, C.-H. (2008). Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry, 32, 29–38.

Cover, T., & Hart, P. (1967). Nearest neighbor pattern classiﬁcation. IEEE Transactions on Information Theory, 13, 21–27.

Deb, K., & Raji Reddy, A. (2003). Reliable classiﬁcation of two-class cancer data using evolutionary algorithms. Biosystems, 72, 111–129.

Diaz-Uriarte, R., & Alvarez de Andres, S. (2006). Gene selection and classiﬁcation of microarray data using random forest. BMC Bioinformatics, 7.

Fix, E., & Hodges, J. (1951). Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical report, USAF School of Aviation Medicine, Randolph Field, TX.

Frank, E., Hall, M., Trigg, L., Holmes, G., & Witten, I. H. (2004). Data mining in bioinformatics using Weka. Bioinfomatics, 20, 2479–2481.

Gao, H., Zhang, Y., Liang, S., & Li, D. (2006). A new chaotic algorithm for image encryption. Chaos, Solitons & Fractals, 29, 393–399.

Ghosh, A. K. (2006). On optimum choice of k in nearest neighbor classiﬁcation. Computational Statistics and Data Analysis, 50, 3113–3123.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classiﬁcation of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.

Hall, M. A. (1999). Correlation-based feature subset selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato.

Huang, H.-L., Lee, C.-C., & Ho, S.-Y. (2007). Selecting a minimal number of relevant genes from microarray data to design accurate tissue classiﬁers. Biosystems, 90, 78–86.

Huerta, E. B., Duval, B., & Hao, J. (2006). A hybrid ga/svm approach for gene selection and classiﬁcation of microarray data. Lecture Notes in Computer Science, 3907, 34–44.

Inza, I., Larranaga, P., Blanco, R., & Cerrolaza, A. J. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artiﬁcial Intelligence in Medicine, 31, 91–103.

Kennedy, J., & Eberhart, R. C. (1995). Particle swarm optimization. In IEEE international conference on neural networks, Perth, WA (pp. 1942–1948). Kennedy, J., & Eberhart, R. C. (1997). A discrete binary version of the particle swarm

algorithm. In IEEE international conference on systems, man, and cybernetics, Orlando, FL (pp. 4104–4108).

Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673–679. Kim, K.-J., & Cho, S.-B. (2008). An evolutionary algorithm approach to optimal

ensemble classiﬁers for DNA microarray data analysis. IEEE Transactions on Evolutionary Computation, 12, 377–388.

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artiﬁcial Intelligence, 97, 273–324.

Kuo, D. (2005). Chaos and its computing paradigm. IEEE Potentials, 24, 13–15. Kwak, N., & Choi, C.-H. (2002). Input feature selection for classiﬁcation problems.

IEEE Transactions on Neural Networks, 13, 143–159.

Liu, X., Krishnan, A., & Mondry, A. (2005). An entropy-based gene selection method for cancer classiﬁcation using microarray data. BMC Bioinformatics, 6. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data

mining. Boston: Kluwer Academic Publishers.

Loughrey, J., & Cunningham, P. (2005). Overﬁtting in wrapper-based feature subset selection: The harder you try the worse it gets. In Research and development in intelligent systems (Vol. XXI, pp. 33–43).

Narendra, P. M., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26, 917–922.

Oh, I.-S., Lee, J.-S., & Moon, B.-R. (2004). Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1424–1437.

Okun, O., & Priisalu, H. (2009). Dataset complexity in gene expression based cancer classiﬁcation using ensembles of k-nearest neighbors. Artiﬁcial Intelligence in Medicine, 45, 151–162.

Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442.

Pudil, P., Novovicov, J., & Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15, 1119–1125.

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning, 1, 81–106. Ramaswamy, S., Ross, K. N., Lander, E. S., & Golub, T. R. (2002). A molecular signature

of metastasis in primary solid tumors. Nature Genetics, 33, 49–54.

Raymer, M. L., Punch, W. F., Goodman, E. D., Kuhn, L. A., & Jain, A. K. (2000). Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation, 4, 164–171.

Reunanen, J., Guyon, I., & Elisseeff, A. (2003). Overﬁtting in making comparisons between variable selection methods. Journal of Machine Learning Research, 3, 1371–1382.

Rich, E., & Knight, K. (1991). Artiﬁcial intelligence (2nd ed.). New York: McGraw-Hill. Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M., Rees, C., Spellman, P., et al. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24, 227–235.

Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinfomatics, 23, 2507–2517.

Schaffer, C. (1993). Overﬁtting avoidance as bias. Machine Learning, 10, 153–178. Schuster, H. G., & Just, W. (2005). Deterministic chaos: An introduction (4th ed.).

Weinheim: Wiley-VCH.

Shi, Y., & Eberhart, R. (1998). A modiﬁed particle swarm optimizer. In IEEE international conference on evolutionary computation, Anchorage, AK (pp. 69–73).

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209.

Sohn, S. Y., & Shin, H. W. (2007). Experimental study for the comparison of classiﬁer combination methods. Pattern Recognition, 40, 33–40.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B (Methodological), 36, 111–147. Taguchi, G., Chowdhury, S., & Taguchi, S. (2000). Robust engineering. New York:

McGraw-Hill.

Tan, S. (2006). An effective reﬁnement strategy for KNN text classiﬁer. Expert Systems with Applications, 30, 290–298.

Tang, Y., Zhang, Y.-Q., Huang, Z., Hu, X., & Zhao, Y. (2008). Recursive fuzzy granulation for gene subsets extraction and cancer classiﬁcation. IEEE Transactions on Information Technology in Biomedicine, 12, 723–730.

Trelea, I. C. (2003). The particle swarm optimization algorithm: Convergence analysis and parameter selection. Information Processing Letters, 85, 317–325. Tsai, J.-T., Liu, T.-K., & Chou, J.-H. (2004). Hybrid Taguchi-genetic algorithm for

global numerical optimization. IEEE Transactions on Evolutionary Computation, 8, 365–377.

Vaidya, P. M. (1989). An O(n log n) algorithm for the all-nearest-neighbors problem. Discrete and Computational Geometry, 4, 101–115.

van ‘t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., et al. (2002). Gene expression proﬁling predicts clinical outcome of breast cancer. Nature, 415, 530–536.

Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F. X., et al. (2005). Gene selection from microarray data for cancer classiﬁcation – A machine learning approach. Computational Biology and Chemistry, 29, 37–46.

Wang, X., Yang, J., Teng, X., Xia, W., & Jensen, R. (2007). Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28, 459–471.

Wolpert, D. H. (1993). On overﬁtting avoidance as bias. Santa Fe Institute, Technical Report SFI-TR-92-03-5001.

Wong, T.-T., & Hsu, C.-H. (2008). Two-stage classiﬁcation methods for microarray data. Expert Systems with Applications, 34, 375–383.

Wu, Y., & Wu, A. (2000). Taguchi methods for robust design. New York: ASME. Xiong, M., Fang, X., & Zhao, J. (2001). Biomarker identiﬁcation by feature wrappers.

Genome Research, 11, 1878–1887.

Yang, C.-S., Chuang, L.-Y., Li, J.-C., & Yang, C.-H. (2008). A novel BPSO approach for gene selection and classiﬁcation of microarray data. In IEEE international joint conference on neural networks, Hong Kong (pp. 2147–2152).

Yang, C. H., Huang, C. C., Wu, K. C., & Chang, H. Y. (2008). A novel GA-Taguchi-based feature selection method. In Intelligent data engineering and automated learning, Daejeon, South Korea (pp. 112–119).

Zhang, H., & Sun, G. (2002). Feature selection using tabu search method. Pattern Recognition, 35, 701–711.

Zhu, Z., Ong, Y.-S., & Dash, M. (2007). Wrapper-ﬁlter feature selection algorithm using a memetic framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 37, 70–76.