Abstract—Single nucleotide polymorphisms (SNPs) hold much promise as a basis for disease-gene association. However, research is limited by the cost of genotyping the tremendous number of SNPs. Therefore, it is important to identify a small subset of informative SNPs, the so-called tag SNPs. This subset consists of selected SNPs of the genotypes, and accurately represents the rest of the SNPs. Furthermore, an effective evaluation method is needed to evaluate prediction accuracy of a set of tag SNPs. In this paper, a genetic algorithm (GA) is applied to tag SNP problems, and the K-nearest neighbor (K-NN) serves as a prediction method of tag SNP selection. The experimental data used was taken from the HapMap project; it consists of genotype data rather than haplotype data. The proposed method consistently identified tag SNPs with considerably better prediction accuracy than methods from the literature. At the same time, the number of tag SNPs identified was smaller than the number of tag SNPs in the other methods. The run time of the proposed method was much shorter than the run time of the SVM/STSA method when the same accuracy was reached.
Keywords—Genetic Algorithm (GA), Genotype, Single nucleotide polymorphism (SNP), tag SNPs.
I. INTRODUCTION
INGLE nucleotide polymorphisms (SNPs) are the most common variants amongst species. The number of identified SNPs is very high and is currently estimated to be about 10 million [1].With the genome-wide SNP discovery, many genome-wide association (GWA) studies are likely to identify multiple genetic variants that are associated with complicated diseases [2], [3]. However, genotyping all existing SNPs for a large number of samples remains a challenge. Therefore, it is essential to select informative SNPs representing the original SNP distributions in the genome (tag SNP selection) for genome-wide association studies. These SNPs are usually chosen from haplotype data and are thus called haplotype tag SNPs (htSNPs). Accordingly, the scale and cost of genotyping can be significantly decreased. Recently, some
Li-Yeh Chuang is with the Department of Chemical Engineering, I-Shou University, Kaohsiung, Taiwan (email: chuang@isu.edu.tw)
Yu-Jen Hou is with the Department of Electronic Engineering, National
Kaohsiung University of Applied Sciences, Taiwan (e-mail:
1096305143@cc.kuas.edu.tw).
Cheng-Hong Yang is with the Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Taiwan (e-mail: chyang@cc.kuas.edu.tw)
hybrid algorithms, such as HAPLO-IHP [4] and ISHAPE [5], have been developed which are capable of improving the performance of haplotyping.
Many algorithms have been developed to select the most informative tag SNPs. Tag SNP selection can follow two different strategies: the block-based and the block-free methods. Numerous block-free methods are also available [13]–[18]. A block-based method is based on the haplotype block structure of the human genome. The rationale is that the human genome can be partitioned into discrete blocks [6] and that most of the population share a very small subset of common haplotypes within each block. Haplotype diversity is limited and conserved in the haplotype block of the whole genome [7], [8]. Many algorithms first partition genomes into haplotype blocks [8]–[11] and then select the tag SNP subset within each block. This method focuses on finding a set of tag SNPs to distinguish all the common haplotypes [6], [12]. The main problem with the block-based method is that the definition of the blocks is not always straightforward and there is no consensus how the blocks are formed. Moreover, tag SNP selection based only on the local correlations between markers of each block ignores inter-block correlations [13].
In a block-free method, the tag SNPs is regarded as a subset of all SNPs, from which the remaining SNPs can be reconstructed with minimal error [14], [15]. Black-free methods do not assume prior block partitioning or limit the diversity of haplotypes. Block-free tagging SNP methods are based on weak correlations that occur across nearby blocks [15]. They make use of the proximity of potentially predictive SNPs and are less limiting than methods involving rigid notation of haplotype blocks. A natural measure for evaluating the prediction accuracy of a set of tag SNPs was developed for these methods [16]. Researchers developed a novel algorithm called STAMPA (selection of tag SNPs to maximize prediction accuracy) to find a minimum set of tag SNPs and minimize their prediction error. Dynamic programming was applied in STAMPA to select tag SNPs and maximize prediction accuracy. STAMPA was found to provide higher prediction accuracy than ldSelect [19] and HapBlock [20] tested on a variety of data sets [16]. He and Zelikovsky have introduced two novel approaches for informative SNP prediction based on multiple linear regression (MLR-tagging) [21] and support vector machines (SVM/STSA) [22]. These prediction algorithms combined were with a
A Novel Prediction Method for Tag SNP
Selection using Genetic Algorithm based on
KNN
Li-Yeh Chuang, Yu-Jen Hou, Jr., and Cheng-Hong Yang
S
World Academy of Science, Engineering and Technology Vol:3 2009-05-28
1173
stepwise tag selection algorithm (STSA) to select a tag SNP set of minimal size. In a direct comparison of MLR-tagging and SVM/STSA, SVM/STSA was proved more effective than MLR-tagging, but also more time-consuming.
Yet another method of tag SNP selection is through the calculation of correlation between each pair of SNPs (such as linkage disequilibrium, LD). Linkage disequilibrium describes the correlation between genotypes at a pair of polymorphic sites and is usually higher when pairwise SNPs are closer. Two statistical values are used to describe LD, named D’ and r2 [26]. r2is most frequently used for pairwise SNP correlation, because it is directly related to statistical power to detect disease associations [19]. However, some studies try to identify a minimum set of LD bin set in existing SNPs with high-LD (r2 0.8) [19]. To do this, SNPs will be partitioned into different regions according to the relevance of SNPs [19], [23]–[25]. SNPs within a bin are denoted tag SNP, and only one tag would be genotyped per bin. However, the disadvantage of this method is that it can not exclude the possibility that SNPs with a low-LD also enhance prediction accuracy.
In this paper, a genetic algorithm (GA) is applied to the tag SNPs problem and the K-nearest neighbor (K-NN) methods serves as an evaluator of the GA; it is used to evaluate the prediction accuracy of a set of tag SNPs. GAs are a randomized search and optimization techniques that derive their working principles from natural genetics; they have been successfully applied to the optimization of a variety of problems. The results of our study were compared to state-of-the-art studies and indicate that the proposed method can effectively select a minimum number of tag SNPs with higher prediction accuracy.
II. PROBLEM FORMULATION
In a haplotype sequence, SNPs are generally bi-allelic, meaning that there are only two alleles in a single SNP: a major type and a minor. In bi-allelic SNPs, each haplotype can be represented by a binary string set. The allele information value is formed by a sequence of base pairs {A, T, C, G}. Each haplotype can be formalized by binary strings 0 and 1 where 0 represents the major allele and 1 represents the minor allele. Thus, we can represent a haplotype h with m SNPs as h = {h1, h2,
…, hm}, hi
{0, 1}. ¯ ® minor is SNP ith of allele : 1 major is SNP ith of allele : 0 hi (1)In a genotype sequence, the allele information value is formed by {A/A, A/T, A/C, A/G…G/C, G/T}. In order to present our method, If a genotype g has m SNPs, it can be represented by g = {g1, g2, …, gm}, gi
{0, 1, 2}. We used 0 and1 to represent the homozygous types ({0,0} or {1,1}), and 2 to represent the heterozygous types ({0,1} or {1,0}).
°¯ ° ® us heterozygo are SNP ith of alleles two : 2 homozygous minor are SNP ith of alleles two : 1 homozygous major are SNP ith of alleles two : 0 gi (2)
A sample S of a population P of genotype (or haplotype) individuals on m SNPs was given. Our goal then was to find a
minimum set of tag SNPs T = {t1, t2, …, tk}, where k represents
the number of tag SNPs (k < m), which consists of selected SNPs of the genotypes, and can predict the remaining unselected SNPs with minimum error. In order to achieve this goal, we need to find the minimum number of tag SNPs. The two major processes involved are the tag selection algorithm and the SNP prediction algorithm.
III. METHODS FOR TAG SNP SELECTION The purpose of tag SNP selection is to find a small subset of informative SNPs (tag SNP), which accurately represents the rest of the genome sequence. In this paper, a GA was applied to the tag SNP selection problem, and the K-nearest neighbor (K-NN) method served as an evaluator of the GA.
A. Genetic Algorithm (GA)
Genetic Algorithms (GAs) were developed by Alan Turing in 1950, and further required by John Holland in 1970 [26]. The main components of the GA used in our study are the encoding schemes, population initialization, fitness evaluation, selection, crossover operator, mutation operator, and the amendment chromosome. The flowchart of the proposed method is shown Figure 1. The components are explained in detail below.
Fig. 1 Flowchart of the proposed method
B. Encoding schemes
Fundamental to the GA’s structure is the encoding scheme. In this paper, the binary encoding method used in a chromosome corresponds to the tag SNP selection problem, as shown in Figure 2. Given are p chromosomes of a population, with each chromosome containing m SNPs (dimension). Each chromosome of the length m is a sequence over {0, 1}m (0 represents a non-selected SNP and 1 represents a selected SNP). The binary encoding method used can be described by:
Ci= {ci1, ci2, ... , cim} and cij = {0, 1}, i = 1, 2, …, p, j = 1, 2,
…, m, where p represents the size of population. cij= 1 means
that the jth SNP on the ith chromosome was selected. For example, assume there is a chromosome represented by Ci = {1,
0, 1, 0, 0, 1, 0}. In this encoding scheme SNP1, SNP3 and SNP6
are predicted to be tag SNPs.
World Academy of Science, Engineering and Technology Vol:3 2009-05-28
1174
ACKNOWLEDGMENT
This work is partly supported by the National Science Council in Taiwan under grant NSC96-2622-E-151-019-CC3, NSC96-2221-E-214-050-MY3, NSC95-2221-E-214-087.
REFERENCES
[1] D. Brinza and A. Zelikovsky, "2SNP: scalable phasing based on 2-SNP
haplotypes,"Bioinformatics, vol. 22, pp. 371-3, Feb 1 2006.
[2] S. Buch, C. Schafmayer, H. Volzke, C. Becker, A. Franke, H. von
Eller-Eberstein, C. Kluck, I. Bassmann, M. Brosch, F. Lammert, J. F. Miquel, F. Nervi, M. Wittig, D. Rosskopf, B. Timm, C. Holl, M. Seeger, A. ElSharawy, T. Lu, J. Egberts, F. Fandrich, U. R. Folsch, M. Krawczak, S. Schreiber, P. Nurnberg, J. Tepel, and J. Hampe, "A genome-wide association scan identifies the hepatic cholesterol transporter ABCG8 as
a susceptibility factor for human gallstone disease," Nat Genet, vol. 39,
pp. 995-9, Aug 2007.
[3] B. W. Zanke, C. M. Greenwood, J. Rangrej, R. Kustra, A. Tenesa, S. M.
Farrington, J. Prendergast, S. Olschwang, T. Chiang, E. Crowdy, V. Ferretti, P. Laflamme, S. Sundararajan, S. Roumy, J. F. Olivier, F. Robidoux, R. Sladek, A. Montpetit, P. Campbell, S. Bezieau, A. M. O'Shea, G. Zogopoulos, M. Cotterchio, P. Newcomb, J. McLaughlin, B. Younghusband, R. Green, J. Green, M. E. Porteous, H. Campbell, H. Blanche, M. Sahbatou, E. Tubacher, C. Bonaiti-Pellie, B. Buecher, E. Riboli, S. Kury, S. J. Chanock, J. Potter, G. Thomas, S. Gallinger, T. J. Hudson, and M. G. Dunlop, "Genome-wide association scan identifies a
colorectal cancer susceptibility locus on chromosome 8q24," Nat Genet,
vol. 39, pp. 989-94, Aug 2007.
[4] Y. J. Yoo, J. Tang, R. A. Kaslow, and K. Zhang, "Haplotype inference for
present absent genotype data using previously identified haplotypes and haplotype patterns." vol. 23: Oxford Univ Press, 2007, p. 2399.
[5] O. Delaneau, C. Coulonges, P. Y. Boelle, G. Nelson, J. L. Spadoni, and J.
F. Zagury, "ISHAPE: new rapid and accurate software for haplotyping," BMC Bioinformatics, vol. 8, p. 205, 2007.
[6] S. B. Gabriel, S. F. Schaffner, H. Nguyen, J. M. Moore, J. Roy, B. Blumenstiel, J. Higgins, M. DeFelice, A. Lochner, M. Faggart, S. N. Liu-Cordero, C. Rotimi, A. Adeyemo, R. Cooper, R. Ward, E. S. Lander, M. J. Daly, and D. Altshuler, "The structure of haplotype blocks in the
human genome," Science, vol. 296, pp. 2225-9, Jun 21 2002.
[7] M. J. Daly, J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. Lander,
"High-resolution haplotype structure in the human genome," Nat Genet,
vol. 29, pp. 229-32, Oct 2001.
[8] N. Patil, A. J. Berno, D. A. Hinds, W. A. Barrett, J. M. Doshi, C. R.
Hacker, C. R. Kautzer, D. H. Lee, C. Marjoribanks, D. P. McDonough, B. T. Nguyen, M. C. Norris, J. B. Sheehan, N. Shen, D. Stern, R. P. Stokowski, D. J. Thomas, M. O. Trulson, K. R. Vyas, K. A. Frazer, S. P. Fodor, and D. R. Cox, "Blocks of limited haplotype diversity revealed by
high-resolution scanning of human chromosome 21," Science, vol. 294,
pp. 1719-23, Nov 23 2001.
[9] X. Ke and L. R. Cardon, "Efficient selective screening of haplotype tag
SNPs,"Bioinformatics, vol. 19, pp. 287-8, Jan 22 2003.
[10] K. Zhang, M. Deng, T. Chen, M. S. Waterman, and F. Sun, "A dynamic
programming algorithm for haplotype block partitioning," Proc Natl
Acad Sci U S A, vol. 99, pp. 7335-9, May 28 2002.
[11] K. Zhang and L. Jin, "HaploBlockFinder: haplotype block analyses," Bioinformatics, vol. 19, pp. 1300-1, Jul 1 2003.
[12] G. C. Johnson, L. Esposito, B. J. Barratt, A. N. Smith, J. Heward, G. Di Genova, H. Ueda, H. J. Cordell, I. A. Eaves, F. Dudbridge, R. C. Twells, F. Payne, W. Hughes, S. Nutland, H. Stevens, P. Carr, E. Tuomilehto-Wolf, J. Tuomilehto, S. C. Gough, D. G. Clayton, and J. A. Todd, "Haplotype tagging for the identification of common disease
genes," Nat Genet, vol. 29, pp. 233-7, Oct 2001.
[13] T. U. M. Phuong, Z. Lin, and R. B. Altman, "CHOOSING SNPs USING FEATURE SELECTION." vol. 4, 2006, pp. 241-257.
[14] V. Bafna, B. V. Halldorsson, R. Schwartz, A. G. Clark, and S. Istrail, "Haplotypes and informative SNP selection algorithms: don't block out information," ACM New York, NY, USA, 2003, pp. 19-27.
[15] B. V. Halldorsson, V. Bafna, R. Lippert, R. Schwartz, F. M. De La Vega, A. G. Clark, and S. Istrail, "Optimal haplotype block-free selection of
tagging SNPs for genome-wide association studies," Genome Res, vol.
14, pp. 1633-40, Aug 2004.
[16]E. Halperin, G. Kimmel, and R. Shamir, "Tag SNP selection in genotype
data for maximizing SNP prediction accuracy," Bioinformatics, vol. 21
Suppl 1, pp. i195-203, Jun 2005.
[17]P. H. Lee and H. Shatkay, "BNTagger: improved tagging SNP selection
using Bayesian networks," Bioinformatics, vol. 22, pp. e211-9, Jul 15
2006.
[18] Z. Liu, S. Lin, and M. Tan, "Genome-wide tagging SNPs with
entropy-based Monte Carlo method," J Comput Biol, vol. 13, pp.
1606-14, Nov 2006.
[19] C. S. Carlson, M. A. Eberle, M. J. Rieder, Q. Yi, L. Kruglyak, and D. A. Nickerson, "Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium," Am J Hum Genet, vol. 74, pp. 106-20, Jan 2004.
[20] K. Zhang, Z. Qin, T. Chen, J. S. Liu, M. S. Waterman, and F. Sun, "HapBlock: haplotype block partitioning and tag SNP selection software
using a set of dynamic programming algorithms," Bioinformatics, vol.
21, pp. 131-4, Jan 1 2005.
[21] J. He and A. Zelikovsky, "MLR-tagging: informative SNP selection for
unphased genotypes based on multiple linear regression,"
Bioinformatics, vol. 22, pp. 2558-61, Oct 15 2006.
[22] J. He and A. Zelikovsky, "Informative SNP selection methods based on
SNP prediction," IEEE Trans Nanobioscience, vol. 6, pp. 60-7, Mar
2007.
[23] K. Zhang, T. Chen, M. S. Waterman, and F. Sun, "A Set of Dynamic Programming Algorithms for Haplotype Block Partitioning and Tag SNP Selection via Haplotype Data or Genotype Data," pp. 1–26.
[24] B. Devlin and N. Risch, "A comparison of linkage disequilibrium
measures for fine-scale mapping," Genomics, vol. 29, pp. 311-22, Sep 20
1995.
[25] H. I. Avi-Itzhak, X. Su, and F. M. De La Vega, "Selection of minimum subsets of single nucleotide polymorphisms to capture haplotype block
diversity," Pac Symp Biocomput, pp. 466-77, 2003.
[26] J. H. Holland, Adaptation in natural and artificial systems: MIT Press
Cambridge, MA, USA, 1992.
[27] E. Fix and J. Hodges, "Discriminatory Analysis-Nonparametric Discrimination: Consistency Properties," Storming Media, 1951. [28] D. E. Goldberg and K. Deb, "A comparative analysis of selection schemes
used in genetic algorithms." vol. 1, 1991, pp. 69-93.
[29] G. A. Thorisson, A. V. Smith, L. Krishnan, and L. D. Stein, "The
International HapMap Project Web site," Genome Res, vol. 15, pp.
1592-3, Nov 2005.
[30] J. He, K. Westbrooks, and A. Zelikovsky, "Linear reduction method for
predictive and informative tag SNP selection," Int J Bioinform Res Appl,
vol. 1, pp. 249-60, 2005.
World Academy of Science, Engineering and Technology Vol:3 2009-05-28
1178