Novel generating protective single nucleotide polymorphism barcode for breast
cancer using particle swarm optimization
Cheng-Hong Yang
a, Hsueh-Wei Chang
b,c,d,*
, Yu-Huei Cheng
a, Li-Yeh Chuang
e,**
aDepartment of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan b
Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Kaohsiung, Taiwan c
Graduate Institute of Natural Products, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan d
Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan e
Department of Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
1. Introduction
Single-nucleotide polymorphisms (SNPs) are the most abundant DNA variations in the human genome[1]. SNPs are widely applied in many association studies[2–4]and they have become the prime candidates for the field of personalized medicine. The improvement of high-throughput SNP genotyping methods, such as SNP array[5–7], also generates huge quantities of SNP genotype data. Recently, the International HapMap Project (http://www.hapmap.org)[8]was developed to provide the representative tagSNPs, in order to rule out some less informative SNPs. Therefore, both SNP array and HapMap
provide the huge and effective SNP data for association studies; however, the evaluation of SNP–SNP interactions are less addressed. While most complex diseases are under the influence of polygenic interactions, many SNPs from different genes should be considered simultaneously. Accordingly, many studies have focused on the combined effect of multiple SNPs on many cancer and disease risks [9–17]. However, the association studies for multiple SNP candidates remain computationally challenged.
New computational methodologies for solving this multiple SNP interaction problem are required. In this study, we introduced discrete binary particle swarm optimization (BPSO) [18] to deal with the optimization problems associated with SNP–SNP interactions in the association study of breast cancer. BPSO constitutes a randomized search and optimization technique to find the pattern discriminated between different groups. The ‘‘SNP barcode’’ profile generated by the BPSO can separate actual cases from control groups (c.f. breast cancer vs. control). The ‘‘SNP barcodes’’ are the combined SNPs with their corresponding genotypes. Each genotype has three possible SNP
Cancer Epidemiology 33 (2009) 147–154
A R T I C L E I N F O Article history: Accepted 1 July 2009 Keywords:
Single nucleotide polymorphism Odds ratio
Binary particle swarm optimization SNP interactions
A B S T R A C T
Background: High-throughput single nucleotide polymorphism (SNP) genotyping generates a huge amount of SNP data in genome-wide association studies. Simultaneous analyses for multiple SNP interactions associated with many diseases and cancers are essential; however, these analyses are still computationally challenging. Methods: In this study, we propose an odds ratio-based binary particle swarm optimization (OR-BPSO) method to evaluate the risk of breast cancer. Results: BPSO provides the combinational SNPs with their corresponding genotype, called SNP barcodes, with the maximal difference of occurrence between the control and breast cancer groups. A specific SNP barcode with an optimized fitness value was identified among seven SNP combinations within the space of one minute. The identified SNP barcodes with the best performance between control and breast cancer groups were found to be control-dominant, suggesting that these SNP barcodes may prove protective against breast cancer. After statistical analysis, these control-dominant SNP barcodes were processed for odds ratio analysis for quantitative measurement with regard to the risk of breast cancer. Conclusion: This study proposes an effective high-speed method to analyze the SNP–SNP interactions for breast cancer association study.
ß2009 Elsevier Ltd. All rights reserved.
* Corresponding author at: Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Kaohsiung, Taiwan.
Tel.: +886 7 312 1101ext2691; fax: +886 7 312 5339.
** Corresponding author. Tel.: +886 7 657 7711; fax: +886 7 383 6844. E-mail addresses:[email protected](C.-H. Yang),[email protected] (H.-W. Chang),[email protected](Y.-H. Cheng),[email protected] (L.-Y. Chuang).
Contents lists available atScienceDirect
Cancer Epidemiology
The International Journal of Cancer Epidemiology, Detection, and Prevention
j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / c a n e p1877-7821/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.canep.2009.07.001
combinations: AA, AC, and CC for an SNP with A/C polymorph-ism. BPSO can provide the information required for the best SNP barcodes with maximal difference between control and breast cancer groups. When the SNP barcode in cases is greater than that in the controls, the SNP barcode is regarded as the ‘‘risk’’ factor. In contrast, when the SNP barcode in the controls is greater than in the actual cases, the SNP barcode is regarded as the ‘‘protective’’ factor. The risk and protective roles for BPSO-generated SNP barcodes are evaluated according to these criteria. Subsequently, the odds ratio of each SNP barcode was calculated for the examination of the quantitative risk of breast cancer. Therefore, we propose the odds ratio-based BPSO (OR-BPSO) method to generate SNP barcodes for statistically predicting breast cancer susceptibility.
2. Method
In this paper, we introduce the OR-BPSO method to generate the best SNP barcodes, i.e., combinational SNPs with their correspond-ing genotypes, to predict breast cancer susceptibility for women. Two stages describe the procedure employed to implement the OR-BPSO method. Stage 1 is the BPSO method. We retrieved seven SNP data sets obtained from our previous study [19] that are verifiably linked to breast cancer, from among several hundred samples. BPSO weakens quickly, and optimal solutions can be found in a short time out of a wide solution space, meaning that we can look for the combination of SNP barcodes with the highest correlation, to obtain the highest proportion of SNP barcode combinations. In stage 2, the odds ratio for the best combination of SNP barcodes is used as a quantitative evaluation for breast cancer risk.
2.1. Binary particle swarm optimization
Particle swarm optimization (PSO) is a population-based stochastic optimization technique, developed by Kennedy and Eberhart in 1995[18], which was inspired by the social behavior of birds in a flock or fish in a school, to describe an automatically evolving system. In PSO, each single candidate solution can be considered ‘‘an individual bird of the flock’’, that is, a particle in the search space. Each particle makes use of its own memory and knowledge gained by the swarm as a whole to find the best (optimal) solution. All of the particles have fitness values that are evaluated by an optimized fitness function, as well as velocities that direct the movement of the particles. During movement, each particle adjusts its position according to its own experience and according to the experience of a neighboring particle, thus making use of the best position encountered by itself and its neighbor. The particles move through the problem space by following a current of optimum particles. The process is then reiterated a predefined number of times or until a minimum error rate is achieved[20].
PSO was originally introduced as an optimization technique for real-number spaces. PSO has been successfully employed in many application areas and has obtained better results in a faster, cheaper way compared with other methods. Although a comprehensive survey of the PSO algorithms and their applica-tions had been developed (www.particleswarm.info/), many optimization problems are set in a space featuring discrete, qualitative distinctions between variables and between levels of variables. Kennedy and Eberhart introduced discrete binary PSO (BPSO), which can be applied to discrete binary variables. In a binary space, a particle may move to near corners of a hypercube by flipping various numbers of bits; thus, the overall particle velocity may be described by the number of bits changed per iteration[20]. In this paper, BPSO is chosen since the position of
each particle can be given in binary string form (0 or 1), which adequately reflects the straightforward yes/no choice as to whether or not a feature needs to be selected. The changes in particle velocity and trajectory can be interpreted as a change in the probability of finding the particle in one state or another; since this change is a probability, it is limited to a range of {0.0–1.0}.
Based on the principles of PSO, we set the required particle number first, and then the initial coding string for each particle is generally created in such a way that the population of the particles is distributed randomly over the search space. In this study, we coded each particle to imitate an SNP barcode in a BPSO. Each particle was coded to a binary string in which the bit value {1} or {0} represents a selected or non-selected feature, respectively.
At every iteration, the particle’s trajectory is updated by the binary evaluation of two ‘‘best’’ values, called pbest and gbest. The coordinates of each particle trajectory are associated with the best solution (fitness) the particle has achieved so far. And this fitness value is stored, and called pbest. When a particle takes the whole population as its topological neighbor, the best solution is a global ‘‘best’’ solution called gbest. Once the adaptive values pbest and gbest are obtained, the features of the pbest and gbest particles can be tracked with regard to their position and velocity. Each particle is updated according to the following equations.
v
newid ¼ w
v
oldid þ c1 rand1 ð pbestid xoldidÞ þ c2 rand2ðgbesti xoldidÞ (1)
if
v
newid 2 ðV= min;VmaxÞ then
v
idnew¼ maxðminðVmax;v
newid Þ; VminÞ (2)Sð
v
new id Þ ¼1
1þ evðnew=idÞ (3)
ifðSð
v
newid Þ > randÞ then xnewid ¼ 1; else xnewid ¼ 0 (4)
where w is the inertia weight, c1and c2are acceleration (learning)
factors, and rand, rand1 and rand2 are random numbers in the
interval (0, 1).
v
newid and
v
oldid are velocities for those of the updatedparticle and the particle before being updated, respectively, xold id is
the original particle position (solution), and xnew
id is the updated
particle position (solution). In Eq.(2), particle velocities of each dimension are tried to a maximum velocity Vmax. If the sum of
accelerations causes the velocity of that dimension to exceed Vmax,
then the velocity of that dimension is limited to Vmax. Vmaxand
Vmin are user-specified parameters (in our case Vmax= 6,
Vmin= 6).
After updating, the features are calculated by the function Sð
v
newid Þ (Eq.(3))[20], in which
v
newid is the updated velocity value. IfSð
v
newid Þ is larger than a randomly produced disorder number that is
within {0.0–1.0}, then its position value xnew
id is represented as 1
(meaning this feature is selected as a required feature for the next update). If Sð
v
newid Þ is smaller than a randomly produced disorder
number that is within {0.0–1.0}, then its position value xnew id is
represented as 0 (meaning this feature is not selected as a required feature for the next update)[20].
In BPSO, four main components were included: encoding schemes, the population initialization, a fitness evaluation, and a particle update operator. These components are explained in detail below.
2.2. Encoding schemes
First, each particle was designed in a format that enabled us to express a particular amount of SNP and genotype combinations.
C.-H. Yang et al. / Cancer Epidemiology 33 (2009) 147–154 148
be performed satisfactorily, and associations with diseases can be clearly established.
3.6. Analysis of SNP–SNP interaction
In complex diseases like cancers, multiple genes and their SNPs are involved. Traditional methods ignore the possibility that effects of multi-SNPs play a larger role than a single SNP effect in determining association studies. Most methods usually address individual SNP effects for each gene; once an individual SNP effect has been evidenced, the SNP–SNP interactive effect is subsequently determined. However, each SNP from a single susceptibility gene may have only a marginal effect and may not be detected easily using traditional statistical analysis.
Since multiple genes and their SNPs are involved or associated in most diseases and cancers, it has become popular to perform association studies involving multiple SNPs from multiple disease-related genes. However, a large number of SNPs makes association studies difficult to analyze simultaneously. Here, we propose the BPSO method to provide computationally representative and protective SNP barcodes with regard to breast cancer. The relative association power of each SNP barcode was listed in order to help determine the SNP barcode with the best performance, i.e., the maximal occurrence difference between controls and cases. Judging from the difference between control-dominant and breast cancer-dominant SNP barcodes, the SNP barcode with the maximum difference is chosen to signify the risky or protective biomarker to breast cancer in this study. Coupled with the odds ratio, the risk or protective features for breast cancer in women in the SNP barcode are evaluated in a quantitative manner. Moreover, the OR-BPSO method utilizes ternary genotypes rather than the commonly used binary allele types, thereby improving the resolution for analyzing the multiple SNP–SNP interaction. It is important to interpret the result carefully using our proposed BPSO method because we are only focused on the evaluation of SNP–SNP associations to breast cancer predisposition rather than to prove the direct evidence to breast cancer risk. For the application of our proposed BPSO method, it is potential to deal with the SNP interaction for SNP data from SNP array[5–7]and the HapMap website[8].
4. Conclusion
This study successfully demonstrates that the proposed method is efficient, and that it proved to be a powerful analysis tool for seven SNP–SNP interactions in an association study on breast
cancer. The method could also identify the protective SNP barcodes with the best performance. The powerful performance of BPSO in generating SNP barcodes with the best fitness between cases and controls groups can potentially be applied to determine the complex SNP–SNP interactions among the huge numbers of SNPs involved in genome-wide association studies.
5. Conflict of interest statement No conflicts of interest. Acknowledgements
This work is partly supported by the National Science Council in Taiwan under grant NSC97-2311-B-037-003-MY3, NSC97-2622-E-151-008-CC2, NSC96-2221-E-214-050-MY3, NSC96-2311-B037-002, NSC96-2622-E214-004-CC3, NSC96-2622-E-151-019-CC3, and the grant KMU-EM-98-1.4.
References
[1] Brookes AJ. The essence of SNPs. Gene 1999;234(2):177–86.
[2] Cantor CR. The use of genetic SNPs as new diagnostic markers in preventive medicine. Ann N Y Acad Sci 2005;1055:48–57.
[3] Wang Y, Armstrong SA. Genome-wide SNP analysis in cancer: leukemia shows the way. Cancer Cell 2007;11(4):308–9.
[4] Roses AD, Saunders AM, Huang Y, Strum J, Weisgraber KH, Mahley RW. Complex disease-associated pharmacogenetics: drug efficacy, drug safety, and confirmation of a pathogenetic hypothesis (Alzheimer’s disease). Phar-macogenomics J 2007;7(1):10–28.
[5] Xing J, Watkins WS, Zhang Y, Witherspoon DJ, Jorde LB. High fidelity of whole-genome amplified DNA on high-density single nucleotide polymorphism arrays. Genomics 2008;92(6):452–6.
[6] Nishida N, Koike A, Tajima A, Ogasawara Y, Ishibashi Y, Uehara Y, et al. Evaluating the performance of Affymetrix SNP Array 6.0 platform with 400 Japanese individuals. BMC Genomics 2008;9:431.
[7] Hao K, Schadt EE, Storey JD. Calibrating the performance of SNP arrays for whole-genome association studies. PLoS Genet 2008;4(6):e1000109. [8] Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap
Project Web site. Genome Res 2005;15(11):1592–3.
[9] Kang G, Yue W, Zhang J, Huebner M, Zhang H, Ruan Y, et al. Two-stage designs to identify the effects of SNP combinations on complex diseases. J Hum Genet 2008;53(8):739–46.
[10] Zabaleta J, Lin HY, Sierra RA, Hall MC, Clark PE, Sartor OA, et al. Interactions of cytokine gene polymorphisms in prostate cancer risk. Carcinogenesis 2008;29(3):573–8.
[11] Lin HY, Wang W, Liu YH, Soong SJ, York TP, Myers L, et al. Comparison of multivariate adaptive regression splines and logistic regression in detecting SNP-SNP interactions and their application in prostate cancer. J Hum Genet 2008;53(9):802–11.
[12] Zheng SL, Sun J, Wiklund F, Smith S, Stattin P, Li G, et al. Cumulative association of five genetic variants with prostate cancer. N Engl J Med 2008;358(9):910–9. Table 3
BPSO-based selection of the best SNP combination and the estimated effects of SNP barcode on the occurrence of breast cancer.
Combined SNP no. SNP genotype Subject no. Breast cancer no. (%) Odds ratio (P value) 95% CI
Two SNPsa Otherb 348 151 (43.39%) 1.00 0.46–0.94 SNPs (3,4) 1-3 206 69 (33.50%) 0.66 (0.02) Three SNPsa Otherb 449 189 (42.09%) 1.00 0.36–0.91 SNPs (1,3,5) 2-1-1 105 31 (29.52%) 0.58 (0.02) Four SNPsa Otherb 501 207 (41.32%) 1.00 0.24–0.89 SNPs (1,2,3,4) 2-2-1-3 53 13 (24.53%) 0.46 (0.02) Five SNPsa Otherb 525 215 (40.95%) 1.00 0.11–0.80 SNPs (1,2,3,4,5) 2-2-1-3-1 29 5 (17.24%) 0.30 (0.02) Six SNPsa Otherb 544 220 (40.44%) 1.00 NE. SNPs (1,2,3,4,6,7) 2-2-1-3-3-2 10 0 (0%) Not estimatable Seven SNPsa Otherb 544 219 (40.26%) 1.00 0.02–1.31 SNPs (1,2,3,4,5,6,7) 2-2-1-3-1-2-3 10 1 (10.00%) 0.17 (0.09)
aThe selected SNP combination is determined by the same analysis shown inTable 2although only the example with two SNP combinations is provided. It is the pattern with best performance and maximal occurrence difference between breast cancer and normal control, rather than arbitrarily selection.
b
‘‘Other’’ is the reference group, indicating the union of all other possible two to seven SNP combinations.
[13] Schabath MB, Wu X, Wei Q, Li G, Gu J, Spitz MR. Combined effects of the p53 and p73 polymorphisms on lung cancer risk. Cancer Epidemiol Biomarkers Prev 2006;15(1):158–61.
[14] Miao X, Zhang X, Zhang L, Guo Y, Hao B, Tan W, et al. Adenosine diphosphate ribosyl transferase and x-ray repair cross-complementing 1 polymorphisms in gastric cardia cancer. Gastroenterology 2006;131(2):420–7.
[15] Ma H, Hu Z, Zhai X, Wang S, Wang X, Qin J, et al. Joint effects of single nucleotide polymorphisms in P53BP1 and p53 on breast cancer risk in a Chinese population. Carcinogenesis 2006;27(4):766–71.
[16] Yen CY, Liu SY, Chen CH, Tseng HF, Chuang LY, Yang CH, et al. Combinational polymorphisms of four DNA repair genes XRCC1, XRCC2, XRCC3, and XRCC4 and their association with oral cancer in Taiwan. J Oral Pathol Med 2008;37(5):271–7.
[17] Lin GT, Tseng HF, Chang CK, Chuang LY, Liu CS, Yang CH, et al. SNP combina-tions in chromosome-wide genes are associated with bone mineral density in Taiwanese women. Chin J Physiol 2008;91(1):1–10.
[18] Particle swarm optimization. Proceedings IEEE International Conference on Neural Networks, Perth, Australia; 1995.
[19] Lin GT, Tseng HF, Yang CH, Hou MF, Chuang LY, Tai HT, et al. Combinational polymorphisms of seven CXCL12-related genes are protective against breast cancer in Taiwan. OMICS 2009;13(2):165–72.
[20] A discrete binary version of the particle swarm algorithm. Computational Cybernetics and Simulation, Proceedings IEEE Intl. Conf. on Systems, Man, and Cybernetic; 1997.
[21] Defining a Standard for Particle Swarm Optimization. Proceedings of the 2007 IEEE Swarm Intelligence Symposium (SIS 2007); 2007.
[22] Arya M, Ahmed H, Silhi N, Williamson M, Patel HR. Clinical importance and therapeutic implications of the pivotal CXCL12-CXCR4 (chemokine ligand-receptor) interaction in cancer cell migration. Tumour Biol 2007;28(3):123–31. [23] Tilton B, Ho L, Oberlin E, Loetscher P, Baleux F, Clark-Lewis I, et al. Signal transduction by CXC chemokine receptor 4. Stromal cell-derived factor 1
stimulates prolonged protein kinase B and extracellular signal-regulated kinase 2 activation in T lymphocytes. J Exp Med 2000;192(3):313–24. [24] Chinni SR, Sivalogan S, Dong Z, Filho JC, Deng X, Bonfil RD, et al. CXCL12/CXCR4
signaling activates Akt-1 and MMP-9 expression in prostate cancer cells: the role of bone microenvironment-associated CXCL12. Prostate 2006;66(1): 32–48.
[25] Brand S, Dambacher J, Beigel F, Olszak T, Diebold J, Otte JM, et al. CXCR4 and CXCL12 are inversely expressed in colorectal cancer cells and modulate cancer cell migration, invasion and MMP-9 activation. Exp Cell Res 2005;310(1): 117–30.
[26] Zhou Y, Yu C, Miao X, Tan W, Liang G, Xiong P, et al. Substantial reduction in risk of breast cancer associated with genetic polymorphisms in the promoters of the matrix metalloproteinase-2 and tissue inhibitor of metalloproteinase-2 genes. Carcinogenesis 2004;25(3):399–404.
[27] Qiao D, Yi L, Hua L, Xu Z, Ding Y, Shi D, et al. Cystic fibrosis transmembrane conductance regulator (CFTR) gene 5T allele may protect against prostate cancer: a case-control study in Chinese Han population. J Cyst Fibros 2008;7(3):210–4.
[28] Liu Z, Calderon JI, Zhang Z, Sturgis EM, Spitz MR, Wei Q. Polymorphisms of vitamin D receptor gene protect against the risk of head and neck cancer. Pharmacogenet Genomics 2005;15(3):159–65.
[29] Flugge J, Krusekopf S, Goldammer M, Osswald E, Terhalle W, Malzahn U, et al. Vitamin D receptor haplotypes protect against development of colorectal cancer. Eur J Clin Pharmacol 2007;63(11):997–1005.
[30] Aberle J, Hopfer I, Beil FU, Seedorf U. Association of peroxisome proliferator-activated receptor delta +294T/C with body mass index and interaction with peroxisome proliferator-activated receptor alpha L162V. Int J Obes (Lond) 2006;30(12):1709–13.
[31] Baier RJ, Loggins J, Yanamandra K. IL-10, IL-6 and CD14 polymorphisms and sepsis outcome in ventilated Very Low Birth Weight infants. BMC Med 2006;4(1):10.
C.-H. Yang et al. / Cancer Epidemiology 33 (2009) 147–154 154