Learning for the Prediction of CpG Islands in the Human
Genome
Li-Yeh Chuang1, Hsiu-Chen Huang2,3*, Ming-Cheng Lin4, Cheng-Hong Yang4,5*
1 Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan, 2 Institute of Biomedical Engineering, National Cheng Kung University, Tainan, Taiwan,3 Department of Physical Medicine and Rehabilitation, Chia-Yi Christian Hospital, Chia-Yi, Taiwan, 4 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan,5 Department of Network Systems, Toko University, Chiayi, Taiwan
Abstract
Background:Regions with abundant GC nucleotides, a high CpG number, and a length greater than 200 bp in a genome are often referred to as CpG islands. These islands are usually located in the 59 end of genes. Recently, several algorithms for the prediction of CpG islands have been proposed.
Methodology/Principal Findings:We propose here a new method called CPSORL to predict CpG islands, which consists of a complement particle swarm optimization algorithm combined with reinforcement learning to predict CpG islands more reliably. Several CpG island prediction tools equipped with the sliding window technique have been developed previously. However, the quality of the results seems to rely too much on the choices that are made for the window sizes, and thus these methods leave room for improvement.
Conclusions/Significance:Experimental results indicate that CPSORL provides results of a higher sensitivity and a higher correlation coefficient in all selected experimental contigs than the other methods it was compared to (CpGIS, CpGcluster, CpGProd and CpGPlot). A higher number of CpG islands were identified in chromosomes 21 and 22 of the human genome than with the other methods from the literature. CPSORL also achieved the highest coverage rate (3.4%). CPSORL is an application for identifying promoter and TSS regions associated with CpG islands in entire human genomic. When compared to CpGcluster, the islands predicted by CPSORL covered a larger region in the TSS (12.2%) and promoter (26.1%) region. If Alu sequences are considered, the islands predicted by CPSORL (Alu) covered a larger TSS (40.5%) and promoter (67.8%) region than CpGIS. Furthermore, CPSORL was used to verify that the average methylation density was 5.33% for CpG islands in the entire human genome.
Citation: Chuang L-Y, Huang H-C, Lin M-C, Yang C-H (2011) Particle Swarm Optimization with Reinforcement Learning for the Prediction of CpG Islands in the Human Genome. PLoS ONE 6(6): e21036. doi:10.1371/journal.pone.0021036
Editor: Vladimir Brusic, Dana-Farber Cancer Institute, United States of America Received February 1, 2011; Accepted May 16, 2011; Published June 28, 2011
Copyright: ß 2011 Chuang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work is partly supported by the National Science Council in Taiwan under grants NSC96-2221-E-214-050-MY3, NSC2221-E-151-040-, NSC 98-2622-E-151-001-CC2 and 98-2622-E-151-024-CC3. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. No additional external funding received for this study.
Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected] (H-CH); [email protected] (C-HY)
Introduction
CpG islands are short sequences that preserve a high concentration of the two nucleic acids Cytosine (C) and Guanine (G). The letter ‘p’ in CpG represents the phosphodiester bonds that appear between the nucleic acids C and G. CpG islands were first identified by Tykocinski and Max as small regions that contain the restriction enzyme HpaII in the genome and were thus originally called HpaII Tiny Fragment (HTF) islands [1].
A definition of CpG islands was first offered by Gardiner-Garden and Frommer (GGF) in 1987 [2]. The original description included the length of the suspected region, which has to exceed 200 bp, the GC content in that region, which has to be higher than 50%, and the observed/expected (O/E) ratio, which has to surpass a value of 0.6. Since biological experiments have proven that there could be two Alu sequences in a CpG island, Takai and Jones revised the GGF criteria of CpG islands in 2002 [3]. Their modified definition
requires that the minimum length of the suspected region is 500 bp and that the required GC content and O/E ratio are 55% and 0.65, respectively. The Alu endonuclease is so-named because it was first isolated from Arthrobacter luteus. Alu sequences are highly repetitive short interspersed elements with an approximate consensus sequence of about 280 bp. Some of these sequences have a relative high GC content and O/E ratio [2,3]. Recently, various algorithms have been adopted in the literature to predict CpG islands, e.g., CpGIS [3], CpGPlot [4], CpGProD [5] and CpGcluster [6], but most of these tools use the sliding window technique with the GC content, O/E ratio and length thresholds as the main parameters; CpGcluster uses the distance between CpG dinculeotides.
PSO is a population-based stochastic optimization technique developed by Kennedy and Eberhart [7]. The main advantage of PSO is that it has the ability to converge fast. The individual memory of the particles in PSO can be used to compare information in a search process. To date, PSO has been
successfully applied in many fields, including operon prediction [8] and biomarker selection [9], amongst others.
In this study we propose a new prediction method called CPSORL, which combines complementary particle swarm optimization (CPSO) with the reinforcement learning (RL) method to predict CpG islands in the human genome. Reinforcement learning [10] is applied to extend the shorter CpG islands or even combine neighboring CpG islands if prescribed requirements are met (an example comparison of CpG island predictions with and without a reinforcement learning process is show in Figure S1).
The proposed CPSORL method adopts the GGF criteria (GC content §50%, O/E ratio §0.6, length §200 bp) as guidelines for the search for CpG islands. CPSORL is composed of two major steps. First, the input sequence is cut apart into windows, and then the PSO algorithm is used to search for DNA sequences that are in accordance with the GGF criteria. The PSO mechanism is updated iteratively to search for optimal results and identifies the best performing particles in the swarm population [11]. If the PSO particles fall into a local search pattern, the complementary concept enables them to leave this local region and participate in the global search again. In a second step, the length of the predicted CpG island is extended by RL; islands are combined with neighboring islands until the length definition parameters are met [10,12]. Experimental results indicate that CPSORL provides results of a higher sensitivity and a higher correlation coefficient in all selected experimental contigs than CpGIS, CpGcluster, CpGProd and CpGPlot.
Results
Parameter settings
In PSO, four different parameters need to be set: the population size, the number of iterations, and the C1and C2constants of the
update function. The population size in our study was set to 300 [13], the number of iterations was set to 100, and C1and C2were
set to 2 [11]. The CpGIS parameters were: length set to 200 bp, GC content set to 50%, O/E ratio set to 0.6, and the gap between adjacent islands set to 100 bp (http://cpgislands.usc.edu/). CpGcluster parameters used were: p-value threshold of 1E-5 and distance threshold (percentile) of 50. CpGProd and CpGplot were used directly from the internet (http://pbil.univ-lyon1.fr/ software/cpgprod_query.html and http://www.ebi.ac.uk/Tools/ emboss/cpgplot/index.html).
Performance measurement
We used five common criteria to determine the prediction accuracy, namely the sensitivity (SN), specificity (SP), accuracy (ACC), performance coefficient (PC) and correlation coefficient (CC) [14]. The five criteria are defined in Eqs. (1–5). Through these five evaluation criteria the superiority of an algorithm was determined. The calculation processes are shown in detail in Figure S2.
SN~ TP TPzFN ð1Þ SP~ TN TNzFP ð2Þ ACC~ TPzTN TPzFPzTNzFN ð3Þ PC~ TP TPzFNzFP ð4Þ CC~ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiTP TN{FP FN (TPzFN) (TPzFP) (TNzFN) (TNzFP) p ð5Þ
where TP is a true positive, FN is a false negative, TN is a true negative and FP is a false positive. We predicted CpG islands under the GGF criteria. Subsequently, we used five evaluation criteria to assess the CpG island prediction performance of all methods.
In addition, the receiver operating characteristic (ROC) curve is introduced to determine equivalence by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate). Hanson has pointed out that the area under the ROC curve can be used to predict the accuracy of a risk scale [15]. The ROC curve plots the sensitivity against the specificity; the sensitivity and specificity express the accuracy of the CpG island prediction factors.
Experimental results
We propose an effective hybrid method of CPSO and RL called CPSORL to identify CpG islands in the human genome. In CPSORL, CPSO supplies the updating function to find potential regions of CpG islands, and RL is used to extend and combine CpG islands in order to improve the prediction quality. The CPSO proposed in this study prevents the entrapment of particles in a local optimum. Table 1 shows a comparison of the performance of different methods from the literature for CpG island prediction, such as SN, SP, ACC, PC, and CC. CPSORL provides SN, PC and CC results that are higher than in other methods it was compared to. We compared CPSORL with various other methods in the literature. Table S1 shows the results in the contig NT_113954.1 for different CpG island prediction tools. Table 2 contains the number of CpG islands located in gene regions identified with CPSORL. A comparison of the number of CpG islands identified in the human genome with different methods is shown in Table 3. Table 4 shows the number of methylation sites identified with CPSORL in chromosomes 21 and 22 of the human genome, and also includes the chromosome length, total length of a CpG island, the number of methylation sites in entire genome, the number of methylation sites selected in CPSORL, and the methylation density of the CpG islands. CPSORL predicted CpG islands with an average methylation density of 5.33% in the entire chromosome; the results are shown in Table 5. Table 6 shows the prediction performance for the entire human chromosome by the proposed method and the methods from the literature.
Discussion
CpG island prediction performance in the contigs
We compared CPSORL with four other methods reported in the literature, namely CpGIS [3], CpGplot [4], CpGProD [5], CpGcluster [6] and PSO. Table 1 shows that the SN of the proposed method was highest on the NT_113952.1 (84.88%), NT_113955.2 (87.38%), NT_113958.2 (84.11%), NT_113953.1 (75.65%), NT_113954.1 (77.68%) and NT_028395.3 (77.02%) datasets (sensitivity bar graphs in Figure S3). The proposed method obtained better prediction results for CpG islands than the
other methods tested. The accuracies (ACC) of CPSO and CPSORL are higher than the accuracies of the other methods. However, even though ACC of CPSORL is lower than ACC of CpGPlot in contig NT_113954.1, the SN, PC and CC of CPSORL are superior to CpGPlot. The reason for this is that CpGPlot does not obtain the FP in the search process, but rather yields many FNs. It therefore obtains high SP and ACC values and a lower SN. In addition, the performance of CPSO is better than that of CPSORL in the NT_113952.1 and NT_028395.3 contigs, the reason for this being that RL yields higher FP and lower SP values in the evaluation criteria. Hence, CPSO can obtain a high CC. As shown in Table 1, SP of this study is lower than the SP of CpGPlot
in all contigs. CPSORL also showed the best PC and CC prediction performance on the chromosomes 21 and 22 contigs shown in Table 1, e.g., NT_113955.2 (87.89%), NT_113958.2 (79.31%), NT_113953.1 (73.10%) and NT_113954.1 (68.53%) have the highest PC and CC values. The PC can be viewed as a criterion to determine the method performance. The CC can be viewed as a combination of sensitivity and specificity [15]. In addition, we used the ROC curves for comparison in order to prove that CPSORL is superior to the other methods. An ROC curve is a plot of the false positive (FP) rate versus the true positive (TP) rate [16]. Figure S4 shows the ROC curves for all methods. Based on these plots it can be stated that the performance of Table 1. Comparison of different methods for CpG island prediction.
Contig. Performance Methods
CpGPlot CpGcluster CpGProD CpGIS PSO CPSO
withoutRL withRL withoutRL withRL NT_113952.1 Length = 184355 SN (%) 56.43 50.46 58.07 83.98 69.22 75.58 77.43 84.88 SP (%) 100.0 99.95 99.50 99.05 99.61 99.02 99.58 99.05 ACC (%) 98.09 97.78 97.69 98.39 98.28 97.99 98.61 98.43 PC (%) 56.42 49.92 52.36 69.59 63.77 62.27 70.91 70.34 CC (%) 74.38 69.41 68.83 81.25 77.66 75.71 82.49 81.80 NT_113955.2 Length = 281920 SN (%) 47.19 67.15 68.51 85.12 54.47 59.63 77.80 87.38 SP (%) 100.0 99.72 99.63 99.30 99.96 99.88 99.50 99.61 ACC (%) 98.08 98.54 98.50 98.79 98.31 98.42 98.71 99.16 PC (%) 47.14 62.47 62.35 71.78 53.87 57.74 68.67 79.08 CC (%) 67.94 77.03 76.65 82.96 72.41 74.51 80.85 87.89 NT_113958.2 Length = 209483 SN (%) 51.29 27.16 46.41 82.13 79.27 81.65 81.08 84.11 SP (%) 99.99 99.94 98.93 98.26 98.13 97.90 98.17 98.34 ACC (%) 96.90 95.32 95.60 97.24 96.93 96.87 97.08 97.43 PC (%) 51.24 26.92 40.10 65.36 62.10 62.33 63.80 67.51 CC (%) 70.38 49.96 56.80 77.63 75.03 75.28 76.41 79.31 NT_113953.1 Length = 131056 SN (%) 22.80 57.32 29.79 74.05 60.20 64.80 70.53 75.65 SP (%) 100.0 99.74 99.56 98.83 99.27 99.23 99.22 99.13 ACC (%) 97.76 98.51 97.53 98.11 98.13 98.23 98.38 98.45 PC (%) 22.80 52.74 25.96 53.23 48.39 51.59 55.91 58.57 CC (%) 47.21 69.89 43.61 68.64 64.50 67.25 70.90 73.10 NT_113954.1 Length = 129889 SN (%) 31.24 29.86 52.01 76.31 56.92 63.58 70.54 77.68 SP (%) 100.0 99.46 98.72 97.62 98.40 98.13 98.34 98.23 ACC (%) 97.47 96.90 97.00 96.83 96.87 96.86 97.32 97.48 PC (%) 31.24 26.19 38.94 47.05 40.12 42.74 49.22 53.15 CC (%) 55.17 43.81 54.68 63.29 55.65 58.36 64.72 68.53 NT_028395.3 Length = 647850 SN (%) 27.11 44.89 54.18 76.68 68.97 72.79 72.52 77.02 SP (%) 100.0 99.47 99.45 98.93 99.27 98.99 9918 98.90 ACC (%) 97.98 97.53 98.19 98.14 98.19 98.06 98.24 98.12 PC (%) 27.10 39.26 45.36 59.36 57.49 57.17 59.36 59.25 CC (%) 51.51 57.21 62.26 73.57 72.21 71.75 73.61 73.48 RL: Reinforcement Learning. SN: Sensitivity. SP: Specificity. ACC: Accuracy. PC: Performance coefficient. CC: Correlation coefficient. Underlined value representing the best results.
CpGlength maxð Þ~2000,CpGlength minð Þ~200 GC Pð Þ~i #Cz#G #Az#T z#Cz#G ð11Þ ObsCpG ExpCpGð Þ~Pi #CpG CpGlength #C CpGlength | #G CpGlength ð12Þ
Fitness Pð Þ~GC Pi ð ÞzObsi CpGExpCpGð ÞzCpGPi lengthð Þ ð13ÞPi
Where the #A: number of A (Adenine), #T: number of T (Thymine), #C: number of C (Cytosine) and #G: number of G (Guanine) nucleotides in the CpG islands represented byparticle Pi. #CpG: number of CpG islands. CpGlength: length of CpG island.
A fitness function is used to evaluate the performance of CPSORL. A high fitness value means that CpG islands are predicted with high correlation coefficient and sensitivity. In general, the length of CpG islands is 200 bp–2000 bp. However, in order to reduce the fitness value of the CpG island length, a normalization function was used to adjust the fitness function. The length value is adjusted to a range of 0 to 1. A step-by-step description of the calculations performed by the algorithm is shown in Figure S9.
Supporting Information
Figure S1 Comparison of CpG island prediction with and
without reinforcement learning. The short bars indicate the CpG islands. (A) Without reinforcement learning, the known CpG islands are divided into two segments by the CPSO-RL prediction. (B) With reinforcement learning, a signal CpG islands is predicted by CPSO-RL that matches a real CpG island.
(DOC)
Figure S2 Illustration of calculating TP, TN, FP and FN. (TP, TN, FP and FN represent true positives, true negatives, false positives and false negatives, respectively.)
(DOC)
Figure S3 Bar graphs illustrating the different sensitivities for each method on chromosomes 21 and chromosome 22 contigs. (DOC)
Figure S4 ROC curves plotted for all methods to evaluate the
data sets. (DOC)
Figure S5 Distribution of CpG islands in the entire human
genome. The blue dots indicate the CpG islands, and the x and y axes are the GC% and the CpGs o/e ratio, respectively. Most CpG islands lie in the region of 50–70% GC, and an o/e ratio of between 0.6 and 1.0.
(DOC)
Figure S6 Analysis of predicted CpG islands by CPSORL in the
entire human genome. (DOC)
Figure S7 Illustration of calculating methylation densities. (DOC)
Figure S8 Length distribution of CPSORL and other methods
in the human genome. (DOC)
Figure S9 A description of the step-by-step procedures for the algorithm.
(DOC)
Text S1 The pseudo-codes for PSO and CPSO.
(DOC)
Table S1 Comparison of different CpG island prediction tools
for contig NT_113954.1. (DOC)
Acknowledgments
We thank H-W Chang for critical reading of the manuscript. We also thank the National Science Council for providing part of the equipment in Taiwan.
Author Contributions
Analyzed the data: L-YC H-CH. Contributed reagents/materials/analysis tools: L-YC H-CH. Wrote the paper: C-HY M-CL. Coordinated and oversaw this study, and modified the manuscript: C-HY H-CH. Participated in the design of the algorithm and wrote the program: M-CL. Provided the biochemistry background and introduced the bioinfor-matics needed: L-YC.
References
1. Tykocinski M, Max E (1984) CG dinucleotide clusters in MHC genes and in 59demethylated genes. Nucleic acids research 12: 4385.
2. Gardiner-Garden M, Frommer M (1987) CpG islands in vertebrate genomes. Journal of molecular biology 196: 261.
3. Takai D, Jones P (2003) The CpG island searcher: a new WWW resource. In silico biology 3: 235–240.
4. Rice P, Longden I, Bleasby A (2000) EMBOSS: the European molecular biology open software suite. Trends in Genetics 16: 276–277.
5. Ponger L, Mouchiroud D (2002) CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinfor-matics 18: 631.
6. Hackenberg M, Previti C, Luque-Escamilla P, Carpena P, Martinez-Aroza J, et al. (2006) CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinformatics 7: 446.
7. Kennedy J, Eberhart R (1995) Particle swarm optimization. IEEE International Conference on Neural Networks. pp 1942–1948.
8. Chuang LY, Tsai JH, Yang CH (2010) Binary particle swarm optimization for operon prediction. Nucleic acids research 38: e128.
9. Ressom HW, Varghese RS, Abdel-Hamid M, Eissa SAL, Saha D, et al. (2005) Analysis of mass spectral serum profiles for biomarker selection. Bioinformatics 21: 4039.
10. Whitehead S, Sutton R, Ballard D (1990) Advances in reinforcement learning and their implications for intelligent control. Proceedings of the 5th IEEE Int. Symposium on Intelligent Control Citeseer. pp 1289–1297.
11. Poli R, Kennedy J, Blackwell T (2007) Particle swarm optimization. Swarm Intelligence 1: 33–57.
12. Barto AG, Sutton RS (1997) Reinforcement learning in artificial intelligence. Advances in Psychology 121: 358–386.
13. Gudise VG, Venayagamoorthy GK (2003) Evolving digital circuits using particle swarm. IEEE. pp 468–472 vol. 461.
14. Fang F, Fan S, Zhang X, Zhang M (2006) Predicting methylation status of CpG islands in the human brain. Bioinformatics 22: 2204.
15. Hanson RK, Canada CSG (1997) The development of a brief actuarial risk scale for sexual offense recidivism: Solicitor General Canada.
16. Egan JP (1975) Signal detection theory and {ROC} analysis.
17. Sujuan Y, Asaithambi A, Liu Y (2008) CpGIF: an algorithm for the identification of CpG islands. Bioinformation 2: 335.
18. Lai H, Chiang Y, Hsu C, Wu F (2008) A Recognition Machine for CpG-islands Based on Boltzmann Model. Journal of Medical and Biological Engineering 28: 23–30. 19. Hackenberg M, Barturen G, Carpena P, Luque-Escamilla P, Previti C, et al.
(2010) Prediction of CpG-island function: CpG clustering vs. sliding-window methods. BMC Genomics 11: 327.
20. Han L, Zhao Z (2009) CpG islands or CpG clusters: how to identify functional GC-rich regions in a genome? BMC Bioinformatics 10: 65.
21. Jiang C, Han L, Su B, Li W, Zhao Z (2007) Features and trend of loss of promoter-associated CpG islands in the human and mouse genomes. Molecular biology and evolution 24: 1991.
22. Lin Y, Kuo M, Yu J, Kuo H, Lin R, et al. (2008) c-Myb is an evolutionary conserved miR-150 target and miR-150/c-Myb interaction is important for embryonic development. Molecular biology and evolution 25: 2189. 23. Hancock J, Worthey E, Santibanez-Koref M (2001) A role for selection in
regulating the evolutionary emergence of disease-causing and other coding CAG repeats in humans and mice. Molecular biology and evolution 18: 1014. 24. Yegnasubramanian S, Haffner MC, Zhang Y, Gurel B, Cornish TC, et al.
(2008) DNA hypomethylation arises later in prostate cancer progression than
CpG island hypermethylation and contributes to metastatic tumor heterogene-ity. Cancer research 68: 8954.
25. Illingworth R, Bird A (2009) CpG islands—‘A rough guide’. FEBS letters 583: 1713–1720.
26. Lister R, Pelizzola M, Dowen R, Hawkins R, Hon G, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462: 315–322.
27. Kane MF, Loda M, Gaida GM, Lipman J, Mishra R, et al. (1997) Methylation of the hMLH1 promoter correlates with lack of expression of hMLH1 in sporadic colon tumors and mismatch repair-defective human tumor cell lines. Cancer research 57: 808.
28. Davis CD, Uthus EO (2004) DNA methylation, cancer susceptibility, and nutrient interactions. Experimental Biology and Medicine 229: 988.