Operon Prediction Using Chaos
Embedded Particle Swarm Optimization
Li-Yeh Chuang, Cheng-Huei Yang, Jui-Hung Tsai, and Cheng-Hong Yang
Abstract—Operons contain valuable information for drug design and determining protein functions. Genes within an operon are co-transcribed to a single-strand mRNA and must be coregulated. The identification of operons is, thus, critical for a detailed
understanding of the gene regulations. However, currently used experimental methods for operon detection are generally difficult to implement and time consuming. In this paper, we propose a chaotic binary particle swarm optimization (CBPSO) to predict operons in bacterial genomes. The intergenic distance, participation in the same metabolic pathway and the cluster of orthologous groups (COG) properties of the Escherichia coli genome are used to design a fitness function. Furthermore, the Bacillus subtilis, Pseudomonas aeruginosa PA01, Staphylococcus aureus and Mycobacterium tuberculosis genomes are tested and evaluated for accuracy, sensitivity, and specificity. The computational results indicate that the proposed method works effectively in terms of enhancing the performance of the operon prediction. The proposed method also achieved a good balance between sensitivity and specificity when compared to methods from the literature.
Index Terms—Operon, particle swarm optimization, chaos
Ç
1
I
NTRODUCTION
V
ALUABLEinformation for drug design and protein
functions can be acquired from the operons of bacterial
genomes [1]. Operons in prokaryote organisms contain one
or more consecutive genes on the same strand. The genes
are cotranscribed into a single-strand mRNA sequence and
are, thus, likely to have the same biological functions, as
well as affect each other directly. Understanding the gene
regulations is, thus, critical for improving the operon
prediction process. However, information regarding
oper-ons is scarce, and experimental methods for predicting
operons are generally difficult to implement. To gain a
deeper insight, operon-related research projects have to be
investigated in further detail. In recent years, many operon
prediction features have been proposed in the literature.
Features commonly used to determine the existence of an
operon are the intergenic distance, metabolic pathway,
homologous genes, terminator, gene order conservation,
clusters of orthologous groups, and the gene length ratio
[2]. Out of the above features, the promoter and the
terminator property in the genome sequence feature are
the most representative properties [3]. The intergenic
distance is the simplest and most widely used prediction
property. It is used to observe whether the distance between
gene pairs within an operon (WO pairs) is shorter than the
distance between gene pairs at the transcription unit
borders (TUB pairs) [4].
Scientists have proposed many methods to predict
operons, including the Bayesian method [5], [6], [7], [8],
machine learning [9], clustering approaches [10], logistic
regression method [11], and graphical theoretic approaches
[12], [13]. Some advanced techniques use artificial
intelli-gence to predict operons; genetic algorithms [1], [3], [14],
particle swarm optimization (PSO) [2], and neural networks
[15], [16] fall into this category. These artificial intelligence
methods have shown a high operon prediction accuracy.
Some databases that fall into neither of the above two
categories have been constructed and made available,
for example, RegulonDB [17], DBTBS [18], DOOR (Database
of prOkaryotic OpeRons) [19], MicrobesOnline [20], and the
ODB (Operon Database) [21].
In this study, we propose a chaotic binary particle swarm
optimization (CBPSO) to predict operons. PSO constitutes a
randomized search and optimization technique that derives
its working principles from the social behavior of
organ-isms. Chaos is a nonlinear system with deterministic
dynamic behavior. It has stochastic and regularity
proper-ties, as well as ergodicity, and is very sensitive to the initial
conditions and parameters. Small differences in the initial
conditions result in great differences after many iterations
[22]. These characteristics of a chaotic system can be used to
enhance the search ability of PSO. The Escherichia coli
(E. coli) genome was selected for training the genome based
on the intergenic distance, the participation in the same
metabolic pathway, and the clusters of orthologous groups
(COG). The Bacillus subtilis (B. subtilis), Pseudomonas
. L.-Y. Chuang is with the Department of Chemical Engineering, Institute of Biotechnology and Chemical Engineering, I-Shou University, No. 1, Sec. 1, Syuecheng Road, Dashu District, Kaohsiung 84001, Taiwan, R.O.C. E-mail: chuang@isu.edu.tw.
. C.-H. Yang is with the Department of Electronic Communication Engineering, National Kaohsiung Institute of Marine Technology, No. 142, Haijhuan, Road, Nandh District, Kaohsiung 81157, Taiwan, R.O.C. E-mail: chyang@mail.nkmu.edu.tw.
. J.-H. Tsai is with the Department of Electronic and Communication Engineering, National Kaohsiung Marine University, Kaohsiung, Taiwan, R.O.C. E-mail: bigblack918@hotmail.com.
. C.-H. Yang is with the Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, R.O.C. E-mail: chyang@cc.kuas.edu.tw.
Manuscript received 15 June 2012; revised 10 May 2013; accepted 13 May 2013; published online 20 May 2013.
For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2012-06-0147. Digital Object Identifier no. 10.1109/TCBB.2013.63.
aeruginosa PA01 (P. aeruginosa PA01), Staphylococcus aureus
(S. aureus), and Mycobacterium tuberculosis (M. tuberculosis)
genome were selected as target genomes. The CBPSO
computational results demonstrate that the prediction
ability of CBPSO is superior to the other methods from
the literature it was compared to [8], [9], [10], [11], [12], [13],
[14], [15], [16].
2
M
ETHODS
2.1
Data Set Preparation
The entire genome data of E. coli, B. subtilis, P. aeruginosa
PA01, S. aureus, and M. tuberculosis were downloaded from
the GenBank database (http://www.ncbi.nlm.nih.gov/).
The related genomic information contains the gene name,
the gene ID, the position, the strand, and the product. The
experimental operon data set of the E. coli and B. subtilis
genomes were obtained from RegulonDB (http://regulondb.
ccg.unam.mx/) [17] and DBTBS (http://dbtbs.hgc.jp/) [18],
respectively, which contains highly reliable data of validated
experimental operons of the E. coli and B. subtilis genomes
[19]. The experimental operon data sets of the P. aeruginosa
PA01, S. aureus, and M. tuberculosis genomes were obtained
from ODB (http://www.genome.sk. ritsumei.ac.jp/odb/)
[21]. The metabolic pathway and COG data of the genomes
were obtained from KEGG (http://www.genome.ad.jp/
kegg/pathway.html) and NCBI (http://www.ncbi.nlm.nih.
gov/COG/), respectively.
2.2
Operon Pairs
An operon is defined as a sequence of one or more genes
that, under certain conditions, are transcribed as a unit.
Adjacent genes in the same operon are called a WO pair. If
the operon contains a single gene and the downstream gene
is of unknown status, the gene pair is called a TUB pair.
However, if the upstream gene is the last gene of an operon,
then the downstream gene is of uncertain status, and thus
the gene pair cannot be labeled a TUB pair [5]. Fig. 1 shows
a simple illustration of WO and TUB pairs.
2.3
Operon Properties
2.3.1 Features Selected for Operon Prediction
Five properties were originally considered for the
predic-tion of operons, i.e., the intergenic distance, the metabolic
pathway, the COG gene function, the gene length ratio, and
the operon length. However, Fig. 2 indicates that the gene
length ratio and the operon length are not as suitable for
operon prediction as the other three features. Thus, we
selected the intergenic distance, the metabolic pathway, and
the COG gene function to predict operons. The intergenic
distance property not only plays an important role in the
initial step but also yields good prediction results [3], [16],
[17], [23], [24]. This property can be used to universally
predict operons in bacterial genomes with a completed
chromosomal sequence. In the functional relations category,
we used the metabolic pathway and the COG gene function
Fig. 1. WO and TUB pairs. The white arrows represent genes that are experimentally unclassified, and the gray arrow represents a singleton operon. In addition, the black arrows represent operons that consist of several genes.
Fig. 2. ROC curves of operon prediction. The false-positive rate is plotted as the abscissa and the true-positive rate as the ordinate. (a) B. subtilis genome. (b) P. aeruginosa genome. (c) S. aureus genome. (d) M. tuberculosis genome.
to predict operons. The metabolic pathway property has a
high prediction accuracy on the E. coli data set, as indicated
in the literature [3]. When adjacent genes have the same
pathway, the probability of a pair being within the same
operon is very high. The reason we selected the COG gene
function is that genes which belong to the same first-level
functional category or fall into the fourth category have a
probability of 83.5 percent of being within the same operon
on the E. coli genome [25]. However, since the metabolic
pathway and the COG gene function belong to the
functional relations category, the method only searches
regions where these properties overlap [3]. Since the same
prediction results were obtained when either one of these
properties was used, is should be noted that the metabolic
pathway property is more efficient for operon prediction.
However, the metabolic pathway property only determines
whether adjacent genes have the same pathway or not, and
thus COG must be used to estimate if a gene is within a
functional category. A detailed description of the above
mentioned three properties is given below.
2.3.2 Intergenic Distance
As shown in (1), the intergenic distance is calculated based
on the base pairs of adjacent genes. In general, the distance
of WO pairs is shorter than the distance of TUB pairs [23].
The maximum frequency of the WO pairs distance is 4 bps
[26]. The distribution frequency of the TUB pairs increases
with the distance. The intergenic distance distribution of
WO and TUB pairs of the E. coli, B. subtilis, P. aeruginosa
PA01, and S. aureus genomes are shown in Figs. 3a, 3b, 3c,
and 3d. The figures indicate that property can be effectively
used for operon prediction
Distance
¼ Gene
2start
ðGene
1end
þ 1Þ:
ð1Þ
2.3.3 Metabolic Pathway
Three levels of biological functions commonly used in
gene ontology are the biological process, the molecular
function, and the cellular component [27]. Genes within an
operon often have the same biological function [1].
Therefore, if adjacent genes are annotated with the same
metabolic pathway, we can infer that the gene pair is from
the same operon.
2.3.4 COG Gene Function
COG consists of three main levels. The first level contains
four classes, namely, information storage processing,
cellular processing, signaling, and metabolism. Each of the
classes is divided into multiple functional categories, and
adjacent genes often remain in the same class [15]. Hence,
we consider a gene pair in a same operon when the adjacent
genes are within the same class.
2.4
Binary Particle Swarm Optimization
Particle swarm optimization is a population-based
evolu-tionary computation technique developed by Kennedy and
Eberhart [28]. The concept of PSO was developed through
the observation of the social behavior of birds in a flock or
fish in a school. Each individual is affected by its past
experience and the swarm behavior. In PSO, each solution
can be considered an individual particle in a given search
space, which has its own position and velocity. During
movement, each particle adjusts its position by changing
its velocity based on its own experience, as well as the
experience of its companions, until an optimum position is
reached by itself and its companions [29]. All of the
particles have fitness values based on the calculations of a
fitness function. Particles are updated by following two
parameters called pbest and gbest at each iteration. Each
Fig. 3. Intergenic distance distribution diagram. The diagram shows the intergenic distance distribution of WO and TUB pairs of the (a) B. subtilis genome, (b) P. aeruginosa PA01, genome, (c) S. aureus genome, and (d) M. tuberculosis genome.
prediction, but the method suffers from a low
prediction sensitivity and accuracy. In other words,
these methods only achieve either a high sensitivity
or a high specificity, but not both. We used the
intergenic distance, the metabolic pathway, and the
COG properties to identify the WO and TUB pairs.
The metabolic pathway is used with a higher
frequency than other the properties; it often carries
out highly specific activities in a biochemical
metabolic pathway [1], [3], [36]. The COG is used
somewhat less frequently than the other properties,
but literature reports [11], [15] have proved the
powerful identification ability of this property due to
one operon often having the same or similar COG
functions. The accuracy, sensitivity, and specificity
results that our method achieved are the highest for
operon prediction even though the method only uses
three properties on all bacterial genomes.
4
C
ONCLUSIONS
This study proposes CBPSO for the prediction of operons in
bacterial genomes. The embedded chaos enhances the
random diversity and thus improves the probability of
finding optimal results. In addition, the log-likelihood
method was employed to design a fitness function. The
evaluation accuracy of the fitness function was further
increased through the use of statistical theory. The
experi-mental results show that the use of only three properties in
CBPSO was sufficient to obtain the highest accuracy on the
four target genomes. The proposed method also achieved a
good balance between sensitivity and specificity. In the
future, we intend to construct an operon prediction system
for bioinformatics research to obtain further valuable
information about operons.
A
CKNOWLEDGMENTS
This work was partly supported by the National Science
Council in Taiwan under grants 102-2221-E-151-024-MY3,
2622-E-151-003-CC3, 101-2622-E-151-027-CC3, and
102-2221-E-214-039.
R
EFERENCES
[1] S. Wang, Y. Wang, W. Du, F. Sun, X. Wang, C. Zhou, and Y. Liang, “A Multi-Approaches-Guided Genetic Algorithm with Applica-tion to Operon PredicApplica-tion,” Artificial Intelligence in Medicine, vol. 41, pp. 151-159, Oct. 2007.
[2] L.Y. Chuang, J.H. Tsai, and C.H. Yang, “Binary Particle Swarm Optimization for Operon Prediction,” Nucleic Acids Research, vol. 38, article e128, 2010.
[3] E. Jacob, R. Sasikumar, and K.N.R. Nair, “A Fuzzy Guided Genetic Algorithm for Operon Prediction,” Bioinformatics, vol. 21, pp. 1403-1407, Apr. 2005.
[4] L. Wang, J.D. Trawick, R. Yamamoto, and C. Zamudio, “Genome-Wide Operon Prediction in Staphylococcus aureus,” Nucleic Acids Research, vol. 32, pp. 3689-3702, 2004.
[5] C. Sabatti, L. Rohlin, M.K. Oh, and J.C. Liao, “Co-Expression Pattern from DNA Microarray Experiments as a Tool for Operon Prediction,” Nucleic Acids Research, vol. 30, pp. 2886-2893, July 2002.
[6] J. Bockhorst, M. Craven, D. Page, J. Shavlik, and J. Glasner, “A Bayesian Network Approach to Operon Prediction,” Bioinfor-matics, vol. 19, pp. 1227-35, July 2003.
[7] M.J. De Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, and S. Miyano, “Predicting the Operon Structure of Bacillus subtilis Using Operon Length, Intergene Distance, and Gene Expression In-formation,” Proc. Pacific Symp. Biocomputing, pp. 276-87, 2004. [8] B.P. Westover, J.D. Buhler, J.L. Sonnenburg, and J.I. Gordon,
“Operon Prediction without a Training Set,” Bioinformatics. vol. 21, pp. 880-888, Apr. 2005.
[9] M. Craven, D. Page, J. Shavlik, J. Bockhorst, and J. Glasner, “A Probabilistic Learning Approach to Whole-Genome Operon Prediction,” Proc. Int’l Conf. Intelligent Systems for Molecular Biology, vol. 8, pp. 116-27, 2000.
[10] G.Q. Zhang, Z.W. Cao, Q.M. Luo, Y.D. Cai, and Y.X. Li, “Operon Prediction Based on SVM,” Computational Biology and Chemistry, vol. 30, pp. 233-240, June 2006.
[11] M.N. Price, K.H. Huang, E.J. Alm, and A.P. Arkin, “A Novel Method for Accurate Operon Predictions in All Sequenced Prokaryotes,” Nucleic Acids Research, vol. 33, pp. 880-892, 2005. [12] M.T. Edwards, S.C. Rison, N.G. Stoker, and L. Wernisch, “A
Universally Applicable Method of Operon Map Prediction on Minimally Annotated Genomes Using Conserved Genomic Con-text,” Nucleic Acids Research, vol. 33, pp. 3253-3262, 2005. [13] G. Li, D. Che, and Y. Xu, “A Universal Operon Predictor for
Prokaryotic Genomes,” Bioinformatics and Computational Biology, vol. 7, pp. 19-38, Feb. 2009.
[14] P. Dam, V. Olman, K. Harris, Z. Su, and Y. Xu, “Operon Prediction Using Both Genome-Specific and General Genomic Information,” Nucleic Acids Research, vol. 35, pp. 288-298, 2007.
[15] X. Chen, Z. Su, Y. Xu, and T. Jiang, “Computational Prediction of Operons in Synechococcus sp. WH8102,” Genome Informatics, vol. 15, pp. 211-222, 2004.
[16] B. Taboada, C. Verde, and E. Merino, “High Accuracy Operon Prediction Method Based on STRING Database Scores,” Nucleic Acids Research, vol. 38, article e130, 2010.
[17] S. Gama-Castro, V. Jimenez-Jacinto, M. Peralta-Gil, A. Santos-Zavaleta, M. Penaloza-Spinola, B. Contreras-Moreira, J. Segura-Salazar, L. Muniz-Rascado, I. Martinez-Flores, and H. Salgado, “RegulonDB (Version 6.0): Gene Regulation Model of Escherichia coli K-12 Beyond Transcription, Active (Experimental) Annotated Promoters and Textpresso Navigation,” Nucleic Acids Research, vol. 36, pp. D120-D124, 2007.
[18] N. Sierro, Y. Makita, M. De Hoon, and K. Nakai, “DBTBS: A Database of Transcriptional Regulation in Bacillus subtilis Containing Upstream Intergenic Conservation Information,” Nucleic Acids Research, vol. 36, p. D93, 2008.
[19] F. Mao, P. Dam, J. Chou, V. Olman, and Y. Xu, “DOOR: A Database for Prokaryotic Operons,” Nucleic Acids Research, vol. 37, p. D459, 2009.
[20] P.S. Dehal, M.P. Joachimiak, M.N. Price, J.T. Bates, J.K. Baumohl, D. Chivian, G.D. Friedland, K.H. Huang, K. Keller, and P.S. Novichkov, “MicrobesOnline: An Integrated Portal for Compara-tive and Functional Genomics,” Nucleic Acids Research, vol. 38, p. D396, 2010.
[21] S. Okuda, T. Katayama, S. Kawashima, S. Goto, and M. Kanehisa, “ODB: A Database of Operons Accumulating Known Operons across Multiple Genomes,” Nucleic Acids Research, vol. 34, p. D358, 2006.
[22] H.G. Schuster and W. Just, Deterministic Chaos. Wiley, 1988. [23] H. Salgado, G. Moreno-Hagelsieb, T.F. Smith, and J.
Collado-Vides, “Operons in Escherichia coli: Genomic Analyses and Predictions,” Proc. Nat’l Academy of Sciences USA, vol. 97, pp. 6652-6657, June 2000.
[24] G. Moreno-Hagelsieb and J. Collado-Vides, “A Powerful Non-Homology Method for the Prediction of Operons in Prokaryotes,” Bioinformatics, vol. 18, pp. S329-S336, 2002.
[25] P.R. Romero and P.D. Karp, “Using Functional and Organiza-tional Information to Improve Genome-Wide ComputaOrganiza-tional Prediction of Transcription Units on Pathway-Genome Data-bases,” Bioinformatics, vol. 20, pp. 709-717, Mar. 2004.
[26] Y. Yan and J. Moult, “Detection of Operons,” Proteins, vol. 64, pp. 615-28, Aug. 2006.
[27] T.T. Tran, P. Dam, Z. Su, F.L. Poole, M.W.W. Adams, G.T. Zhou, and Y. Xu, “Operon Prediction in Pyrococcus furiosus,” Nucleic Acids Research, vol. 35, p. 11, 2006.
[28] J. Kennedy and R. Eberhart, “Particle Swarm Optimization,” Proc. IEEE Int’l Joint Conf. Neural Network, vol. 4, pp. 1942-1948, 1995.
[29] J. Kennedy, “The Particle Swarm: Social Adaptation of Knowl-edge,” Proc. IEEE Int’l Conf. on Evolutionary Computation, pp. 303-308, 1997.
[30] D. Kuo, “Chaos and Its Computing Paradigm,” IEEE Potentials, vol. 24, no. 2, pp. 13-15, Apr./May 2005.
[31] J. Kennedy and R. Eberhart, “A Discrete Binary Version of the Particle Swarm Algorithm,” Proc. IEEE Int’l Conf. System, Man, and Cybernetics, pp. 4104-4108, 1997.
[32] X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang, “Operon Prediction by Comparative Genomics: An Application to the Synechococcus sp. WH8102 Genome,” Nucleic Acids Research, vol. 32, pp. 2147-2157, 2004.
[33] J. Kennedy, “Swarm Intelligence,” Handbook of Nature-Inspired and Innovative Computing, pp. 187-219, Springer, 2006.
[34] P. Roback, J. Beard, D. Baumann, C. Gille, K. Henry, S. Krohn, H. Wiste, M.I. Voskuil, C. Rainville, and R. Rutherford, “A Predicted Operon Map for Mycobacterium tuberculosis,” Nucleic Acids Research, vol. 35, pp. 5085-5095, 2007.
[35] R.W. Brouwer, O.P. Kuipers, and S.A. van Hijum, “The Relative Value of Operon Predictions,” Briefings in Bioinformatics, vol. 9, pp. 367-75, Sept. 2008.
[36] Y. Zheng, J.D. Szustakowski, L. Fortnow, R.J. Roberts, and S. Kasif, “Computational Identification of Operons in Microbial Genomes,” Genome Research, vol. 12, pp. 1221-1230, Aug. 2002.
[37] E. Laing, K. Sidhu, and S.J. Hubbard, “Predicted Transcription Factor Binding Sites as Predictors of Operons in Escherichia coli and Streptomyces coelicolor,” BMC Genomics, vol. 9, article 79, Feb. 2008.
[38] M.D. Ermolaeva, O. White, and S.L. Salzberg, “Prediction of Operons in Microbial Genomes,” Nucleic Acids Research, vol. 29, pp. 1216-1221, Mar. 2001.
[39] T. Yada, M. Nakao, Y. Totoki, and K. Nakai, “Modeling and Predicting Transcriptional Units of Escherichia coli Genes Using Hidden Markov Models,” Bioinformatics, vol. 15, pp. 987-993, Dec. 1999.
Li-Yeh Chuang received the MS degree from the Department of Chemistry, University of North Carolina, in 1989 and the PhD degree from the Department of Biochemistry, North Dakota State University, in 1994. She is a professor and director of the Department of Chemical Engineering and the Institute of Biotechnology and Chemical Engineering at I-Shou University, Kaohsiung, Taiwan. Her main areas of research include bioinformatics, biochemistry, and genetic engineering.
Cheng-Huei Yang received the BS degree from the National Taipei Institute of Technology, Taiwan, in 1978, the MS degree from North-eastern University, Boston, Massachusetts, in 1987, and the PhD degree in electrical engineer-ing from the National Chen Kung University, Tainan, Taiwan, in 2001. Currently, he is a professor in the Department of Telecommunica-tion and Computer Engineering, NaTelecommunica-tional Kaoh-siung Institute of Marine Technology, Taiwan. His research interests include network communication, electronic instrument systems, and image processing.
Jui-Hung Tsai received the MS degrees from the Department of Electronic Engineering, Na-tional Kaohsiung University of Applied Sciences, Taiwan, in 2008 and 2010, respectively. He has rich experience in computer programming, da-tabase design and management, and systems programming and design. His main areas of research include bioinformatics and computa-tional biology.
Cheng-Hong Yang received the MS and PhD degrees in computer engineering from North Dakota State University in 1988 and 1992, respectively. He is a professor in the Depart-ment of Electronic Engineering at the National Kaohsiung University of Applied Sciences and serves as president of the university. His main areas of research include evolutionary computation, bioinformatics, and assistive tool implementation.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.