Operon Prediction Using Chaos Embedded Particle Swarm Optimization

(1)

Operon Prediction Using Chaos

Embedded Particle Swarm Optimization

Li-Yeh Chuang, Cheng-Huei Yang, Jui-Hung Tsai, and Cheng-Hong Yang

Abstract—Operons contain valuable information for drug design and determining protein functions. Genes within an operon are co-transcribed to a single-strand mRNA and must be coregulated. The identification of operons is, thus, critical for a detailed

understanding of the gene regulations. However, currently used experimental methods for operon detection are generally difficult to implement and time consuming. In this paper, we propose a chaotic binary particle swarm optimization (CBPSO) to predict operons in bacterial genomes. The intergenic distance, participation in the same metabolic pathway and the cluster of orthologous groups (COG) properties of the Escherichia coli genome are used to design a fitness function. Furthermore, the Bacillus subtilis, Pseudomonas aeruginosa PA01, Staphylococcus aureus and Mycobacterium tuberculosis genomes are tested and evaluated for accuracy, sensitivity, and specificity. The computational results indicate that the proposed method works effectively in terms of enhancing the performance of the operon prediction. The proposed method also achieved a good balance between sensitivity and specificity when compared to methods from the literature.

Index Terms—Operon, particle swarm optimization, chaos

Ç

1 I

NTRODUCTION

V

ALUABLE

information for drug design and protein

functions can be acquired from the operons of bacterial

genomes [1]. Operons in prokaryote organisms contain one

or more consecutive genes on the same strand. The genes

are cotranscribed into a single-strand mRNA sequence and

are, thus, likely to have the same biological functions, as

well as affect each other directly. Understanding the gene

regulations is, thus, critical for improving the operon

prediction process. However, information regarding

oper-ons is scarce, and experimental methods for predicting

operons are generally difficult to implement. To gain a

deeper insight, operon-related research projects have to be

investigated in further detail. In recent years, many operon

prediction features have been proposed in the literature.

Features commonly used to determine the existence of an

operon are the intergenic distance, metabolic pathway,

homologous genes, terminator, gene order conservation,

clusters of orthologous groups, and the gene length ratio

[2]. Out of the above features, the promoter and the

terminator property in the genome sequence feature are

the most representative properties [3]. The intergenic

distance is the simplest and most widely used prediction

property. It is used to observe whether the distance between

gene pairs within an operon (WO pairs) is shorter than the

distance between gene pairs at the transcription unit

borders (TUB pairs) [4].

Scientists have proposed many methods to predict

operons, including the Bayesian method [5], [6], [7], [8],

machine learning [9], clustering approaches [10], logistic

regression method [11], and graphical theoretic approaches

[12], [13]. Some advanced techniques use artificial

intelli-gence to predict operons; genetic algorithms [1], [3], [14],

particle swarm optimization (PSO) [2], and neural networks

[15], [16] fall into this category. These artificial intelligence

methods have shown a high operon prediction accuracy.

Some databases that fall into neither of the above two

categories have been constructed and made available,

for example, RegulonDB [17], DBTBS [18], DOOR (Database

of prOkaryotic OpeRons) [19], MicrobesOnline [20], and the

ODB (Operon Database) [21].

In this study, we propose a chaotic binary particle swarm

optimization (CBPSO) to predict operons. PSO constitutes a

randomized search and optimization technique that derives

its working principles from the social behavior of

organ-isms. Chaos is a nonlinear system with deterministic

dynamic behavior. It has stochastic and regularity

proper-ties, as well as ergodicity, and is very sensitive to the initial

conditions and parameters. Small differences in the initial

conditions result in great differences after many iterations

[22]. These characteristics of a chaotic system can be used to

enhance the search ability of PSO. The Escherichia coli

(E. coli) genome was selected for training the genome based

on the intergenic distance, the participation in the same

metabolic pathway, and the clusters of orthologous groups

(COG). The Bacillus subtilis (B. subtilis), Pseudomonas

. L.-Y. Chuang is with the Department of Chemical Engineering, Institute of Biotechnology and Chemical Engineering, I-Shou University, No. 1, Sec. 1, Syuecheng Road, Dashu District, Kaohsiung 84001, Taiwan, R.O.C. E-mail: chuang@isu.edu.tw.

. C.-H. Yang is with the Department of Electronic Communication Engineering, National Kaohsiung Institute of Marine Technology, No. 142, Haijhuan, Road, Nandh District, Kaohsiung 81157, Taiwan, R.O.C. E-mail: chyang@mail.nkmu.edu.tw.

. J.-H. Tsai is with the Department of Electronic and Communication Engineering, National Kaohsiung Marine University, Kaohsiung, Taiwan, R.O.C. E-mail: bigblack918@hotmail.com.

. C.-H. Yang is with the Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, R.O.C. E-mail: chyang@cc.kuas.edu.tw.

Manuscript received 15 June 2012; revised 10 May 2013; accepted 13 May 2013; published online 20 May 2013.

For information on obtaining reprints of this article, please send e-mail to: tcbb@computer.org, and reference IEEECS Log Number TCBB-2012-06-0147. Digital Object Identifier no. 10.1109/TCBB.2013.63.

(2)

aeruginosa PA01 (P. aeruginosa PA01), Staphylococcus aureus

(S. aureus), and Mycobacterium tuberculosis (M. tuberculosis)

genome were selected as target genomes. The CBPSO

computational results demonstrate that the prediction

ability of CBPSO is superior to the other methods from

the literature it was compared to [8], [9], [10], [11], [12], [13],

[14], [15], [16].

2 M

ETHODS

2.1 Data Set Preparation

The entire genome data of E. coli, B. subtilis, P. aeruginosa

PA01, S. aureus, and M. tuberculosis were downloaded from

the GenBank database (http://www.ncbi.nlm.nih.gov/).

The related genomic information contains the gene name,

the gene ID, the position, the strand, and the product. The

experimental operon data set of the E. coli and B. subtilis

genomes were obtained from RegulonDB (http://regulondb.

ccg.unam.mx/) [17] and DBTBS (http://dbtbs.hgc.jp/) [18],

respectively, which contains highly reliable data of validated

experimental operons of the E. coli and B. subtilis genomes

[19]. The experimental operon data sets of the P. aeruginosa

PA01, S. aureus, and M. tuberculosis genomes were obtained

from ODB (http://www.genome.sk. ritsumei.ac.jp/odb/)

[21]. The metabolic pathway and COG data of the genomes

were obtained from KEGG (http://www.genome.ad.jp/

kegg/pathway.html) and NCBI (http://www.ncbi.nlm.nih.

gov/COG/), respectively.

2.2 Operon Pairs

An operon is defined as a sequence of one or more genes

that, under certain conditions, are transcribed as a unit.

Adjacent genes in the same operon are called a WO pair. If

the operon contains a single gene and the downstream gene

is of unknown status, the gene pair is called a TUB pair.

However, if the upstream gene is the last gene of an operon,

then the downstream gene is of uncertain status, and thus

the gene pair cannot be labeled a TUB pair [5]. Fig. 1 shows

a simple illustration of WO and TUB pairs.

2.3 Operon Properties

2.3.1 Features Selected for Operon Prediction

Five properties were originally considered for the

predic-tion of operons, i.e., the intergenic distance, the metabolic

pathway, the COG gene function, the gene length ratio, and

the operon length. However, Fig. 2 indicates that the gene

length ratio and the operon length are not as suitable for

operon prediction as the other three features. Thus, we

selected the intergenic distance, the metabolic pathway, and

the COG gene function to predict operons. The intergenic

distance property not only plays an important role in the

initial step but also yields good prediction results [3], [16],

[17], [23], [24]. This property can be used to universally

predict operons in bacterial genomes with a completed

chromosomal sequence. In the functional relations category,

we used the metabolic pathway and the COG gene function

Fig. 1. WO and TUB pairs. The white arrows represent genes that are experimentally unclassified, and the gray arrow represents a singleton operon. In addition, the black arrows represent operons that consist of several genes.

Fig. 2. ROC curves of operon prediction. The false-positive rate is plotted as the abscissa and the true-positive rate as the ordinate. (a) B. subtilis genome. (b) P. aeruginosa genome. (c) S. aureus genome. (d) M. tuberculosis genome.

(3)

to predict operons. The metabolic pathway property has a

high prediction accuracy on the E. coli data set, as indicated

in the literature [3]. When adjacent genes have the same

pathway, the probability of a pair being within the same

operon is very high. The reason we selected the COG gene

function is that genes which belong to the same first-level

functional category or fall into the fourth category have a

probability of 83.5 percent of being within the same operon

on the E. coli genome [25]. However, since the metabolic

pathway and the COG gene function belong to the

functional relations category, the method only searches

regions where these properties overlap [3]. Since the same

prediction results were obtained when either one of these

properties was used, is should be noted that the metabolic

pathway property is more efficient for operon prediction.

However, the metabolic pathway property only determines

whether adjacent genes have the same pathway or not, and

thus COG must be used to estimate if a gene is within a

functional category. A detailed description of the above

mentioned three properties is given below.

2.3.2 Intergenic Distance

As shown in (1), the intergenic distance is calculated based

on the base pairs of adjacent genes. In general, the distance

of WO pairs is shorter than the distance of TUB pairs [23].

The maximum frequency of the WO pairs distance is 4 bps

[26]. The distribution frequency of the TUB pairs increases

with the distance. The intergenic distance distribution of

WO and TUB pairs of the E. coli, B. subtilis, P. aeruginosa

PA01, and S. aureus genomes are shown in Figs. 3a, 3b, 3c,

and 3d. The figures indicate that property can be effectively

used for operon prediction

Distance

¼ Gene

2

start

ðGene

1

end

þ 1Þ:

ð1Þ

2.3.3 Metabolic Pathway

Three levels of biological functions commonly used in

gene ontology are the biological process, the molecular

function, and the cellular component [27]. Genes within an

operon often have the same biological function [1].

Therefore, if adjacent genes are annotated with the same

metabolic pathway, we can infer that the gene pair is from

the same operon.

2.3.4 COG Gene Function

COG consists of three main levels. The first level contains

four classes, namely, information storage processing,

cellular processing, signaling, and metabolism. Each of the

classes is divided into multiple functional categories, and

adjacent genes often remain in the same class [15]. Hence,

we consider a gene pair in a same operon when the adjacent

genes are within the same class.

2.4 Binary Particle Swarm Optimization

Particle swarm optimization is a population-based

evolu-tionary computation technique developed by Kennedy and

Eberhart [28]. The concept of PSO was developed through

the observation of the social behavior of birds in a flock or

fish in a school. Each individual is affected by its past

experience and the swarm behavior. In PSO, each solution

can be considered an individual particle in a given search

space, which has its own position and velocity. During

movement, each particle adjusts its position by changing

its velocity based on its own experience, as well as the

experience of its companions, until an optimum position is

reached by itself and its companions [29]. All of the

particles have fitness values based on the calculations of a

fitness function. Particles are updated by following two

parameters called pbest and gbest at each iteration. Each

Fig. 3. Intergenic distance distribution diagram. The diagram shows the intergenic distance distribution of WO and TUB pairs of the (a) B. subtilis genome, (b) P. aeruginosa PA01, genome, (c) S. aureus genome, and (d) M. tuberculosis genome.

(4)

prediction, but the method suffers from a low

prediction sensitivity and accuracy. In other words,

these methods only achieve either a high sensitivity

or a high specificity, but not both. We used the

intergenic distance, the metabolic pathway, and the

COG properties to identify the WO and TUB pairs.

The metabolic pathway is used with a higher

frequency than other the properties; it often carries

out highly specific activities in a biochemical

metabolic pathway [1], [3], [36]. The COG is used

somewhat less frequently than the other properties,

but literature reports [11], [15] have proved the

powerful identification ability of this property due to

one operon often having the same or similar COG

functions. The accuracy, sensitivity, and specificity

results that our method achieved are the highest for

operon prediction even though the method only uses

three properties on all bacterial genomes.

4 C

ONCLUSIONS

This study proposes CBPSO for the prediction of operons in

bacterial genomes. The embedded chaos enhances the

random diversity and thus improves the probability of

finding optimal results. In addition, the log-likelihood

method was employed to design a fitness function. The

evaluation accuracy of the fitness function was further

increased through the use of statistical theory. The

experi-mental results show that the use of only three properties in

CBPSO was sufficient to obtain the highest accuracy on the

four target genomes. The proposed method also achieved a

good balance between sensitivity and specificity. In the

future, we intend to construct an operon prediction system

for bioinformatics research to obtain further valuable

information about operons.

A

CKNOWLEDGMENTS

This work was partly supported by the National Science

Council in Taiwan under grants 102-2221-E-151-024-MY3,

2622-E-151-003-CC3, 101-2622-E-151-027-CC3, and

102-2221-E-214-039.

R

EFERENCES

[1] S. Wang, Y. Wang, W. Du, F. Sun, X. Wang, C. Zhou, and Y. Liang, “A Multi-Approaches-Guided Genetic Algorithm with Applica-tion to Operon PredicApplica-tion,” Artificial Intelligence in Medicine, vol. 41, pp. 151-159, Oct. 2007.

[2] L.Y. Chuang, J.H. Tsai, and C.H. Yang, “Binary Particle Swarm Optimization for Operon Prediction,” Nucleic Acids Research, vol. 38, article e128, 2010.

[3] E. Jacob, R. Sasikumar, and K.N.R. Nair, “A Fuzzy Guided Genetic Algorithm for Operon Prediction,” Bioinformatics, vol. 21, pp. 1403-1407, Apr. 2005.

[4] L. Wang, J.D. Trawick, R. Yamamoto, and C. Zamudio, “Genome-Wide Operon Prediction in Staphylococcus aureus,” Nucleic Acids Research, vol. 32, pp. 3689-3702, 2004.

[5] C. Sabatti, L. Rohlin, M.K. Oh, and J.C. Liao, “Co-Expression Pattern from DNA Microarray Experiments as a Tool for Operon Prediction,” Nucleic Acids Research, vol. 30, pp. 2886-2893, July 2002.

[6] J. Bockhorst, M. Craven, D. Page, J. Shavlik, and J. Glasner, “A Bayesian Network Approach to Operon Prediction,” Bioinfor-matics, vol. 19, pp. 1227-35, July 2003.

[7] M.J. De Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, and S. Miyano, “Predicting the Operon Structure of Bacillus subtilis Using Operon Length, Intergene Distance, and Gene Expression In-formation,” Proc. Pacific Symp. Biocomputing, pp. 276-87, 2004. [8] B.P. Westover, J.D. Buhler, J.L. Sonnenburg, and J.I. Gordon,

“Operon Prediction without a Training Set,” Bioinformatics. vol. 21, pp. 880-888, Apr. 2005.

[9] M. Craven, D. Page, J. Shavlik, J. Bockhorst, and J. Glasner, “A Probabilistic Learning Approach to Whole-Genome Operon Prediction,” Proc. Int’l Conf. Intelligent Systems for Molecular Biology, vol. 8, pp. 116-27, 2000.

[10] G.Q. Zhang, Z.W. Cao, Q.M. Luo, Y.D. Cai, and Y.X. Li, “Operon Prediction Based on SVM,” Computational Biology and Chemistry, vol. 30, pp. 233-240, June 2006.

[11] M.N. Price, K.H. Huang, E.J. Alm, and A.P. Arkin, “A Novel Method for Accurate Operon Predictions in All Sequenced Prokaryotes,” Nucleic Acids Research, vol. 33, pp. 880-892, 2005. [12] M.T. Edwards, S.C. Rison, N.G. Stoker, and L. Wernisch, “A

Universally Applicable Method of Operon Map Prediction on Minimally Annotated Genomes Using Conserved Genomic Con-text,” Nucleic Acids Research, vol. 33, pp. 3253-3262, 2005. [13] G. Li, D. Che, and Y. Xu, “A Universal Operon Predictor for

Prokaryotic Genomes,” Bioinformatics and Computational Biology, vol. 7, pp. 19-38, Feb. 2009.

[14] P. Dam, V. Olman, K. Harris, Z. Su, and Y. Xu, “Operon Prediction Using Both Genome-Specific and General Genomic Information,” Nucleic Acids Research, vol. 35, pp. 288-298, 2007.

[15] X. Chen, Z. Su, Y. Xu, and T. Jiang, “Computational Prediction of Operons in Synechococcus sp. WH8102,” Genome Informatics, vol. 15, pp. 211-222, 2004.

[16] B. Taboada, C. Verde, and E. Merino, “High Accuracy Operon Prediction Method Based on STRING Database Scores,” Nucleic Acids Research, vol. 38, article e130, 2010.

[17] S. Gama-Castro, V. Jimenez-Jacinto, M. Peralta-Gil, A. Santos-Zavaleta, M. Penaloza-Spinola, B. Contreras-Moreira, J. Segura-Salazar, L. Muniz-Rascado, I. Martinez-Flores, and H. Salgado, “RegulonDB (Version 6.0): Gene Regulation Model of Escherichia coli K-12 Beyond Transcription, Active (Experimental) Annotated Promoters and Textpresso Navigation,” Nucleic Acids Research, vol. 36, pp. D120-D124, 2007.

[18] N. Sierro, Y. Makita, M. De Hoon, and K. Nakai, “DBTBS: A Database of Transcriptional Regulation in Bacillus subtilis Containing Upstream Intergenic Conservation Information,” Nucleic Acids Research, vol. 36, p. D93, 2008.

[19] F. Mao, P. Dam, J. Chou, V. Olman, and Y. Xu, “DOOR: A Database for Prokaryotic Operons,” Nucleic Acids Research, vol. 37, p. D459, 2009.

[20] P.S. Dehal, M.P. Joachimiak, M.N. Price, J.T. Bates, J.K. Baumohl, D. Chivian, G.D. Friedland, K.H. Huang, K. Keller, and P.S. Novichkov, “MicrobesOnline: An Integrated Portal for Compara-tive and Functional Genomics,” Nucleic Acids Research, vol. 38, p. D396, 2010.

[21] S. Okuda, T. Katayama, S. Kawashima, S. Goto, and M. Kanehisa, “ODB: A Database of Operons Accumulating Known Operons across Multiple Genomes,” Nucleic Acids Research, vol. 34, p. D358, 2006.

[22] H.G. Schuster and W. Just, Deterministic Chaos. Wiley, 1988. [23] H. Salgado, G. Moreno-Hagelsieb, T.F. Smith, and J.

Collado-Vides, “Operons in Escherichia coli: Genomic Analyses and Predictions,” Proc. Nat’l Academy of Sciences USA, vol. 97, pp. 6652-6657, June 2000.

[24] G. Moreno-Hagelsieb and J. Collado-Vides, “A Powerful Non-Homology Method for the Prediction of Operons in Prokaryotes,” Bioinformatics, vol. 18, pp. S329-S336, 2002.

[25] P.R. Romero and P.D. Karp, “Using Functional and Organiza-tional Information to Improve Genome-Wide ComputaOrganiza-tional Prediction of Transcription Units on Pathway-Genome Data-bases,” Bioinformatics, vol. 20, pp. 709-717, Mar. 2004.

[26] Y. Yan and J. Moult, “Detection of Operons,” Proteins, vol. 64, pp. 615-28, Aug. 2006.

[27] T.T. Tran, P. Dam, Z. Su, F.L. Poole, M.W.W. Adams, G.T. Zhou, and Y. Xu, “Operon Prediction in Pyrococcus furiosus,” Nucleic Acids Research, vol. 35, p. 11, 2006.

[28] J. Kennedy and R. Eberhart, “Particle Swarm Optimization,” Proc. IEEE Int’l Joint Conf. Neural Network, vol. 4, pp. 1942-1948, 1995.

(5)

[29] J. Kennedy, “The Particle Swarm: Social Adaptation of Knowl-edge,” Proc. IEEE Int’l Conf. on Evolutionary Computation, pp. 303-308, 1997.

[30] D. Kuo, “Chaos and Its Computing Paradigm,” IEEE Potentials, vol. 24, no. 2, pp. 13-15, Apr./May 2005.

[31] J. Kennedy and R. Eberhart, “A Discrete Binary Version of the Particle Swarm Algorithm,” Proc. IEEE Int’l Conf. System, Man, and Cybernetics, pp. 4104-4108, 1997.

[32] X. Chen, Z. Su, P. Dam, B. Palenik, Y. Xu, and T. Jiang, “Operon Prediction by Comparative Genomics: An Application to the Synechococcus sp. WH8102 Genome,” Nucleic Acids Research, vol. 32, pp. 2147-2157, 2004.

[33] J. Kennedy, “Swarm Intelligence,” Handbook of Nature-Inspired and Innovative Computing, pp. 187-219, Springer, 2006.

[34] P. Roback, J. Beard, D. Baumann, C. Gille, K. Henry, S. Krohn, H. Wiste, M.I. Voskuil, C. Rainville, and R. Rutherford, “A Predicted Operon Map for Mycobacterium tuberculosis,” Nucleic Acids Research, vol. 35, pp. 5085-5095, 2007.

[35] R.W. Brouwer, O.P. Kuipers, and S.A. van Hijum, “The Relative Value of Operon Predictions,” Briefings in Bioinformatics, vol. 9, pp. 367-75, Sept. 2008.

[36] Y. Zheng, J.D. Szustakowski, L. Fortnow, R.J. Roberts, and S. Kasif, “Computational Identification of Operons in Microbial Genomes,” Genome Research, vol. 12, pp. 1221-1230, Aug. 2002.

[37] E. Laing, K. Sidhu, and S.J. Hubbard, “Predicted Transcription Factor Binding Sites as Predictors of Operons in Escherichia coli and Streptomyces coelicolor,” BMC Genomics, vol. 9, article 79, Feb. 2008.

[38] M.D. Ermolaeva, O. White, and S.L. Salzberg, “Prediction of Operons in Microbial Genomes,” Nucleic Acids Research, vol. 29, pp. 1216-1221, Mar. 2001.

[39] T. Yada, M. Nakao, Y. Totoki, and K. Nakai, “Modeling and Predicting Transcriptional Units of Escherichia coli Genes Using Hidden Markov Models,” Bioinformatics, vol. 15, pp. 987-993, Dec. 1999.

Li-Yeh Chuang received the MS degree from the Department of Chemistry, University of North Carolina, in 1989 and the PhD degree from the Department of Biochemistry, North Dakota State University, in 1994. She is a professor and director of the Department of Chemical Engineering and the Institute of Biotechnology and Chemical Engineering at I-Shou University, Kaohsiung, Taiwan. Her main areas of research include bioinformatics, biochemistry, and genetic engineering.

Cheng-Huei Yang received the BS degree from the National Taipei Institute of Technology, Taiwan, in 1978, the MS degree from North-eastern University, Boston, Massachusetts, in 1987, and the PhD degree in electrical engineer-ing from the National Chen Kung University, Tainan, Taiwan, in 2001. Currently, he is a professor in the Department of Telecommunica-tion and Computer Engineering, NaTelecommunica-tional Kaoh-siung Institute of Marine Technology, Taiwan. His research interests include network communication, electronic instrument systems, and image processing.

Jui-Hung Tsai received the MS degrees from the Department of Electronic Engineering, Na-tional Kaohsiung University of Applied Sciences, Taiwan, in 2008 and 2010, respectively. He has rich experience in computer programming, da-tabase design and management, and systems programming and design. His main areas of research include bioinformatics and computa-tional biology.

Cheng-Hong Yang received the MS and PhD degrees in computer engineering from North Dakota State University in 1988 and 1992, respectively. He is a professor in the Depart-ment of Electronic Engineering at the National Kaohsiung University of Applied Sciences and serves as president of the university. His main areas of research include evolutionary computation, bioinformatics, and assistive tool implementation.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.