Binary particle swarm optimization for operon prediction

(1)

Binary particle swarm optimization for

operon prediction

Li-Yeh Chuang

1

, Jui-Hung Tsai

2

and Cheng-Hong Yang

3,4,

*

1

Department of Chemical Engineering, I-Shou University, 2Department of Computer Science and Information Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, 3Department of Network Systems, Toko University, Chiayi and 4Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan

Received October 3, 2009; Revised February 26, 2010; Accepted March 9, 2010

ABSTRACT

An operon is a fundamental unit of transcription and contains specific functional genes for the con-struction and regulation of networks at the entire genome level. The correct prediction of operons is vital for understanding gene regulations and func-tions in newly sequenced genomes. As experimental methods for operon detection tend to be nontrivial and time consuming, various methods for operon prediction have been proposed in the literature. In this study, a binary particle swarm optimization is used for operon prediction in bacterial genomes. The intergenic distance, participation in the same metabolic pathway, the cluster of orthologous groups, the gene length ratio and the operon length are used to design a fitness function. We trained the proper values on the Escherichia coli genome, and used the above five properties to im-plement feature selection. Finally, our study used the intergenic distance, metabolic pathway and the gene length ratio property to predict operons. Experimental results show that the prediction accuracy of this method reached 92.1%, 93.3% and 95.9% on the Bacillus subtilis genome, the Pseudomonas aeruginosa PA01 genome and the Staphylococcus aureus genome, respectively. This method has enabled us to predict operons with high accuracy for these three genomes, for which only limited data on the properties of the operon structure exists.

INTRODUCTION

Operons in prokaryote organisms contain one or more consecutive genes on the same strand, although a few

eukaryotic organisms also have operon-like structures, e.g. Caenorhabditis elegans (1). These genes are co-transcribed into a single-strand mRNA sequence. Co-transcribed genes likely have the same biological func-tions and directly aﬀect each other. Operon prediction can therefore be used to infer the function of putative proteins if the functions of other genes in the same operon are known. A well-known example is the lactose operon in Escherichia coli. This operon contains the three consecu-tive structural genes, lacZ, lacY and lacA, which all share the same promoter and terminator.

Operons of bacterial genomes contain information valuable for drug design and determining protein func-tions (2). The Gram-positive Staphylococcus bacterium, for example, is a human pathogen that is responsible for community-acquired and nosocomial infections (3). Operon prediction on this bacterium can facilitate drug target identiﬁcation and the development of antibiotic drugs. However, knowledge of operons is scarce, and experimental methods for predicting operons are gener-ally diﬃcult to implement (4). To gain better insight, the number and organization of operons in bacterial genomes have to be studied in greater detail. A detailed understand-ing of the transcription rules is critical, as it would allow scientists to accurately predict operons based on an organism’s genomic sequence.

A number of scientists have proposed properties that can accurately predict operons. These properties can be divided into the following ﬁve categories (5): intergenic distance, conserved gene clusters, functional relations, genome sequence and experimental evidence. In each of the aforementioned categories, it is pivotal to detect the promoter and the terminator at the operon boundaries to identify the biologically most representative properties (4). The simplest and most important prediction property is to observe whether the distance between gene pairs within an operon (WO pairs) is shorter than the distance between gene pairs at the borders of the transcription units (TUB

*To whom correspondence should be addressed. Tel: +886 7 3814526; Fax: +886 7 3836844; Email: chyang@cc.kuas.edu.tw

Published online 12 April 2010 Nucleic Acids Research, 2010, Vol. 38, No. 12 e128 doi:10.1093/nar/gkq204

ß The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

at National Kaohsiung University of Applied Sciences on July 7, 2014

http://nar.oxfordjournals.org/

(2)

pairs) (3). The distance property yields very good operon prediction results.

Many computational algorithms have been proposed to properly balance the sensitivity and specificity of operon prediction. Jacob et al. (4) proposed an algorithm guided by fuzzy logic. Fuzzy logic does not rely on complex mathematical formulas to calculate fitness values of a chromosome. Genetic algorithms (GA) (2) use the intergenic distance, metabolic pathways, cluster of orthologous groups (COG) and microarray expression data to predict operons. Zhang et al. (6) presented a support vector machine algorithm (SVM) to predict operons. This method uses the four biological properties as SVM input vectors and divides gene pairs into operon pairs (OPs) and non-operon pairs (NOPs). The experimen-tal accuracy of prediction was 0.9. In our study, we compare additional predictors [genome-specific (7), DVDA (8), FGENESB, ODB (9), OFS (10), OPERON (11), JPOP (12), VIMSS (13), UNIPOP (1) and genome-wide operon prediction in Staphylococcus aureus (3)], in addition to the above-mentioned methods.

In this paper, we propose an effective binary particle swarm optimization (BPSO) for operon prediction. To validate the feasibility of the method, we calculated the logarithmic likelihood of each property in the E. coli (NC_000913) genome as a fitness value of each gene in the particle. Three bacterial genomes [Bacillus subtilis (NC_000964), Pseudomonas aeruginosa PA01 (NC_002516) and S. aureus (NC_002952)] were selected as benchmark genomes of known operon structure. In a first step, a restriction was introduced in the strand form to initialize a basis for the intergenic distance property. In order to select the best possible combination of properties, we employed the concept of feature selection to implement operon prediction. The five features investigated were the intergenic distance, metabolic pathways, COG, gene length ratio and operon length. Based on the experimental results and our analysis thereof, the intergenic distance, metabolic pathways and gene length ratio were selected after the feature selection process to calculate the fitness value of each gene in a particle. The particle was subse-quently updated by an update formula at each generation. The detailed updating process is described in the next section. The experimental results indicate that the proposed method obtained a higher accuracy, sensitivity and specificity on the test data sets when compared to other methods from the literature.

MATERIALS AND METHODS Data set preparation

The complete microbial genome data were downloaded from the GenBank database (http://www.ncbi.nlm.nih .gov/). The data contain a total of 4225, 5651 and 2845 genes in the B. subtilis genome, P. aeruginosa PA01 genome and S. aureus genome, respectively. The related genomic information consists of the gene name, gene ID, position, strand and product. The operon databases of E. coli and B. subtilis were obtained from RegulonDB (http://regulondb.ccg.unam.mx/) (14) and DBTBS

(http://dbtbs.hgc.jp/) (15), respectively. The operon data-bases of the P. aeruginosa PA01 genome and the S. aureus genome were obtained from ODB (http://odb.kuicr .kyoto-u.ac.jp/) (9). The genomes’ metabolic pathway data and COG data were obtained from KEGG (http:// www.genome.ad.jp/kegg/pathway.html) and NCBI (http://www.ncbi.nlm.nih.gov/COG/), respectively.

Deﬁnition of a potential operon pair

In order to gain valuable information pertaining to drug and protein functions, operons have to be predicted based on an organism’s genomic sequence. The entire genome is scanned for adjacent gene pairs on the same string, and each gene pair is then classified into one of three types: (i) adjacent; (ii) WO pair; or (iii) TUB pair. The latter two are defined as positive and negative, respectively, before the accuracy for a putative operon map is calculated. The WO pairs of adjacent genes shown in Supplementary Figure S1 are in the same operon. If the operon contains a single gene and the downstream gene is of unknown status, the gene pair is called a TUB pair. However, if the gene is of uncertain status at the end of the border of the transcrip-tion unit (16), the gene pair cannot be labeled a TUB pair. In addition, the first gene of an operon and the upstream gene are TUB pairs.

Binary particle swarm optimization

Overview. The particle swarm optimization (PSO) tech-nique is a population-based evolutionary algorithm de-veloped by Kenney and Eberhart in 1995 (17). PSO has been developed through simulation of the social behavior of organisms, e.g. fish in a school or birds in a flock. The method is similar to a genetic algorithm, in which particles are initialized within a random population and search for global optimal solutions at each generation. However, PSO is not suitable for optimization problems in a discrete feature space. Hence, Kenney and Eberhart de-veloped binary PSO (BPSO) to overcome this problem (18). The basic elements of BPSO are briefly introduced below:

(i) Population: A swarm (population) consists of N particles.

(ii) Particle position, xi: Each candidate solution can

be represented by a D-dimensional vector; the ith particle can be described as xi¼ ðxi1,xi2, . . . , xiDÞ,

where xiD is the position of the ith particle with

respect to the Dth dimension.

(iii) Particle velocity, vi: The velocity of the ith particle

is represented by vi¼ ðvi1,vi2, . . . ,viDÞ, where viD is

the velocity of the ith particle with respect to the Dth dimension. In addition, the velocity of a particle is limited within V½ min,VmaxD.

(iv) Inertia weight, w: The inertia weight is used to control the impact of the previous velocity of a particle on the current velocity. This control par-ameter aﬀects the trade-oﬀ between the exploration and exploitation abilities of the particles.

e128 Nucleic Acids Research, 2010, Vol. 38, No. 12 PAGE2OF9

(3)

(i) Most methods predict operons based on the properties of adjacent genes, which they try to identity as either OP or NOP. However, this procedure does not take the properties of near genes into account, and thus generally results in lower accuracies for operon prediction. The BPSO used in this study evaluates the properties of all genes, and thereby increases the probability of ﬁnding an optimal solution. In order to raise the BPSO prediction performance, we set the inertia weight to 1, and limit the velocity of BPSO to between Vminand Vmax. If the velocity

is close to 0, the probability of a state changing is increased, and vice versa. Hence, BPSO has global and local search capabilities. The probability of obtaining the best solution is thus increased.

(ii) Operon prediction accuracy can be increased if better particles are selected in the initial step since the benefits of the initially superior particle are multiplied through the repeated updating process at each generation. In our study, the intergenic distance and the gene strand condition were evaluated in the initiation step. As shown in Table 2, we obtained a higher specificity and lower sen-sitivity when the initiation threshold was set to 300 bp. When the threshold was adjusted to 600 bp, the sensitivity was raised, but the specificity was reduced. A sensitivity and specificity value of higher than 0.8 represents a good balance between the two parameters (6). In order to obtain a good balance between sensitivity and specificity and increase the accuracy of operon prediction, proper settings at the initiation step are of critical importance. By boosting the quality of particles at the initiation, the best particles can be obtained by successive progression through the generations.

(iii) Generally, the fitness value of a particle is propor-tional to the prediction accuracy. Although adjacent genes have related properties, they still have a different probabil-ity of being in different operons. This necessitates the implementation of a fitness function in the proposed method. We calculate the fitness value of each particle based on the logarithmic likelihood ratio test since this method is designed on the basis of statistics. Therefore, the fitness value of a putative operon is directly propor-tional to the prediction accuracy. The experimental results prove that this fitness function identifies better particles.

(iv) Experimental data on the E. coli genome can be downloaded from the RegulonDB database, but for other genomes extensive experimental data are not readily available. In order to apply the proposed method to other genomes with fewer attributes, only five common properties for operon prediction were used. Theoretically, methods using more properties for operon prediction achieve a higher accuracy. Some of the methods in Table 1 use numerous properties, yet our BPSO method only uses three such properties and still achieves better results. The simplicity of our method can thus be con-sidered a great attribute for operon prediction. When we used the five original properties to predict operons, the prediction accuracy did not improve, but the prediction time was increased (data not shown). Table 1 shows that the intergenic distance, homologous genes and pathway property are frequently used. ODB uses four properties for operon prediction, but the method suffers from a

low prediction sensitivity (1). In addition, the WO pair and TUB pair performance of DVDA was <0.5 in the gene pair analyses performed, and the operon prediction performance based on the literature (5) was <0.2 based on the complete operons of E. coli and B. subtilis. We thus omitted the homologous gene property, and used two properties more suitable for identiﬁcation of the WO and TUB pairs. The gene length ratio is used somewhat less frequently than other properties, but the literature (7) hints at the powerful identiﬁcation ability of this property. Our method achieved the highest accuracy for operon pre-diction even though it only uses three properties on all bacterial genomes. The contributions to operon prediction are thus self-evident.

CONCLUSION

We propose a novel operon prediction method called BPSO for operon prediction in bacterial genomes. The intergenic distance and strand are applied at the initiation step, and BPSO thus superior particles are used at the initialization of a population. We used the intergenic distance, metabolic pathway, COG gene functions, gene length ratio and the operon length of the E. coli genome for feature selection and designed a ﬁtness function. Finally, BPSO was used to predict operons based on the intergenic distance, metabolic pathway and gene length ratio properties. The experimental results show that the proposed method not only increases the accuracy of operon prediction on the three genome data sets tested, but also reduces the computation time needed for the pre-diction. In the future, we intend to investigate diﬀerent properties and other algorithms on the problems of operon prediction in order to increase the prediction per-formance further.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The National Science Council in Taiwan under grant (NSC96-2221-E-214-050-MY3, NSC96-2622-E-214-004-CC3, and NSC97-2622-E-151-008-CC2). Funding for open access charge: National Science Council in Taiwan (NSC96-2221-E-214-050-MY3).

Conﬂict of interest statement. None declared.

REFERENCES

1. Li,G., Che,D. and Xu,Y. (2009) A universal operon predictor for prokaryotic genomes. J Bioinform Comput Biol., 7, 19–38. 2. Wang,S., Wang,Y., Du,W., Sun,F., Wang,X., Zhou,C. and

Liang,Y. (2007) A multi-approaches-guided genetic algorithm with application to operon prediction. Artif. Intell. Med., 41, 151–159. 3. Wang,L., Trawick,J.D., Yamamoto,R. and Zamudio,C. (2004)

Genome-wide operon prediction in Staphylococcus aureus. Nucleic Acids Res., 32, 3689–3702.

e128 Nucleic Acids Research, 2010, Vol. 38, No. 12 PAGE8OF9

(4)

4. Jacob,E., Sasikumar,R. and Nair,K.N. (2005) A fuzzy guided genetic algorithm for operon prediction. Bioinformatics, 21, 1403–1407.

5. Brouwer,R.W., Kuipers,O.P. and van Hijum,S.A. (2008) The relative value of operon predictions. Brief Bioinform., 9, 367–375.

6. Zhang,G.Q., Cao,Z.W., Luo,Q.M., Cai,Y.D. and Li,Y.X. (2006) Operon prediction based on SVM. Comput. Biol. Chem., 30, 233–240.

7. Dam,P., Olman,V., Harris,K., Su,Z. and Xu,Y. (2007) Operon prediction using both genome-speciﬁc and general genomic information. Nucleic Acids Res., 35, 288–298.

8. Edwards,M.T., Rison,S.C., Stoker,N.G. and Wernisch,L. (2005) A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context. Nucleic Acids Res., 33, 3253–3262.

9. Okuda,S., Katayama,T., Kawashima,S., Goto,S. and Kanehisa,M. (2006) ODB: a database of operons accumulating known operons across multiple genomes. Nucleic Acids Res., 34, D358–D362. 10. Westover,B.P., Buhler,J.D., Sonnenburg,J.L. and Gordon,J.I. (2005) Operon prediction without a training set. Bioinformatics, 21, 880–888.

11. Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Prediction of operons in microbial genomes. Nucleic Acids Res., 29, 1216–1221.

12. Chen,X., Su,Z., Dam,P., Palenik,B., Xu,Y. and Jiang,T. (2004) Operon prediction by comparative genomics: an application to the Synechococcus sp. WH8102 genome. Nucleic Acids Res., 32, 2147–2157.

13. Price,M.N., Huang,K.H., Alm,E.J. and Arkin,A.P. (2005) A novel method for accurate operon predictions in all sequenced

prokaryotes. Nucleic Acids Res., 33, 880–892. 14. Gama-Castro,S., Jimenez-Jacinto,V., Peralta-Gil,M.,

Santos-Zavaleta,A., Penaloza-Spinola,M., Contreras-Moreira,B., Segura-Salazar,J., Muniz-Rascado,L., Martinez-Flores,I. and Salgado,H. (2007) RegulonDB (version 6.0): gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res., 36, D120–D124.

15. Sierro,N., Makita,Y., de Hoon,M. and Nakai,K. (2008) DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res., 36, D93–D96.

16. Sabatti,C., Rohlin,L., Oh,M.K. and Liao,J.C. (2002)

Co-expression pattern from DNA microarray experiments as a tool for operon prediction. Nucleic Acids Res., 30, 2886–2893. 17. Kennedy,J. and Eberhart,R. (1995) Particle swarm optimization.

Proceedings of the IEEE International Conference on Neural Networks, Vol. 4, pp. 1942–1948.

18. Kennedy,J. and Eberhart,R. (1997) A discrete binary version of the particle swarm algorithm. Proceedings of the IEEE

International Conference on Systems, Man, and Cybernetics, Vol. 5, pp. 4104–4108.

19. Crammer,K. and Singer,Y. (2002) On the learnability and design of output codes for multiclass problems. Machine Learn., 47, 201–233.

20. Salgado,H., Moreno-Hagelsieb,G., Smith,T.F. and

Collado-Vides,J. (2000) Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA, 97, 6652–6657.

21. Yan,Y. and Moult,J. (2006) Detection of operons. Proteins, 64, 615–628.

22. Romero,P.R. and Karp,P.D. (2004) Using functional and organizational information to improve genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics, 20, 709–717. 23. Tran,T.T., Dam,P., Su,Z., Poole,F.L. II, Adams,M.W.,

Zhou,G.T. and Xu,Y. (2007) Operon prediction in Pyrococcus furiosus. Nucleic Acids Res., 35, 11–20.

24. Bockhorst,J., Craven,M., Page,D., Shavlik,J. and Glasner,J. (2003) A Bayesian network approach to operon prediction. Bioinformatics, 19, 1227–1235.

25. De Hoon,M.J., Imoto,S., Kobayashi,K., Ogasawara,N. and Miyano,S. (2004) Predicting the operon structure of Bacillus

subtilisusing operon length, intergene distance, and gene

expression information. Pac. Symp. Biocomput., 276–287. 26. Kennedy,J., Eberhart,R. and Shi,Y. (2001) Swarm Intelligence.

Springer, New York.

27. Roback,P., Beard,J., Baumann,D., Gille,C., Henry,K., Krohn,S., Wiste,H., Voskuil,M.I., Rainville,C. and Rutherford,R. (2007) A predicted operon map for Mycobacterium tuberculosis. Nucleic Acids Res., 35, 5085–5095.

28. Moreno-Hagelsieb,G. and Collado-Vides,J. (2002) A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics, 18, S329–S336.

PAGE9OF9 Nucleic Acids Research, 2010, Vol. 38, No. 12 e128