Hui-Ling Huang*1, Yi-Fan Liou*1, Hua-Chin Lee*1, Wen-Lin Huang*2, and Shinn-Ying Ho*1,*3 Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan*1 Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli, Taiwan*2
Corresponding author: 886-3-571-2121, ext: 56909; e-mail: syho @mail.nctu.edu.tw*3
Abstract— Bioluminescence proteins are becoming increasingly important in a variety of research fields such as in situ imaging and the study of protein-protein interactions in vivo, and increased spectral variety of bioluminescent reporters is needed for further progress. The existing method BLProt using support vector machine (SVM) and physicochemical properties to predict bioluminescence proteins. The BLProt method identified the most prominent features using various filter approaches, ReliefF, infogain, and mRMR. BLProt utilized 100 features to achieve a training accuracy of 80% and test accuracy of 80.06%.
Physicochemical properties are well recognized to be effective in designing various predictors for understanding the functions and characteristics of proteins. In this study, we propose an efficient method for designing predictors of bioluminescence proteins using a small set of informative physicochemical properties obtained by using an inheritable bi-objective genetic algorithm.
The benchmark datasets were used to evaluate the proposed method using SVM and informative physicochemical properties as the features. The prediction accuracy of independent test is 81.79% using 15 properties. From the analysis of informative physicochemical properties, some knowledge of bioluminescent problems can be revealed. The proposed physicochemical property mining method can be used conveniently as the core for designing predictors for various types of bioluminescent problems.
Keywords — Bioluminescent protein, genetic algorithm, SVM physicochemical properties, prediction
I. INTRODUCTION
B
ioluminescence is a light producing process. The basic two factors included in this process are luciferase and luciferin, which are the catalytic enzyme and its substrate respectively.Work on bioluminescence is actively pursued at all levels, such as naturalist or phtochemist, due to it abnormal characters. The visible light, generated from luciferase, is emitted at room temperature while light often can be generated at extreme high temperature causing violent oxidation of some objects. The actual emission of bioluminescence is the extremely rapid final process of usually multistep reaction.
Most often, the excited state of luciferin is excited by electron or photon [1].
Bioluminescence provides an ideal tool to solve scientific problems. Previous studies [2] are already renowned for the preparation and application of an extended series of
radiometric ion-sensitive indicators and a number of sophisticated reporter molecules based on fluorescence resonance energy transfer (FRET). In order to generate genetically encoded FRET probes which are suitable for radiometric measurements, more fluorophores are need to be discovered or generated.
However, the biofunction of those bioluminescence proteins are quite alike, they do not share strongly homologous. Many orgasms use different proteins which have different mechanisms to generate light [3]. Bioluminescence proteins are becoming increasingly important in a variety of research fields such as in situ imaging and the study of protein-protein interactions in vivo, and increased spectral variety of bioluminescent reporters is needed for further progress.
Beside the bioluminescent characters, some characters are also interesting. First, the luciferins are extremely hydrophobic macro molecules. To catalyze the molecules, the catalytic sites must be very different to tune the catalytic orientation between the enzymes and subtracts. Secondly, the bioluminescence light in some live orgasms, like firefly, is regulated. The GFP does not have a significant regulation structure like the C-terminal ball-chain structure of voltage-dependent gate channel on neuron. But some regulation mechanisms still occur for this purpose [4]. Third, the bioluminescence does not share homologous but they have a similar function.
Understanding physicochemical properties of the bioluminescence proteins may help improve the applications of bioluminescence proteins.
Kandaswamy et al. [5] proposed an accurate prediction method BLProt that uses a support vector machine (SVM) and physicochemical properties to predict bioluminescence proteins. BLProt used a training dataset consisting of 300 bioluminescence proteins and 300 non-bioluminescence proteins, and an independent test dataset consisting of 141 bioluminescence proteins and 18202 non-bioluminescence proteins. To identify the most prominent features, they carried out feature selection with three different filter approaches, ReliefF, infogain, and mRMR. For the aim of designing accurate prediction methods, the major concern is to identify feature vectors with high discrimination abilities for classifying positive and negative samples. Their feature selection method suffers from a large set of candidate features.
We investigate the optimal design of predictors for
bioluminescence proteins from amino acid sequences using both informative features and an appropriate classifier.
Furthermore, we obtain a set of informative physicochemical properties which can advance prediction performance.
Physicochemical properties extracted from protein sequences were utilized as effective features in recent years. Our previous work Auto-IDPCPs [6] is an SVM based classifier with automatic feature selection from a large set of physicochemical composition features to predict DNA-binding domain/protein. The POPI method used physicochemical properties as efficient features to predict peptide immunogenicity [7]. The prediction method UbiPred [8]
mined informative physicochemical properties from protein sequences to identify promising ubiquitylation sites.
The informative physicochemical properties of amino acids indices selected in this study were used as features in designing SVM classifiers. An efficient algorithm inheritable bi-objective genetic algorithm (IBCGA) was used to select significant features which could discriminate the two classes of proteins.
The feature sets selected by IBCGA were analyzed carefully to reveal the fundamental differences existed between bioluminescence proteins and non-bioluminescence proteins. In conclusion, we proposed a novel prediction method combining the informative physicochemical properties of amino acid and SVM to solve the prediction problem of bioluminescence proteins.
II. METHOD
We propose a novel method using the physicochemical properties for predicting bioluminescence proteins (PBLP).
The identification of an effective feature set of physicochemical properties is mainly derived by using an inheritable bi-objective genetic algorithm (IBCGA) [9]. The IBCGA mines informative physicochemical properties and tune parameter settings of SVM simultaneously while maximizing 5-fold cross validation (5-CV) accuracy.
A. Datasets
The bioluminescence proteins (BLPs) extracted from Martinetz et al. Pfam database are used to obtain the seed proteins of BLPs. To enrich the dataset, PSI-BLAST with stringent threshold (E value 0.01) is carried out to search against the non-redundant sequence database. Then, CD-hit are performed to remove the sequences with identity >= 40%
in the collected dataset. After all, a total 441 bioluminescence proteins are kept as positive dataset. The statistic of the training and test sets is shown in Table 1.
There are 300 BLPs randomly selected from the 441 positive samples and are served as training samples. The others are served as test samples. There are 300 non-BLPs also randomly picked from seed proteins of Pfam protein families.
These proteins, served as negative samples, are unrelated to BLPs.
The negative testing dataset is composed of the seed proteins of non-BLPs Pfam protein families. All sequences contained in the training dataset have less than 40 residues are
removed. Finally, the test dataset is composed of 141 BLPs and 18202 non-BLPs.
Table 1. The statistic of the training/test sets.
dataset Number of BLPs Number of non-BLPs
Training 300 300
Test 141 18202
B. Support Vector Machine
Support vector machine (SVM) is a learning model dealing with binary classification problems. SVM constructs a binary classifier by finding a hyperplane to separate two classes with a maximal distance between margins of two classes consisting of support vectors. In order to make linear separation of samples easier, SVM uses one of various kernel functions to transform the samples into a high-dimensional search space. In this work, the commonly-used radial basis function is applied to nonlinearly transform the feature space, defined as follows:
0 transformed into a high-dimensional search space. The cost parameter C>0 of SVM adjusts the penalty of total error.
These two parameters C and γ must be tuned to get the best prediction performance.
For multi-class classification problems, ‘one-against-one’
strategy is applied to transform the multi-class problem into several binary classification problems. Given h classes, there are h(h−1)/2 classifiers constructed and each one trains the samples from two classes. A voting strategy is applied to give a final prediction for test samples. In this study, h=2 and the used SVM is obtained from LIBSVM package version 2.81 [10].
C. Inheritable Bi-objective Genetic Algorithm
Selecting a minimal number of informative features while maximizing prediction accuracy is a bi-objective 0/1 combinatorial optimization problem. An efficient inheritable bi-objective genetic algorithm [11] is utilized to solve this optimization problem. IBCGA consists of an intelligent genetic algorithm [12] with an inheritable mechanism. The intelligent genetic algorithm uses a divide-and-conquer strategy and an orthogonal array crossover to efficiently solve large-scale parameter optimization problems. In this study, the intelligent genetic algorithm can efficiently explore and exploit the search space of C(n, r). IBCGA can efficiently search the space of C(n, r1) by inheriting a good solution in the space of C(n, r) [11]. Therefore, IBCGA can economically obtain a complete set of high-quality solutions in a single run where r is specified in an interesting range such as [5, 20].
The proposed chromosome encoding scheme of IBCGA consists of both binary genes for feature selection and parametric genes for tuning SVM parameters, where the gene and chromosome are commonly-used terms of genetic algorithm (GA), named GA-gene and GA-chromosome for
discrimination in this paper. The GA-chromosome consists of n=531 binary GA-genes bi for selecting informative properties and two 4-bit GA-genes for tuning the parameters C and γ of SVM. If bi=0, the ith property is excluded from the SVM classifier; otherwise, the ith propertyis included. This encoding method maps the 16 values of and C into {2-7, 2-6…, 28}.
The feature vector for training the SVM classifier is obtained from decoding a GA-chromosome using the following steps. Consider a given DNA-PBs sequence. At first, the index vectors for all selected physicochemical properties are constructed from AAindex for each amino acid. Feature vector of a peptide consists of the selected features whose values are obtained by averaging the values in their corresponding index vectors. Finally, all values of the feature vectors are normalized into [-1, 1] for applying SVM.
Fitness function is the only guide for IBCGA to obtain desirable solutions. The fitness function of IBCGA is the 5-CV overall accuracy. IBCGA with the fitness function f(X) can simultaneously obtain a set of solutions, Xr, where r=rstart, rstart+1, …, rend in a single run. The algorithm of IBCGA with the given values rstart and rend is described as follows:
Step 1) (Initiation) Randomly generate an initial population of Npop individuals. All the n binary GA-genes have r 1’s and n-r 0’s where r = rstart.
Step 2) (Evaluation) Evaluate the fitness values of all individuals using f(X).
Step 3) (Selection) Use the traditional tournament selection that selects the winner from two randomly selected individuals to form a mating pool.
Step 4) (Crossover) Selectpc·Npop parents from the mating prevent the best fitness value from deteriorating, mutation is not applied to the best individual.
Step 6) (Termination test) If the stopping condition for the binary GA-genes for each individual from 0 to 1;
increase the number r by one, and go to Step 2).
Otherwise, stop the algorithm.
D. Prediction Method PBLP
The selected m physicochemical properties and the associated parameter set of SVM by using PBPL are used to implement the computational system and analyze the physicochemical properties to further understand the BLPs.
Since the PBPL is a non-deterministic method, it should make more effort to identify an efficient and robust feature set of
informative physicochemical properties in five aspects. The procedure is as the following steps:
Step 1 : We prepare the independent data sets where each set is used as the training data set of 5-CV.
Step 2 : PBPL is performed R independent runs for each of independent data sets. In this study, R = 30. There are total 30 sets of m physicochemical properties for each of independent data sets.
Step 3 : Choose the set of selected physicochemical properties with a maximal accuracy.
PBLDs will automatically determine a set of informative physicochemical properties and an SVM-model for prediction bioluminescent and non- bioluminescence proteins.
III. RESULTS A. Results of training and test datasets
The training data sets contain 300 positive and 300 negative samples. The sequence similarity of the training data set is smaller than 40%. We performed 30 independent runs of PBPL to select robust feature set which could improve the performance of SVM classifier on discriminating the two classes of proteins. The highest training accuracy of 30 PBPL runs was 84.11% and its corresponding test accuracy was 81.79%. (Table 2).
Table 2. Results of the training and independent test by BLProt and PBLP.
B. Selected a small set of physicochemical properties.
The quantified effectiveness of individual physicochemical properties on prediction is useful to characterize the PBLP mechanism by physicochemical properties. Orthogonal experimental design with factor analysis can be used to estimate the individual effects of physicochemical properties according to the value of main effect difference (MED) [7, 12].
The property with the largest value of MED is the most effective in predicting BLPs.
According to MED, the 15 informative properties are ranked and their descriptions are shown in Table 3 and Fig. 1.
The most effective property with MED=16.16668 is RACS820111 denoting “Differential geometry and polymer conformation. Conformational and nucleation properties of individual amino acids”.
Method Specificity
Table 3. The highest accuracy with selected m = 15 feature set
Positional flexibilities of amino acid residues in globular proteins
13 BROC82010 2
The isolation of peptides by high-performance liquid chromatography using predicted elution
positions 18 BUNA79010
3
1H-nmr parameters of the common amino acid residues measured in aqueous solutions of the linear
tetrapeptides H-Gly-Gly-X-L-Ala-OH 95 FINA910104 Physical reasons for secondary structure stability:
alpha-helices in short peptides 107 GEIM800111 Amino acid preferences for secondary structure
vary with protein class 202 NAKH92010
1
The amino acid composition is different between the cytoplasmic and extracellular sides in
membrane proteins 223 PALJ810101 Protein secondary structure
310 RACS820111 Differential geometry and polymer conformation. 4.
Conformational and nucleation properties of individual amino acids 380 VENT84010
1
Hydrophobicity parameters and the bitter taste of L-amino acids
439 PARS000102 Protein thermal stability: insights from atomic displacement parameters (B values) 473 MITS020101 Amphiphilicity index of polar amino acids as an aid
in the characterization of amino acid preference at membrane-water interfaces
475 TSAJ990102 The packing density in proteins: standard radii and volumes
489 PUNT030101 A knowledge-based scale for amino acid membrane propensity
491 GEOR03010 1
An analysis of protein domain linkers: their classification and role in protein folding 502 ZHOH04010
3
Quantifying the effect of burial of amino acid residues on protein stability
IV. DISCUSSION
The merits of the proposed method are twofold: 1) a small set of informative physicochemical properties is identified for predicting bioluminescence proteins (PBLP) with promising accuracy, and 2) the small set of informative physicochemical properties can be more easily interpretable. The existing method BLProt with a test accuracy of 80.06% has been proved to be more accurate than BLAST and HMM using 100 features. The proposed method PBLP achieves a higher test accuracy of 81.79% using only 15 physicochemical properties for predicting bioluminescence proteins.
The identified feature sets from 30 independent runs of PBLP are very robust. The appearance frequency of each identified cluster in the 30 runs is shown in Fig. 3. From the statistic result, the clusters 7, 9, 10 and 16 with very high selection frequencies are more informative for predicting bioluminescence proteins. The selected clusters of the 30 runs are very similar in terms of cluster ID from 20 clusters. The most effective property with RACS820111 is belonging to the 10th cluster with Beta propensity in six groups.
PBLP is an efficient approach to selecting informative physicochemical properties for SVM classifier. With the IBCGA-selected features, the prediction accuracy of our method is better than the existing method. This method can be also applied to other sequence-based prediction problems.
Figure 1. The rank of the selected feature set with the highest training accuracy is analyzed by MED analysis.
REFERENCES
[1] Wilson T. 1995. Comments on the mechanisms of chemi- and bioluminescence. Photochem.Photobiol.62:601–6
[2] Heim, R., and Tsien, R.Y. (1996). Engineering greenfluorescent protein for improved brightness, longerwavelengths and fluorescence resonance energytransfer.Curr. Biol. 6, 178–182.
[3] Cubitt AB, Heim R, Adams SR, Boyd AE,Gross LA,Tsien RY. 1995.
Understanding,improving and using green fluorescent proteins.Trends Biochem. Sci. 20:448–55
[4] Johnson CH, Knight MR, Kondo T, Masson P,Sedbrook J, et al. 1995.
Circadian oscillationsof cytosolic and chloroplastic free calcium inplants.
Science 259:1863–65
[5] K. K. Kandaswamy, G. Pugalenthi, M. K. Hazrati, K.-U. Kalies, and T.
Martinetz. BLProt: Prediction of bioluminescent proteins based on Support Vector Machine and Relief feature selection. BMC Bioinformatics, 2011.
[6] Huang, H.-L., Lin, I.-C., Liou, Y.-F., Tsai, C.-T., et al., Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinformatics 2011, 12 Suppl 1.
[7] Chun-Wei Tung and Shinn-Ying Ho, “POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties,” Bioinformatics, vol. 23, no. 8, pp. 942–949, 2007.
[8] Chun-Wei Tung and Shinn-Ying Ho, “Computational identification of ubiquitylation sites from protein sequences,” BMC Bioinformatics, vol.
9:310, July 2008.
[9] JR Quinlan. C4.5: programs for machine learning. In. San Mateo, CA:
Morgan Kaufmann. 1993.
[10] C. C. Chang, and, C. J. Lin (2001) LIBSVM: a library for support vector
machines. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[11] S.-Y. Ho, et al.,“Inheritable genetic algorithm for bi-objective 0/1 combinatorial optimization problems and its applications,” IEEE Trans.
Syst. Man Cybern. Part B-Cybern., vol. 34, pp. 609-620, 2004a.
[12] Ho, S.Y., Shu, L.S., Chen, J.H. 2004. Intelligent evolutionary algorithms for large parameter optimization problems. IEEE Transactions on Evolutionary Computation 8, 522–541