Methods - 基於蛋白質自由能之預測B細胞表位方法

Chapter 2 Manuscript

2.2 Methods

Dataset

B-cell epitopes were selected from Bcipep database [12]. We used a total of 200 epitopes distributed across 145 protein sequences for training and testing our classification methods. For energy estimation purposes, we retrieved antigenic sequences from Uniprot based on the accession number associated with each source antigen [29]. Non-epitope peptides were generated by randomly extracting segments from the 145 protein sequences while ensuring that the non-epitope peptides so obtained were not present in the epitope data set. This approach is certainly an approximation for non-epitope sequences, as the proteins involved have not been explicitly mapped. To avoid classification bias resulting from non-epitope peptides, I generated a total of 1000 non-epitope peptides. For each epitope in the training or testing dataset, a non-epitope peptide of the same length was randomly selected from the pool of 1000 non-epitope peptides.

Blind dataset

To evaluate the energy-related features on an independent dataset, a set of 85 epitopes distributed across 45 protein sequences was retrieved from the AntiJen database [27]. As a reference, a set of 100 non-epitope peptides was randomly selected from the protein sequences with the criterion that the selected non-epitopes were not the same as any one of the epitopes.

Dataset for current continuous B-cell epitope prediction methods

A number of current B-cell epitope prediction methods, such as BCPred, ABCPred, and the AAP method, require input peptides of specific lengths [11, 15, 16]. These methods demonstrated

- 18 -

superior performance with peptides of 16 amino acids in length. To compare the performance of the energy-related features to that of current prediction methods, the peptides in the aforementioned datasets were fixed to 16 amino acids in length. If the peptide length is less than 16 amino acids, the peptide is extended by adding an equal number of amino acids to both ends based on the protein sequence of the source antigen. If the peptide length is shorter than 16 amino acids, the peptide is shortened by trimming amino acids from both ends.

Energy estimation

Protein structure modeling was performed using SWISS-MODEL [30-32], an automated protein structure homology-modeling server. First, antigenic protein sequences were submitted to identify known protein structures that resemble the structure of the antigen. Those antigenic sequences for which we did not find suitable templates from the structural database were not included in our data set. For sequences we identified suitable templates, the template with the lowest E-value was chosen. Subsequently, the template ID and antigenic protein sequence were submitted to the server for homology modeling, and the resultant PDB file was saved locally for further operations.

Mutations in the PDB structures were initiated using Deepview [30], a molecular visualization package designed to interact with the SWISS-MODEL server, and the point-mutated structure was energy minimized and equilibrated using the Deepview-implementation of the GROMOS96 force-field [33]. Each site in the antigenic structure was mutated to each of the 20 naturally occurring amino acids. Following energy minimization, protein free energy (FE) associated with a particular point-mutated structure was recorded. Thus, a total of 20L FE was obtained for an antigen with sequence length of L. Note that we minimized FE with respect to bond lengths, bond angles, torsion energies, and improper

- 19 -

angles. While minimizing non-bonded interaction energy would certainly yield more accurate energy estimation, the search for a global minimum in energy is computationally expensive, due to the vast number of conformational variants analyzed in this study. Therefore, throughout this analysis, we make the assumption that conformation of a stable protein resulting from a single site substitution resembles that of the parent protein closely, with any tertiary structural changes localized in the neighborhood of the substitution.

As a consequence of protein energy estimation, epitopes in the data set were selected based on the following conditions: (i) the accession number of source antigen is provided, (ii) there exists a template structure for the antigenic sequence, and (iii) the antigen-template alignment provides information about the spatial arrangement of the epitope.

Figure 2-1 – System control flow for energy-related features.

- 20 -

Energy features

Three types of energy features were proposed in this study - FEavg, FEdiff, and FEss. I used Deepview [30] to calculate the FE used to generate the three types of features. For a sequence of length L, a total of 20L FE was determined. Based on the 20 FE associated with inducing the 20 possible point mutations at a particular site, the minimum FE was determined and assigned to that site. As a result, an amino acid sequence is transformed into a numerical sequence, with each numerical value representing the optimal stability that results from inducing a point mutation on the corresponding site. The minimum FE from each sequence is subtracted from all the FE in that sequence. Given a peptide, the length of which is delineated by window size of w, a 1D FEavg

feature was constructed by averaging the FE associated with each amino acid, denoted as , where .

∑ (2.1)

I used w of 6, 10, 14, 18, 22, 26, and 30 amino acids in length to generate different 1D FE_avg features. The idea of 1D FEavg was further extended to a 3D perspective. For 3D FEavg, I considered a sphere S with center coordinate that correspond to the coordinate associated with the mid-index of the peptide in three-dimensional space. The size of S is defined by a specified value of radius r. By averaging the FE associated with , where , I obtained 3D FEavg.

∑ (2.2)

- 21 -

The radii used to generate 3D FE_avg features were 3.0A, 5.0A and 10.0A. Examples of 1D and 3D FEavg features are illustrated in Figure 2-2a.

A 1D FE_diff feature describes the difference between 1D FE_avg and the weighted average of FE in upstream and downstream regions, where the weight of each equals 0.5. A large difference suggests an energy fluctuation between neighboring regions in proteins, whereas a minor difference indicates a consistent pattern in energy. I used window sizes of 3, 5, and 10 amino acids in length to define 1D upstream or downstream neighborhood.

(2.3)

In three-dimensional space, the peptide is extended to a sphere, as described previously, and the upstream and downstream peptides are extended to a shell surrounding the sphere. Given spheres and respectively, where is greater than , is their difference in volume. 3D FE_diff is defined by

( ) (2.4)

Table 2-1 summarizes the selected lengths for and . Examples of 1D and 3D FE_diff features are illustrated in Figure 2-2b.

In contrast to the previous two types of features, which analyze FE confined to the area surrounding the peptide, FEss features consider of the peptide with respect to the full antigenic sequence. Using a sliding window approach, a total of L – w – 5 can be collected from a full sequence. Note that I removed three FE from each end of the sequence

- 22 -

where the peptide is relatively unstructured. If there exists a distribution for the set of FE_avg collected from the full sequence, FEss is defined as the standard score of the FEavg associated with the peptide with respect to the set of FE_avg collected from the full sequence. In other words, a 1D FEss feature compares the 1D FEavg of the peptide to the mean of the set of 1D FEavg successively averaged within a sliding window along the protein sequence. The comparison is expressed as the number of standard deviations of the 1D FE_avg from mean.

(2.5)

Similar to 1D FE_avg features, I used window sizes of 6, 10, 14, 18, 22, 26, and 30 amino acids in length to generate 1D FEss features. From a three-dimensional perspective, I collected a set of 3D FEavg generated by successively setting the coordinate associated with each site along the full antigenic sequence as the center coordinate of , and averaging the FE associated with , where , thus generating a set of 3D FEavg. The mean and standard deviation of the set of 3D FE_avg were determined. 3D FE_ss feature is defined as

(2.6)

The radii used to define 3D FE_ss features were 3.0A, 5.0A and 10.0A. In total, 44 energy-related features were constructed. Table 2-1 summarizes the choice of parameters used to define the features.

- 23 -

Figure 2-2. a) 1D FE_diff feature based on a neighborhood of 10 amino acids on both sides. The value of this 1D FEdiff feature is the difference in FE between the peptide and its neighborhood on both sides. b) 3D FEdiff feature defined by a central sphere with radius r1 and a shell with radius r2. The feature value is the difference in FE between the sphere and the shell.

Peptide Upstream

Neighborhood

Downstream Neighborhood

- 24 -

Table 2-1. a) Parameters used to define 1D energy-related features. b) Radii used to define 3D energy-related features.

- 25 -

Existing methods

We also implemented a number of existing methods for continuous B-cell epitope prediction to determine how these compare with the energy descriptors that we have developed. These parameters can be grouped into amino acid propensity scales, word probabilities [13], sequence complexity [13, 34], and the amino acid pair (AAP) antigenicity scale [15].

58 amino acid propensity scales were obtained from ProtScale (http://us.expasy.org/cgi-bin/protscale.pl; as of May 2012). These scales reflect physicochemical properties such as hydrophobicity, and secondary structure. Based on each propensity scale, the average value for a peptide was determined. Additionally, the pair wise difference between an amino acid and its neighbor was determined, then averaged over the length of the peptide.

Word probabilities were calculated as described by Sollner and Mayer [13]. These features estimate if successions of certain amino acid patterns, or words, exhibit a higher prevalence in one of the two sequence sets considered. Specifically, a neighborhood matrix, which describes the probability of each possible amino acid pattern in the neighborhood, was created based on a set of training sequences. The matrix held frequencies for patterns of length 1-3 amino acids.

Subsequently, the matrix was used to classify a peptide by assigning matrix values to neighborhoods surrounding the peptide of interest.

Sequence complexity was calculated as described by Wootton and Federhen [34]. It describes the amino acid frequencies f(x) in epitope and control peptides, and the complexity C of a peptide is given by

∑ (2.7)

- 26 -

Finally, we also implemented the AAP antigenicity scale developed by Chen et al. [15]. It has been reported that the AAP composition for epitopes is different from that of non-epitopes.

Specifically, the AAP antigenicity is defined by

(

) (2.8)

where and are the observed frequencies of a given AAP in epitopes and non-epitopes, respectively. In my implementation, both and were derived from the training data set.

In total, 178 parameters were constructed based on existing methods for continuous B-cell epitope prediction.

Table 2-2. Features used in published continuous B-cell epitope prediction methods.

Feature type Number of features

Amino acid propensity scale 58

Amino acid propensity scale (pair-wise

difference) 58

Word probability 60

Sequence complexity 1

AAP 1

Total 178

- 27 -

Feature normalization

Since the novel FE features are based on the free energy of protein, the input matrix to classifiers was column-wise normalized so that each column has a zero mean and variance one. That is, features fit to a standard normal distribution after normalization.

Classifier implementation

I applied the WEKA [35] implementation of k-nearest neighbor (IBk), SVM (SVMlib) with RBF kernel, and ANN (Multilayer Perceptron). In each tenfold cross validation, the ten classifiers used the same set of parameters for learning. In other words, the classifiers were not optimized for individual test sets. Rather, they were optimized in order to get the best average accuracy.

- 28 -

在文檔中基於蛋白質自由能之預測B細胞表位方法 (頁 25-36)