• 沒有找到結果。

Chapter 1 Literature review

1.6 Protein energy

The majority of physicochemical properties, thus the machine learning methods that were derived from them, are based on the antigenic protein structure. A protein needs to fold into specific three-dimensional conformation to carry out its biological role. Protein folding is organized at various levels. The linear sequence of amino acids constitutes the protein primary structure. The primary sequence is held together by covalent or peptide bonds, which are made during the process of protein biosynthesis. The primary structure of a protein is encoded by the gene corresponding to the protein. A specific sequence of nucleotides in a gene segment is transcribed to mRNA, which is translated into protein by ribosomes. The primary sequence of a protein is exclusive to that protein, and determines the three-dimensional structure and function of the protein.

Secondary structure refers to local substructures, such as alpha helix, beta strand or beta sheets. These secondary structures are held together by hydrogen bonds, which are one of the main factors in the stabilization of secondary structure in proteins. Depending on the primary structure, hydrogen bonds form at specific places along the main chain peptide groups. Patterns and arrangement of hydrogen bonds define local secondary structures.

- 12 -

Tertiary structure is the three-dimensional structure of a single protein polymer that is created by bringing together the local secondary structures to form a compact globule. The folding is driven by the hydrophobic effect, in which nonpolar amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, tryptophan and methionine, cluster together within the protein for form a hydrophobic core. Exclusion of the hydrophobic core from water, while exposing charged and polar sides chains to the surface of protein, where they interact with surrounding water molecules, stabilizes the folded state of tertiary structure. Furthermore, formation of hydrogen bonds also helps to define the shape of a protein’s tertiary structure.

Some proteins also possess a quaternary structure, which is the ensemble of multiple protein molecules or polypeptide chains in a multi-subunit complex. The subunits in a quaternary structure are stabilized by non-covalent interactions and disulfide bonds. Different subunits in a complex may have unique functions. For instance, in an enzyme complex such as the DNA polymerase, some subunits carry out regulatory functions, whereas others carry out catalytic activity. The different intermolecular bonds and forces play a very important role in keeping the shape of proteins. Proteins must fold into specific three-dimensional conformations in order to perform their biological functions.

While the correct three-dimensional structure is essential to function, the macromolecule is usually flexible and dynamic. It can rearrange its shape in response to local perturbations such as mutations. Current continuous B-cell epitope prediction methods identify a peptide as epitope or non-epitope based on features extracted from its sequence composition. However, genetic variability exists as a result of mutations, and two or more phenotypes of an antigen may exist simultaneously in a population. None of the published methods has systematically combined and compared protein properties associated with antigenic mutants. In view of the dependence of current prediction methods on sequence composition, occurrence of mutations in the antigenic

- 13 -

sequence may affect prediction performance. That epitope prediction methods based on computational methods aim to identify candidate peptides for the development of vaccine design makes the dependence of prediction results on a single antigenic phenotype particularly impractical.

- 14 -

Statement of the Problem

One of the major challenges in the field of vaccine design is to identify continuous B-cell epitopes in an ever-evolving virus. Current prediction algorithms mostly rely on amino acid propensity scales and their variants. In view of the dependence of current prediction results on a particular sequence composition, existence of genetic variation in nature leads to variable prediction outcome. Each unique primary sequence composition leads to a unique set of intermolecular forces, which combine to produce total free energy that reflects the overall protein structure. While the majority of physicochemical properties are also related to protein structure, the performance of protein total free energy in continuous B-cell epitope prediction has not been evaluated. Features based on combining or comparing total free energy associated with antigenic mutants may be important in continuous B-cell epitope prediction and vaccine development.

Hypothesis

This study critically assesses point mutations and resultant protein energy in the prediction of continuous B-cell epitopes. It is proposed that features based on protein free energy associated with point mutations can be used to identify continuous B-cell epitopes through machine learning methods.

- 15 -

2.1 Introduction

B-cell epitopes are antigenic determinants that are recognized and bound by antibodies on the surface of B-cells, also known as B lymphocytes. When an antibody binds to cognate antigens on the surface of invading microbes, the antibody can tag the microbe or infected cell for attack by other parts of the immune system, or can neutralize the microbe directly. Consequently, understanding and identification of B-cell epitopes is critical for the design of effective vaccines.

Based on structure and interaction with the antibody, B-cell epitopes can be classified into two types, continuous epitopes and discontinuous epitopes. A continuous epitope is a short peptide that corresponds to a contiguous amino acid sequence fragment of a protein. A discontinuous epitope is composed of amino acids that are not contiguous in the antigenic sequence, but are close together in the folded antigenic structure. That an antibody and its cognate antigen possess complementary geometric shapes implies antibody-antigen interactions are also conformation dependent in the case of continuous B-cell epitopes.

Current continuous B-cell epitope machine learning methods are mostly based on physicochemical properties and their variants. However, a major problem associated with current methods is that they are based on a single sequence composition, and the results of these methods are affected by the occurrence of mutations. This is impractical considering that epitope regions are prone to mutations, possibly as a means to escape immune detection. In this study, I systematically introduced mutations along the length of proteins and determined the protein free energy associated with point-mutated sequences in folded state. The set of protein free energy was used to construct a novel type of features based on combining or comparing the protein free energy. This study is primarily concerned with patterns in protein free energy that contribute to the learning process of classifiers in continuous B-cell epitope prediction.

- 16 -

A total of 44 energy-related features were proposed. The performance k-NN, SVM, and ANN trained with these features was analyzed. In order to identify features that are particularly relevant to continuous B-cell epitope prediction, performance of subsets of the features in bringing about good classification performance was analyzed. Since continuous B-cell epitopes exist in various lengths, I also checked the sensitivity of prediction performance to epitope length.

In addition, the performance of publicly available B-cell epitope prediction methods was compared with each other, and with my method. I reported direct comparison of my method with ABCPred, BCPred, and the AAP method implemented by El-Manzalawy et al. The performance of the classifiers was validated on an independent dataset.

- 17 -

2.2 Methods

Dataset

B-cell epitopes were selected from Bcipep database [12]. We used a total of 200 epitopes distributed across 145 protein sequences for training and testing our classification methods. For energy estimation purposes, we retrieved antigenic sequences from Uniprot based on the accession number associated with each source antigen [29]. Non-epitope peptides were generated by randomly extracting segments from the 145 protein sequences while ensuring that the non-epitope peptides so obtained were not present in the epitope data set. This approach is certainly an approximation for non-epitope sequences, as the proteins involved have not been explicitly mapped. To avoid classification bias resulting from non-epitope peptides, I generated a total of 1000 non-epitope peptides. For each epitope in the training or testing dataset, a non-epitope peptide of the same length was randomly selected from the pool of 1000 non-epitope peptides.

Blind dataset

To evaluate the energy-related features on an independent dataset, a set of 85 epitopes distributed across 45 protein sequences was retrieved from the AntiJen database [27]. As a reference, a set of 100 non-epitope peptides was randomly selected from the protein sequences with the criterion that the selected non-epitopes were not the same as any one of the epitopes.

Dataset for current continuous B-cell epitope prediction methods

A number of current B-cell epitope prediction methods, such as BCPred, ABCPred, and the AAP method, require input peptides of specific lengths [11, 15, 16]. These methods demonstrated

- 18 -

superior performance with peptides of 16 amino acids in length. To compare the performance of the energy-related features to that of current prediction methods, the peptides in the aforementioned datasets were fixed to 16 amino acids in length. If the peptide length is less than 16 amino acids, the peptide is extended by adding an equal number of amino acids to both ends based on the protein sequence of the source antigen. If the peptide length is shorter than 16 amino acids, the peptide is shortened by trimming amino acids from both ends.

Energy estimation

Protein structure modeling was performed using SWISS-MODEL [30-32], an automated protein structure homology-modeling server. First, antigenic protein sequences were submitted to identify known protein structures that resemble the structure of the antigen. Those antigenic sequences for which we did not find suitable templates from the structural database were not included in our data set. For sequences we identified suitable templates, the template with the lowest E-value was chosen. Subsequently, the template ID and antigenic protein sequence were submitted to the server for homology modeling, and the resultant PDB file was saved locally for further operations.

Mutations in the PDB structures were initiated using Deepview [30], a molecular visualization package designed to interact with the SWISS-MODEL server, and the point-mutated structure was energy minimized and equilibrated using the Deepview-implementation of the GROMOS96 force-field [33]. Each site in the antigenic structure was mutated to each of the 20 naturally occurring amino acids. Following energy minimization, protein free energy (FE) associated with a particular point-mutated structure was recorded. Thus, a total of 20L FE was obtained for an antigen with sequence length of L. Note that we minimized FE with respect to bond lengths, bond angles, torsion energies, and improper

- 19 -

angles. While minimizing non-bonded interaction energy would certainly yield more accurate energy estimation, the search for a global minimum in energy is computationally expensive, due to the vast number of conformational variants analyzed in this study. Therefore, throughout this analysis, we make the assumption that conformation of a stable protein resulting from a single site substitution resembles that of the parent protein closely, with any tertiary structural changes localized in the neighborhood of the substitution.

As a consequence of protein energy estimation, epitopes in the data set were selected based on the following conditions: (i) the accession number of source antigen is provided, (ii) there exists a template structure for the antigenic sequence, and (iii) the antigen-template alignment provides information about the spatial arrangement of the epitope.

Figure 2-1 – System control flow for energy-related features.

- 20 -

Energy features

Three types of energy features were proposed in this study - FEavg, FEdiff, and FEss. I used Deepview [30] to calculate the FE used to generate the three types of features. For a sequence of length L, a total of 20L FE was determined. Based on the 20 FE associated with inducing the 20 possible point mutations at a particular site, the minimum FE was determined and assigned to that site. As a result, an amino acid sequence is transformed into a numerical sequence, with each numerical value representing the optimal stability that results from inducing a point mutation on the corresponding site. The minimum FE from each sequence is subtracted from all the FE in that sequence. Given a peptide, the length of which is delineated by window size of w, a 1D FEavg

feature was constructed by averaging the FE associated with each amino acid, denoted as , where .

(2.1)

I used w of 6, 10, 14, 18, 22, 26, and 30 amino acids in length to generate different 1D FEavg features. The idea of 1D FEavg was further extended to a 3D perspective. For 3D FEavg, I considered a sphere S with center coordinate that correspond to the coordinate associated with the mid-index of the peptide in three-dimensional space. The size of S is defined by a specified value of radius r. By averaging the FE associated with , where , I obtained 3D FEavg.

(2.2)

- 21 -

The radii used to generate 3D FEavg features were 3.0A, 5.0A and 10.0A. Examples of 1D and 3D FEavg features are illustrated in Figure 2-2a.

A 1D FEdiff feature describes the difference between 1D FEavg and the weighted average of FE in upstream and downstream regions, where the weight of each equals 0.5. A large difference suggests an energy fluctuation between neighboring regions in proteins, whereas a minor difference indicates a consistent pattern in energy. I used window sizes of 3, 5, and 10 amino acids in length to define 1D upstream or downstream neighborhood.

(2.3)

In three-dimensional space, the peptide is extended to a sphere, as described previously, and the upstream and downstream peptides are extended to a shell surrounding the sphere. Given spheres and respectively, where is greater than , is their difference in volume. 3D FEdiff is defined by

( ) (2.4)

Table 2-1 summarizes the selected lengths for and . Examples of 1D and 3D FEdiff features are illustrated in Figure 2-2b.

In contrast to the previous two types of features, which analyze FE confined to the area surrounding the peptide, FEss features consider of the peptide with respect to the full antigenic sequence. Using a sliding window approach, a total of L – w – 5 can be collected from a full sequence. Note that I removed three FE from each end of the sequence

- 22 -

where the peptide is relatively unstructured. If there exists a distribution for the set of FEavg collected from the full sequence, FEss is defined as the standard score of the FEavg associated with the peptide with respect to the set of FEavg collected from the full sequence. In other words, a 1D FEss feature compares the 1D FEavg of the peptide to the mean of the set of 1D FEavg successively averaged within a sliding window along the protein sequence. The comparison is expressed as the number of standard deviations of the 1D FEavg from mean.

(2.5)

Similar to 1D FEavg features, I used window sizes of 6, 10, 14, 18, 22, 26, and 30 amino acids in length to generate 1D FEss features. From a three-dimensional perspective, I collected a set of 3D FEavg generated by successively setting the coordinate associated with each site along the full antigenic sequence as the center coordinate of , and averaging the FE associated with , where , thus generating a set of 3D FEavg. The mean and standard deviation of the set of 3D FEavg were determined. 3D FEss feature is defined as

(2.6)

The radii used to define 3D FEss features were 3.0A, 5.0A and 10.0A. In total, 44 energy-related features were constructed. Table 2-1 summarizes the choice of parameters used to define the features.

- 23 -

Figure 2-2. a) 1D FEdiff feature based on a neighborhood of 10 amino acids on both sides. The value of this 1D FEdiff feature is the difference in FE between the peptide and its neighborhood on both sides. b) 3D FEdiff feature defined by a central sphere with radius r1 and a shell with radius r2. The feature value is the difference in FE between the sphere and the shell.

r1

r2

a

b

Peptide Upstream

Neighborhood

Downstream Neighborhood

- 24 -

Table 2-1. a) Parameters used to define 1D energy-related features. b) Radii used to define 3D energy-related features.

- 25 -

Existing methods

We also implemented a number of existing methods for continuous B-cell epitope prediction to determine how these compare with the energy descriptors that we have developed. These parameters can be grouped into amino acid propensity scales, word probabilities [13], sequence complexity [13, 34], and the amino acid pair (AAP) antigenicity scale [15].

58 amino acid propensity scales were obtained from ProtScale (http://us.expasy.org/cgi-bin/protscale.pl; as of May 2012). These scales reflect physicochemical properties such as hydrophobicity, and secondary structure. Based on each propensity scale, the average value for a peptide was determined. Additionally, the pair wise difference between an amino acid and its neighbor was determined, then averaged over the length of the peptide.

Word probabilities were calculated as described by Sollner and Mayer [13]. These features estimate if successions of certain amino acid patterns, or words, exhibit a higher prevalence in one of the two sequence sets considered. Specifically, a neighborhood matrix, which describes the probability of each possible amino acid pattern in the neighborhood, was created based on a set of training sequences. The matrix held frequencies for patterns of length 1-3 amino acids.

Subsequently, the matrix was used to classify a peptide by assigning matrix values to neighborhoods surrounding the peptide of interest.

Sequence complexity was calculated as described by Wootton and Federhen [34]. It describes the amino acid frequencies f(x) in epitope and control peptides, and the complexity C of a peptide is given by

∑ (2.7)

- 26 -

Finally, we also implemented the AAP antigenicity scale developed by Chen et al. [15]. It has been reported that the AAP composition for epitopes is different from that of non-epitopes.

Specifically, the AAP antigenicity is defined by

(

) (2.8)

where and are the observed frequencies of a given AAP in epitopes and non-epitopes, respectively. In my implementation, both and were derived from the training data set.

In total, 178 parameters were constructed based on existing methods for continuous B-cell epitope prediction.

Table 2-2. Features used in published continuous B-cell epitope prediction methods.

Feature type Number of features

Amino acid propensity scale 58

Amino acid propensity scale (pair-wise

difference) 58

Word probability 60

Sequence complexity 1

AAP 1

Total 178

- 27 -

Feature normalization

Since the novel FE features are based on the free energy of protein, the input matrix to classifiers was column-wise normalized so that each column has a zero mean and variance one. That is, features fit to a standard normal distribution after normalization.

Classifier implementation

I applied the WEKA [35] implementation of k-nearest neighbor (IBk), SVM (SVMlib) with RBF kernel, and ANN (Multilayer Perceptron). In each tenfold cross validation, the ten classifiers used the same set of parameters for learning. In other words, the classifiers were not optimized for individual test sets. Rather, they were optimized in order to get the best average accuracy.

- 28 -

2.3 Results

Tenfold cross-validation

I used stratified tenfold cross-validation tests. The data set was randomly divided into ten equal subsets such that the number of epitopes to non-epitopes was in a 1:1 ratio. Nine of the ten subsets were used for training the classifier, and the tenth subset was used for testing the classifier. This procedure was repeated ten times, with each subset used exactly once as the testing data. Results from five tenfold runs were averaged to produce a single value, which represents the estimated performance of classifier.

Performance of energy-related features in selected classifiers

In this study, 44 energy-related features were developed for continuous B-cell epitope prediction.

First, the energy-related features were tested on learning algorithms that have previously demonstrated prominent performance in the prediction of continuous B-cell epitopes, namely k-NN, SVM, and ANN. Performance of the classifiers trained with energy-related features is

First, the energy-related features were tested on learning algorithms that have previously demonstrated prominent performance in the prediction of continuous B-cell epitopes, namely k-NN, SVM, and ANN. Performance of the classifiers trained with energy-related features is

相關文件