Machine learning approaches for the prediction of continuous B-cell epitopes 8

Chapter 1 Literature review

1.5 Machine learning approaches for the prediction of continuous B-cell epitopes 8

A number of continuous B-cell epitope prediction methods based on machine learning approaches have been developed. ABCPred uses artificial neural networks for predicting continuous B-cell epitopes [11]. The method was trained and tested on epitopes derived from the Bcipep database [12], and reference peptides from the Swiss-Prot database. One of the constraints associated with machine learning techniques is that peptides needs to be adjusted to a fixed length, however, the length of B-cell epitopes vary from 5 to 30 amino acids. The authors tested a number of fixed lengths (10, 12, 14, 16, 18, 20 amino acids). In case where the peptide length was smaller than the specified length, the peptide was extended by adding amino acids on both sides, based on the corresponding complete antigen sequences. Alternatively, if the peptide was longer than the specified length, an equal number of amino acids were removed from both sides of the peptide. The dataset was divided into five-fold, where three parts were used for training, one for minimizing the error during learning, and one for testing. The best accuracy, 66%

accuracy, was obtained using a recurrent neural network with 35 neurons in the hidden layer, and trained with window length of 16 amino acids. It was suggested that this method demonstrated improved accuracy, sensitivity, and specificity compared with the scale-based methods.

1.5.2 Prediction method based on C4.5 and k-NN

In the method by Sollner and Mayer, amino acid scales, neighborhood matrices, and respective probability and likelihood values, were combined, then included in decision tree and nearest neighbor approaches to derive a classification algorithm [13]. The dataset used for training was derived from the public domain sources (Bcipep and FIMM [14] databases), and a proprietary dataset of experimentally determined epitopes. For each peptide in the epitope dataset, a

- 9 -

non-epitope peptide of the same length was selected randomly. The mean epitope length of peptides was found to be 13 amino acids, with a minimum of 6, and a maximum of 20 amino acids. The peptides were transformed into a parameter space. The parameters considered were grouped into three classes – amino acid propensity scales, sequence complexities, and neighborhood word probabilities. Based on the distribution of each parameter for the epitope and non-epitope datasets, a feature was selected if the parameter exhibited correct class assignment of at least 60%. Subsequently, the selected parameters were used as input for C4.5 decision tree and k-NN. The best performance, 72% accuracy, was attained using the nearest neighbor approach.

While this approach demonstrated an improved accuracy over previous methods, the method used for representing peptides as input to the classifiers is not publicly available.

1.5.3 Prediction method based on SVM

In 2007, Chen et al. tried to improve prediction quality using a novel scale called the amino acid pair (AAP) antigenicity scale [15]. The author used epitopes derived from the Bcipep database and non-epitopes derived from the Swiss-Prot. The peptides were adjusted to different window sizes. Initial analysis of AAPs demonstrated that the frequencies of some pairs differed significantly in the epitope and non-epitope datasets. Therefore, the average of all the AAPs in a peptide, as well as hydrophilicity, accessibility, flexibility, and antigenicity, were projected as a vector into feature space. A SVM classifier was used to assign an example to one of the classes {-1, +1}. The SVM, using a radial symmetric function as kernel, produced a prediction accuracy of 71%.

BCPred is also based on a SVM, combined with a variety of kernel methods, namely string kernels, and the widely used radial bias function kernel, for continuous B-cell epitope prediction

- 10 -

[16]. The string kernels [17-20] are a class of kernel methods that have been used in a variety of text classification tasks [19-24]. Among them, the authors selected four string kernels in building the SVM, including the spectrum kernel, mismatch kernel, local alignment kernel, and subsequence kernel. The spectrum kernel maps an input example to feature space based on the function , where is the number of occurrences of the k-length subsequence α in the peptide x, defined on the alphabet A (e.g. 20 amino acids). The kernel captures the degree of similarity between two peptides by determining the number of common subsequences in them. The subsequence kernel and mismatch kernel are both alternations to the spectrum kernel. The subsequence kernel considers a feature space generated by all contiguous and non-contiguous subsequences, where gaps are penalized in non-contiguous subsequences.

The mismatch kernel considers inexact matching in the comparison of substrings. The local alignment kernel is a string kernel specific for biological sequences [21], and it determines the level of similarity between two peptides by summing up scores obtained from gapped local alignments between the peptides. The authors used peptides of 20 amino acids long derived from the Bcipep database, and found that the maximum prediction accuracy, 74.57%, was obtained using a SVM trained with the subsequence kernel.

BEOracle [25] is a SVM that combines evolutionary information with various structural properties to predict B-cell epitopes. In total, the authors evaluated evolutionary conservation information, compositional and per residues probabilities for secondary structure, solvent accessibility, disorder, low-complexity, and structural properties as potential learning features to the SVM. The majority of features were calculated using the Open Life Science Gateway (OLSGW) [26], which is a grid computing resource that facilitates the computation of complex biological problems. The dataset used in this study was retrieved from Bcipep, AntiJen [27], and

- 11 -

immune epitope database (IEDB) [28], and the peptides were extended to a final length of 100 amino acids in order to obtain accurate prediction of structural properties. Features extracted from the dataset were used as input to SVM trained with different kernel functions, including linear, polynomial kernels of degrees 2 and 3, RBF, and sigmoid kernels. The authors found that the best accuracy, 82.16%, was achieved with SVM trained with evolutionary information combined with secondary structural information, using a polynomial kernel of degree 3.

在文檔中基於蛋白質自由能之預測B細胞表位方法 (頁 16-19)