Chapter 5 Predicting immunogenicity of MHC binding peptides
5.8 Independent test performance of POPI-MHC2
For testing the informative physicochemical properties mined from PEPMHCII taset, we extracted an additional independent test dataset IEDB1500 from IEDB da-tabase [94] which is a largest collection of immune epitopes. The IEDB1500 consists of all T-cell response data using proliferation assays, human host, and naturally processed peptides restricted by HLA class II molecules. Peptides of human protein source are removed because this study attempts to model normal immune systems instead of host with autoimmune disease. Note that the T cell response data is qua-litative. A peptide is annotated as either immunogenic or non-immunogenic. After removing duplicate and inconsistence records, the numbers of immunogenic and non-immunogenic peptides of IEDB1500 are 1301 and 199, respectively.
All peptides in IEDB1500 were encoded using the 21 informative
physico-chemical properties. Due to the huge difference of immunogenic levels and dataset sizes between datasets PEPMHCII and IEDB1500, it is hard and not fair to directly test PEPMHCII-derived model on IEDB1500. To evaluate the prediction perfor-mance of the 21 informative physicochemical properties, jackknife test is applied to predict peptides in IEDB1500 with default SVM parameters of C=1 and =1/21.
The area under ROC (receiver operating characteristic) curve (AUC) is a robust and nonparametric performance measurement for binary-class problems and is widely used for comparison of prediction methods. Finally, a reasonable high performance of POPI-MHC2 with AUC=0.67 using jackknife test is obtained with a highly unba-lanced dataset which is different from PEPMHCII.
The previous section 5.4 already showed the poor performance of affini-ty-based method AFFIPRE on PEPMHCII. However, an additional performance comparison between POPI-MHC2 and affinity-based methods on IEDB1500 is de-sirable to show the robustness of POPI-MHC2. Due to the lack of annotated MHC binding affinities for peptides in IEDB1500, two state-of-the-art methods of ARB method [95] in IEDB analysis resource [96] and NetMHCIIpan [97] are applied to predict binding affinity of a peptide-MHC complex. The ARB method is based on an average relative binding matrix and can directly predict IC50 values. The matrix is trained on a large number of quantitative peptide binding data of IEDB and regu-larly updated with new data. Also, it is benchmarked as one of the best methods for binding affinity prediction [96, 98]. The NetMHCIIpan based on neural networks allows pan-specific predictions of peptide binding affinity to many HLA-DR mole-cules and is ranked as best individual predictor [99]. Therefore, comparing PO-PI-MHC2 with ARB can provide meaningful results.
Because most peptides are not annotated with complete supertype and subtype information of restricted MHC alleles, the ARB and NetMHCIIpan methods are not
Table 5.8 Predicted levels of peptides to induce both CTL and HTL responses.
Predicted level High Moderate Little None Total Peptides with high-level CTL
re-sponse 0 17 14 70 101
Peptides with high-level HTL
re-sponse 12 16 8 21 57
able to predict their binding affinity. Therefor, two datasets of TEST163 and TEST 320 consisting of only 163 and 320 peptides restricted by ARB and NetMHCIIpan support alleles were isolated from the IEDB1500 dataset, respectively. For each pep-tide in TEST163, its corresponding binding affinity was predicted by ARB. The scores for calculating ROC curve are minus predicted IC50 values because a large IC50 value means a weak binder. The binding score predicted by NetMHCIIpan represents the binding strength of each peptide in TEST320 and is used to calculate ROC curve. For POPI-MHC2, jackknife test using the 21 informative physicochem-ical properties and default SVM parameters is again used to evaluate the prediction performance on TEST163.
Due to the small number of peptides in TEST163 and TEST320, PO-PI-MHCII performs slightly worse. However, POPO-PI-MHCII with AUC=0.60 is still much better than the affinity prediction method ARB with AUC=0.34 for TEST163.
The NetMHCIIpan method with AUC=0.43 is worse than POPI-MHCII with AUC=0.59 for TEST320. The poor performances of affinity prediction methods are reasonable because they do not intend to directly predict T-cell responses. The results confirm the idea that the binding affinity alone is not sufficient for predicting T-cell responses.
5.9 Follow-up works
A recently published study utilize our POPI prediction server to analyze their the secretome of Candida albicans [100]. The Candida albicans is a pathogenic fungus and secrets a large number of proteins. To select candidates for vaccine developments, they applied mass spectrometry to identify secretory proteins and applied our POPI server to predict peptide immunogenicity. Finally, 29 highly immunogenic peptides originating from 18 proteins were identified as candidates for vaccine development.
A work done in University of Tübingen, Germany tried to improve our work by constructing a larger datasets and transforming the usage of averaged values of in-formative physicochemical properties to consider the position effects [101].
5.10 Summary
The effectiveness of vaccination depends on peptide immunogenicity in designing peptide-based vaccines. Accurate prediction of peptide immunogenicity will decrease many experimental efforts. This study investigates the prediction problem of peptide immunogenicity and proposes two efficient prediction systems POPI and
PO-PI-MHC2 to predict immunogenicity of peptides with variable lengths. POPI and POPI-MHC2 are SVM-based classifiers with a set of informative features selected by the proposed informative physicochemical property mining algorithm (IPMA).
In this study, two datasets PEPMHCI and PEPMHCI2 of peptides associated with human MHC class I and II molecules extracted from MHCPEP was established, respectively. Considering the correlated effects among physicochemical properties and the cooperation with the SVM classifier, both feature selection and parameter tuning are simultaneously optimized using IPMA. A feature set consisting of 23 and 21 physicochemical properties was selected to implement the prediction system PO-PI and POPO-PI-MHC2.
To our knowledge POPI and POPI-MHC2 is the first computational system for prediction of peptide immunogenicity based on physicochemical properties. The feature selection method was compared with a rank-based selection method and the selected properties were analyzed using the factor analysis of orthogonal experimen-tal design. Simulation results show that IPMA can select a small set of informative properties considering the correlated effects, compared with the rank-based method.
Three prediction methods were tested for comparison, namely the align-ment-based methods ALIGN and PSI-BLAST, and the affinity-driven prediction method AFFIPRE. Because the reference dataset is not sufficiently large, ALIGN and PSI-BLST cannot work well. This poor performance of AFFIPRE shows that affinity is not suitable to predict peptide immunogenicity directly. This result is con-sistent with previous studies that the peptide immunogenicity does not strongly cor-relate with its affinity for the MHC molecule [76, 77].
To cope with the small size of the training dataset in mining informative physi-cochemical properties, the proposed method can provide each selected property with the effectiveness according to its main effect difference in discriminating immuno-genic levels and the robustness in terms of selection frequency. The valuable infor-mation is helpful in determining a best set of features to implement an accurate pre-diction system, as well as to further understand immunogenicity from the informa-tive physicochemical properties.
Chapter 6
Identification of T-cell recep-tor recognition sites
Compared to the knowledge of anchor positions of peptides for MHC binding, pre-vious studies for identifying T-cell receptor (TCR) recognition positions were based on small-scale analyses using only a few peptides and concluded different recognition positions. Large-scale analyses are necessary to better characterize and predict a pep-tide‟s T-cell reactivity (and thus immunogenicity). The identification and characteriza-tion of important posicharacteriza-tions influencing T-cell reactivity will provide insights into the underlying mechanism of immunogenicity. In Chapter 5, the POPI prediction sys-tems are proposed to predict peptide immunogenicity with reasonably high accuracy.
However, the effect of MHC alleles on immunogenicity was not considered. Also, it is hard to identify T-cell receptor recognition sites because of the used averaged fea-tures. In this chapter, a weighted degree string kernel is proposed to identify T-cell receptor recognition sites and improve prediction performances by considering the effects of positions and MHC alleles.
6.1 Motivation
The first predictor for T-cell reactivity published is POPI [59] (Chapter 5). POPI is a support vector machine (SVM)-based method trained on 23 informative physico-chemical properties of MHC class I binding peptides. While POPI performs rea-sonably well, it uses averaged physicochemical properties to represent peptides
inde-Identification of T-cell
receptor recognition sites
pendent of their length. It thus does not allow for identifying relevant positions of the peptide for T-cell reactivity. The method thus cannot yield structural insights into T-cell reactivity.
In previous studies on the formation of the TCR-peptide-MHC complex, crys-tal structures have been analyzed [102-104] to correlate structural features of the TCR with immunogenicity and to identify TCR recognition positions. However, due to the low number of available crystal structures of the ternary complex, these are just case studies, with limited potential for generalization. For example, two studies found different important positions of HLA-A2 binding peptides for TCR recogni-tion (posirecogni-tion 8 [104]; posirecogni-tions 4 and 6 [102]). As an alternative approach to T-cell reactivity, experiments with substitutions and cytotoxicity assays have been per-formed for HLA-B27 [105]. However, so far results are based on only a few peptides.
Large-scale analyses are thus desirable to better characterize the important positions of MHC binding peptides for immunogenicity.
In this work, a systematic statistical approach is proposed for the prediction of T-cell reactivity. This study presents a more advanced machine learning study consi-dering the effects of MHC restriction on immunogenicity. In order to better charac-terize the immunogenicity induced by MHC class I binding peptides, we employ support vector machines (SVMs) using string kernels (SK) that have been successful-ly applied in many classification tasks [106-110]. This method was applied (1) to pre-dict peptide immunogenicity and (2) to identify important positions of MHC binding peptides for immunogenicity. The present study is based on a large dataset IMMA2, which contains data from databases of MHCPEP [83], SYFPEITHI [111, 112] and IEDB [94].
The prediction system POPISK for predicting peptide immunogenicity of HLA-A2 binding peptides was built on this machine learning approach. POPISK performs well achieving an overall performance of 0.68 for accuracy (ACC) and 0.74 for area under the receiver operating characteristic curve (AUC). This is significantly better than POPI on the same dataset (0.60 for ACC and 0.64 for AUC) IMMA2. In an analysis of seven HLA-A2-binding peptides with known crystal structures, PO-PISK accurately predicts the immunogenicity for the majority of peptides and suc-cessfully predicted the immunogenicity change of single residue modifications re-ported in previous studies [113, 114]. We also analyzed the importance of amino acid positions of the peptides by selecting positions whose deletion significantly decrease prediction performance. This technique shows that six positions (1, 4, 5, 6, 8 and 9) of HLA-A2 binding peptides are the most important for T-cell reactivity and thus
immunogenicity. Three of these positions were reported in previous studies (position 8 [104]; positions 4 and 6 [102]). As a confirmation, graphical analyses using two sample logos [115] identified nearly identical important positions 4, 6, 8 and 9.
6.2 Datasets
We first extracted peptide binders of length 9 with associated human MHC class I alleles and the corresponding immunogenicity data from MHCPEP [83], SYFPEI-THI [111, 112] and IEDB [94]. For the MHCPEP database, the peptide sequences and their associated MHC alleles, binding and immunogenicity data are extracted from the fields of „SEQUENCE‟, „MHC MOLECULE‟, „BINDING‟ and „ACTIV-ITY‟, respectively. The „BINDING‟ field annotates a peptide as either a binder or a non-binder. The peptide immunogenicity in MHCPEP is defined by its PD50 value, which is the peptide concentration giving 50% maximal specific lysis by cytotoxic T-cells of target cells displaying the MHC-peptide complex. According to MHCPEP,
Figure 6.1 Comparison of nested 10-CV performances of POPISK and PO-PI-modified and POPI-IPMA.
a peptide with PD50 value (obtained from the field „ACTIVITY‟) larger than 10 μM is considered a non-immunogenic peptide, all others are considered immunogenic. For the SYFPEITHI database, the data of binders and immunogenic peptides associated with various MHC alleles is extracted from the field „Natural ligands‟ and „T-Cell epitopes‟, respectively. For the IEDB database, the peptide sequences and their asso-ciated MHC alleles, qualitative binding and qualitative immunogenicity data are ex-tracted from the fields of „Epitope‟, „MHC Restriction‟, „MHC binding‟, „T cell re-sponse‟, respectively.
Only peptides with positive binding annotation were selected for analyses.
These peptide sequences were grouped into allele-specific datasets according to their associated HLA supertypes [116]. In order to utilize all available data for analyses, peptides with contradictory annotations (immunogenic and non-immunogenic) were regarded as immunogenic peptides. After removing duplicate entries, the dataset of allele HLA-A2 (named IMMA2) consists of 558 immunogenic and 527 non-immunogenic peptides. The IMMA2 dataset is available at http://iclab.life.nctu.edu.tw/POPISK/download.php. This study focuses on HLA-A2 because it is one of the best known allele. It is easy to compare results ob-tained from this study and previous knowledge. Also, due to the small number of peptides associated with the other alleles, it is hard to create robust models for the other alleles.
6.3 Weighted degree string kernel
An effective weighted degree string kernel [109, 117] counting the numbers of matched subsequences of length p at corresponding positions of two sequences is applied to transform samples to high-dimensional space to make linear separation easier. Given two sequences si and sj of equal length L and degree d, the weighted degree string kernel computes the total numbers of matched subsequences of length p {1, …, d} at corresponding positions l of two sequences, defined as follows:
LIBSVM [40] was chosen for the implementation of the predictor.
6.4 Prediction of peptide immunogenicity
To accurately predict immunogenicity of HLA-A2 binding peptides, it is necessary to tune two parameters (cost parameter C of the SVM and degree d of the weighted degree kernel) to build an accurate SVM classifier. In this study, a nested 10-fold cross-validation (10-CV) procedure was adopted to evaluate the prediction perfor-mance of our string kernel-based SVM classifier as it provides an almost unbiased estimate of the prediction error [119].
The nested 10-CV consists of two cross-validation loops: an inner loop for tuning SVM parameters and an outer loop for evaluating the prediction performance of tuned SVM classifiers. First, the IMMA2 dataset was randomly divided into ten subsets of approximately equal size. For each iteration m (outer loop), the m-th sub-set is left out for testing the tuned SVM classifier trained by using the selected op-timal parameters giving highest AUC performance using 10-CV on the remaining dataset (inner loop). The grid search method is applied to tune the parameters C{2-4, 2-3, …, 24} and d {1, 2, …, 9}.
To obtain a robust statistical estimation of prediction performances, a total of 20 runs of nested 10-CV procedure were applied to calculate the mean values of performance measurements as final prediction performances. The best values of C and d having the highest AUC value on the inner 10-CV loop are always 1 and 9, re-spectively. The mean prediction performances and corresponding standard deviation (SD) values of nested 10-CV on the IMMA2 dataset are 0.68 and 0.007 for ACC, 0.74 and 0.004 for AUC and 0.37 and 0.013 for MCC, respectively (Figure 6.1). All nine string kernels and five complex string kernels provided by Shogun were eva-luated. Most of them perform similarly to or slightly worse than the weighted degree string kernel. Except for cost parameters C and degree parameter d, the above-mentioned results were obtained by using default values of parameters. All kernels might thus perform better by carefully tuning the respective parameters.
6.5 Comparison to POPI
POPI is an SVM-based method using radial basis function kernel and 23 informative physicochemical properties mined by using an inheritable bi-objective genetic algo-rithm. It is not fair to directly compare the results of POPISK with POPI because POPI is a four-class prediction method that predicts a peptide as highly, medium,
little and not immunogenic. Furthermore, POPI is based on a smaller dataset. In or-der to perform a comparison, a modified POPI method (POPI-modified) was con-structed using the same dataset IMMA2 and the 23 informative physicochemical properties for binary prediction problem of immunogenic and non-immunogenic peptides.
The evaluation procedures of POPI-modified are described as follows. First, the 23 informative physicochemical properties were used to encode peptides of IMMA2 dataset. Subsequently, 20 runs of nested 10-CV were applied as follows. The grid search method was applied to tune the cost parameter C{2-4, 2-3, …, 24} and the kernel parameter γ{2-4, 2-3, …, 24} in the inner 10-CV loop. The SVM classifiers trained by using the selected parameters giving highest AUC performance in inner 10-CV loop are used to evaluate the prediction performances in the outer 10-CV loop.
Due to the difference of datasets and assays for measuring immunogenicity between the original POPI method and POPISK, another comparison using IPMA method to reselect informative physicochemical properties can provide better in-sights into the advantage of used string kernel method POPISK. However, due to
Figure 6.2 The decrease in MCC performances evaluated on datasets without using residues in specific positions.
the time-consuming nature of genetic algorithm, it is difficult to do 200 runs of IPMA. Considering the balance of preliminary results for comparisons and experi-ment efforts, 20 runs of IPMA is applied to give a rough performance for compari-son with POPISK. The evaluation procedures of POPI-IPMA are similar with PO-PI-modified. The only difference is that POPI-IPMA reselect informative physico-chemical according to the validation performance instead of using 23 informative physicochemical properties selected by previous POPI method
The comparison of nested 10-CV performances of POPISK, POPI-modified and POPI-IPMA is shown in Figure 6.1. Obviously, POPISK dominates PO-PI-modified with 10% improvements of ACC and AUC. Although the performance of POPISK is 2-5% better than POPI-IPMA, note that the POPI-IPMA utilize av-erage feature could be further improved by changing the position-independent fea-ture to consider the position effects of physicochemical properties. The nested 10-CV performances and corresponding SD values of POPI-modified are 0.60 and 0.009 for ACC, 0.64 and 0.009 for AUC and 0.19 and 0.018 for MCC, respectively.
The nested 10-CV performances and corresponding SD values of POPI-IPMA are 0.65 and 0.017 for ACC, 0.68 and 0.147 for AUC and 0.30 and 0.033 for MCC, re-spectively. By collecting more data, POPISK is expected to perform better and can be applied to analyze immunogenicity of peptides associated with other MHC alleles.
6.6 Identification of important positions for immunogenicity
Compared to well-known MHC binding motifs, T-cell recognition positions of MHC binding peptides are still not fully understood. Some studies have aimed to identify the T-cell recognition positions. However, these studies were based on only a few crystal structures and identified different recognition positions [102-104]. The com-putational identification of important positions for immunogenicity will shed light on the mechanism of T-cell recognition and accelerate the development of pep-tide-based vaccines. To assess the individual contributions of each position of MHC-binding peptides to the prediction performance, we proposed an efficient me-thod to estimate the importance of positions that is described as follows.
The proposed method uses the decrease in prediction performance resulted from removing the sequence information on a specific position within the peptide to designate the importance for each position. The larger the decrease in performance, the greater the importance of the position is. The change in prediction performance
is evaluated as follows. First, nine additional datasets for nine positions were created by removing residues in the corresponding positions from the IMMA2 dataset.
is evaluated as follows. First, nine additional datasets for nine positions were created by removing residues in the corresponding positions from the IMMA2 dataset.