Summary - Prediction of ubiquitylation sites

Chapter 4 Prediction of ubiquitylation sites

4.8 Summary

Ubiquitylation plays many important regulatory roles in the physiology of eukaryotic cell. Nowadays, many experimental studies are working on identifying ubiquitylated proteins and their ubiquitylation sites. To accurately predict ubiquitylation sites by computational methods is helpful to save experimental efforts. In this study, an SVM-based method is presented to assess three kinds of features, including amino acid identity, evolutionary information and physicochemical property, in predicting ubiquitylation sites. The ubiquitylation datasets extracted from the UbiProt database are established to evaluate the proposed methods. Results show that physicochemical property is the best kind of features for the SVM-based prediction method.

It is well recognized that irrelevant information will interfere with classifiers.

This study proposes an algorithm IPMA for mining a small set of informative phy-sicochemical properties to advance the prediction performance. The 31 informative physicochemical properties improve the prediction accuracy from 72.19% to 84.44%, and their individual effectiveness is ranked for further understanding the ubiquityla-tion mechanism. Finally, the system UbiPred for predicting ubiquitylaubiquityla-tion sites is de-signed by using 31 informative physicochemical properties. The web server of Ubi-Pred has been implemented and is available at http://iclab.life.nctu.edu.tw/ubipred.

Chapter 5 Predicting immunogenicity of MHC binding peptides

Both modeling of antigen processing and presentation pathways and immunogenici-ty prediction of those MHC-binding peptides are essential to develop a comput-er-aided vaccine design system that is one goal of immunoinformatics. Numerous studies have dealt with modeling the immunogenic pathway but not the intractable problem of immunogenicity prediction due to complex effects of many intrinsic and extrinsic factors. Moderate affinity of the MHC-peptide complex is essential to in-duce immunogenicity, but the relationship between the affinity and peptide immu-nogenicity is too weak to use for predicting immuimmu-nogenicity.

This study focuses on mining informative physicochemical properties from known experimental immunogenicity data to understand immunogenicity and predict immunogenicity of MHC-binding peptides accurately.

5.1 Motivations

After the prediction of peptides binding to cytotoxic T lymphocyte (CTL) and helper T lymphocyte (HTL), defining peptide immunogenicity is desirable to accurately pre-dict immunogenicity of epitopes (i.e. CTL and HTL responses) for the vaccine de-sign. The peptide immunogenicity is influenced by many factors, including intrinsic physicochemical properties and extrinsic factors such as host immunoglobulin re-pertoire [74, 75]. Several studies aimed to clarify the relationship between the peptide binding affinity to the MHC molecule and its immunogenicity [76, 77]. These studies

Predicting immunogenicity of

MHC binding peptides

revealed that moderate binding affinity of peptide-MHC molecules is essential to induce immunogenicity, but the ability of peptides to induce cytotoxic T lymphocyte and helper T lymphocyte responses does not strongly correlate with their affinity for the MHC molecule. In some extreme cases, a peptide with nearly-undetectable bind-ing affinity of MHC class II molecules can induce strong T-cell responses [78]. Fur-thermore, peptide-flanking residues other then MHC anchor residues were identified as import factors for MHC class II-restricted T-cell responses [79, 80]. These studies show great importance of modeling T-cell responses.

Physicochemical properties of amino acids were extensively and successfully used in sequence-based prediction methods [33-37]. Because of the weak correlation between peptide immunogenicity and peptide-MHC binding affinity, mining infor-mative physicochemical properties is a potentially good approach to designing a clas-sifier for predicting immunogenicity. Because the number of available physicochemi-cal properties is as large as more than 500, the properties used in previous studies are usually selected according to domain knowledge [36] or the rank-based method [81].

Therefore, these methods cannot be effectively applied to the investigated intractable problems because of limited knowledge or neglect of correlated effects among mul-tiple properties [33]. This study aims to design an accurate predictor by efficiently selecting a small set of informative physicochemical properties considering the cor-related effects.

It is well recognized that feature selection and classifier design should be opti-mized simultaneously to maximize prediction accuracy [82]. The SVM-based learning methods are shown effective for various prediction methods from protein sequences [12, 15]. However, internal detection of relevant-feature correlation is not offered by conventional SVMs; meanwhile, appropriate setting of their control parameters is often treated as another independent problem [40]. Let there be n candidates of phy-sicochemical properties of amino acids. To maximize accuracy of the investigated prediction problem by selecting a small number m out of n properties while coope-rating with SVM simultaneously, it is equivalent to solve the binary combinatorial op-timization problem having a huge search space of C(n, m)=n!/(m!(n-m)!)). To solve this problem, an informative physicochemical property mining algorithm (IPMA) capable of simultaneous feature selection and classifier design (described in Chapter 3) is proposed to mine informative physicochemical properties for predicting CTL and HTL responses.

5.2 The proposed prediction systems

Two prediction systems named POPI and POPI-MHC2 were proposed to predict immunogenicity of MHC class I and II binding peptides, respectively. High perfor-mance of POPI and POPI-MHC2 arises mainly from the inheritable bi-objective genetic algorithm which aims to automatically determine the best number m out of 531 physicochemical properties, identify these m properties, and tune SVM parame-ters simultaneously. The datasets of PEPMHCI and PEPMHCII consisting of 428 human MHC class I binding peptides and 226 human MHC class II binding peptides.

All the peptides belongs to four classes of immunogenicity and are extracted from MHCPEP, a database of MHC-binding peptides [83]. Table 5.1 and Table 5.2 show the used datasets PEPMHCI and PEPMHCII of peptides associated with human MHC class I and II molecules, respectively. By applying the proposed IPMA to the experimental datasets, two prediction systems of POPI and POPI-MHC2 were con-structed by using the selected informative physicochemical properties.

The IPMA is performed to mine informative physicochemical properties using the whole datasets of PEPMHCI and PEPMHCII. In this study, the parameters of IPMA are set as N_pop=50, P_c=0.8, P_m=0.05, r_start=5 and r_end=45. For each feature set with size r, IPMA selected a small set of physicochemical properties and parameter values of SVM. Figure 5.1 shows a potentially good result for PEPMHCI in terms of averaged accuracy (AA) and the number of used features obtained from a single run of IPMA using 10-CV. The result reveals that the best number of selected fea-tures is m=23 where the SVM classifier with C=2 and γ=2 has the best averaged ac-curacy AA=63.67% and overall acac-curacy OA=66.12%.

Table 5.1 The dataset PEPMHCI and PEPMHCII of peptides associated with human MHC class I and II molecules

Immunogenicity class PEPMHCI PEPMHCII

None 144 45

Little 83 60

Moderate 100 64

High 101 57

Total 428 226

To further evaluate the feature selection of IPMA, a traditional rank-based me-thod for evaluating performance of a single feature is also implemented for compar-ison. The rank-based method suffers from the incapability of finding appropriate values of C and γ to train SVM classifiers. In order to achieve high performance, two parameter settings of SVM were tested. The first rank-based method named RankD using the default values of SVM parameters that C=1 and γ=1/r. The best perfor-mance of RankD is AA=36.08% with 21 features. The second rank-based method named RankI using the same values of C=2 and γ=2 obtained from IPMA. The best performance of RankI is AA=48.87% with 18 features. Figure 5.1 shows the per-formance of RankI is better than that of RankD, revealing that the parameter setting of SVM parameters derived from IPMA is effective.

Furthermore, the performance of feature selection of IPMA is much better than that of the rank-based method. This result is well recognized that the feature

Figure 5.1 Averaged accuracies (AAs) of 10-CV for IPMA, rank-based methods (RankD and RankI) and the alignment-based method (ALIGN) for MHC class I

binding peptides.

selection by additionally considering the correlated effects among physicochemical properties can advance prediction performance. The results of mining informative physicochemical properties for PEPMHCI2 is similar to PEPMHCI that shown in Figure5.2.

5.3 POPI for predicting immunogenicity of

Table 5.2 Performance comparisons of ALIGN, PSI-BLAST and POPI using LOOCV on the whole dataset PEPMHCI.

Immunogenicity

Table 5.3 Performance comparisons between AFFIPRE and POPI.

Immunogenicity

MHC class I binding peptides

The immunogenicity of a peptide is determined by measuring the concentration of peptides giving 50% of maximum specific lysis by CTLs of target cells displaying the peptide, and is given a descriptive value belonging to the four classes, None, Little, Moderate, High. POPI utilizing the 23 selected properties performs well with the accuracy of 64.72% using leave-one-out cross-validation (LOOCV). For comparison, sequence alignment-based and affinity-based methods were implemented to evaluate the LOOCV performances.

Sequence alignment may be an efficient approach to predicting peptide immu-nogenicity because similar sequences may have similar peptide immuimmu-nogenicity. In order to compare the alignment-based prediction methods with POPI, two methods including global sequence alignment tool ALIGN [84] and advanced sequence com-parison method PSI-BLAST that is capable of detecting remote homologues [60]

were applied to search for similar sequences. For each tested peptide, ALIGN and PSI-BLAST using three iterations were applied separately to search for its homolo-gues. Results are shown in Table 5.2.

In the past, affinity was considered as an important index to predict peptide immunogenicity. To evaluate the affinity-driven prediction method, an additional da-taset was established by extracting MHC class I binding peptides with known activity levels in both fields of „BINDING‟ and „IMMUNOGENICITY‟ from the MHCPEP database. However, there are four levels in the field of „IMMUNOGE-NICITY‟, but the field of „BINDING‟ has only three levels without the level „none‟.

To fairly evaluate the prediction performance of the affinity-driven prediction, the immunogenic class None was combined with the class Little. The dataset contains 160 peptides belonging to three classes.

To evaluate the affinity-driven prediction method, a prediction system named AFFIPRE to predict peptide immunogenicity was implemented using the following criterion. If the immunogenic level and the affinity level of a peptide are identical, this test is regarded as a successful prediction. Otherwise, this prediction is fail. The four measurements were used to evaluate AFFIPRE, which are the same with those for IPMA.

The results shown in Table 5.3 reveal that POPI performs well, compared with two sequence alignment-based prediction methods ALIGN (54.91%) and PSI-BLAST (53.23%). The poor performance of AFFIPRE reveals that the affinity only can not be directly used to predict peptide immunogenicity and this result is consistent with previous studies that the affinity of peptide-MHC molecules is not

the main factor for predicting peptide immunogenicity [76, 77].

In contrast to the existing affinity-based methods of predicting immunogenicity by way of predicting MHC-binding peptides, POPI is the first computational system based on physicochemical properties to predict peptide immunogenicity using epi-topes associated with human MHC class I molecules, which has been implemented as a web server (http://iclab.life.nctu.edu.tw/POPI). Up to date, there are >18,690

vis-Table 5.4 Performance comparisons of ALIGN, PSI-BLAST and POPI-MHC2.

Immunogenicity ALIGN PSI-BLAST POPI-MHC2

ACC (%) MCC ACC (%) MCC ACC (%) MCC

None 68.89 0.74 66.67 0.69 86.67 0.81

Little 46.67 0.34 23.21 0.29 68.33 0.54

Moderate 50.00 0.22 75.86 0.22 57.81 0.53

High 71.93 0.56 38.00 0.31 85.96 0.73

OA 58.41 49.75 73.45

AA 59.37 50.94 74.69

Table 5.5 Performance comparison between AFFIPRE and POPI-MHC2.

Immunogenicity

class Peptides AFFIPRE POPI-MHC2

ACC (%) MCC ACC (%) MCC

None and Little 21 23.81 0.30 42.86 0.49

Moderate 6 33.33 -0.08 0.00 -0.07

High 42 50.00 0.16 92.86 0.41

OA 40.58 69.57

AA 35.71 45.24

its from >20 countries, and >20,000 sequences were analyzed.

5.4 POPI-MHC2 for predicting immunogenic-ity of MHC class II binding peptides

The 21 informative physicochemical properties and SVM parameters selected by IBCGA are applied to construct POPI-MHC2, an SVM-based prediction system for immunogenicity of MHC class II binding peptides. The web server has also been implemented and is available at http://iclab.life.nctu.edu.tw/POPI. POPI-MHC2 performs well with accuracy of 73.45% using leave-one-out cross-validation, com-pared with two alignment-based methods ALIGN (58%) and PSI-BLAST (<49.75%) shown in Table 5.4.

For comparing with affinity-based prediction, another dataset consisting of 69 peptides with annotated binding and immunogenicity level was constructed.

PO-Figure 5.2 Averaged accuracies (AAs) of 10-CV for IPMA, rank-based methods (RankD and RankI) and the alignment-based method (ALIGN) for MHC class II

binding peptides.

PI-MHC2 (69.57%) performs better than the affinity-based method (40.5%) shown in Table 5.5.The poor performance of AFFIRE (OA=40.58 and AA=35.71%) im-plies that affinity is not the deterministic factor for peptide immunogenicity of MHC class II binding peptide. Instead, physicochemical properties might play more im-portant roles for determining the immunogenicity.

Users can use POPI-MHC2 by entering either a sequence or a file of sequences of MHC binding peptides. The predicted immunogenicity levels will be shown in the web page. POPI-MHC2 is publicly available at http://iclab.life.nctu.edu.tw/POPI

5.5 Analysis of informative physicochemical properties

After identification of informative physicochemical properties, it is desired to analyze and interpret the obtained knowledge. Revealing individual effects of identified phy-sicochemical properties on immunogenicity of MHC class II-restricted peptides is important for immunologist to further investigate immunogenic problems. Factor analysis of the orthogonal experimental design used in IPMA can efficiently estimate effects of an individual feature by evaluating its main effect difference (MED). The property with the largest MED value is the most effective property.

Because IPMA is a non-deterministic algorithm and SVM parameter values will slightly affect prediction accuracy, the identified feature sets with the highest accuracy obtained from multiple independent runs would be not the same. In order to obtain a robust feature set, 60 independent runs of IPMA were performed for identifying informative physicochemical properties. The largest, mean and smallest numbers m of selected features are 45, 28.63 and 12, respectively. The highest, mean and lowest AA accuracies in the training phase are 76.84%, 73.64% and 69.68%, respectively.

The statistic result reveals that a small set of effective properties is more stable in each run of IPMA.

Table 5.6 and Table 5.7 show the typical feature sets with MED values consi-dering both training accuracy and selection frequency for MHC class I and II binding peptide, respectively. For CTL immune response, the property of AAindex identity GEIM800103 is the most effective property with MED=33.29, which corresponds to „Alpha-helix indices for beta-proteins‟ [85]. The least effective property is MIYS850101 with MED=0.80 which corresponds to „Effective partition energy‟ [86].

For HTL immune response, the AAindex identity KUHL950101 is the most effective property (denoting „Hydrophilicity scale‟) with MED=46.06 [87]. The AAindex

Table 5.6 Individual effects of identified properties for CTL responses in terms of main effect difference (MED).

ID of AAindex Description MED Class

GEIM800103 Alpha-helix indices for beta-proteins 33.29 S OOBM770104 Average non-bonded energy per residue 31.97 O PALJ810115 Normalized frequency of turn in alpha+beta class 24.91 S QIAN880132 Weights for coil at the window position of -1 23.90 S OOBM850102 Optimized propensity to form reverse turn 17.09 S NADH010106 Hydropathy scale based on self-information values in

the two-state model (36% accessibility)

14.79 H

RADA880106 Accessible surface area 11.64 V

QIAN880112 Weights for alpha-helix at the window position of 5 10.71 S WEBA780101 RF value in high salt chromatography 10.65 O QIAN880125 Weights for beta-sheet at the window position of 5 10.63 S

JOND750101 Hydrophobicity 9.27 H

QIAN880124 Weights for beta-sheet at the window position of 4 9.06 S MUNV940101 Free energy in alpha-helical conformation 7.44 S

HUTJ700102 Absolute entropy 6.62 V

MITS020101 Amphiphilicity index 5.10 H

KARP850103 Flexibility parameter for two rigid neighbors 4.63 O

FAUJ880113 pK-a(RCOOH) 4.37 S

ISOY800106 Normalized relative frequency of helix end 4.31 S

RACS820113 Value of theta(i) 3.25 S

GEOR030105 Linker propensity from small dataset (linker length is less than six residues)

3.05 S QIAN880114 Weights for beta-sheet at the window position of -6 2.99 S DIGM050101 Hydrostatic pressure asymmetry index, PAI 1.60 O

MIYS850101 Effective partition energy 0.80 H

H: hydrophobicity; S: structure; V: volume; O: others

Table 5.7 Individual effects of identified properties for HTL responses in terms of main effect difference (MED).

ID of AAindex Description MED Class

KUHL950101 Hydrophilicity scale 46.06 H

WERD780103 Free energy change of alpha(Ri) to alpha(Rh) 37.10 O KHAG800101 The Kerr-constant increments 32.78 O VHEG790101 Transfer free energy to lipophilic phase 31.92 H BIOV880102 Information value for accessibility; average fraction

23%

31.20 H

ENGD860101 Hydrophobicity index 27.79 H

WOLR810101 Hydration potential 26.18 H

JOND750102 pK (-COOH) 25.03 H

GEIM800109 Aperiodic indices for alpha-proteins 23.66 O AURR980103 Normalized positional residue frequency at helix

termini N"

22.46 S ROBB760111 Information measure for C-terminal turn 16.96 S YUTK870104 Activation Gibbs energy of unfolding, pH9.0 15.93 O PALJ810113 Normalized frequency of turn in all-alpha class 15.36 S

RACS820114 Value of theta(i-1) 14.21 S

MAXF760104 Normalized frequency of left-handed alpha-helix 12.83 S KUMS000103 Distribution of amino acid residues in the

al-pha-helices in thermophilic proteins

11.13 S H: hydrophobicity; S: structure; V: volume; O: others

identity DESM900102 with the smallest MED value of 4.11 denoting „Average membrane preference: AMP07‟ [88].

5.6 Comparison of physicochemical properties responsible for CTL and HTL responses

It is interesting to know similarity and difference between the two property sets re-sponsible for HTL and CTL responses. To analyze compositions of informative physicochemical properties, physicochemical properties of each set are categorized into four classes, hydrophobicity, structure, volume and others. Properties with ob-vious annotation of hydrophobicity-, secondary structure- and volume-related words can be easily categorized first. For each of uncategorized properties, its correlation coefficients (CCs) to the categorized properties are measured. The same class of the categorized property is assigned to the uncategorized property with the CC value larger than or equal to 0.85.

Figure 5.3 shows pie-chart representations of the property compositions in terms of the four classes for CTL and HTL responses. As expected, hydrophobici-ty-related properties play an important role in both HTL (43%) and CTL (17%) im-mune responses in immunogenicity that is consistent with our knowledge that hy-drophobicity is important for biomolecular recognition [88, 89]. Recent studies [90, 91] have reported importance of antigen structures in influencing T-cell dominance.

It is also consistent that structure propensity-related properties has a large propor-tion for both HTL (33%) and CTL (57%) immune responses (Figure 5.3).

The situation is similar that all the hydrophobicity- and structure-related prop-erties take a large proportion (close to 75%) among all propprop-erties. The major differ-ence is that the categorized properties with the largest proportion for HTL (43%) and CTL (57%) responses are the hydrophobicity and structure classes, respectively.

In other words, hydrophobicity-related properties are more important for HTL res-ponses, compared with CTL responses. In contrast, structure-related properties are more important for CTL than HTL responses.

The great importance of structure- and hydrophobicity-related properties for CTL and HTL responses, respectively, can also be observed by the MED-based analysis for ranking individual effects of informative physicochemical properties. For CTL responses, the most effective property of AAindex identity GEIM800103 with MED=33.29 is „Alpha-helix indices for beta-proteins‟ [85] (Table 5.6). In contrast, the property of AAindex identity KUHL950101 denoting „Hydrophilicity scale‟ [87]

is the most effective property with MED=46.06 for HTL responses (Table 5.7).

From the perspective of similarity, the CTL response-related property of AAindex identity MIYS850101 denoting „Effective partition energy‟ highly correlate with two HTL response-related properties of AAindex identities BIOV880102 and DESM900102 denoting „Information value for accessibility; average fraction 23%‟

and „Average membrane preference: AMP07‟ with CC values of 0.93 and 0.83, re-spectively. All three properties are hydrophobicity-related properties. Both vo-lume-related properties of AAindex identities RADA880106 and HUTJ700102 de-noting „Accessible surface area‟ and „Absolute entropy‟, respectively, for CTL res-ponses highly correlate with volume-related property of AAindex identity CHOC750101 denoting „Average volume of buried residue‟ for HTL responses (CC=0.87 and 0.80, respectively). Structure-related properties of RACS820114 and MUNV940101 for HTL and CTL responses denoting „Value of theta(i-1)‟ and „Free energy in alpha-helical conformation‟ also show high correlation with CC=0.83. Al-together, informative physicochemical properties for CTL and HTL responses share a few similar properties of all three major classes except for the class, others.

Figure 5.3 Pie-chart representations of compositions of categorized physico-chemical properties of peptides responsible for CTL and HTL responses.

5.7 Peptides capable of inducing both CTL and HTL responses

An epitope capable of inducing both CTL and HTL responses is considered as a

在文檔中預測T細胞後天免疫反應 (頁 57-0)