Organization - 預測T細胞後天免疫反應

Chapter 1 Introduction

1.3 Organization

In summary, this dissertation focuses on predicting adaptive T-cell immune responses for computer-aided vaccine design. For solving optimization problems of mining informative physicochemical properties for the peptide immunogenicity, efficient evolutionary algorithms were proposed to develop efficient vaccine design system.

The rest of this dissertation is organized as follows. Chapter 2 addresses the related works of this dissertation. The proposed algorithm for mining informative

physico-chemical properties is presented in Chapter 3. Chapter 4 presents the prediction sys-tem for predicting ubiquitylation sites. Chapter 5 describes the proposed prediction systems for predicting immunogenicity of MHC class I and II binding peptides. The improved prediction of peptide immunogenicity using string kernels is presented in Chapter 6. Finally, conclusions are given in Chapter 7.

Chapter 2 Related Work

This chapter presents related works for predicting adaptive T-cell immune response including prediction of ubiquitylation sites, immunogenicity prediction of MHC class I binding peptides, and immunogenicity prediction of MHC class II binding pep-tides.

2.1 Highly Ubiquitylated Proteins as Antigen Sources

Ubiquitin-proteasome system is an important mechanism for protein degradation that the ubiquitylated proteins will be degraded by proteasome. The ubiquitin acts as a specific tag for marking proteins for degradation. The proteasome is a major me-chanism for cells to regulate the concentration of particular proteins and degrade misfolded proteins. The degradation process produces short peptides of about 7~8 amino acids. The resulting short peptides can be further degraded into amino acids that can be used in protein synthesis [2, 3].

The proteasome plays an important role in the function of the adaptive im-mune system. The peptide antigens presented on the surface of antigen-presenting cells are produced by proteasomal degradation of pathogen proteins and displayed by MHC class I molecules [4]. A previous study investigated the role of ubiqui-tin-dependent proteolytic pathway in MHC class I-restricted antigen presentation and concluded that ubiquitin-conjugation (also called ubiquitylation) plays an important role in the presentation of a cytosolic antigen with MHC [5]. Another study found that an amino-terminal modification of a viral protein will promote ubiqui-tin-dependent degradation and lead to the enhancement of presentation with MHC

class I [6].

Some recent studies have similar results that ubiquitin-conjugation will enhance the efficacy of polynucleotide viral vaccines [7] and vaccines against tuberculosis [8].

Another study claimed that the low frequency of memory cytotoxic T lymphocyte and inefficient antiviral protection of DNA immunization with minigenes can be rectified by ubiquitylation [9]. Therefore, accurate prediction of ubiquitylation sites can provide better understandings of ubiquitylation mechanism. The selection of highly ubiquitylated peptides can improve the effectiveness of vaccines. In Chapter 4, three kinds of features and three classifiers were assessed for their prediction per-formances. Subsequently, informative physicochemical property mining algorithm is applied to select informative physicochemical properties and improve the prediction performance. Finally, a prediction system UbiPred was constructed to predict ubi-quitylation sites.

2.2 Immunogenic Pathway of MHC class I

Developing a computer-aided system to design peptide vaccines is one goal of im-munoinformatics. The major work of previous studies for peptide vaccine designs is to identify cytotoxic T lymphocyte (CTL) epitopes and investigate their correspond-ing immunogenicity. The CTL cells play a critical role in protective immunity by re-cognizing and eliminating self-altered cells, which recognize short peptides derived from intracellular degradation of foreign proteins in combination with major histo-compatibility complex (MHC) class I molecules. The immunogenicity of MHC class I binding peptides is their ability to induce CTL responses. Accurate predictions of the CTL epitopes and their corresponding immunogenicity are critical in developing a computer-aided system for vaccine designs.

Direct approach to predicting the CTL epitopes has been studied initially but its accuracy is fairly low [10]. Instead, indirect approach to predicting the MHC-binding peptides is useful because peptides must be processed prior to inducing cellular im-munogenicity. The recent studies of bioinformatics utilized the information about antigen processing pathway to predict the CTL epitopes. At first, the peptides are cleaved by proteasomal cleavage. Several studies elucidating the specificity of pro-teasome have been presented. To predict proteasomal cleavage sites, NetChop used a neural network method [11] and Pcleavage is based on a support vector machine (SVM) learning model [12].

After cleavage, peptide fragments are transported into endoplasmic reticulum by TAP which is the transporter associated with antigen processing. Some studies of

investigating the TAP transport efficiency were presented such as the affinity predic-tion of TAP binding peptides using the cascade SVM [13] and the predicpredic-tion of TAP transport efficiency of epitope precursors using a simple scoring matrix [14]. Finally, the peptide fragments that bound to MHC class I molecules are subsequently trans-located to the cell surface, where these complexes may active CTL. Some methods have been developed to predict MHC class I binding affinity, such as the SVM-based SVMHC [15] and Gibbs sampling method [16]. Moreover, the hybrid approaches integrated the above-mentioned methods like the prediction of proteasomal cleavage, TAP transport efficiency and MHC binding to advance the prediction performance [17, 18].

The problem of predicting immunogenicity of MHC class I binding peptides is crucial to further identify highly immunogenic peptides. The selection of highly im-munogenic peptides can save many experimental efforts and accelerate the develop-ing progress. In Chapter 5, a prediction system POPI was developed to predict im-munogenicity of MHC class I binding peptides. POPI performs better than align-ment-based and affinity-based methods.

In Chapter 6, an improved prediction system POPISK was constructed to pre-dict T-cell responses induced by HLA-A2-restricted peptides. POPISK using string kernels is useful for predict peptide immunogenicity and immunogenicity changes made by single residue modifications that is especially useful for optimizing pep-tide-based vaccines.

2.3 Immunogenic Pathway of MHC class II

The immunogenic pathway of MHC class II includes four steps. First, antigens are engulfed by endocytosis forming endosome. Second, endosome fuses with lysosome and is cleaved by peptidase in lysosome. Third, the peptide fragments bound to MHC class II will be translocated to cell surface. Finally, immune responses (also called immunogenicity) will be triggered when helper T lymphocyte (HTL) recognize non-self antigens presented by antigen presenting cell (APC). The activated HTL will induce the resting HTLs to proliferate and differentiate into memory cells or effector cells and provide specific help to CTL, B lymphocytes and phagocytic cells [19, 20].

Previous studies for predicting immunogenic pathway of MHC class II focus on the prediction of MHC class II-restricted peptides (qualitative methods) and the binding affinity of peptide-MHC complex (quantitative methods). Many methods are proposed to predict MHC class II binding peptides. The evolutionary algorithms in-cluding ant colony algorithms [21], evolutionary algorithms combined with artificial

neural networks [22] and multi-objective evolutionary algorithms [23] are developed for optimizing a matrix for predicting binding affinity. Other methods including the neural network based methods [22, 24, 25], Bayesian neural networks [26], fuzzy neural networks [27], the hidden Markov model [28], Gibbs samplers [16], support vector machines [29-31] and alignment-based method SMM-align that is a stabiliza-tion matrix alignment method for predicting MHC class II binding affinity [32].

However, the problem of predicting immunogenicity of MHC class II binding peptides is also important to understand immunogenicity and design effective vac-cines. In Chapter 5, a prediction system POPI-MHC2 based on informative physi-cochemical properties was developed to predict immunogenicity of MHC class II binding peptides. The informative physicochemical properties are mined by using the informative physicochemical property mining algorithm (described in Chapter 3).

This study shows similar results to POPI that the traditional affinity-based method and alignment-based methods are less effective than the proposed method PO-PI-MHC2.

Chapter 3 Informative Physicochemical Property Mining Algorithm

For mining informative physicochemical properties from experimental data, a genetic algorithm based method was proposed to simultaneously determine optimal subset of physicochemical properties and design a support vector machine classifier.

3.1 Physicochemical properties

Physicochemical properties of amino acids were extensively and successfully used in sequence-based prediction methods [33-37]. There are 544 physicochemical proper-ties of amino acids extracted from amino acid index database version 9.0 (AAindex), which is a collection of published amino acid indices representing different physico-chemical and biological properties of amino acids [38, 39]. Each physicophysico-chemical property consists of a set of 20 numerical values for amino acids. The property hav-ing the value „NA‟ in a value set of amino acid index was discarded. Finally, 531 properties were used for the following mining method.

To encode an input vector from peptide sequences for machine learning clas-sifiers, a two-step method is utilized. The first step determines a vector D_t of 531 index values for each amino acid of peptides. A peptide of length l has 531 l-dimensional vectors that can be defined as:



1 2



D_t  d d_t , _t ,...,d_tl ,t1,...,531,

Informative Physicochemical

Property Mining Algorithm

where t denotes the t-th physicochemical properties. In the second step, a vector V of 531 mean values is obtained by averaging these l attributes in each vector, defined as follows:

Support vector machines (SVMs) are powerful tools in the field of machine learning.

SVMs cope well with the over-fitting problem arising from a small training dataset by finding a linear separation hyperplane that maximizes the distance between two classes to create a classifier. SVMs can efficiently deal with classification, prediction, and regression problems. Given training vectors x_i Rⁿand their class values y_i va-riables allowing for some misclassifications. The cost parameter C > 0 controls the trade-off between the margin and the training error. Larger values of C will lead to a higher error penalty.

In order to make linear separation of samples easier, SVM uses one of various kernel functions to transform the samples into a high-dimensional search space. In this work, the commonly-used radial basis function is applied to nonlinearly trans-form the feature space, defined as follows:

( ,_i _j) exp( _i _j 2), 0

K x x   x x   . (3-3)

The kernel parameter γ determines how the samples are transformed into a high-dimensional search space. These two parameters C and γ must be tuned to get the best prediction performance.

For multi-class classification problems, „one-against-one‟ strategy is applied to transform the multi-class problem into several binary classification problems. Given h classes, there are h(h−1)/2 classifiers constructed and each one trains the samples from two classes. A voting strategy is applied to give a final prediction for test sam-ples. In this study, the used SVM is obtained from LIBSVM package version 2.81 [40].

3.3 Orthogonal experimental design

Statistic design of experiments is a process of planning experiments. Orthogonal experimental design with orthogonal array and factor analysis is an efficient method to analyze the effect of several factors simultaneously [41, 42]. The factors are the parameters, which affect response variables, and a discriminative value of a factor is regarded as a level of the factor. A “complete factorial” experiment would make measurements at each of all possible level combinations. However, the number of level combinations is often so large that this is impractical, and a subset of level combinations must be judiciously selected to be used, resulting in a “fractional fac-torial” experiment. Orthogonal experimental design utilizes properties of fractional factorial experiments to efficiently determine the best combination of factor levels to use in design problems.

Orthogonal array is a fractional factorial array, which assures a balanced com-parison of levels of any factor. Orthogonal array can reduce the number of level combinations for factor analysis. Each row of an orthogonal array represents the le-vels of factors in each combination, and each column represents a specific factor that can be changed from each combination. The term “main effect” of one factor de-signates the effect on response variables that one can trace to a design parameter, which does not bother the estimation of the main effect of another factor. After proper tabulation of experimental results, the summarized data are analyzed using factor analysis to determine the relative level effects of factors.

Factor analysis can evaluate the effects of individual factors on the evaluation function, rank the most effective factors, and determine the best level for each factor such that the evaluation function is optimized. Table 3.1 shows an illustrative exam-ple of orthogonal experimental design using a two-level orthogonal array L_M(2^M-1) with M rows and M-1 columns. In this example of M=8, there are 7 factors where

each corresponds to a physicochemical property and its two levels correspond to ex-clusion and inex-clusion of the feature in the proposed feature selection. Let f_t denote a function value (prediction accuracy of 10-CV in this study) of the combination t.

Define the main effect of factor j with level k as S_jk where j = 1, …, M-1 and k = 1, 2:

S_jk=



f_t^{∙ F}t , t = 1, …, M, (3-4) where F_t = 1 if the level of factor j of combination t is k; otherwise, F_t = 0. Since the objective function is to be maximized, the level 1 of factor j makes a better contribu-tion to the funccontribu-tion than level 2 of factor j does when S_j1>S_j2. The main effect re-veals the individual effect of a factor. After the better one of two levels of each fac-tor is determined, a good combination consisting of all facfac-tors with the better levels can be easily reasoned [43].

The Rank in Table 3.1 shows the rank of the combination t in all 128 (=27)

Table 3.1 An illustration example of orthogonal array L₈(2⁷) and factor analysis.

t Factors Accuracy(%)

possible combinations. In this example, the reasoned combination gets the best ac-curacy with Rank 1. Notably, the reasoned combination is not guaranteed to be the best one in general cases. The most effective factor j has the largest main effect dif-ference MED=|S_j1 –S_j2|. The 6th factor having the largest main effect difference 36.3 is the most effective factor.

3.4 Inheritable bi-objective genetic algorithm

Selecting a minimal number of informative features while maximizing prediction ac-curacy is a bi-objective 0/1 combinatorial optimization problem. An efficient inhe-ritable bi-objective genetic algorithm (IBCGA, [43]) is utilized to solve this optimiza-tion problem. IBCGA consists of an intelligent genetic algorithm [44] with an inhe-ritable mechanism. The intelligent genetic algorithm uses a divide-and-conquer strategy and an orthogonal array crossover to efficiently solve large-scale parameter optimization problems. In this study, the intelligent genetic algorithm can efficiently explore and exploit the search space of C(n, r). IBCGA can efficiently search the space of C(n, r^1) by inheriting a good solution in the space of C(n, r) [43]. There-fore, IBCGA can economically obtain a complete set of high-quality solutions in a single run where r is specified in an interesting range such as [5, 45].

The proposed chromosome encoding scheme of IBCGA consists of both bi-nary genes for feature selection and parametric genes for tuning SVM parameters, where the gene and chromosome are commonly-used terms of genetic algorithm (GA), named GA-gene and GA-chromosome for discrimination in this paper. The GA-chromosome consists of n=531 binary GA-genes b_i for selecting informative properties and two 4-bit GA-genes for tuning the parameters C and γ of SVM. If b_i=0, the i-th property is excluded from the SVM classifier; otherwise, the i-th prop-erty is included. This encoding method maps the 16 values of  and C into {2^-7, 2^-6…, 2⁸}. Figure 3.1 shows the encoding scheme of GA-chromosome and process of constructing feature vectors for fitness function evaluation using a concise example.

The feature vector for training the SVM classifier is obtained from decoding a GA-chromosome using the following steps. Consider a given peptide sequence, e.g., LAL. At first, the index vectors for all selected physicochemical properties (Residue volume and Molecular weight in this example) are constructed from AAindex for each amino acid. Feature vector of a peptide consists of the selected features whose values are obtained by averaging the values in their corresponding index vectors. Fi-nally, all values of the feature vectors are normalized into [-1, 1] for applying SVM.

Fitness function is the only guide for IBCGA to obtain desirable solutions. To

Figure 3.1 An illustration example of fitness function evaluation from decoding a GA-chromosome.

avoid from the prediction bias for some classes, the averaged accuracies (AA) of all classes, defined in (3-10), is adopted as the fitness function. The performance of se-lected properties associated with the parameter values of SVM is measured by 10-CV.

Therefore, the fitness value of a GA-chromosome is obtained by computing the mean accuracy of 10 runs.

IBCGA with the fitness function f(X) can simultaneously obtain a set of solu-tions, Xr, where r=r_start, r_start+1, …, rend in a single run. The algorithm of IBCGA with the given values r_start and r_end is described as follows:

Step 1) (Initiation) Randomly generate an initial population of N_pop individuals.

All the n binary GA-genes have r 1‟s and n-r 0‟s where r = r_start. Step 2) (Evaluation) Evaluate the fitness values of all individuals using f(X).

Step 3) (Selection) Use the traditional tournament selection that selects the winner from two randomly selected individuals to form a mating pool.

Step 4) (Crossover) Selectp_c·N_popparents from the mating pool to perform orthogonal array crossover on the selected pairs of parents where p_c is the crossover probability.

Step 5) (Mutation) Apply the swap mutation operator to the randomly se-lected p_m·N_pop individuals in the new population where p_m is the muta-tion probability. To prevent the best fitness value from deteriorating, mutation is not applied to the best individual.

Step 6) (Termination test) If the stopping condition for obtaining the solu-tion X_r is satisfied, output the best individual as X_r. Otherwise, go to Step 2).

Step 7) (Inheritance) If r < r_end, randomly change one bit in the binary GA-genes for each individual from 0 to 1; increase the number r by one, and go to Step 2). Otherwise, stop the algorithm.

3.5 Performance evaluations

Four measurements are applied to evaluate developed prediction systems including accuracy (ACC) and Matthew‟s correlation coefficient (MCC) for each class, and overall accuracy (OA) and averaged accuracy (AA) for all classes, defined as follows:

i 100%

i i

ACC TP

TP FN

 

 , (3-5)

       

TP TN FP FN

TP FN TP FP TN FP TN FN

i i i i

MCC

i i i i i i i i

  

      

 , (3-6)

TPi

OA



N ^, ^(3-7)

ACCi

AA



h ^, ^(3-8)

where i is the number of classes and TP_i, TN_i, FP_i and FN_i are the numbers of true positives, true negatives, false positives and false negatives, respectively. N is the total number of sequences and h is the number of classes.

Chapter 4 Prediction of ubiquitylation sites

Ubiquitylation plays an important role in regulating protein functions. Recently, expe-rimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this stud aims to develop an accurate sequence-based prediction method to identify promising ubiqui-tylation sites.

4.1 Motivation

Ubiquitylation (also called ubiquitination) is an important mechanism of post-translational modification that ubiquitin will be linked to specific lysine residues of target proteins by forming isopeptide bonds. Three enzymes including activating enzyme (E1), conjugating enzyme (E2), and ubiquitin ligase (E3) are involved in the ubiquitylation process. Another enzyme E4 can help to stabilize and extend polyubi-quitin chain [45, 46]. The first discovered function of ubiquitylation is to target pro-teins for subsequent degradation by the ATP-dependent ubiquitin-proteasome sys-tem. Subsequently, many regulatory functions of ubiquitylation were discovered in-cluding the regulation of DNA repair and transcription, control of signal transduc-tion, and implication of endocytosis and sorting [45, 46].

Because of the important regulatory roles of ubiquitylation, numerous methods were developed to purify ubiquitylated proteins [47]. Also, the growing number of studies of large-scale identification of ubiquitylated proteins and analysis of

ubiqui-Predicting of ubiquitylation

sites

tin-related proteome reflect the importance of identifying ubiquitylation proteins and sites [48-53]. The three steps affinity purification, proteolytic digestion, and analysis using mass spectrometry were applied in most of these studies [54]. These works cost a lot of experimental efforts. Therefore, developing a prediction system using informative features from protein sequences can not only save experimental efforts but also provide insights into the mechanism of ubiquitylation.

4.2 Assessment of features and classifiers

This study focuses on the sequence-based prediction of ubiquitylation sites. There-fore, three kinds of useful features which can be extracted from protein sequences

Figure 4.1 The sequence logo of the 151 positive samples with w=21. (a) infor-mation content and (b) frequency plot.

and are widely used in bioinformatics studies are evaluated for prediction of ubiqui-tylation sites: conventional amino acid identity [55], evolutionary information [56, 57],

在文檔中預測T細胞後天免疫反應 (頁 25-0)