• 沒有找到結果。

Chapter 1 Introduction

1.4. Thesis organization

In this thesis, we study the relationships between genetic and antigenic evolution focusing on three dimensions and the thesis is organized as follows (Fig. 1.1). In Chapter 2, we developed a method for identifying antigenic critical amino acid positions, rules, and co-mutated positions for antigenic variants. The rules describe when one (e.g. circulating) strain will not be recognized by antibodies against another (e.g. vaccine) strain based on HA sequences. The co-mutated positions are two positions that mutate simultaneously on HA. We first identified the co-mutated positions and discussed its relatedness to the antigenic drift.

The critical positions are widely distributed on HA structure; however the antibody recognition of HA is highly correlated to the conformation changes on the antigenic sites (epitopes). In Chapter 3, we developed an antigenic site based method to identify the antigenic drift of influenza A utilizing the conformation changes on epitopes. We address two issues in this chapter: first, how to quantify the degree of conformational change in a changed epitope; second, what are the relationships between changed epitopes and antigenic drift.

From the previous two dimensions, we observed that some amino acid mutations can cause antigenic variants while other mutations have few effects for antigenic variants. In addition, we also noticed that mutations on epitope A and B seem more likely to cause antigenic variants. The above observations raise the question of whether the amino acid positions are antigenically equivalent or not. In Chapter 4, we developed a Bayesian method to identify the antigenic drift

the accumulated HI assay during last 40 years, we utilized the likelihood ratio (LR) to quantify the antigenic distance of an amino acid position. We discuss the relationships between LR values of positions and antigenic drift. Moreover, we developed an index, ADLR, to quantify the antigenic distance of a given pair of HA sequences based on naïve Bayesian network and LR. We evaluated ADLR for predicting antigenic variants, explaining the vaccine-vaccine transitions and selection the WHO vaccines on 2,789 circulating strains. Finally, Chapter 5 presents the conclusions and the future work.

v

Predicting antigenic variants Identifying antigenic drift

and vaccine strain selection Chanter 2 Chapter 3 Chapter 4

Predicting antigenic distance

Figure 1.1 Overview of this thesis for studying the relationships between genetic and antigenic evolution. (A) The vaccine strain and circulating strains. (B) The Chapter 2, 3 and 4 in this thesis.

(C) The applications for our methods.

Chapter 2

Co-evolution Positions and Rules for

Antigenic Variants of Influenza A (H3N2) Viruses

2.1. Introduction

Pathogenic avian and influenza viruses often cause significant damage to human society and economics [23]. The influenza viruses are divided into subtypes based on differences in the surface proteins HA and NA, which are the main targets for the human immune system. In circulating influenza viruses, gradually accumulated mutations on HA occur immunologically distinct strains (named as antigenic variants), which lead to antigenic drift. The antigenic drift often implies that vaccines should be updated to correspond with the dominant epidemic strains [23]. Mapping the genetic evolution to the antigenic drift of influenza viruses is one of key issues to public health. Many methods have been proposed to study the antigenic drift and vaccine development [15, 33, 38-40].

Retrospective quantitative analyses of the genetic data have revealed important insights into the evolution of influenza viruses [31, 33, 41]. In the current global influenza surveillance system, the ferret serum HI assay is the primary method to define the antigenic variants. Several studies used statistical models to predict the antigenic variant of a given pair of HA sequences based on these known HI assays and their respective HA sequences [15, 40]. Furthermore, Smith et al. demonstrated that the antigenic evolution was more punctuated than the genetic evolution [15], and the genetic change sometimes has a disproportionately large antigenic effect. Recently, few studies discusses the relationship between evolution and co-mutated positions on influenza

2.2. Motivation and aim

The current trivalent vaccine contains seasonal H1N1, H3N2 and influenza B virus strains [23].

Among the influenza viruses, the H3N2 subtype causes higher mortality [43] and evolves more rapidly [44]. In addition to all of the above, the large amount of genetic and antigenic data for H3N2 virus provides valuable opportunity for us to understand the relationships between genetic and antigenic evolution of influenza A viruses.

Here, we proposed a method to predict the antigenic variants of A (H3N2) viruses by identifying critical positions and rules which describe when one (e.g. circulating) strain will not be recognized by antibodies against another (e.g. vaccine) strain. Our method is also able to detect the co-mutated positions for predicting the antigenic variants. These critical positions and rules were evaluated on two datasets which consist of 181 and 31,878 pairs, respectively. The results demonstrate that our model is able to reflect the biological meanings and achieve high prediction accuracy.

Figure 2.1 Overview of our method for predicting the antigenic variants of influenza A (H3N2) viruses.

2.3. Materials and Methods

Figure 2.1 shows the overview of our method for predicting the antigenic variants of influenza A (H3N2) viruses by identifying critical positions, rules and their co-evolution on the HA.

2.3.1. Data sets

We collected an HI assay data set, which contains 181 pairs of HA sequences with 45 HA (H3N2 viruses) sequences having 329 amino acids collected during the period, 1971 to 2002, from related work [40]. According to this data set, we applied the decision tree C4.5 [45] to predict the antigenic variants by identifying critical positions as well as discovering the rules and co-mutated positions. In this data set, the main samples (65%, 122 pairs among 181 pairs) consist of pairs of vaccine-circulating strains, and for each pair it is known whether there is inhibition of the circulating strain by antibodies against the vaccine strain ("antigenic variants"

and "similar viruses"). Vaccine strains are selected by World Health Organization (WHO) and are often the dominant strains of influenza seasons. Each pair includes the HI assay value (i.e.

antigenic distance) and a bit string of 329 binary bits by aligning a pair of HA sequences (329 amino acids). For a specific position on a pair of HA sequences, the binary value is "1 (named as mutation)" if the residue types of the two sequences on this position are different; conversely, its binary value is "0 (named as no mutation)". In general, an influenza vaccine should be updated if an HI assay value is more than 4.0 between the current vaccine strain and the strains expected to circulate in next season [15]. The antigenic distance is defined as the reciprocal of the geometric mean of two ratios between the heterologous and homologous antibody titers [40]. Among 181 pairs of HA sequences, 125 pairs with antigenic distance ≥ 4 are considered as "antigenic variants" and 56 pairs with antigenic distance < 4 are classified as "similar viruses". For example, the antigenic distance of the pair of HA sequences, A/Port_Chalmers/1/73 and A/Victoria/3/75, is 16 and this pair is considered as "antigenic variants". Conversely, the antigenic distance of the pair of HA sequences, A/Wuhan/359/95 and A/Nanchang/933/95, is 1 and this pair is considered as "similar viruses".

Furthermore, we prepared another HI assay data set proposed by Smith et al. to independently evaluate our model and compare with other methods for predicting the antigenic variants [15]. This data set consists of 253 H3N2 viruses which are clustered into 11 antigenic

viruses" pair and a virus-pair in different groups is considered as a "antigenic variants" pair.

Finally, we obtained 31,878 HI measurements and these sequences were extracted from supporting materials of publication [15].

2.3.2. Identifying critical positions on HA

In this study, positions with a both highly antigenic discriminating score and highly genetic diversity are considered as critical positions. We first evaluate the genetic diversity, which commonly believed, relates to immune selection [33], of each amino acid position on HA. Here, Shannon entropy was used to measure the genetic diversity of an amino acid position i (i=1~329) with 20 amino acid types and is defined as

where P(Ai=T) is the probability of the position i with amino acid type T. The information gain [45] measures the score of an amino acid position on HA for discriminating between antigenic variants and similar viruses. An amino acid with high IG at a specific position implies that a mutation on this position is highly correlated to antigenic variants. The IG of the position i associates to antigenic type Y (i.e. antigenic variants (V) and similar viruses (S)) is defined as

)

H(Y| i) is the conditional entropy of Y when given the position i. Two states of the position i are mutation (M) and non-mutation (N). H(Y| i) is defined as

For example, for the position 145, the numbers of the "mutation" and "non-mutation" are 62 and 119, respectively, among 181 pair-wise HA sequences in the training data set. For 62 mutation pairs, the numbers of "antigenic variants" and "similar viruses" are 61 and 1, respectively. The numbers of "antigenic variants" and "similar viruses" are 55 and 64, respectively, for 119 non-mutation pairs. According to these data, we can calculated that P(A145=M) is 0.34 and H(Y|A145=M) is 0.12 for the mutation state; P(A145=N) is 0.66 and H(Y|A145=N) is 1.0 for the non-mutation state. Finally, we obtained H(Y| i)=0.70. The values of information gain and entropy of 329 HA positions are normalized in the range from 0 to 1.

2.3.3. Discovering the rules of antigenic variants

After identifying critical positions, we discovered the rules for predicting antigenic variants by applying the decision tree C4.5 [45]. These antigenic amino acid positions are considered as the attributors (features). An amino acid position with high IG was selected as an internal node in the tree to discriminate "antigenic variants" and "similar viruses". According to the selected positions and constructed tree, we can easily identify the rules according to the paths from the root to the leaves of the tree.

2.3.4. Predicting antigenic variants

In order to evaluate and compare our model with other methods [9, 40] for predicting antigenic variants, we collected two data sets. The first data set consists of 181 pair-wise HI measurements and the second independent data set contains 31,878 HI measurements proposed by Smith et al.

[15]. Wilson & Cox [9] suggested that a drift viral variant of epidemiologic importance usually contains more than 4 residues changes located on at least 2 of the five epitopes on the HA. Lee &

Chen [40] proposed a model based on the hamming distance (HD) of 131 positions on all the five epitopes of HA to predict antigenic variants. Their model predicted a pair of HA sequences as the antigenic variants if there are more than 6 amino acid mutations between this pair of HA sequences.

2.3.5. Identifying co-mutated positions for antigenic variants

Here, we used the decision tree hierarchy to identify co-mutation of two amino acid positions.

In order to identify all potential co-mutated pairs on HA, the positions (i.e. 101 positions among 329 positions), which occur mutations in 181 pairs of HA sequences, are sequentially selected to identify its co-mutated positions. Based on these 101 positions, the total number of two-position combinations is 10,100. For each amino acid positions (i), the co-mutation score (S(i,j)) between the position i and its partner position j is defined as

)

where IGW(j,Y) is the IG value, which is derived from the whole data set (i.e. 181 pairs of HA sequences in the training set) using Equation (2), of the position j; IGRi(j,Y) is the IG value of the position j derived from the data set R by removing the pairs, in which the position i is mutated, from the whole data set. The z-score of the S(i,j) of a pair of co-mutated positions is derived from 10,100 pairs and it is defined as

μ and σ are the mean and standard deviation of all 10,100 position pairs. For example, position 145 (IG is 1.0) is selected as the first node in the tree. Among 181 pairs, 62 pairs are mutated on the position 145. The amino acid positions are considered as co-mutated positions of the position 145 if their IG values significantly decrease after these 62 pairs are removed from the data set.

For example, the z-score of the S(145, 137) of the pair-positions 145 and 137 is 3.58

2.4. Results

2.4.1. Critical positions on HA

In this study, we used the information gain (IG) and Shannon entropy to measure the scores of an amino acid, which is located at a specific position on HA, for discriminating antigenic variants and similar viruses. The highest and lowest values of both IG and entropy are 1 and 0, respectively. An amino acid with high IG at a specific position implied that this position is highly correlated to the antigenic variants. An amino acid with high entropy means that this position is

often mutated in the data set. Figure 2.2 shows the relationship of IG values and entropies of HA positions. The summary of some amino acid positions are listed in Table 2.1. Of the 329 amino acids of HA, 131 positions are considered to lie in or near the five antibody combining sites (named as epitopes) which are labeled A through E [9]. The first rank (i.e. position 145-A) locates at the epitope A of HA. Its IG and entropy are 1.0 and 0.87, respectively. Among 181 pairs of HA sequences in the training set, the position 145-A mutates on 62 pairs and 61 pairs are the antigenic variants. This result implies that a mutation on this position highly induces an antigenic drift. This observation is consistent to previous results [15], that is, the single amino acid substitution N145K can be responsible for antigenic cluster transition. We observed that the other positions with high IG values obtained the similar behaviors.

The relationship between IG values and entropies of 101 positions in HA is shown in Fig. 2.2 by excluding 228 positions which have zero for both IG and entropy. All positions can be classified into four groups according to the values of IG (antigenic degree) and entropy (i.e.

genetic diversity). Those 19 positions with high IG and high entropy (i.e. Area I) are considered as critical positions in this work. According to the HA structure obtained from protein data bank (PDB code 1HGF [46]), 18 of the positions locate at all the five epitopes and 15 of them are on the surface (Fig. 2.3) by using PyMOL [47]. The positions in Area II (i.e. high entropy and low antigenic degree) imply that high genetic diversity may infer low antigenic discriminating score.

For example, the positions (e.g. 226-D, 135-A, 121-D, 142-A and 186-B) have high entropies and low IG values (Table 2.1 and Fig. 2.2). Among 181 pairs of HA sequences, the position 226-D mutates on 61 pairs and 34 of these pairs are the antigenic variant. A low IG position indicates that a mutation on this position less preferred to be an antigenic variant. Our method can avoid the disadvantage of considering only the genetic data, which was widely used in previous works.

Figure 2.2 The relationship between entropies and information gains of 329 amino acids on HA.

The positions in area I (e.g. 145-A, 189-B and 278-C) with both high entropy and high IG values are highly correlated to the antigenic variants. 145-A denotes the amino acid position 145 located at the epitope A.

Table 2.1 The entropy, information gain, and co-mutated positions of 15 amino acid positions on HA sequences

Position-e

pitope Entropy IG

Number of co-mutate positions

Co-mutated positions Positive selection

Cluster Transition 145-A1 0.87 1.00 12 9,31,63,78,83,126,137,160,193,197,242,278 + 2 + 3

137-A 0.68 0.41 23 9,31,53,54,62,63,83,126,143,145,146,158,160,164,174,189,

193,201,213,217,244,260,278 +

193-B 0.86 0.23 17 9,31,63,78,83,126,137,145,158,160,164,174,201,217,242,26

0,278 + +

160-B 0.58 0.28 16 2,31,54,62,126,137,143,146,156,158,164,197,217,244,260,2

78 +

156-B 0.80 0.43 8 54,62,143,146,160,197,244,260 + + 226-D 1.00 0.15 2 145,189 +

135-A 0.83 0.07 1 165 +

121-D 0.72 0.00 0 +

142-A 0.47 0.00 0 +

186-B 0.41 0.00 0 +

164-B 0.24 0.46 6 126,137,158,174,201,217, +

201-D 0.27 0.36 4 137,164,174,217 +

78-E 0.14 0.29 4 31,63,126,242

174-D 0.32 0.47 4 137,164,201,217 +

63-E 0.19 0.39 6 78,83,126,137,242,278 1 The epitope of the position on HA sequence.

2 the position is under positive selection defined by Bush et al. [33].

3 the position is a cluster-difference substitution defined by Smith et al. [15].

The relationship between IG values and structural locations of 329 positions is shown in Fig.

2.3A. The positions with four highest IG values (i.e. 145-A, 189-B, 278-C, and 158-B) are blue and other positions are near to gray based on the IG values. The positions with high IG values are located on the protein surface. Three (145-A 189-B and 158-B) of top four IG-value positions are located around the receptor-binding site, which is the key for neutralizing influenza virus. In addition, the high IG positions also prefer to locate on the top head, which are more exposed and preferable recognized by antibodies, of HA and on the interface between HA monomers.

Figure 2.3 The distribution of IG values and co-mutation scores on HA structure. (A) The distribution of IG values of 329 amino acids on HA structure (PDB code 1HGF [46]) and the R indicates the receptor-binding site. The blue and gray indicate the highest IG value and the lowest IG value, respectively. (B) The structural locations and scores of 12 co-mutation positions of the position 145. These structures are presented by using PyMOL [47].

2.4.2. The rules of antigenic variants and predicting accuracies

We used the decision tree (Fig. 2.4) to build a model for predicting antigenic variants of influenza A (H3N2) virus. Based on the IG values of 329 amino acid positions derived from 181 pairs in training data set, six amino acid positions are selected as internal nodes in this tree. The first rule of this tree is that the antigenic type is predicted as the antigenic variant if the position 145 is mutated, that is, the residue types of a pair of sequences on the position 145 are different.

Among 181 pairs of sequences in the training set, 62 pairs can apply this rule and 61 pairs can be predicted correctly. The last rule of this tree is that the antigenic type is predicted as the similar viruses if six positions (i.e. 145, 189, 62, 155, 213, and 214) are not mutated.

Figure 2.4 The decision tree and rules for predicting antigenic variants. Each internal node (circle) is represented as an amino acid position. The leaf node (square) includes the predicted antigenic type (i.e. "antigenic variants" and "similar viruses"), the numbers of total pairs (the first value) and predicted error pairs (the second value) by applying this rule in this node.

Based on this model, we can derive seven rules and the predicted accuracies are 91.2%

(165/181) for training data set and 96.2% (30,675/31,878) for independent data set, respectively.

As shown in Fig. 2.5 and Table 2.2, our method outperformed two comparative methods, i.e.

Wilson & Cox (89.7%) [9] and Lee & Chen (92.4%) [40], on the independent data set. For the independent data set, the accuracies of Wilson & Cox method on predicting the antigenic variants and the similar viruses are 99.71% and 32.74%, respectively. Conversely, our model performed well for predicting the antigenic variants (99.73%) and the similar viruses (76.34%).

Figure 2.5 Comparison of our method with other two methods (Wilson & Cox [9]; Lee and Chen [25]) on predicting antigenic variants on two data sets

Table 2.2 Comparison of our method with other methods for predicting the antigenic variants on 31,878 pairs

Antigenic variants Wilson &

Cox, 1990 [9]

pairs 27020 26113 27026 Number of

predicted pairs 1565 3337 3649 Accuracy 99.71% 96.37% 99.73% Accuracy 32.74% 69.81% 76.34%

1 the number of the pairs in the cluster.

2.4.3. Co-mutated positions for antigenic variants

Two amino acid positions may mutate simultaneously to cause antigenic drift or highly co-evolution in H3N2 virus. Understanding the co-mutation of amino acid position-pairs is one of the key steps to recognizing the antigen-antibody interactions. Here, we used the co-mutation score, S(i,j), between the position i and its co-mutated position j to measure the co-mutated pair (i ,j) for predicting the antigenic variants. We calculated all of the co-mutated combinations (i.e.

10,100 pairs) of 101 amino acid positions which mutated more than once on 181 pairs of HA sequences in the training data set.

Table 1 show the co-mutated positions of some HA positions. In this work, the position (j) is considered as the co-mutation position of the position (i) when its co-mutation z-score (i.e. Z(i,j) defined as Equation (7)) is more than 2.3 because the score of the position i and j is significant (p-value is 0.01) derived from 10,100 pairs. Among 329 positions of HA sequences, 40 positions have co-mutated positions. The number of co-mutated positions for a position ranges from 0 to 23 and the total number of the significant pairs are 308 among 10,100 pairs.

In the tree model (Fig. 2.4), the position 145-A is selected as first node and has 12 significant

145-A are (145-A, 126-A), (145-A, 278-C) and (145-A, 137-A). The 145-A, 278-C, and 137-A are the residues to cause the transition from cluster EN72 into cluster VI75 [15]. In addition to position 145-A, the residue 156-B has 8 significant co-mutated positions (Table 2.1). Seven (except position 260-E) of these 8 positions co-mutate with 156-B to cause the transition from the cluster TX77 into the cluster BK79 [15].

Table 2.3 The number of co-mutation positions of five epitopes and the other area on HA Epitope A Epitope B Epitope C Epitope D Epitope E Other area sum

Epitope A 15 24 8 11 16 8 82

Epitope B 19 15 6 13 13 5 71

Epitope C 15 11 3 5 9 4 47

Epitope D 12 13 3 8 6 4 46

Epitope E 13 11 4 6 7 3 44

Other area 4 2 1 3 4 4 18

Table 2.3 shows the numbers of significant co-mutation positions on six blocks, including five epitopes and the other area on the HA. The numbers (24 and 19 pairs, respectively) of the co-mutation pairs, which located at epitopes A and B, are significantly higher than other block.

This result implies that the mutation on epitopes A and B could yield a high probability to cause the antigenic drift. Moreover, residues in epitopes A and B form 82 and 71, respectively,

This result implies that the mutation on epitopes A and B could yield a high probability to cause the antigenic drift. Moreover, residues in epitopes A and B form 82 and 71, respectively,

相關文件