• 沒有找到結果。

The final purpose of all the influenza researches hope to answer the following questions:

when to update vaccine? and how to choose future vaccine candidate [1]? To the purpose, there are many approaches to answer these two questions. Either from experiment or computational approach, scientists collect abundant data and want to discover the pattern or evolution trend of the influenza virus [7-10]. In the view of whether they considering HI titer value, they could be classified into two approaches. First kind of approach focus on the genetic evolution of HA protein [8, 9], and the second kind combine experiment data which could further consider the antigenic evolution of the HA protein [11, 12].

In the genetic level, there had been discovered that those sites of HA1 involved in antigen determination exhibit significantly more non-synonymous nucleotide substitutions

than synonymous substitutions [8], whereas the remaining sites show the more common pattern of primarily synonymous variation. These observations demonstrate that HA is undergoing positive Darwinian selection for new antigenic variants [13]. Bush et al. [8] have identified 18 HA1 codon sites with significantly higher non-synonymous to synonymous ratios.

In order to analysis the evolutional pattern of influenza virus, there had been propose a cluster method [10], which cluster 560 HA protein sequences into 174 clusters. According to the cluster result, there are several representative clusters. By the means of compare genetic variation between intra and inter of representative clusters, they found some evolution trend of influenza virus. They also proposed a method to predict the future vaccine candidate.

Before the year of 2004, mostly works focus on the genetic level on HA. Until the year of 2004, there began to have more efforts made on the comparison between genetic and antigenic evolution. The result shows that gradual genetic evolution, but punctuated antigenic evolution [11]. As a result they found that the genetic evolution could not directly correspond to antigenic evolution. Genetic change sometimes had a disproportionately large antigenic effect. The next question should be what are the relations between genetic and antigenic evolution.

By collect historical WHO vaccine HI titer tables and HA sequence from 1968 to 2002, a global prediction model is build [12]. The highest performance model for predict antigenic

variant shows that when there are more than 7 amino acid changes on the epitope sites then a antigenic variant strain is predicted (agreement = 83%). But the importance of these positions in terms of affecting cross-reactive antibody is unclear.

In order to find what key position changes would affect cross-reactive antibody interaction. We apply an index value from information theory. The information gain evaluates the relation between two variables (genetic and antigenic evolution). Here we take information gain as a index to represent relations between genetic and antigenic evolution.

We hope to find out antigenically important positions and to understand the pattern between genetic and antigenic evolution.

Chapter 2

Materials and Methods

2.1 Overview of Research Steps

The research flowchart could be divided into two parts (Fig 4). In the first part we calculate the information gain of 329 HA positions from on a representative training dataset and evaluate the fitness for information gain to represent the relations between genetic and antigenic evolutions. Then in the second part we apply the important positions selected by information gain to predict antigenic variants on two unseen and meaningful application sets (test sets).

In the first part we first select a representative training set which was used in a published work [12]. Then we extract features from sequences and HA protein structure. The HI titers are transformed from folds of serial dilution to an antigenic distance between two influenza viruses. The large antigenic distance means more antigenic difference between two viruses.

After we have two variables (genetic features and antigenic distance), we could calculate information gain for each 329 HA positions. By a well-known method (Decision Tree C4.5) based on information gain we could select several clusters of important positions and get a training model for predicting antigenic variants. After we found important positions, we discuss the fitness for information gain to represent the relations between genetic and

antigenic evolutions. Those selected positions are then used to predict antigenic variants and compare predicting performance to related works.

In the second part we find two unseen test sets which have antigenic properties. The first smaller set (51 cases) were all vaccine strains extracted from WER (1968~2006) and each case with known HI titer value. The second larger set (5928 cases) containing 181 influenza viruses from 1968 to 2003 which having an antigenic clustering label [11]. Then we apply the position and rules from training model on these two test sets.

In the following part we would first show that how materials are prepared and then the detail of methods.

2.2 Influenza Sequence Database

The influenza sequence database [14] is a well-known and frequently cited database, which collect the nucleotide sequence of influenza virus. They collect all 3 influenza species and 8 protein segments of various hosts (Appendix ). This difference between NCBI database the ISD is that ISD deposit not only publish sequence but also un-publish sequences. The ISD also provide some useful information such as vaccine selection from 1999 to 2006 (Appendix I) and influenza virus activity in United States from 1981 to now (Appendix I) .Since all

sequence is presented in nucleotide format, the translation is required. The EBI translation tool is recommended (http://www.ebi.ac.uk/emboss/transeq/).

2.3 Training Set

We need a representative and robust training set which should including representative influenza virus strain and the set should better to be complete and balanced. From the literature search we choose a set which was used in a related work [12]. This set consisted of six sets ferret serum HI cross-reactivity data which including 45 influenza virus strains and 181 pairwise ferret serum HI titers (Table 2). From 1968 to 2005 there were 21 influenza virus strains treat as WHO vaccine component (Table 1), and this set cover 17 virus strains of them.

The first set included 11 viruses (55 pairwise comparisons, virus ID: A to K) isolated from 1971 to 1979 [15]. The second set included 8 viruses (28 pairwise comparisons, virus ID:

J, L to R) isolated from 1979 to 1987 [16]. The third set included 10 viruses (45 pairwise comparisons, virus ID: S to AB) isolated from 1989 to 1994 [17]. The fourth set included 8 viruses (28 pairwise comparisons, virus ID: AC to AJ) isolated from 1994 to 1996 [18]. The fifth set included 5 viruses (10 pairwise comparisons, virus ID: AE, AK to AN) isolated from 1995 to 1999 [19]. The sixth set included 6 viruses (15 pairwise comparisons, virus ID: AN to AT) isolated from 1999 to 2002 [20]. ( Note : the strain TOK75’s position 226 x is assign to amino acid Leucine, which is identical to other residue in table one. The sequence need manually key in table one is using template J02135 ). The information of all the sequences is listed on table 3.

After the feature extraction and calculation of antigenic distance, the training set have

181 cases and 125 of them are variant type (antigenic distance ≧ 4) while the other 56 cases are equal type(antigenic distance < 4) . Among all the 329 residues of HA protein, there are 101 positions have occurred change in this set.

2.4 Test Set

The purpose of test set is to evaluate the correctness of the positions and rules learning from the training set. Since our method integration both genetic and antigenic evolution, the test set should also containing antigenic property.

The first set was extracted from WER (1968~2006), from which we could found 62 reference pairs of HI titer value with both homologous and heterologous titer values and available HA sequences. We further filter these 62 cases with one more criterion: there should have at lease one vaccine strain for the pair comparison. Finally we could got 50 cases satisfy the condition.

The second dataset includes 253 sequences. All 253 sequences were grouped into 11 groups according to the K-mean result which using antigenic distances transform from HI titer [11] (Fig 5). After a simple all pairwise comparison, we identify 181 non-identical sequences treated in the test set (Table 4). Since the cluster result were based on antigenic data, we assume that two different cluster would have different antigenic properties (consider as variant) and members within a cluster would have similar antigenic properties (consider as

equal ) (Fig 6).

According to the article, there are 273 isolates .But according to the final grouped table on the supporting material, there are only 253 sequences. According to the query condition in reference and supporting material we could get 255 sequences, but there are 3 sequences of A/SP/1/96(AY661200 AY661199 AY661198) and 1 outlier Dk/33/80. The three A/SP/1/96 sequences in which two are identical, and we adapt the first one AY661200. There is one sequence in the grouped table but not in the supporting material, which is A/Sydney/5/97. So the sequence number : 255-2(two identical)-1(Dk/33/80)+1(A/Sydney/5/97)=253 sequences.

The test set have 181 non-identical sequences. After the antigenic type assign, there are 2118 equal cases and 3810 variant cases.

2.5 The ISD set

This set was downloaded from Influenza Sequence Database at 2005/07/10. The query is

“ A type, HA, Human, H3” , so we could get 1744 sequences . The sequences download from ISD are in the nucleotide format and the length are not identical, so we need to translate them into protein sequence and modify their length to 329 residues. The flowchart is in recorded (Fig. 7). For some virus strains the isolation date is recorded, so those sequences could be clustered according to the influenza season.

2.6 Feature extraction from HA sequence and 3D protein structure.

The inputs of this question are two influenza virus strain’s HA protein sequence, then a pairwise comparison is generated. The most common method to compare two influenza virus strains is hamming distance (HD) which counts the total number of changed amino acids [12].

But the HD method can’t explain each position’s different importance to determine antigenic property. We here apply the position-specific change (PSC) coding, the change of each position is independently recorded as a feature (Fig 8). For example, the number of changed amino acids between A/Panama/2007/99 and A/Fujian/411/2002 is 13 positions, so the HD is 13. But the position change method individually record which 13 positions are changed (Fig 8A).

Since the protein structure of HA is determined and deposit in the Protein Data Bank [21], we further want to utilities the information of structure environment to find important regions on HA structure. Here we apply the contact map coding which could consider each position’s environment information. In the contact map coding, each position is considered as the center of a sphere (Fig 8B). The region here is defined as a sphere which center at each amino acid position. Since there are 329 positions in HA, there are 329 regions on the 3D structure of HA. If any position in a region is changes, then this region is considered as changed. The radius of the sphere region is test on the training set from 3 to 12 Å to determine what distance’s performance is best

2.7 Antigenic Distance

We want to find out what positions change would affect HI titer value, so we need to define to what degree the HI titer value is considered as changed. In this work, we divide the degree of HI value difference into two categories: antigenic variant and antigenic equal cases.

The HI value from experiment was not convenient for analysis, so the HI values are usually transformed to antigenic distance for large scale analysis. We apply the equation used in the related work [12, 22, 23] to define antigenic variants. This equation calculate the antigenic distance between two virus strains and the equation is show as follows:

(1)

This equation need four cell of HI values that means both two antisera are needed for cross test. A antigenic variant is defined when antigenic distance is ≧ 4. That means both two homologous and heterologous HI test should have HI difference equal or more than 4 times. The

example is illustrated in (figure 9).

I_J) ologous J_I)(heter

ous (heterolog

J_J) ogous I_I)(homol s

(homologou

2.8 Entropy

Entropy is used to measure the degree of disorder of one space. We use the entropy here to evaluate the disorder of each position as an index in the genetic level. The equation to calculate entropy is as follows:

(2)

The H(X) is the entropy of position X and Pr is the probability the amino acid type r in this position. The entropy of position X sums all 20 types of amino acids. The higher the entropy means that position have more genetic diversity.

2.9 Information Gain

Information gain is an index value from information theory with statically meaning.

Information gain measures the association between two variables. The higher the information gain means more association between two variables. In this case, a position with very high information gain means if that position is changed then an antigenic variant is expected. As a consequence we could use information gain to build relations between genetic and antigenic evolutions.

=

=

20

1 r

r

r

log( )

)

( X P P

H

Here we use the information gain to measure the degree of each position change’s effect to antigenic change. The information gain of a given attribute X with respect to the class attribute Y is the reduction in uncertainty about the value of Y when we know the value of X.

The equation is show as follows:

(3)

The uncertainty about the value of Y is measured by its entropy, H(Y ). The uncertainty about the value of Y when we know the value of X is given by the conditional entropy of Y given X, H(Y |X). Equation (3) could translated into following form:

(4)

Equation (4) works when Y and X are discrete variables that take values in {y1...yk} and {x1...xl}.

2.10 Selecting Important Positions by Information Gain

The key idea for selecting important positions is as follows:

Suppose there are many possible HA mutation patterns for influenza virus to escape immune-selection. So we could classify those different HA mutations into several groups.

Each group of mutations could explain part of antigenic change from 1968 to 2002.

The process is illustrated in figure (Figure 18). We adapt the greedy method to select important positions. In the level 1 we have full training dataset (181 cases) and then we select the position P1 with highest antigenic association (highest information gain). Those cases in the level 1 which have mutation on P1 is considered as explained by position P1 and those explained cases are removed from the original dataset. Then in the level 2, the non-explained cases all have no mutation on P1, so we find the position P2 with highest information gain for the remain cases in level 2.

By recursively selecting positions with highest information gain and then remove explained cases, we could finally find several positions to explain all cases.

Decision tree are sophisticated data mining tools for discovering patterns and using them to make predictions. The kernel methodology of decision tree is information gain. Here we adapt the decision tree C4.5 [24] help us to select positions with highest information gain in each levels.

Chapter 3

Results and Discussions

The Results could be divided into two main parts. First part is the evaluation process for the suitability of information gain to represent genetic and antigenic evolution. This part also shows the process and result of selecting important positions via information gain.

Second part use the important positions selected by information gain to predict two unseen test set. The predicting performance and results are discussed.

3.1 The result and meaning of information gain value

In order to find out what position change would affect HI value. We calculate the information gain of 329 HA positions from 181 cases. We first to evaluate that whether information gain is a proper index for represent the association between genetic and antigenic evolutions.

The process is illustrated in (figure10). In the figure 10 (table A) we list eight cases of virus comparison, the left most column record the antigenic type between that two virus and the right most column record the genetic changed positions between that two virus’s HA protein sequence. In the (table B), we statistic each 329 position’s change frequency in the total 181 cases. The change frequency of one position is separated classified into two

categories, the change happened in variant type and in equal type. In the (table C), we do the calculation of entropy and information gain for each 329 positions. The top three information gain positions are 145, 189, 278 all have high association with antigenic type. For example, position 145 have total 62 frequencies of change is total 181 cases and 61 of them happened in the variant type. We may conclude that position with high information gain means high association between position change and antigenic type change. The three positions with high entropy are 226, 135, 124 show low association between genetic and antigenic relationship.

For example, position 226 have total 61 frequencies of change in total 181 cases and 34 of them happened in variant type and 27 of them happened in equal type. The result shows that position 226 with very low association between genetic and antigenic relations. Specially note the position 145 with top information gain have been verified by experiment that could lead to cluster transition [11].

. The information gain of each position is plot in graph (Fig. 11).The entropy of each position is plot in graph (Fig. 12). We also plot the information gain on the HA structure (Figure 13). Figure 13 shows the information gain for 329 positions on the HA protein structure in the form of color. The red the color means more high the information gain and the top five information gain positions are labeled. Figure (A) is the front view of HA monomer.

Figure (B) is the top view of HA trimer. Compare the red region between front view and top view shows that the top view show more high information gain positions

The comparison between information gain and entropy is also plot in graph (Fig 14). In the genetic view, residues with high entropy may be important. But from the view of information gain, positions with high entropy may have zero information (Ex: position 124).

From this figure we may conclude that positions with highest information gain means high association between genetic and antigenic evolutions. The top 10 information gain and top 10 codon diversity positions are list in table (Table 5). The information gain of all positions are listed in appendix ( Appendix II).

3.2 The Discuss for Information Gain

Since information gain associates genetic and antigenic evolution. We here to discuss the relationship between them.

The relation between information gain and genetic evolution is plot in figure (Figure 15).

For each 181 cases, we compare the genetic changes and information gain for both All positions (329 positions) and epitope sites (131 positions). The linear regression R factor shows good relation between genetic change and information gain (R> 0.9) and epitope sites could better fit the genetic change. But for the same value of information gain, the genetic sequence may have high diversity change. For example the information gain value near 0.5, the position change number could range from 7 to 19. The result shows that information gain treat each position change with different weight, but not equal weight.

The relation between information gain and antigenic distance is plot in figure (Figure 16).

For each 181 cases, we compare the antigenic distance and information gain for both All positions (329 positions) and epitope sites (131 positions). The result shows that sum of information gain could fit the linear relation to antigenic distance (R >0.74). The result also shows that epitope could better fit the antigenic distance than all positions.

Antigenic variants are defined when antigenic distance ≧ 4, from this figure when sum of information gain > 0.1835 , we could get best predicting performance for predicting antigenic variant. The agreement is 87%.

3.3 The Ability for Information Gain to Predict Antigenic Variants

From figure 15 we found that information gain have the potential to predict antigenic variants. For each pair of viruses we calculate the sum of information gain of changed positions. And the result is compared with a related work [12] which based on hamming distance (The sum of different amino acid positions). The result is illustrated in figure (Figure 17). For each 181 cases, the information gain and number of sequence mutations of epitope sites is plot on the figure. When the sum of Information gain value > 0.1835, the case is predicted as antigenic variant and the agreement is 87 % (158/181). When the sum of sequence mutations ≧ 7, the case is predicted as antigenic variant and the agreement is 83%

From figure 15 we found that information gain have the potential to predict antigenic variants. For each pair of viruses we calculate the sum of information gain of changed positions. And the result is compared with a related work [12] which based on hamming distance (The sum of different amino acid positions). The result is illustrated in figure (Figure 17). For each 181 cases, the information gain and number of sequence mutations of epitope sites is plot on the figure. When the sum of Information gain value > 0.1835, the case is predicted as antigenic variant and the agreement is 87 % (158/181). When the sum of sequence mutations ≧ 7, the case is predicted as antigenic variant and the agreement is 83%