• 沒有找到結果。

Chapter 4 Conclusions and Future Perspectives

4.2 Major contributions and Future Perspectives

By integrating genetic and antigenic evolution, we have successfully applying the information gain to represent the association between genetic and antigenic evolution. We also found that information gain is a good index to predict antigenic variants.

Compare to the traditional hamming distance which consider each positions are equal weighted, the information gain give each position a different weight which could help us to identify the important positions for antigenic evolution.

Recently the threat of avian influenza has to emerge and human know little about the it.

Our method could also apply to avian influenza virus and have high potential to identify the important positions.

In the interaction between antigen and antibody, there are many interesting issues. What positions are immunodominant and what kind of residue changes are important? Will the protein structure help to explain some problems? Currently we have just try to find out what positions change are important but there remain many interesting problems for us. Next step we want to know what kinds of amino acid type changes are important and we need to put more effort to discuss the co-evolution between positions.

The methodology currently applied to influenza virus could also apply to other kind of molecule interaction which having experiment data to represent the degree of binding affinity.

Table 1. The influenza vaccine component recommended by WHO from 1968 to 2006

Year Vaccine Strain a Influenza Season

1968~1973 A/Hong_Kong/1/68 1968/10/01~1973/09/30

1971~1974 A/England/42/72 1972/10/01~1974/09/30

1975~1976 A/Port_Chalmers/1/73 1974/10/01~1976/09/30

1975 A/Scotland/840/74 1975/10/01~1976/09/30

1977~1978 A/Victoria/3/75 1976/10/01~1978/09/30

1979~1980 A/Texas/1/77 1978/10/01~1980/09/30

1981~1983 A/Bangkok/1/79 1980/10/01~1983/09/30

1984~1986 A/Phillipines/2/82 1983/10/01~1986/09/30

1987 A/Mississippi/1/85

A/Christchurch/4/85

1986/10/01~1987/09/30

1988 A/Leningrad/360/86 1987/10/01~1988/09/30

1989 A/Sichuan/2/87 1988/10/01~1989/09/30

1990 A/Shanghai/11/87 1989/10/01~1990/09/30

1991 A/Guizhou/54/89 1990/10/01~1991/09/30

1992~1993 A/Beijing/353/89 1991/10/01~1993/09/30

1994 A/Beijing/32/92 1993/10/01~1994/09/30

1995 A/Shangdong/9/93 1994/10/01~1995/09/30

1996 A/Johannesburg/33/94 1995/10/01~1996/09/30

1997~1998 A/Wuhan/359/95 1996/10/01~1998/09/30

1999~2000 A/Sydney/5/97 1999/05/01~2000/04/30

2000~2004 A/Moscow/10/99 2000/05/01~2004/04/30

2004~2005 A/Fujian/411/2002 2004/05/01~2005/04/30

2005_2 A/Wellington/1/2004 2005/05/01~2005/10/31

2006 A/Calfornia/7/2004 2005/11/01~2006/10/30

2007_1 A/Wisconsin/67/2005 2006/11/01~2007/04/30

a. Influenza virus strains in bold and blue font are present in training set

Table 2. The six HI titer table adapted in training set

Table From To

Seq Pairs Variant a Equal b Period Years

a. The antigenic type is variant when antigenic distance ≧ 4.

b. The antigenic type is equal when antigenic distance < 4.

c. The influenza virus strain names are in abbreviate from and names in blue

and bold font are vaccine strains.

Table 3. The list of influenza virus strains in training set

Table 4. The sequence number and name of the 11 clusters in the test set [11].

Group Name Sequence number Non-identical

1 HK68 14 12 2 EN72 15 10 3 VI75 9 8

4 TX77 3 3

5 BA79 16 14 6 SI87 25 18 7 BE89 64 29 8 BE92 57 43 9 WU95 28 25 10 SY97 16 16

11 FU02 6 3

total 253 181

Table 5. The positioins with top 10 information gain and positionss with top 10 codon diversity. Residues prefer to change in special antigenic type would have higher information gain. Residues with high rank codon diversity possible have zero information gain.

a. The times of residue change occur in variant type.

b. The times of residue change occurs in variant type.

c. Information gain.

d. The rank of codon diversity of each residues [7].

e. If the residue is in the receptor binding sites.

f. If the residue is under positive selection [8].

g. If this residue leads to cluster transition [11].

Table 6. There the 14 cases successful predicted cases by information gain but false predicted by hamming distance. Eight cases of nine variant cases have changes on the top two information gain positions (145:0.1969 and 189:0.1286). And only the A/England/42/72 vs

A/Port_Chalmers/1/73 do not have any position with information gain>0.1 but could reach the 0.1836 threshold. The five equal cases all have change on positions with low information gain.

Virus1 Virus2 Antigenic

Type

HD Epitope

IG

Epitope Aminoacidchanges

A/Leningrad/360/86 A/Shanghai/11/87 Equal 8 0.166 I88V F94Y G124D T138A Y155H E188D K189R S247R

A/Hong_Kong/1/94 A/Guangdong/25/93 Equal 8 0.0578 P47S K92E N96S N124D D216N Y219S L226Q R299K

A/Alaska/10/95 A/South_Africa/1147/96 Equal 10 0.0962 T121N G124S D133N K135T G142R D165N D190V I226V N262S D275L

A/Wuhan/359/95 A/South_Africa/1147/96 Equal 8 0.1178 T121N G124S D133N G142R D190V I194L I226V G275L

A/Fujian/47/96 A/South_Africa/1147/96 Equal 8 0.1029 T121N G124S D133N G142R D190V R193S I226V D275L

A/England/42/72 A/Port_Chalmers/1/73 Variant 6 0.2926 L3F D63N T160A N188D S193N A198T G208R A/Allegheny_County/29/76 A/Victoria/112/76 Variant 4 0.2112 S157L R189K S209N R240G

A/Hong_Kong/34/90 A/Hong_Kong/23/92 Variant 5 0.2732 S157L V174F R189S I214T T276N A/Hong_Kong/34/90 A/Shangdong/9/93 Variant 6 0.3139 D53G S157L V174F R189S I214T T276N

A/Beijing/32/92 A/Scotland/142/93 Variant 6 0.2815 H75N I121T S157L R189S R201K Q226L A/Hong_Kong/23/92 A/Madrid/252/93 Variant 4 0.2569 G135K N145K R208K T214I

A/Shangdong/9/93 A/Madrid/252/93 Variant 5 0.2976 G53D G135K N145K R208K T214I A/Scotland/160/93 A/Madrid/252/93 Variant 4 0.2866 N145K R208K F219S L226Q

Table 7.

This table analysis the 91 cases apply the relation found by contact map.

The positions are divide into two groups according to structure neighbors.The value represent the frequency that both two position changes. For example: the marked value 31 meant there are 31 times that position 133 and position 145 change in the same time.The average and standard frequency is 7.6 and 8.5 respectively. Values higher than average+1STD are underlined. The result shows that position 133 and 145 in region 1 highly co-evolution to p o s i t i o n 1 5 6 1 5 8 1 6 0 1 9 7 i n r e g i o n 2 .

Region 1 Region 2

Position 131 132 133 135 145 146 155 129 156 157 158 159 160 196 197

131 5 0 5 0 0 0 0 0 0

132 10 10 3 1 4 10 4 0 2

133 32 2 31 9 18 2 19 4 19

135 23 0 9 15 0 0 0 0 6

145 53 7 28 14 22 7 24 0 19

146 18 2 18 2 14 2 18 0 18

155 28 10 14 1 4 22 4 0 2

129 0 10 2 0 7 2 10 10

156 5 3 31 9 28 18 14 50

157 0 1 9 15 14 2 1 23

158 0 4 18 0 22 14 4 30

159 0 10 2 0 7 2 22 22

160 0 4 19 0 24 18 4 32

196 0 0 4 0 0 0 0 4

197 0 2 19 6 19 18 2 25

sum 5 34 104 30 121 74 57 31 108 42 62 43 69 4 66

average 7.6

STD 8.5

AVE+1STD 16

Table 8. The comparison between our models and related works.

Wilson &

Cox, 1990

Lee & Chen, 2004

Sum of Information

Gain

IG Sets Contact Map Coverage of

Positions 131/329 131/329 131/329 101/329 229/329

Method Hamming

Distance

Hamming Distance

Information Gain

Information Gain

Information Gain Number of

Rules 1 1 1 7 10

Error Cases 48 31 23 16 7

Performance 73.5% 82.9% 87.2% 91.2% 96.1%

Table 9.The false predicted cases by Four methods (Lee,2004 ; Sum of IG; IG Sets; Contact Map) .The column “Antigenic Type” means the true class. In the three columns (Lee , 2004 ; Sum of IG;IG Sets ; Contact Map) only value “1” means a successful prediction and “0”

means false prediction. Rows in orange color means our three models better than Lee, 2004 and rows in pink color means IG Sets and contact map better than other two models.

Virus1 Virus2 Antigenic

Type

L3F D63N T160A N188D S193N A198T G208R

A/Mayo_Clinic/1/75 A/England/864/75 Variant 45.25 1 1 1 0 8 8

K50R E82K N137Y I145N Q189K K193N M260I Y308N

A/Victoria/3/75 A/Allegheny_County/29/76 Equal 1.89 0 0 0 1 7 7

N53D S145N K189R N193D I213V I230V G240R

A/Allegheny_County/29/76 A/Victoria/112/76 Variant 9.24 0 1 1 1 4 4 S157L R189K S209N R240G A/Bangkok/1/79 A/Bangkok/2/79 Variant 9.24 0 0 0 0 3 3 D188Y N193K S278I

A/Bangkok/1/79 A/Philippines/2/82 Variant 11.31 1 0 1 1 7 7

A138T D144N N173K V182I I213V N248T K307R

A/Bangkok/1/79 A/Leningrad/360/86 Variant 11.31 1 0 1 1 10 11

N2K V88I A138T D144V S159Y V163A N173K D188E I213V N248T K307R

A/Mississippi/1/85 A/Sydney/1/87 Equal 1.41 0 0 0 0 9 9

F94Y G124D A138S Y155H K156E S159Y K189R N193K Q226L

A/Leningrad/360/86 A/Victoria/7/87 Variant 5.66 1 0 1 0 7 7

I88V F94Y G124D T138A Y155H E188D K189R

A/Leningrad/360/86 A/Shanghai/11/87 Equal 2 0 1 0 1 8 8

I88V F94Y G124D T138A Y155H E188D K189R S247R

A/Leningrad/360/86 A/Sydney/1/87 Equal 2 0 0 0 1 8 8

I88V F94Y G124D T138S Y155H E188D K189R N193K

A/Victoria/7/87 A/Sichuan/2/87 Variant 5.66 0 0 0 0 2 2 E156K S186V

Virus1 Virus2 Antigenic

A/Hong_Kong/34/90 A/Hong_Kong/23/92 Variant 5.66 0 1 1 1 5 5 S157L V174F R189S I214T T276N A/Hong_Kong/34/90 A/Shangdong/9/93 Variant 5.66 0 1 1 1 6 6 D53G S157L V174F R189S I214T T276N

A/Beijing/32/92 A/Hong_Kong/23/92 Equal 2 1 1 0 1 3 3 S157L R189S T276N

A/Beijing/32/92 A/Shangdong/9/93 Equal 2 1 0 0 1 4 4 D53G S157L R189S T276N

A/Beijing/32/92 A/Scotland/142/93 Variant 5.66 0 1 1 1 6 6 H75N I121T S157L R189S R201K Q226L A/Hong_Kong/23/92 A/Scotland/160/93 Equal 2.83 1 1 0 1 4 4 G135K T214I S219F Q226L

A/Hong_Kong/23/92 A/Hong_Kong/1/94 Equal 2.83 0 1 0 1 7 7

S47P D124N G135K T214I N216D S219Y Q226L

A/Hong_Kong/23/92 A/Madrid/252/93 Variant 22.63 0 1 1 1 4 4 G135K N145K R208K T214I A/Hong_Kong/23/92 A/Guangdong/25/93 Variant 5.66 0 0 1 1 5 5 K92E N96S G135K T214I R299K A/Shangdong/9/93 A/Scotland/160/93 Variant 4 0 0 1 1 5 5 G53D G135K T214I S219F Q226L

A/Shangdong/9/93 A/Hong_Kong/1/94 Variant 4 1 0 1 1 8 8

S47P G53D D124N G135K T214I N216D S219Y Q226L

A/Shangdong/9/93 A/Madrid/252/93 Variant 16 0 1 1 1 5 5 G53D G135K N145K R208K T214I A/Shangdong/9/93 A/Guangdong/25/93 Variant 11.31 0 0 1 1 6 6 G53D K92E N96S G135K T214I R299K

A/Scotland/160/93 A/Scotland/142/93 Variant 4 1 0 1 1 7 7

H75N I121T K135G R201K I214T F219S N276T

A/Scotland/160/93 A/Madrid/252/93 Variant 8 0 1 1 1 4 4 N145K R208K F219S L226Q A/Hong_Kong/1/94 A/Scotland/142/93 Variant 5.66 1 0 1 1 10 10

P47S H75N I121T N124D K135G R201K I214T D216N Y219S N276T

A/Hong_Kong/1/94 A/Guangdong/25/93 Equal 2 0 1 1 1 8 8

P47S K92E N96S N124D D216N Y219S L226Q R299K

A/Scotland/142/93 A/Guangdong/25/93 Variant 11.31 1 0 1 1 10 10

N75H K92E N96S T121I G135K K201R T214I L226Q T276N R299K

A/Alaska/10/95 A/South_Africa/1147/96 Equal 2 0 1 1 1 10 10

T121N G124S D133N K135T G142R D165N D190V I226V N262S D275L

Virus1 Virus2 Antigenic

Type

T121N G124S D133N G142R D190V I194L I226V G275L

A/New_York/37/96 A/South_Africa/1147/96 Equal 2 0 1 1 1 7 7

T121N G124S D133N G142R D190V I229R D275L

A/Fujian/47/96 A/South_Africa/1147/96 Equal 2.83 0 1 1 1 8 8

T121N G124S D133N G142R D190V R193S I226V D275L

A/South_Africa/1147/96 A/Auckland/5/96 Equal 1 0 1 1 1 7 7

N96D N121T S124G N133D R142G V190D L275G

A/Sydney/5/97 A/Moscow/10/99 Equal 1.41 1 0 1 1 6 8

I3L R57Q Y137S S142R K160R I194L A196T H233Y

A/Sydney/5/97 A/Panama/2007/99 Equal 1.41 0 0 1 1 8 12

I3L P21S R57Q Y137S S142R I144N D172E H183L T192I I194L I226V H233Y

A/Sydney/5/97 A/Ireland/10586/99 Equal 1.41 0 0 1 1 7 12

I3L P21S R57Q Y137S S142R D172E H183L T192I I194L I226V H233Y D271N

A/Fujian/140/2000 A/Chile/6416/2001 Variant 4 1 0 0 1 7 12

G14C A43V A106V N144D S186G V194L P199S P221H I226V N246K C247S S273P

A/Fujian/140/2000 A/Hong_Kong/1550/2002 Equal 2 0 1 1 1 8 14

G14C A43V R50G E83K N96S S186V V194I P199S V202I W222R G225D I226V C247S S273P

A/Fujian/140/2000 A/New_York/55/2001 Variant 5.66 1 0 0 1 7 12

G14C A43V G49S A106V N144D S186G V194I P199S I226V R229G C247S S273P A/Chile/6416/2001 A/Hong_Kong/1550/2002 Equal 2 0 1 1 1 7 12

R50G E83K N96S V106A D144N G186V L194I V202I H221P W222R G225D K246N

False Cases 31 23 16 7

Accuracy Rate 0.83 0.9 0.91 0.961

Table1 10

The predicting accuracy comparison between various methods. The result shows that both our three method have the best accuracy. And the contact map performed best in training set do not have best accuracy both in the test set1 and 2.

Wilson &

Cox, 1990

Lee &

Chen, 2004

Sum of IG IG Sets Contact Map

Right Cases 136 150 158 165 174

Training Set

Accuracy 73.5% 82.9% 87.2% 91.2% 96.1%

Right Cases 29 32 38 37 34

Test Set 1

(WER) Accuracy 58% 64% 76% 74% 68%

Right Cases 4191 4710 5180 5201 4504

Test Set s

(Smith,2004) Accuracy 70.7% 79.5% 87.4% 87.7% 76.0%

Table 11

The detail information of the 50 test cases from WER. The value for column 4,5,6 means if the prediction is successful. Value “1” means a successful prediction. The 13 false predicted cases by IG Sets are listed in top 13 rows. The first 7 antigenic equal cases are all have position change on 145, so they are applied to level one. The first three cases A/Hong_Kong/1/68 vs A/England/42/72 and A/Shanghai/31/80 vs A/Bangkok/1/79 and A/Texas/1/77 vs

A/Belgium/2/81 all have epitope position changes more than 11 positions while the maximum position change for antigenic change in the training set is only 10.

Virus1 Virus2 Antigenic Type

A/Shanghai/24/90 A/Beijing/353/89 Equal 2.83 0 0 0 1 8 8 G135E,N145K,S186I,D190E,S193N,Q226L,E62K,N262T A/Wyoming/3/2003 A/Wellington/1/2004 Equal 2.00 0 0 0 2 9 9

A128T,Y159F,V186G,S189N,S193N,Y219S,I226V,S227P,N2 46S

A/Texas/1/77 A/Bangkok/1/79 Variant 5.66 1 1 0 6 7 7 N133S,P143S,G146S,K156E,T160K,Q197R,V217I A/Texas/1/77 A/Bangkok/2/79 Variant 11.31 1 1 0 6 10 10

N133S,P143S,G146S,K156E,T160K,D188Y,N193K,Q197R,S 278I,V217I

A/Caen/1/84 A/Philippines/2/82 Variant 5.66 0 0 0 6 6 7 A138T,V144N,K156E,Y279S,V182I,Y94F,K2N A/Caen/1/84 A/Christchurch/4/85 Variant 4.00 0 0 0 6 6 6 K156G,S159Y,V163A,Y279S,N246S,Y94F

A/Christchurch/4/85 A/Philippines/2/82 Variant 5.66 1 0 0 6 7 8 A138T,V144N,G156E,Y159S,A163V,V182I,S246N,K2N A/Washington/15/91 A/Beijing/353/89 Variant 4.00 0 0 0 6 2 2 D158E,S193N

A/Hong_Kong/1/68 A/England/878/69 Variant 5.33 1 1 1 1 10 13

D133N,G144D,K145S,R146G,R50K,S54N,K62I,D63N,K80Q ,N81D,V31N,F139C,P199S

9S

Virus1 Virus2 Antigenic Type

Virus1 Virus2 Antigenic

A/Wellington/4/85 A/Philippines/2/82 Equal 2.83 1 1 1 6 5 6 A138T,V144N,Y159S,A163V,V182I,K2N A/Wellington/4/85 A/Mississippi/1/85 Equal 2.00 1 1 1 6 3 3 E156K,Y159S,L226Q

A/England/427/88 A/Shanghai/11/87 Equal 1.89 1 1 1 6 5 6 A131T,R299K,S247R,K82E,E83K,S15L A/England/427/88 A/Guizhou/54/89 Equal 1.63 1 1 1 6 4 5 V144I,Y159H,S186I,Q44H,S15L

A/Guizhou/54/89 A/Beijing/353/89 Variant 22.63 0 1 1 1 5 5 G135E,I144V,N145K,H159Y,H44Q

A/Guandong/39/89 A/Shanghai/11/87 Equal 3.46 0 1 1 6 8 8 A131T,H159Y,I186S,R299K,S247R,K82E,E83K,M260I A/Guandong/39/89 A/Beijing/353/89 Variant 13.06 0 1 1 1 4 4 G135E,N145K,H159Y,M260I

A/Shanghai/11/87 A/Beijing/353/89 Variant 7.54 1 1 1 1 8 8 T131A,G135E,N145K,S186I,K299R,R247S,E82K,K83E A/Shanghai/16/89 A/Beijing/353/89 Variant 5.66 0 1 1 1 3 3 G135E,N145K,H159Y

A/England/261/91 A/Beijing/353/89 Equal 1.41 1 1 1 6 5 5 T138A,S186I,S193N,I196V,D172G A/Washington/15/91 A/Beijing/32/92 Variant 8.00 1 1 1 1 10 10

S133D,E135G,K145N,E156K,D158E,I186S,E190D,I214T,L2 26Q,T262N

A/Shangdong/9/93 A/Johannesburg/33/94 Variant 5.66 1 0 1 7 7 7 D124N,G135K,S47P,G53D,T214I,N216D,S219Y A/Guangdong/25/93 A/Johannesburg/33/94 Equal 1.00 0 1 1 6 7 7 D124N,S47P,K299R,S96N,N216D,S219Y,E92K A/Wuhan/359/95 A/Sydney/5/97 Variant 16.00 1 1 1 3 10 12

A/California/7/2004 A/Wyoming/3/2003 Variant 4.00 1 1 1 1 8 8 N145K,T128A,F159Y,G186V,N188D,N189S,S219Y,P227S A/California/7/2004 A/Wellington/1/2004 Variant 4.00 0 1 1 1 5 5 N145K,N188D,S193N,I226V,N246S

A/Wisconsin/67/2005 A/California/7/2004 Equal 2.83 1 1 1 6 3 6 D122N,D188N,F193S,H195Y,I223V,N225D

Table 12. The result of apply training model on test set2. The total performance is 87.7%. The left part of table evaluates the cluster transition and the right part of table evaluate that strains in the same cluster should have close antigenic properties. The most false cases are [b][a] type which means we predict the cases in the same cluster to antigenic variants. The majority false cases are in the

“BE92” cluster.

Cluster-Transition [a][a] [a][b] [b][a] [b][b] Cluster

HK68-EN72 120 0 44 22 HK68 1

EN72-VI75 80 0 29 16 EN72 2

VI75-TX77 24 0 13 15 VI75 3

TX77-BA79 40 2 2 1 TX77 4

BA79-SI87 252 0 45 46 BA79 5

SI87-BE89 522 0 17 136 SI87 6

BE89-BE92 1247 0 0 406 BE89 7

BE92-WU95 1048 27 486 417 BE92 8

WU95-SY97 400 0 47 253 WU95 9

SY97-FU02 48 0 15 105 SY97 10

0 0 0 3 FU02 11

Rate 99.24% 0.76% 32.96% 67.04%

[a][a]: Variant type and predict Variant [a][b]: Variant type and predict Equal [b][a]: Equal type and predict Variant [b][b]: Equal type and predict Equal

Table 13.

Analysis the rules and positions lead to cluster transitions.

The Result shows that position 145 plays an important role in 6 cluster transitions. Position 189 only counts for BA79-SI87 cluster transition. Position 62 count for theWU95-SY97 cluster transition

Level1 Level2 Level3 Level4 Level5 Level6 Level7 145 189 62 155 213 214 214 sum

HK68-EN72 67 0 7 46 0 0 0 120

EN72-VI75 73 7 0 0 0 0 0 80

VI75-TX77 3 0 8 0 13 0 0 24

TX77-BA79 0 0 28 0 12 2 0 42

BA79-SI87 0 252 0 0 0 0 0 252

SI87-BE89 522 0 0 0 0 0 0 522

BE89-BE92 1247 0 0 0 0 0 0 1247

BE92-WU95 1032 2 2 0 0 27 12 1075

WU95-SY97 39 0 361 0 0 0 0 400

SY97-FU02 3 0 0 45 0 0 0 48

Total 2986 261 406 91 25 29 12 3810

Table 14. Cluster-difference amino acid substitutions, and distances between

antigenic clusters [11].

Table 15. This table list all the 39 cases appear both in training set and test set. The “TrainingType” means the true class and the

“TestType” mean our assumption. If trainingtype and testtype are different then the conflict column will denote “X”.

Index Virus1 Virus2 Cluster1 Cluster2 TrainingType TestType Conflict 1 A/HK/107/71 A/EN/42/72 HK68 EN72 Variant Variant

2 A/HK/107/71 A/PC/1/73 HK68 EN72 Variant Variant

11 A/EN/42/72 A/PC/1/73 EN72 EN72 Variant Equal X 13 A/EN/42/72 A/VIC/3/75 EN72 VI75 Variant Variant

21 A/PC/1/73 A/VIC/3/75 EN72 VI75 Variant Variant 64 A/PH/2/82 A/LE/360/86 BA79 BA79 Equal Equal 65 A/PH/2/82 A/VI/7/87 BA79 SI87 Variant Variant 66 A/PH/2/82 A/SI/2/87 BA79 SI87 Variant Variant 67 A/PH/2/82 A/SH/11/87 BA79 SI87 Variant Variant 74 A/LE/360/86 A/VI/7/87 BA79 SI87 Variant Variant 75 A/LE/360/86 A/SI/2/87 BA79 SI87 Variant Variant

76 A/LE/360/86 A/SH/11/87 BA79 SI87 Equal Variant X 78 A/VI/7/87 A/SI/2/87 SI87 SI87 Variant Equal X 79 A/VI/7/87 A/SH/11/87 SI87 SI87 Variant Equal X 81 A/SI/2/87 A/SH/11/87 SI87 SI87 Equal Equal

85 A/BE/353/89 A/BE/32/92 BE89 BE92 Variant Variant 89 A/BE/353/89 A/HK/1/94 BE89 BE92 Variant Variant 90 A/BE/353/89 A/SC/142/93 BE89 BE92 Variant Variant 92 A/BE/353/89 A/GU/25/93 BE89 BE92 Variant Variant

104 A/BE/32/92 A/HK/1/94 BE92 BE92 Variant Equal X 105 A/BE/32/92 A/SC/142/93 BE92 BE92 Variant Equal X 106 A/BE/32/92 A/MA/252/93 BE92 WU95 Variant Variant

Index Virus1 Virus2 Cluster1 Cluster2 TrainingType TestType Conflict 123 A/HK/1/94 A/SC/142/93 BE92 BE92 Variant Equal X 124 A/HK/1/94 A/MA/252/93 BE92 WU95 Variant Variant

125 A/HK/1/94 A/GU/25/93 BE92 BE92 Equal Equal 126 A/SC/142/93 A/MA/252/93 BE92 WU95 Variant Variant

127 A/SC/142/93 A/GU/25/93 BE92 BE92 Variant Equal X 128 A/MA/252/93 A/GU/25/93 WU95 BE92 Variant Variant

130 A/JO/33/94 A/WU/359/95 BE92 WU95 Variant Variant 131 A/JO/33/94 A/NA/933/95 BE92 WU95 Variant Variant 142 A/WU/359/95 A/NA/933/95 WU95 WU95 Equal Equal 157 A/NA/933/95 A/SY/5/97 WU95 SY97 Variant Variant 158 A/NA/933/95 A/MO/10/99 WU95 SY97 Variant Variant 159 A/NA/933/95 A/PA/2007/99 WU95 SY97 Variant Variant 161 A/SY/5/97 A/MO/10/99 SY97 SY97 Equal Equal 162 A/SY/5/97 A/PA/2007/99 SY97 SY97 Equal Equal 164 A/MO/10/99 A/PA/2007/99 SY97 SY97 Equal Equal 171 A/PA/2007/99 A/FU/411/2002 SY97 FU02 Variant Variant

Figure 1. Viruses recommended for inclusion in the influenza H3N2 virus vaccines, 1968-2000 [25]. The influenza HA protein is the major influenza vaccine component.

(A) (B)

Figure 2. The 3D structure of HA monomer (Protein Data Bank ID 1HGF:A). (A) the front view and (B) the back view. The five epitope sites are space fill in different colors.

Figure 3. The Hemagglutination inhibition test table.

The Hemagglutination inhibition test results are present in a table form. The first column represent antigen and the first row represent antisera. The HI test always need reference tests when the test is applied to new isolates. The reference antigens are always in the top rows of the table. The HI titer values represent serial dilution folds. The antigenic distance is calculated from HI titer values.

Calculating Information Gain and Evaluating IG for Represent Relations Between Genetic and Antigenic Evolution

Practical Applications Collect HI titer table

and HA sequence

Training Set Test Sets

Figure 4. The flowchart of this research.

The research flowchart could be divided into two parts. In the left part we calculate the information gain of 329 HA positions from on a representative training dataset and evaluate the suitability for information gain to represent the relations between genetic and antigenic evolutions. Then in the right part we apply the important positions selected by information gain to predict antigenic variants on two unseen and meaningful application sets (test sets).

In the first part we first select representative influenza strains from the WHO influenza vaccine strains. Then we extract the genetic data from sequence and antigenic data from HI titer value. The performance of predicting antigenic variants is compared to related works. In the second part the model is applied to practical applications and the result is analyzed.

Figure 5. The antigenic map of influenza A(H3N2) virus from 1968 to 2003.

The cluster result is based on K-mean and which using the antigenic distance transformed from HI titer. There are 11 clusters with different colors and different cluster means different antigenic properties. Every cluster is named after the first vaccine strain in the cluster—two letters refer to the location of isolation [11].

8 Inter: 3810 pairs : Variant

3

Figure 6. This figure show how the antigenic type in application set is defined. (B) (C) There are 11 clusters as a result by K-mean method which using antigenic distance from HI titer values. (A) The equal type is defined within a cluster and the variant type is defined when the cluster transitions are occur [11].

(A) (B) (C)

A Human

Figure 7. The flowchart of processing query sequences from ISD. The sequences deposit in the ISD are in the nucleotide format and the length of HA not always long enough ( > 987 nucleotides). If the sequence have “?” nucleotides or could not be translated into protein sequence then that sequence is removed.

Position change 150:0 151:0 152:0 153:0

>A/Panama/2007/99 RLNWLHQLKYKYPALNVTMPNNEKF DQISLYAQAS

>A/Fujian/411/2002 RLNWLTHLKYKYPALNVTMPNNEKF DQISLYAQAS

Binary string 0000011000000000000000000 0000000000

Binary string 0000011111100000000000000 0001101000

150 170 190 200

>A/Panama/2007/99 RLNWLHQLKYKYPALNVTMPN

>A/Fujian/411/2002 RLNWLTHLKYKYPALNVTMPN

Sequence

Hamming Distance Total 2 changes Binary string 000001100000000000000

150 170

structure neighbors

Contact map 150:0 151:0 152:0 153:0 154:0 155:1 156:1 157:1 158:1 159:1 160:1 193:1 194:1 196:1

(A) (B)

Figure 8. This figure shows how the position-specific changes and contact map coding works.

(A) If the position occurs change then the binary string would denote “1”. Each position change is treated as individual feature. (B) In the contact map coding, if any single position in a region occurs change then all member in this region are consider as changed.

I_J)

Figure 9. This figure demonstrate that how the antigenic distance is calculated. Currently there are two kinds of transformation equations in related works[12, 23], we adapt equation one in this work.

G135KN145KR208K T214I A/Madrid/252/93

A/Hong Kong/23/92 Variant

S47P D124N G135KT214I N216D S219Y Q226L A/Hong Kong/1/94

A/Hong Kong/23/92 Equal

S133D E135G K145NE156K I186S E190D N193S I214T L226Q T262N

A/Beijing/32/92 A/Beijing/353/89

Variant

F94Y G124D Y155H K156E S159Y K189R Q226L A/Victoria/7/87

A/Mississippi/1/85 Variant

N2K T138A N144V E156K V163A I182V L226Q A/Mississippi/1/85

A/Philippines/2/82 Equal

K50R E82K N137Y I145NQ189K K193N M260I Y308N A/England/864/75

A/Mayo Clinic1/75 Variant

F3L S9N N31D T83K T126N S145IG158E A160T N193K T198A I217V I278S

A/Mayo Clinic1/75 A/Port Chalmers/1/73

Variant

S9N N31D D63N T83K T126N S145IG158E N188D S193K G208R I217V I278S

A/Mayo Clinic1/75 A/England/42/72

Variant

Genetic Sequence (Different Positions on HA) Virus2

Figure 10. This diagram shows how to find immunodominant positions via calculation of information gain. Table A is the source data. We want to find which position change would affect antigenic type. Table B is some simple statistic result. For example, position 145 have total 62 changes in the full 181 cases and 61 of them occurred in the variant type. Table C is the comparison between entropy and information gain of selected positions. Specially note that position 135 with high entropy but zero information gain.

a. The number of total mutation times in the 181 training set.

b. The number of total mutation times in the variant type.

d. If the residue is in the receptor binding sites.

e. If the residue is under positive selection [8].

f. The rank of codon diversity of each residues[7].

g. If this residue leads to cluster transition [11].

278 189

145

226 124

135

0 0.2 0.4 0.6 0.8 1 1.2 1.4

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 Positions

Entropy

Figure 11. This figure shows the entropy of all 329 positions. The entropy is an index to evaluate the genetic evolution.

126 158 145

189 278

0 0.05 0.1 0.15 0.2 0.25

1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 Positions

Information Gain

Figure 12. This figure shows the information gain of all 329 positions. The information gain may be a good index to represent the association degree between genetic and antigenic eolutions. Special note that position 145 have been verified by experiment that could lead to cluster transition [11].

145 189

278 158

126

145 189

278

126 158

(A) (B)

Figure 13.

This figure shows the information gain for 329 positions on the HA protein structure in the form of color. The red the color means more high the information gain and the top five information gain positions are labeled. Figure (A) is the front view of HA monomer. Figure (B) is the top view of HA trimer. Compare the red region between front view and top view shows that the top view show more high information gain positions.

Figure 14. Evaluate each position’s importance via entropy and information gain. The upper three positions (145 278 189) have both high entropy and information gain are potential immunodominant positions. The lower three positions(135 124 226) have only high entropy are consider as important sites in the genetic level but show low information gain. Special note that position 145 have been verified by experiment that could lead to cluster transition [11].

0.0 0.5 1.0 1.5 2.0 2.5 0

5 10 15 20 25 30 35 40

Epitope AllPositions Fit of Epitope Fit of All Positions

Sequence Mutation (Hamming Distance)

Information Gain (Sum of IG)

R=0.92 (Epitope) R=0.90 (All Positions)

Figure 15. The relation between information gain and genetic evolution.

For each 181 cases, we compare the genetic changes and information gain for both All positions (329 positions) and epitope sites (131 positions). The linear regression R shows good relation (R >0.9) between genetic change and information gain and epitope sites could better fit the genetic change.

But for the same value of information gain, the genetic sequence may have high diversity change. For example the information gain value near 0.5, the position change number could range from 7 to 19. The result shows that information gain treats each position change with different weight, but not equal weight.

0.00.1835 0.5 1.0 1.5 2.0 2.5 -2

0 2 4 6 8

Epitope AllPositions Fit of Epitope Fit of All Positions

Log (Antigenic Distance)

Information Gain (Sum of IG)

R=0.75 (Epitope) R=0.74 (All Positions)

Agreement is 87%

Figure 16.

The relation between information gain and antigenic distance.

For each 181 cases, we compare the antigenic distance and information gain for both All positions (329 positions) and epitope sites (131 positions). The result shows that sum of information gain could fit the linear relation to antigenic distance (R >0.74). The result also shows that epitope could better fit the antigenic distance than all positions.

Antigenic variants are defined when antigenic distance ≧ 4, from this figure when sum of information gain > 0.1835 , we could get best predicting performance for predicting antigenic variant. The agreement is 87%.

Figure 17.

For each 181 cases, the information gain and number of sequence mutations of epitope sites are plot on the figure. When the sum of Information gain value > 0.1835, the case is predicted as antigenic variant and the agreement is 87 % (158/181). When the sum of sequence mutations ≧ 7, the case is predicted as antigenic variant and the agreement is 83% (150/181).

The different predicted case are illustrated in figure 17.

0.0 0.1835 0.5 1.0 1.5 2.0

0 5 10 15 20 25 30 35

Agreement = 83% (HD)

AntigenicEqual AntigenicVariant

Sequence Matations ( Hamming Distance)

Information Gain (Sum of IG) Agreement = 87% (IG)

0.0 0.1835 0

2 4 6 8 10

B3

B2

A2 B1

AntigenicEqual AntigenicVariant

Sequence Matations ( Hamming Distance)

Information Gain (Sum of IG)

A1

Figure 18.

The detail view of figure 16. Cases successfully predicted by information gain but false predicted by hamming distance are label with big circle. The cases in the circle A means little sequence mutations but which leads to antigenic variant pair. The cases in circle B means large sequence mutations but still antigenic equal pair. The result show that when sequence mutations are less than 11 positions, the position which actually changes would more important than the amount of total mutations. The 14 successful predicted cases are list in table 6.

181 Cases 145

We have 181 cases in the initial and then to find the position with highest information gain.

Position 145 have the highest information gain in level one, so the first selected position is position 145. There are 62 cases in level one have position changes on 145, so those 62 cases are considered as explained by position 145 and removed from the original set. In each level we further consider those positions with information gain > average + 2*standard deviation.

In level one there are other 4 positions satisfy the condition. The position 126 and 278 are considered as co-evolution to position 145 because when the 62 cases are removed from the dataset, their information drop significantly in the level two. The position 189 and 158 are considered as independent important positions would not drop information gain when the 62 cases are removed from the dataset. When the 62 cases are removed from the dataset, we use

1 2 3 4 5 6

This figure plots the information gain in 6 levels of the 21 positions selected by information gain. In each level the selected position having the highest information gain. Positions with close information gain behavior are consider co-evolution groups are colored in the same color. For example the first group includes position 145, 278 and 126 are in green color.

Specially note that when the position with highest information gain is selected and those cases have mutation on that position is removed from the dataset. The information of that selected position would drops to zero. As a consequence, some positions ( Ex: 155) do not have high information gain in the level one but it’s information gain gradually increase from level one to level four.

145A > 0 : Variant (62/1)

Figure (A) is the decision tree model of using decision tree tool C4.5 to select the positions with highest information gain. The nodes of decision tree are the positions with highest information gain in each level. The root is on the top of the tree and we should read the tree

Figure (A) is the decision tree model of using decision tree tool C4.5 to select the positions with highest information gain. The nodes of decision tree are the positions with highest information gain in each level. The root is on the top of the tree and we should read the tree