Physiochemical Constraints in Influenza A Hemagglutinin

Jiunn-I Shieh

Department of Information Science and Applications Asia University, Taichun, Taiwan, R.O.C.

[email protected] Kuei-Jen Lee

Department of Health and Nutrition Biotechnology Asia University, Taichun, Taiwan, R.O.C.

[email protected] Jing-Doo Wang

Department of Computer Science and Information Engineering Asia University, Taichun, Taiwan, R.O.C.

[email protected]

I-Chun Chen, Am-Chou Chen, Pei-Chun Chang, and Hsiang-Chuan Liu Department of Bioinformatics

Asia University, Taichun, Taiwan, R.O.C.

[email protected], [email protected], {pcchang,lhc}@asia.edu.tw

Abstract

Influenza A viruses are negative-strand RNA viruses. The gene of hemagglutinin (HA) protein in the virus genome is the major molecule that determining the range of hosts. Mutation of HA gene may bring infection cross species. In this paper, we studied physicochemical constraints during the variations of HA gene. Fuzzy measure and Choquet integral were used to estimate the combining effect of different physicochemical properties for single residue in HA protein that related to infective events. With this method, a HA sequence was quantified residue by residue and produced a value series. Finally, the Hurst exponent was adopted to infer the constraints in the series. We found that the physicochemical constraints in HA sequences mainly falling into two classes of interdependence strength during gene variation, that were distinct from the diversity of clusters in the phylogenetic analysis.

4. Introduction

Influenza A viruses are negative-strand RNA viruses that infect a wide variety of animals in the nature. The infection of human may cause significant mortality and morbidity worldwide [1]. The gene of hemagglutinin (HA) protein in the virus genome is the major molecule that determining the range of hosts. The natural reservoir of influenza virus such as avian flu may emerge in strains

infectious to human by mutation of HA gene [2,3]. Owing to that, it is important to understand the variation nature of HA gene. In the past, the researches in this field mainly have been focused on the phylogenetic reconstructions [4,5]. As shown in the explosive information on HA sequences, the reconstruction of a phylogenetic tree can provide abundant evolution information, and help in understanding the drifts of influenza hosts [6]. However, the feature and tendency about physicochemical properties of gene variations for specific host are never been discussed.

Fuzzy measure theory considers a number of special classes of measurements, each of which is characterized by a special property. In the fuzzy measure theory, the conditions are precise, but the information about an element alone is insufficient to determine which special classes of measure should be used. The fuzzy measure estimates the possible interactions among the special classes of measurements [7]. Choquet integral is a tightly related concept with fuzzy measure. It assesses the integrated effect for some issue based on the concept of fuzzy measure [7,8]. The Hurst exponent (H) is a statistical measure used to classify time series [9]. For example, H=0.5 indicates a random series while H>0.5 indicates a constrained reinforcing series. The larger the H value is, the stronger the constraint. In this paper, we studied the physicochemical constraints of HA protein of Influenza A

viruses regarding to serotypes H1, H3, and H5. We concerned three types of physicochemical property for each residue that have acidity, Van der waal volume, and hydrophobicity [10]. Pearson’scorrelation coefficientwas used to quantify the dependence of physicochemical properties on infection hosts, human or avian. For each residue,therewerethreevaluesofPearson’scorrelation coefficient corresponding to three types of physicochemical properties. Based on the coefficients, Sugeno λ-measure [11] was adopted to calculate the fuzzy measure. Subsequently, the Choquet integral was applied to assess the integrated effect of physicochemical properties on infection hosts for each residue. A protein sequence implies a series of integral values. Finally, we used Hurst exponent to analyze the value series for exploring the integrated physicochemical constraints in the protein sequence.

5. Methods

5.1. Sequence data collection

The sequence data of Influenza A viruses used in this research were obtained from public databases: Influenza Sequence Database (http://www.flu.lanl.gov). All HA nucleotide sequences of human and birds in this databases were downloaded on October 16, 2006. The HA sequences were extracted, of which less than 900 nucleotides were considered as partial sequences and were excluded from this study. Identically coded sequences are considered as duplicates and only the earliest isolated strain among the duplicates was used as a representative sequence in the group. In total, we had 831 H1 sequences, 3018 H3 sequences and 1376 H5 sequences for our analysis. All sequences were isolated between 1963 and 2006 from locations around the globe. The exact isolation time (calendar year), host type and location can be found in the strain names.

2.2. Residue coding

The sequence alignment processes were implemented in ClustalX 3.14 [12] regarding to H1, H3, and H5. After alignment, the sequence length regarding to H1, H3, and H5 were 565, 567, and 583 amino acids respectively. The protein residues were coded according to its values of acidity, Van der waal volume, and hydrophobicity in the situation of single amino acid [10, 13], as shown in table 1.

For every physicochemical property, we had a matrix size of 831x565 for H1 group, 3018x567 for H3 group, and

1376x583 for H5 group.

Table 1. The residue codes regarding to acidity, Van der waal volume, and hydrophobicity.^a Amino acid Acidity Van der waal

volume

Hydrophobicity

Alanine 7.0 67. 0.616

Cysteine 8.4 86. 0.68

Aspartic acid 3.9 67. 0.028

Glutamic acid 4.1 109. 0.043

Phenylalanine 7.0 135. 1.

Glycine 7.0 48. 0.501

Histidine 6.0 118. 0.165

Isoleucine 7.0 124. 0.943

Lysine 10.5 135. 0.283

Leucine 7.0 124. 0.943

Methionine 7.0 124. 0.738

Asparagine 7.0 148. 0.236

Proline 7.0 90. 0.711

Glutamine 7.0 114. 0.251

Arginine 12.5 167. 0.

Serine 7.0 73. 0.359

Threonine 7.0 93. 0.45

Valine 7.0 105. 0.825

Tryptophan 10.5 163. 0.878

Tyrosine 7.0 141. 0.88

aThe gaps in the aligned sequences were coded as 7., 0., and 0.5 for acidity, Van der waal volume, and hydrophobicity.

2.3. Inference of physicochemical constraints

Choquet integral is defined to integrate functions with respect to the fuzzy measure [7]. It is very useful in assessment of the effect that results from the nonlinear interactions. The definitions of fuzzy measure and Choquet integral are as follows:

Definition 1.

Let N be a finite set of criteria. A discrete fuzzy measure on N is a set function _v_:₂^N__[₀_,₁_] which satisfies the following axioms:

(i)

v (  )  0

v ( N )  1

(boundary conditions) (ii)

A



B

implies

v ( A )  v ( B )

(monotonicity)

for

A

B

2^N.

For each subset of criteria

S



N

v (S )

can be interpreted as the weight of the coalition S.

The Sugeno λ-measure is a special case of fuzzy measures. It has the following definition.

Definition 2.

Let

N



 X X

1, 2,,

X



be a finite set and

^{  }  

^1, ^{. A Su}geno λ-measure is a function νfrom 2^Nto [0, 1] with properties:

(i) ν(

N) = 1.

(ii) if

A B

, 2^N with

A   B 

then

( A B ) ( ) A ( ) B ( ) ( ) A B

         

As a convention, the value of ν at a singleton set

  X

i is called a density and is denoted by

   X

i ^.

Tahani and Keller [14] as well as Wang and Klir [15]

have showed that that once the densities are known, it is possible to use the previous polynomial to obtain the valuesofλ uniquely.

Definition 3.

Let

v

be a fuzzy measure on N. The discrete Choquet integral of function x: N→

R with respect

v

is defined by

indicates a permutation on

N

such that

) The discrete Choquet integral takes into account the interaction by means of the fuzzy measure

v

. If the criteria are independent, the fuzzy measure is additive.

Then, the discrete Choquet integral coincides with the weighted arithmetic mean method. That is,

C

_v(x) =



 correlation-based method proposed by Hsiang-Chuan Liu in 2006 [16,17] to construct the fuzzy measures in the discrete Choquet integral was used.

The Hurst exponent occurs in several areas of applied mathematics, including fractals and chaos theories, long memory processes and spectral analysis. Hurst exponent estimation has been applied in areas ranging from biophysics to computer networking. Estimation of the Hurst exponent was originally developed in hydrology.

However, the modern techniques for estimating the Hurst exponent come from fractal mathematics.

Estimating the Hurst exponent for a data set provides a measure of whether the data is a pure random walk or has underlying trends. Another way to state this is that a random process with an underlying trend has some degree of autocorrelation. Furthermore, when the autocorrelation has a very long (or mathematically infinite) decay this kind of Gaussian process is sometimes referred to as a long

memory process.

The Hurst exponent (H) is a statistical measure used to classify time series. H=0.5 indicates a random series while H>0.5 indicates a trend reinforcing series. The larger the H value is, the stronger the trend. In this paper we investigate the use of the Hurst exponent to classify series of financial data representing different periods of time.

Experiments with back propagation Neural Networks show that series with large Hurst exponent can be predicted more accurately than those with H value close to 0.50.

Thus the Hurst exponent provides a measure for predictability.

Three methods were used most often for the estimation of the Hurst exponent: the R/S method, the roughness–length (R–L) method and variogram. The R/S method [18] is commonly perceived as the most suitable for the time series analysis on the stock market or an optimal volume of water reservoirs, because it presents the relationship between irregular (singular) rescaled ranges, signal value and their local statistical properties relative to the scale factor. In this study R/S method is used. R/S method [19] is based on empirical observations by Hurst and estimates H are based on the R/S statistic. It indicates (asymptotically) second-order self-similarity. H is roughly estimated through the slope of the linear line in a log-log plot, depicting the R/S statistics over the number of points of the aggregated series. That is, given a time sequence of observations

w

_t , define the series





and

  

slope determines the Hurst exponent.

There is a 7-step to make hurst exponent analyze:

Step 1.

With quantizing three properties each amino acid of each protein sequence, we have three time series for each protein sequence.

Step 2.

For each property, normalize the data for each position which the same position of aligned protein sequences for affecting human and birds. That is, label elements in the sample by l and treat each position in aligned protein sequence as a random variable. Assume the

size of the sample is k. For the element l, let i-th position

of aligned protein sequences for property m be a random variable

X

_i^l^,^m where 1≦l≦k, 1≦m≦3, and n is the length of aligned protein sequences. If

   

i^l^m affecting the human and 0 otherwise for the element l. Let

)'

Then, for each

i

compute

^{v X} ⁽ 

ⁱ¹

^, ^X

ⁱ²

 ⁾

^{v X} ⁽ 

ⁱ¹

^, ^X

ⁱ³

 ⁾

^{v X} ⁽ 

ⁱ²

^, ^X

ⁱ³

 ⁾

^by

Sugeno λ-measure. Note that

^{v X} ⁽ 

ⁱ¹

^, ^X

ⁱ²

^, ^X

ⁱ³

 ^{) 1} ^

Step 5.

Combined the three properties to be one, compute the Choquet integral for each position by equation (2). Then we get one time series for each aligned protein sequence.

Step 6.

Calculate Hurst exponent for each aligned

protein sequence.

Step 7.

Analyze the results.

The above steps were calculated using Matlab package, except for Hurst exponent was obtained from the website: http://www.mathworks.com/matlabcentral/.

2.4. Results

We calculated the Hurst exponent regarding to H1, H3, and H5 to infer the physicochemical interdependency among the residues in the HA protein. The serotype H1 are shown in Fig.1, there are 2 clusters in the frequency distributions of Hurst exponents for human strains and avian strains. The Hurst exponent is nearby 1 for one cluster, and nearby 0.5 for another cluster. That mean some variations are constrained strongly, and some variations are random-like. The tendency of H3 is shown in Fig.2 and similar to H1, but the Hurst exponents in the two clusters are closer and away from 1 and 0.5. The results about H5 are shown in Fig.3, the distribution pattern is different from H1 and H3 for avian strains. There are three clusters in the frequency distribution.

The phylogenetic analysis is based on the mutation frequency between residues regarding homologous proteins. The evolution of quantitative property during the process of residue changes is ambiguous. In this study, we proposed a method based on the quantitative properties of residues regarding to infection issue of Influenza A viruses to estimate the constrain strength in the HA proteins. The distribution of constrain strength are distinct from the diversity of clusters in the phylogenetic analysis.

Avian

Figure 1. The frequency distribution of H1 Hurst exponents for human strains and avian strains.

Avian

Figure 2. The frequency distribution of H3 Hurst exponents for human strains and avian strains.

Avian

Figure 3. The frequency distribution of H5 Hurst exponents for human strains and avian strains.

2.5. Discussion

The gene of HA protein in the virus genome is the major molecule that determining the range of hosts.

Basically, the infection process is physicochemical interaction between receptor of host and HA protein. For the sake of successful infection, the gene variations must follow certain rules under physicochemical base. Higher value of Hurst exponent implies more constraints or intra-structure in the sequence properties. As to that, the gene variations are apt to destroy the intra-structure with high value of Hurst exponent. The variation tolerance is different for the same serotype of HA corresponding to the different clusters of Hurst exponents.

4. Conclusions

The constraints in HA sequences mainly fall into two classes of Hurst strength during gene variations. That imply the variation tolerance of HA gene is diverse in the same serotype of HA.

Acknowledgements

This work was supported by the National Science Council, grant no. NSC 95-2221-E-468-006-.

References

[1] P. Palese. Influenza: old and new threats. Nat. Med.,

Vol 10, pp. s82– s87, 2004.

[2] R.G. Webster, W.J. Bean, O.T. Gorman, and T.M.

Chambers, Y. Kawaoka., Evolution and ecology of influenza A viruses. Microbiol. Rev., Vol 56, pp.

152– 179, 1992.

[3] N.J. Cox and K. Subbarao. Global epidemiology of influenza: Past and present. Annu. Rev. Med., Vol 51,

pp. 407– 421, 2000.

[4]

W.M. Fitch, R.M. Bush, C.A. Bender, and N.J. Cox.

Long term trends in the evolution of H(3) HA1 human influenza type A. Proc. Nat. Acad. Sci., Vol 94, pp. 7712– 7718, 1997.

[5]

R.M. Bush, W.M. Fitch, C.A. Bender, and N.J. Cox.

Positive selection on the H3 hemagglutiningene of human influenza virus A. Mol. Biol. Evol., Vol 16, pp.

1457– 1465, 1999.

[6]

R.M. Bush, C.A. Bender, K. Subbarao, N.J. Cox, and W.M. Fitch. Predicting the evolution of human influenza A. Science, Vol 286, pp. 1921– 1925, 1999.

[7] T. Murofushi and M. Sugeno. An interpretation of fuzzy measure and the Choquet integral as an integral with respect to a fuzzy measure. Fuzzy Sets and

Systems, Vol 29, pp. 201– 227, 1989.

[8] T. Calvo, A. Kolesarova, M. Komornikova, and R.

Mesiar. Aggregation operators: New trends and applications. Physica-Verlag, Springer, 2002.

[9] T. Di Matteo, T. Aste, and M.M. Dacorogna. Long term memories of developed and emerging markets:

using the scaling analysis to characterize their stage of development. Journal of Banking & Finance, Vol

29, pp. 827-851, 2005.

[10] D. Whitford. Proteins: structure and function. John

Wiley & Sons Ltd., 2005.

[11] T. Murofushi and M. Sugeno. An interpretation of fuzzy measure and the Choquet integral as an integral

with respect to a fuzzy measure. Fuzzy Sets and

Systems, Vol 29, pp. 201– 227, 1989.

[12]

J.D. Thompson, T.J. Gibson, F. Plewniak, F.

Jeanmougin, and D.G. Higgins. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools.

Nucl. Acid Res. Vol 24, pp. 4876– 4882, 1997.

[13] S.D. Black and D.R. Mould. Development of Hydrophobicity Parameters to Analyze Proteins Which Bear Post- or Cotranslational Modifications.

Anal. Biochem., Vol 193, pp. 72– 82, 1991.

[14] H. Tahani and J. Keller. Information Fusion in Computer Vision Using the Fuzzy Integral. IEEE

Transactions on Systems, Man and Cybernetic, Vol 20, pp. 733-741, 1990.

[15] Z. Wang and G.J. Klir. Fuzzy Measure Theory.

在文檔中癌症細胞生化路徑網路的交互作用 (頁 93-98)

Physiochemical Constraints in Influenza A Hemagglutinin

Abstract

4. Introduction

5. Methods

5.1. Sequence data collection

2.2. Residue coding

2.3. Inference of physicochemical constraints

Definition 1.

v (  )  0

v ( N )  1

A

B

v ( A )  v ( B )

A

B

S

N

v (S )

Definition 2.

N

 X X

X



    

N) = 1.

A B

A   B 

( A B ) ( ) A ( ) B ( ) ( ) A B

         

  X

   X

Definition 3.

v

R with respect

v

N

v

C



memory process.

w



  

Step 1.

Step 2.

size of the sample is k. For the element l, let i-th position

X

   

)'

i

v X ( 

, X

 )

v X ( 

, X

 )

v X ( 

, X

 )

v X ( 

, X

, X

 ) 1 

Step 5.

Step 6.

Step 7.

2.4. Results

2.5. Discussion

4. Conclusions

Acknowledgements

References

Vol 10, pp. s82– s87, 2004.

152– 179, 1992.

pp. 407– 421, 2000.

W.M. Fitch, R.M. Bush, C.A. Bender, and N.J. Cox.

Long term trends in the evolution of H(3) HA1 human influenza type A. Proc. Nat. Acad. Sci., Vol 94, pp. 7712– 7718, 1997.

R.M. Bush, W.M. Fitch, C.A. Bender, and N.J. Cox.

Positive selection on the H3 hemagglutiningene of human influenza virus A. Mol. Biol. Evol., Vol 16, pp.

1457– 1465, 1999.

R.M. Bush, C.A. Bender, K. Subbarao, N.J. Cox, and W.M. Fitch. Predicting the evolution of human influenza A. Science, Vol 286, pp. 1921– 1925, 1999.

^{  }  

^{v X} ⁽ 

^, ^X

 ⁾

^{v X} ⁽ 

^, ^X

 ⁾

^{v X} ⁽ 

^, ^X

 ⁾

^{v X} ⁽ 

^, ^X

^, ^X

 ^{) 1} ^