MATERIALS AND METHODS - 蛋白質結構與動態之關係

2.1 Dataset

The dataset used in the present study consists of 972 protein chains from PDB-REPRDB¹⁷. According to the same criteria used in the research of WCN¹⁴, we chose X-ray crystallographic structures with resolution under 2.0Å and R-factor less than 0.2. All chains have at least 60 residues, and share pair-wise identity less than 25%. Detailed criteria can be found in Table 2.

2.2 Features and models

2.2.1 B-factor

X-ray diffraction data provide the average structure of a protein in a crystal as well as the scale of its atomic fluctuations normally expressed as the isotropic Debye-Waller factors, or B-factors. We retrieved the B-factor from PDB records.

2.2.2 Centroid model

With structural information, CM¹³ managed to reproduce protein dynamics and correlation of the fluctuations in proteins. It simply calculates the square of the atomic distance from the center of the mass of the protein. Let X0 be the center of mass of the protein, i.e.

∑

= km Xk k kmk

X₀ , where mk and Xk are the mass and the crystallographic position of atom k, respectively. The distance between atom i and the center of mass can be obtained by:

) )(

( ₀ ₀

2 = X_i −X X_i−X

ri (1) This value is found to be linearly related to the average atomic fluctuation in proteins.

Unlike Molecular dynamics (MD)5; 6; 7; 8; 9 or Elastic network model (ENM) which based on mechanical model, either sophisticated or simplified, CM depends only on the information of protein geometrical shape. CM involves no computation of proteins’ trajectories or matrix operation and thus is simple and efficient.

2.2.3 Weighted contact number (WCN) model

A recent study¹⁹ showed that the atomic mean-square displacement (or B-factor) is closely related to the number of neighboring atoms. This method was referred to as the protein contact number (CN) model. The CN model was able to predict B-factor profiles directly from protein structures without either trajectory integration or matrix diagonalization.

However, the CN model can be further improved by introducing a weight being the square of reciprocal distance between the contact pair.

The weighted contact number vi of residue i is given by The WCN profile of a protein of N residues is defined as:

)

2.2.4 Gaussian network model (GNM)

The Gaussian Network Model (GNM)¹⁰ simplifies a protein by modeling it with its Cα atoms only and attaching springs with uniform constants to contacted Cα pairs. Cα pairs are

considered to be in contact when their separation distance is within a cutoff distance rc , typically set to 7.0±0.5Å. In GNM, all the dynamic properties can be obtained from the Kirchhoff matrix T whose elements are given by:

if i≠j and Rij r≦ c

if i≠j and Rij > rc (4) if i=j

where Rij is the distance between atom i and j, and rc is the cutoff distance. The mean square fluctuations of each atom and the theoretical B-factors can be expressed as:

[ ]

2.2.5 Solvent accessible surface (SAS)

The surface area is an important structural characteristic because interaction with other molecules happens on the surface. The solvent accessible surfaces of residues were assigned by using the program DSSP¹⁸. The data listed in column “ACC” of each residue was retrieved and divided by the maximum area of that amino acid. The maximum accessible area of each amino acid is listed in Table 1. The relative SAS were listed in the order of residue number to obtain SAS profiles for each protein.

2.3 Methods

2.3.1 The relationship between protein structure and dynamics

For the convenience of comparison between features with quite different ranges, we first calculated the z-scores of the values in five feature/model profiles. Z-scores is also called

⎪⎪

standard score, z-value, normal score and standardized score. It indicates how many standard deviations an observation is above or below the mean. The z-score of the ith(i=residue number) value in the profile X is defined as:

σ , where μ is the mean and σ is the standard deviation of the distribution of values in profile X.

The calculations are described by the following formulas:

After z-score normalization, we calculated the correlation coefficient (CC) of two feature/model profiles X and Y by:

If X and Y are independent then the correlation is 0. The closer the coefficient is to either

−1 or 1, the stronger the correlation between the variables.

2.3.2 The relationship between B-factor and smoothed SAS

To understand how smoothing would influence the correlation coefficient between SAS and B-factor, different smoothing processes were performed on SAS profiles and predicted SAS profiles.

SAS prediction is accomplished by running the Protein Solvent AcceSsibility Prediction Server (http://140.113.239.214/~weilun/) developed by Shu WL and Hwang JK. We used the FASTA sequences of all 972 chains in our dataset as the input and chose the real-value model of this server. As a result, predicted SAS profiles (PSAS profile) could be obtained.

Two types of smoothing processes combined with different window-sizes were

implemented. The basic idea of both methods is the same. They simply took the average SAS of residue i along with its neighboring (in sequence order) residues within a certain window size. The smoothed SAS (SASi’) of residue i can be calculated by:

(1) WSn(window-size n, where n=2k+1, k=1,2,3,4)

(2) mWSn(modified window-size n, where n=2k+1, k=1,2,3,4)

The only difference between the two types of methods is that, in modified version, each value is multiplied with a weight before calculating the average. The weight reduces as the distance from the center of the current window (residue i) increases. Take mWDW5 as an example, SASi’=(SASi-2 +2*SASi-1 +2²* SASi +2* SASi+1 + SASi+2)/(1+2+2²+2+1).

2.3.3 The effect of SCOP classification and protein length

The dataset was classified into 11 groups according to SCOP database¹⁶. The following two cases were not assigned to any class and were ignored in this part of analysis: (1) chains with no SCOP entries (noted as N/A in this work) and (2) chains that are separated into segments being assigned to different classes (noted as M).

The notation and description of each class can be found in Table 3. The PDBID assigned to each class were listed in Appendix 1. The averages of ten types of correlation coefficient were also calculated on the basis of SCOP grouping.

On the other hand, our dataset was divided into 11 groups according to protein length. The range and number of each group were listed in Table 4. The members’ PDBID were listed in Appendix 2.

To understand the influence of protein length, we first consider one feature/model at a time.

For a protein with size N, the mean B-factor was computed by:

N B B

∑

= =¹ (13)

Mean CM/SAS/WCN was derived in the same way for each protein. These values were also classified according to protein length.

The result in part 1 was ten types of correlation coefficients (CCB-CM, CCB-GNM, CCB-SAS, CCB-WCN, CCCM-GNM, CCCM-SAS, CCCM-WCN, CCGNM-SAS, CCGNM-WCN, CCSAS-WCN) for 972 proteins. For each type of correlation coefficient, we calculated the average of each group. As a result, each type of correlation coefficient would yield 11 values. Each value represents the general condition of proteins with length in the same range.

在文檔中蛋白質結構與動態之關係 (頁 12-17)