2.1 Dataset
The dataset used in the present study consists of 972 protein chains from PDB-REPRDB17. According to the same criteria used in the research of WCN14, we chose X-ray crystallographic structures with resolution under 2.0Å and R-factor less than 0.2. All chains have at least 60 residues, and share pair-wise identity less than 25%. Detailed criteria can be found in Table 2.
2.2 Features and models
2.2.1 B-factor
X-ray diffraction data provide the average structure of a protein in a crystal as well as the scale of its atomic fluctuations normally expressed as the isotropic Debye-Waller factors, or B-factors. We retrieved the B-factor from PDB records.
2.2.2 Centroid model
With structural information, CM13 managed to reproduce protein dynamics and correlation of the fluctuations in proteins. It simply calculates the square of the atomic distance from the center of the mass of the protein. Let X0 be the center of mass of the protein, i.e.
∑
∑
= km Xk k kmk
X0 , where mk and Xk are the mass and the crystallographic position of atom k, respectively. The distance between atom i and the center of mass can be obtained by:
) )(
( 0 0
2 = Xi −X Xi−X
ri (1) This value is found to be linearly related to the average atomic fluctuation in proteins.
Unlike Molecular dynamics (MD)5; 6; 7; 8; 9 or Elastic network model (ENM) which based on mechanical model, either sophisticated or simplified, CM depends only on the information of protein geometrical shape. CM involves no computation of proteins’ trajectories or matrix operation and thus is simple and efficient.
2.2.3 Weighted contact number (WCN) model
A recent study19 showed that the atomic mean-square displacement (or B-factor) is closely related to the number of neighboring atoms. This method was referred to as the protein contact number (CN) model. The CN model was able to predict B-factor profiles directly from protein structures without either trajectory integration or matrix diagonalization.
However, the CN model can be further improved by introducing a weight being the square of reciprocal distance between the contact pair.
The weighted contact number vi of residue i is given by The WCN profile of a protein of N residues is defined as:
)
2.2.4 Gaussian network model (GNM)
The Gaussian Network Model (GNM)10 simplifies a protein by modeling it with its Cα atoms only and attaching springs with uniform constants to contacted Cα pairs. Cα pairs are
considered to be in contact when their separation distance is within a cutoff distance rc , typically set to 7.0±0.5Å. In GNM, all the dynamic properties can be obtained from the Kirchhoff matrix T whose elements are given by:
if i≠j and Rij r≦ c
if i≠j and Rij > rc (4) if i=j
where Rij is the distance between atom i and j, and rc is the cutoff distance. The mean square fluctuations of each atom and the theoretical B-factors can be expressed as:
[ ]
ii2.2.5 Solvent accessible surface (SAS)
The surface area is an important structural characteristic because interaction with other molecules happens on the surface. The solvent accessible surfaces of residues were assigned by using the program DSSP18. The data listed in column “ACC” of each residue was retrieved and divided by the maximum area of that amino acid. The maximum accessible area of each amino acid is listed in Table 1. The relative SAS were listed in the order of residue number to obtain SAS profiles for each protein.
2.3 Methods
2.3.1 The relationship between protein structure and dynamics
For the convenience of comparison between features with quite different ranges, we first calculated the z-scores of the values in five feature/model profiles. Z-scores is also called
⎪⎪
standard score, z-value, normal score and standardized score. It indicates how many standard deviations an observation is above or below the mean. The z-score of the ith(i=residue number) value in the profile X is defined as:
σ , where μ is the mean and σ is the standard deviation of the distribution of values in profile X.
The calculations are described by the following formulas:
X
After z-score normalization, we calculated the correlation coefficient (CC) of two feature/model profiles X and Y by:
2
If X and Y are independent then the correlation is 0. The closer the coefficient is to either
−1 or 1, the stronger the correlation between the variables.
2.3.2 The relationship between B-factor and smoothed SAS
To understand how smoothing would influence the correlation coefficient between SAS and B-factor, different smoothing processes were performed on SAS profiles and predicted SAS profiles.
SAS prediction is accomplished by running the Protein Solvent AcceSsibility Prediction Server (http://140.113.239.214/~weilun/) developed by Shu WL and Hwang JK. We used the FASTA sequences of all 972 chains in our dataset as the input and chose the real-value model of this server. As a result, predicted SAS profiles (PSAS profile) could be obtained.
Two types of smoothing processes combined with different window-sizes were
implemented. The basic idea of both methods is the same. They simply took the average SAS of residue i along with its neighboring (in sequence order) residues within a certain window size. The smoothed SAS (SASi’) of residue i can be calculated by:
(1) WSn(window-size n, where n=2k+1, k=1,2,3,4)
(2) mWSn(modified window-size n, where n=2k+1, k=1,2,3,4)
The only difference between the two types of methods is that, in modified version, each value is multiplied with a weight before calculating the average. The weight reduces as the distance from the center of the current window (residue i) increases. Take mWDW5 as an example, SASi’=(SASi-2 +2*SASi-1 +22* SASi +2* SASi+1 + SASi+2)/(1+2+22+2+1).
2.3.3 The effect of SCOP classification and protein length
The dataset was classified into 11 groups according to SCOP database16. The following two cases were not assigned to any class and were ignored in this part of analysis: (1) chains with no SCOP entries (noted as N/A in this work) and (2) chains that are separated into segments being assigned to different classes (noted as M).
The notation and description of each class can be found in Table 3. The PDBID assigned to each class were listed in Appendix 1. The averages of ten types of correlation coefficient were also calculated on the basis of SCOP grouping.
On the other hand, our dataset was divided into 11 groups according to protein length. The range and number of each group were listed in Table 4. The members’ PDBID were listed in Appendix 2.
To understand the influence of protein length, we first consider one feature/model at a time.
For a protein with size N, the mean B-factor was computed by:
N B B
N
i
∑
i= =1 (13)
Mean CM/SAS/WCN was derived in the same way for each protein. These values were also classified according to protein length.
The result in part 1 was ten types of correlation coefficients (CCB-CM, CCB-GNM, CCB-SAS, CCB-WCN, CCCM-GNM, CCCM-SAS, CCCM-WCN, CCGNM-SAS, CCGNM-WCN, CCSAS-WCN) for 972 proteins. For each type of correlation coefficient, we calculated the average of each group. As a result, each type of correlation coefficient would yield 11 values. Each value represents the general condition of proteins with length in the same range.