3.1 The relationship between protein structure and dynamics
Ten types of correlation coefficients (CCB-CM, CCB-GNM, CCB-SAS, CCB-WCN, CCCM-GNM, CCCM-SAS, CCCM-WCN, CCGNM-SAS, CCGNM-WCN, CCSAS-WCN) were calculated. The average correlation coefficients over 972 proteins were listed in Table 5.
The lowest value was the correlation coefficient between B-factor and SAS (CCB-SAS), 0.523. However, it still shows that these two features are positively correlated to each other. In other words, the more a residue is exposed, the greater it fluctuates.
The highest value was CCGNM-WCN (0.879). Note that the inverse of original weighted contact number was taken when building WCN profile. If a residue is located in a more crowded region, it yields a lower value in the WCN profile. From the fairly high value of CCGNM-WCN, one can presume that this residue would exhibit higher autocorrelation of atomic fluctuation.
Comparing the three models built for interpreting dynamic properties from structural information: CM, GNM and WCN, WCN outperformed the others in this dataset.
(CCB-CM=0.528, CCB-GNM=0.556, CCB-WCN=0.61)
Although CM, GNM and WCN are based on different theoretical hypothesis, they shared high mutual correlation with one another (CCCM-GNM=0.72, CCCM-WCN=0.870 and CCGNM-WCN=0.879). A possible reason is that, the underlying physical property of protein can be expressed in different way. There might be some interchangeability among these three models.
B-factor, CM and GNM depict protein dynamic properties, while SAS and WCN represent structural characteristics. Nevertheless all the five features were highly positively correlated with one another (with mutual correlation 0.523-0.879). This is in accordance with our basic concept that protein structure is related to dynamics.
3.2 The relationship between B-factor and smoothed SAS/PSAS
As we mentioned earlier, CCB-SAS posses the lowest value (0.523) among the ten types of correlation coefficients. Comparing the profile of B-factor, SAS, GNM and WCN (see Figure 1.(A) to (C)), SAS profile is especially rough than the others. However, if we took the average value of SAS with adjacent residues as the new SAS, the correlation with B-factor can be fairly improved (see Figure 1.(D)). These observations gave rise to the idea of smoothing SAS.
Picture that a protein segment {residuei-1, residuei, residuei+1} is exposed to the surface.
However, residuei is bended or hustled toward the center during folding. Under such circumstances, it’s likely that some parts of the accessible surface of residuei be considered closer to residuei-1 and residuei+1. Consequently, sharp peaks are seen throughout the plot of SAS. Although residuei is also close to the surface and might exhibit high flexibility, it’s hard to tell from the SAS of residuei alone.
Take 1A1IA as an example, the plot of SAS in residue-number ordering is in Figure 2(A).
The z-score of b-factor was 2.684, 2.149 and 2.656 for SER-111, CYS-112 and ASP-113, respectively. 1A1IA in B-factor putty was drawn in Figure 2(B). The proportional SAS of CYS-112 was 0.134, while SER-111 was 0.806 and ASP-113 was 0.878. In terms of z-score, CYS-112 was -1.375, while SER-111 was 1.553 and ASP-113 was 1.863. The surface of 1A1IA can be found in Figure 2(C). The surface area belonged to CYS-112 was colored red.
As a whole, SER-111, CYS-112 and ASP-113 bear similar B-factor but the SAS of CYS-112 is much smaller than the other two.
Here rises the need of smoothing SAS. Taking the average of the SAS within a certain window size helps us learn more about how a residue is exposed or buried.
The average CCB-SAS’ and CCB-PSAS’ derived from eight different types of smoothing processes along with CCB-SAS and CCB-PSAS can be found in Table 6.
Smoothing either experimental or predicted SAS show meaningful improvement in the correlation with B-factor. For predicted SAS (CCB-PSAS=0.435), smoothing could uplift the correlation to 0.525, which is comparable to experimental CCB-SAS (0.523). Greater performances could be observed in experimental SAS. All eight smoothing processes help gain 0.08-0.12 of improvements than CCB-SAS (0.523).
The result of these simple and intuitive smoothing processes indeed support our postulate:
Smoothing SAS related better to B-factors.
3.3 The effect of SCOP classification and protein length
The result of SCOP classification is shown in Table 7. There might be some messages lie in the data. For example, coiled coil proteins not only possess highest CCB-CM, CCB-GNM, CCB-WCN, CCCM-GNM, but also lowest CCCM-SAS, CCCM-WCN, CCGNM-SAS, CCGNM-WCN among all classes. On the other hand, designed protein possesses lowest CCB-GNM, CCB-SAS and highest CCCM-SAS, CCGNM-SAS, CCGNM-WCN, CCSAS-WCN. Nevertheless, the numbers of coiled coil
protein and designed protein in our dataset are too small to be statistically meaningful (5 and 1, respectively). Studying the characteristics of each type (SCOP class) of structures would help us understand the underlying correlation. However, over 1/3 of proteins in the dataset (357/972) are lack of SCOP entries. Such circumstances left uncertainties in the final result.
The effect of structural classification can be further investigated as SCOP entries become more completed.
The distribution of each period of length is shown in Figure 2. The average length of the dataset is 294.16.
In Figure 4. (A)-(D) showed the average mean B-factor/CM/SAS/WCN. Unlike CM/SAS/WCN, the curve of B-factor shows neither linear nor exponential relationship with protein length. From Figure 4. (B) one can tell that in larger proteins each residue is more distant from protein centroid on average. On the contrary, residues in smaller protein expose more to the surface area. Figure4. (D) gave us a hint that residues tend to be more packed with each other when protein grows larger.
The average correlation coefficients (CCB-CM, CCB-GNM, CCB-SAS, CCB-WCN, CCCM-GNM, CCCM-SAS, CCCM-WCN, CCGNM-SAS, CCGNM-WCN, CCSAS-WCN) of each length period were displayed in Figure 5.(A)-(J). No obvious trends were found.