Bovine pancreatic trypsin inhibitor (BPTI) is a 58‐residue protein composed of two α‐helices and a two‐stranded anti‐parallel β‐sheet.
Hydrogen exchange study43 showed that the slowest exchanging residues are R20‐Y23, Q31, F33, and F45.
Figure 1.19 Conformational entropy profile of BPTI. Residues with the slowest exchanging protons (blue) and the lowest conformational entropy (red) are labeled.
Figure 1.19 shows the conformational entropy profile of BPTI. The residues with the lowest conformational entropy are I18‐F22 and C30‐V34 which are located on the two β‐strands. The slow exchanging residues and low conformational entropy residues overlap well. Figure 1.20 shows the spatial distribution of residues with the slowest exchanging protons and the lowest conformational entropy on the ribbon diagram of BPTI.
Figure 1.20 Spatial distribution of residues with the slowest exchanging protons (blue) and the lowest conformational entropy (red) on the ribbon diagram of
BPTI.
27
Discussion
We developed a method to compute structure conservation from protein sequence. The basic idea is to know whether the secondary structure of a residue is predictable from its context. Although chameleon sequences are usually thought to be a problem in the prediction of secondary structures using sequence homology1,6,44, we instead use this kind of property to develop our method to compute structure conservation. The residual property of the predictability is quantified to a value, the conformational entropy. The secondary structure of the structure‐conserved residues can be predicted more easily than that of the structurally variable residues. As a consequence, the structure‐conserved residues have lower conformational entropy.
We applied our method to a set of proteins with known hydrogen‐exchange rate data and found a strong correlation between structure conservation and slow hydrogen‐exchange rate. The hydrogen‐exchange rates are measurement of the exchanging rate of backbone amide protons. They are involved in the formation of secondary structures and related to the local structure stability. For example, in the typical 4‐helices, strong hydrogen bonds are formed between the backbone oxygen atom of residue i and the backbone nitrogen atom of residue i+3. The atoms which are located in rigid environment or involving in more
hydrogen‐exchange rates. Our results show that these atoms are more structurally conserved.
Our method only considered the local interactions. On the other hand, structure conservation is also affected by long‐range interactions. It would be interesting to know how long‐range interactions contribute to the property of structure conservation. In addition to secondary structures, there are other attributes of protein structure can be used to describe the structure conservation, such as backbone geometry and solvent accessible surface area.
We are also interested in finding the relationship between structural conservation and the dynamics of protein structure. If the dynamics of protein structure are directly related to the structural conservation, our method can be used to predict protein dynamics from the sequence.
29
CHAPTER TWO
NMR Order Parameter Prediction Using
Protein‐fixed‐point Model and Weighted Contact Number Model
Introduction
One of the most important topics in biological science is to understand the protein function. It is well‐known that protein dynamics is closely related to the function of protein. For example, the Aquifex adenylate kinase changes its conformation for the phosphotransfer function. Figure 2.1 shows the structure of the protein before (red) and after (green) the conformational change. The two lids must close to exclude bulk water from the active site and to bring the substrate into position for phosphotransfer.
Figure 2.1 Superposition of apo Aquifex Adk (red) and Aquifex Adk in complex (green)
The advance of X‐ray diffraction technology gives us the information about how the protein moves, i.e. B‐factor or temperature factor, in the experiment environment. However, the development of computational theories to predict protein dynamics directly from structure is still needed:
1. Only a small fraction of protein structures are solved (52013 solved structures in PDB5 due to July 31, 2008).
2. The X‐ray B‐factors are heavily affected by the experimental conditions:
temperature, solvent, or biological unit.
Several computational methods have been developed to determine the protein dynamics. Molecular dynamics (MD) simulation has been widely used in the study of protein function and dynamics. It simulates the interactions between each atom, bonding force, van der Waals force, charge‐charge interaction, etc. The computation time is extremely long when the size of the protein is large and the selection of appropriate parameters of force field itself is a complicated problem. Gaussian network model (GNM)45 transfers the protein structure into a network in which each Cα atom pair is connected together if their distance is smaller than a given cutoff value. Using this protein‐converted network, GNM can compute the theoretical thermal fluctuation of each atom and the correlation of motions between each atom pair. Recently we have developed a model to predict the thermal fluctuation from protein structure, which is called protein‐fixed‐point (PFP) model46. The PFP model only uses the coordinates of Cα atoms and simply determines the center of mass of the protein. We found that the thermal fluctuation is
31
proportional to the squared distance from the atom to the center of mass of the structure. Another model called weighted contact number (WCN) model47 computes the number of neighboring atoms weighted by the inverse distance between each atom pair. The PFP and WCN models show that the protein dynamics can be extracted directly from the intrinsic property of protein structure without the use of any mechanical model.
The order parameter according to the NMR experiment is widely used to study the dynamic‐related protein functions. Zhang and Brushweiler48 used a contact model to predict the order parameter from protein structure. They expressed the backbone S2 order parameter as a function of close contacts between the amide proton and the carbonyl oxygen of the preceding amino acid and the surrounding protein atoms, i.e.,
Si2 = tanh[α( e−ri−1,kO /ρ
k
∑
+βe−ri ,kH/ρ)]+γ (2.1)where is the distance between the carbonyl oxygen of residue and heavy atom and is the distance between the amide proton and heavy atom into account of motional correlation effects, Bruschweiler and coworker developed a hybrid between the CM and the elastic network model (ENM)
45,49referred to as reorientational contact‐weighted elastic network model (rCENM)50.
Here, we use the PFP and WCN model to predict the N‐H S2 order parameter directly from the protein structure. Our results show that the WCN
model can more accurately reproduce the experimental order parameter than the previous published results.
33
Methods
Protein‐fixed‐point Model (PFP)
Let X0 be the center of mass of the protein,
∑
where mk and Xk are the mass and the crystallographic position of atom k, respectively. The squared distance of atom i from the center of mass of the protein is computed by
(
0)(
0)
2 X X X X
ri = i − i − (2.3)
Each protein of size N will have its distinct distribution given by (r , r , , r ), referred as the r2 profile. The squared distance is directly related to the predicted order parameter. In order to confine the computed order parameter
to lying between 0 and 1, we apply the hyperbolic tangent function to .
∑
≠where rij is the distance between Cα atoms of residue i and j. This equation defines the number of neighboring Cα atoms surrounding that of the ith residue ‐‐ the contribution of each surrounding atom j to the central atom i is scaled down by the factor . In order to confine the computed order parameter
1/ rij2
S2 to lying between 0 and 1, we apply the hyperbolic tangent function to νi.
Si2 ~ tanh2νi (2.6)
We will refer this method as the weighted protein contact‐number (WCN) model.
Dataset
We used the datasets of Zhang and Bruschweiler48 and Ming and Bruschweiler50. However, our dataset is not completely identical with the original one, since we deleted some of the order parameters that are not consistent with those of the original dataset, and we also added some new ones from current literatures. Our current dataset is larger than the original dataset and comprises βARK1 PH domain (1BAK), calbindin (4ICB), CspA (3MEF), frenolicin acyl carrier protein (1OR5), lysozyme (1JEF), P85α SH2 domain (1BFJ), ubiquitin (1UBQ), ketosteroid isomerase (8CHO), tautomerase (4OTA), and interleukin‐4 (1HIK).
35
Results
We computed values for the following proteins: βARK1 PH domain (1BAK), calbindin (4ICB), CspA (3MEF), frenolicin acyl carrier protein (1OR5), lysozyme (1JEF), P85_SH2 domain (1BFJ), ubiquitin (1UBQ), ketosteroid isomerase (8CHO), tautomerase (4OTA), and interleukin‐4 (1HIK). Table 2.1 summarizes the Pearson correlation coefficients between the experimental and the computed N‐H order parameters by the WCN, PFP, CM, and rCENM.
S2
S2
Table 2.1 The experimental backbone N-H order parameter data: βARK1 PH domain51, Calbindin52, CspA53, Frenolicin acyl carrier protein54, Lysozyme55, P85α SH2 domain of phosphoinostide 3-kinase56, Ubiquitin57 , Ketosteroid isomerase58, Taotomerase59, and Interleukin-460. * The average correlation coefficient over the 7 structures for WCN, PFP, CM and rCENM are 0.82, 0.74, 0.73 and 0.81, respectively.
The average correlation coefficient of WCN model is 0.79, while that of the CM is 0.71. The average correlation coefficient of rCENM is 0.81 (for the 7 available proteins), while that of WCN is 0.82. In general, the WCN model
performs better than the CM, except two cases, 1OR5 and 1HIK. The results of WCN is comparable to the rCENM, however, there is no available software for rCENM which can be used to make a complete comparison. The following sections discuss each case in the dataset.
βARK1 PH domain
β ARK1 PH domain has the same topology as other PH domains, which are characterized by several β ‐strands forming a β ‐sandwich flanked on one side by an extended C‐terminal α ‐helix that behaves as a molten helix51. The Pearson correlation coefficient between the computed N‐H order parameter SWCN2 and the experimental one is 0.83. Figure 2.2 shows the experimental and predicted order parameters for
SNMR2
β ARK1 PH domain by using WCN, CM, and PFP models.
Figure 2.2 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
37
Calbindin Calbindin D9k52
is composed of four α ‐helices, the N‐terminal (E17‐S24) and C‐terminal Ca2+‐binding loops (D54‐S62), and the linker loop. Our prediction correctly identifies the most mobile linker loop and the C‐terminal Ca2+‐binding loop which have significant lower S2 values. The rigid helical regions are also predicted to have higher order parameters. The experimental data does not produce the S2 value of the residue P20 on the N‐terminal Ca2+‐binding loop because of the limit of NMR relaxation experiment.
However P20 shows higher temperature factor than its neighboring residues in the X‐ray 61, which is consistent with our prediction. Despite the missing data of P20, the correlation coefficient is still high (r=0.79). Figure 2.3 shows the experimental and predicted order parameters for calbindin by using WCN, CM, and PFP model.
Figure 2.3 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line)
Cold‐shock protein A from E‐coli
Cold‐shock protein A from E‐coli (CspA) 53 is a Greek‐key β‐barrel protein.
The segment of residues N39 to Y42 between two β‐strands is identified to be partially disordered in the crystallization environment53. Our method successfully predicts it to be the most mobile region in the protein except the N‐terminal loop. However there is an disagreement between the predicted value and the experimental value on residue D46 ( : 0.58, : 0.84) which did not fit well with any model in the NMR experiment53. The correlation coefficient increases to 0.78 (from r=0.74) if the data of residue D46, which is less reliable, is removed. Figure 2.4 shows the experimental and predicted order parameters for CspA by using WCN, CM, and PFP model.
Sexp2 SWCN2
Figure 2.4 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
39
Frenolicin acyl carrier protein
Frendicin acyl carrier protein (frendicin ACP)54 is comprised of a three‐helix bundle structure and have a high correlation coefficient between prediction and experiment (r=0.81). The average value of the order parameters of the three helices is 0.844, which is consistent with our prediction having high S2 values. We also correctly predict that the C‐terminal residues and the long loop (G17‐D23) connecting two helices have the first and second lowest average S2 values respectively (0.358 and 0.492). Figure 2.5 shows the experimental and predicted order parameters for frendicin ACP by using WCN, CM, and PFP model.
Figure 2.5 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
Lysozyme
The correlation coefficient of the prediction for lysozyme55 is 0.83. The order parameter of residue P70 located on the most flexible loop is not available from NMR relaxation. According to the X‐ray structure of lysozyme, P70 has the sixth highest temperature factor (28.37) in the whole protein62 (the four residues having the highest temperature factors are on the C‐terminal region).
Our prediction also shows that P70 is the most flexible except a few residues located on the C‐terminal. Figure 2.6 shows the experimental and predicted order parameters for lysozyme by using WCN, CM, and PFP model.
Figure 2.6 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
41
SH2 domain of p85α subunit of phosphoinositide 3‐kinase
The SH2 domain of the p85α subunit of phosphoinositide 3‐kinase contains 111 residues forming a structure with 3 α‐helices and a 3‐strand anti‐parallel β‐sheet. The correlation coefficient between the experimental value and predicted value of WCN model is 0.86, which is higher than that of the PFP model (r=0.76) and the CM (r=0.79). Figure 2.7 shows the experimental and predicted order parameters for the protein by using WCN, CM, and PFP model. The result for this protein predicted with the WCN model is basically excellent.
Figure 2.7 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
Ubiquitin
Ubiquitin57 is a small single‐domain protein with 76 residues containing both an α ‐helix and a β ‐sheet. The agreement between SWCN
2 and of ubiquitin is excellent (r=0.96). Figure 2.8 shows the experimental and predicted order parameters for ubiquitin by using the WCN, CM, and PFP model. The CM and PFP model also yield excellent results with correlation coefficients of 0.96 and 0.92, respectively.
SNMR2
Figure 2.8 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
43
Ketosteroid isomerase
The ketosteroid isomerase is a 125‐residue protein comprising 2 α‐helices and 2 anti‐parallel β‐strands. The correlation coefficient between experimental value and predicted value of the WCN model is 0.82, which is higher than that of the PFP model (r=0.43) and CM (r=0.57). Figure 2.9 shows the experimental and predicted order parameters for ketosteroid isomerase by using the WCN, CM, and PFP model. Both the PFP and WCN model predict the Y88‐K92 region to be flexible (even the CM). The region is actually an isolated loop connecting two β‐strands. The reason for the disagreement between experimental and predicted value may be that the protein forms a dimer under experimental condition and the loop is located and stabilized at the interface between two subunits.
Figure 2.9 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line)
4‐oxalocrotonate tautomerase
The 4‐oxalocrotonate tautomerase is a homohexamer with 62 residues per unit, consisting of an α‐helix, 2 β‐strands, a β‐hairpin, 2 loops, 2 turns, and a C‐terminal coil. The correlation coefficient between the experimental and predicted value of the WCN model is 0.51, which is higher than that of the PFP model (r=0.28) and CM (r=0.44). The PFP model predicts the S28‐R39 loop region to be flexible, however, the region is actually highly ordered according to the NMR experiment. The reason may be that the loop is away from the center of the protein but the environment around it is crowded. The WCN model correctly predicts the loop region to be highly ordered. Figure 2.10 shows the experimental and predicted order parameters for 4‐oxalocrotonate tautomerase by using the WCN, CM, and PFP model.
Figure 2.10 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
45
Interleukin‐4
In the original paper for the experimental NMR data of interleukin‐460, the authors mentioned an uncommon loop region (H59‐A70) which has high order parameters. The observed average order parameter for the loop is 0.87 which is significantly higher than those of other loops and is only slightly lower than that observed for the secondary structure regions. The uncommon loop is predicted to have low order parameter for the WCN model, however, the correlation coefficient increases to 0.78 if the loop region is removed.
Figure 2.11 shows the experimental and predicted order parameters for Interleukin‐4 by using the WCN, CM, and PFP model.
Figure 2.11 Upper part: computed S2 values of WCN model (solid line) and CM (dotted line) and experimental S2 values (circle). Lower part: computed S2 values of PFP model (solid line) and experimental S2 values (circle).
of any additional mechanical model. Since we can compute the order parameters directly from the topological properties (such as protein contact number) of protein structures, our study underscores a very direct link between protein topological structure and its dynamics. In addition, since our model uses only Cα atoms, our results indicate that protein dynamics (such as the order parameters) can be determined without the knowledge of protein sequences. As increasing numbers of protein structures are solved, this method offers an efficient way to determine backbone motions with high accuracy and is practical in the study of protein function‐dynamics relationship and structural genomics.
To check the effects of the additional information of side‐chain groups on the computed order parameters, we compared the computed S2 values with the side‐chain information. However, the inclusion of the side‐chain groups deteriorates the performance of the WCN model (r goes down from 0.79 to 0.73). The reason for this is not clear. It may be that the flexible side‐chain conformations, though conveying more detailed information about the atomic environments, introduce undesirable noises that overshadow the supposedly useful information of the former in the computation of the order parameters.
47
A recent study by Halle63 found a linear correlation between the contact number of non‐covalent neighboring atoms and B‐factors. Their result shows that the B‐factor is inversely proportional to the contact density, i.e., the number of non‐covalent neighbors. They assume that the contact density of covalent neighbors for each Cα atom is similar. Therefore, only the non‐covalent neighbors are needed to be considered. Our method has the advantage of not using the cut‐off distance, while Halle’s method needs to determine the optimal cut‐off distance when computing contact number. In addition, Halle’s method seems to be sensitive to the number of atoms included in the cut‐off radius. A recent study shows that the WCN model performs better than the naïve contact number model, which needs a cut‐off distance, in computing B‐factors47.
The biological function of proteins is closely related to the cooperative motions and the correlated fluctuations which involve large portions of the structure64‐67. Normal Mode Analysis (NMA) had been used to study biomolecules since early 1980s68,69. It decomposes the protein dynamics into a collection of motions which include large scale/low frequency and small scale/high frequency motions. Biologists usually focus on the large scale/low frequency motions that are relevant to protein functions. The major contribution of NMA to the biological research field is the ability to provide the information of large, domain‐scale protein motions which is hard to compute by other methods. The classical approach of NMA is to diagonalize the Hessian matrix, i.e. the second derivative of the potential function of a molecular dynamics (MD) simulation. The major shortcoming of the classical NMA is that the sampling time increases dramatically with the size of the protein.
The Elastic Network Model (ENM)49, which describes protein dynamics without amino acid sequence and atomic coordinates, has been widely used in the studies of protein dynamics and structure‐function relationship. The ENM views the protein structure as an elastic network, the nodes of which are
49
the Cα atoms of individual residues. Residue pairs within a cutoff distance are connected by springs which have a uniform force constant in the network.
Based on ENM, a coarse‐grained version of NMA is developed and widely
Based on ENM, a coarse‐grained version of NMA is developed and widely