Chapter 4. Results
4.5 Discussion
It was discussed in the study of Dan et al. [42] that conformational changes were
commonly observed in DNA-binding proteins. To understand how common the
conformational changes are present in protein-DNA interactions and how large the
changes are usually observed, we further collected available structure pairs of unbound
and bound states for DNA-binding proteins from the PDB database. Since a protein
may have multiple unbound-bound structure pairs, we adopted a strict criterion that a
55
protein has transitions if at least one of the associated unbound-bound structure pair
has transitions. The definition of transitions between a structure pair is identical to that
of Dan et al.’s work (the DSSP program was used to assign secondary structures and
only segments in which the same transition was consistent for at least five consecutive
residues were considered). The results show 40.2% of the 132 proteins underwent SSE
transitions (changes on secondary structure) and 53.8% underwent D2O transitions.
The high ratios concur with the points of Dan et al.
On the other hand, it is observed that the RMSD values were not that large, i.e., all
structure pairs were less than 4Å (data not shown). If the criterion ‘RMSD ≤ 2Å’, a
rigorous threshold in general, is considered to indicate small structural variation,
93.2% proteins have at least one structure pair with small structural variation. In Table
7, we found that the ratio of proteins underwent SSE (0.0%) and D2O (14.3%, one
among the seven test cases) transitions were much lower than those of the overall
distribution (40.2% SSE and 53.8% D2O transitions). The most difference between
56
Table 7 and the analysis in this section is that in Table 7 the structure pair was selected
by the structure alignment score. This suggests that in practice using the best structure
alignment score helps to find structure pairs with few transitions for PWM prediction.
If the structure pair with the best RMSD is chosen to investigate the conformational
changes of a protein upon binding DNA, we found that ratios of proteins which
underwent SSE and D2O transitions dropped to 13.8% and 39.4%, respectively. These
results suggest that the proposed method will benefit the study of a large quantity of
DNA-binding proteins with only unbound structures in the PDB database.
To shift the problem of structure-based PWM prediction from native complexes to
unbound protein structures, the most challenging issue may be the construction of a
reliable synthetic protein-DNA complex on which physics- or knowledge-based
scoring functions can be applied to perform prediction. In this study, we observed that
in many cases, template structures with low structure similarity were selected by SVM
and then used to generate the synthetic complexes. An example is shown for
ATF2_HUMAN in Figure 9, where 1K79:A, 1K7A:A, 1GXP:B and 3G6T:B are the
top four templates selected by SVM. The structure similarity between the query
structure (1T2K:D) and template structures are 0.15502, 0.15535, 0.18413 and 0.22657,
57
respectively. However, the template with the highest structure similarity is 2H7H:A
(TM-score=0.88661). Although the selected template structures had lower structure
similarity, the PWM prediction performance was better. That is because the query
structure (1T2K:D) and template structure (2H7H:A) contained many alpha-helices
and the aligned structure generated by TM-align does not perform well in this situation.
In this study, the TM-score is not the only criterion for selecting appropriate template
structures. The extra features improve template selection and thus improve PWM
prediction performance.
In some cases, the best PWM is the 3rd or 4th ranked prediction, as observed by using
DBD2BS to consider all PWMs simultaneously and decide which positions in the
PWMs are important. This is shown in Figure 10, which shows two PWMs generated
by templates (1MJ2:C and 1MJO:C) aligned against each other. Their consistency
implies that these two PWMs are more important than the others. Thus, the final PWM
prediction for this protein is the average for each position in the two PWMs.
58
Figure 9 An example where templates with low structure similarity were selected.
This case demonstrates that using templates with low structure similarity can produce good predictions, where 2H4H:A (top-left) is the template with the highest structure similarity as compared to those templates selected by SVM (1K79:A, 1K7A:A, 1GXP:B and 3G6T:B). It can be seen that 3 of 4 templates selected by SVM resulted in better Ψ-scores despite having lower structure similarity as measured by the TM-score.
This case is due to the query structure having an abundance of alpha-helices, where the query protein is shown in green and the template protein is shown in brown.
59
Figure 10 An example where DBD2BS decides the final PWM prediction.
This case demonstrates how DBD2BS can decide the final PWM prediction from the four predictions for “CRP_ECOLI”. The SVM model suggests four templates (1NFK:A, 1YTF:C, 1MJ2:C and 1MJO:C) which were then used for PWM prediction.
The first column shows a template and its similarity with the query protein. The second column shows the predicted PWM based on this template. The third column illustrates the comparison of each template. It can be seen that two of the four predicted PWMs were highly consistent and thus lend greater confidence that these two PWMs generated by templates (1MJ2:C and 1MJO:C) are important.
Two concluding remarks are provided here. The DNA sequence in the selected
template is probably not the native DNA sequence to which the query protein can bind.
Thus the ability of the adopted potential function to handle the mutations of DNA
sequences embedded in the synthetic complex is critical to the success of the proposed
framework. Regarding this issue, we concluded that the selected atomic
knowledge-based potential function is generally able to predict the most favorable base
type without being affected by the original sequence present in the synthetic complex.
60
Three examples are shown in Figure 11 to illustrate this observation. Another
important issue related to the development of structure-based methods is their
applicability. In the PDB release of July 30, 2011, there are 114 DNA-binding proteins
that do not have native complexes but have unbound structures with potential
templates from homologues available. The definition of a pair of unbound structure
and the potential template is e-value<0.001 for the sequence alignment and TM-score
>0.5 for the structure alignment. Currently the public version of TRANSFAC database
contains 398 annotated PWMs for 133 proteins, most of which were determined via
sequence-based methods. However, the overlap between the 114 DNA-binding
proteins, which are the targets of this study, and the 133 proteins with known PWMs is
only 16. This small overlap concurs with the fact that the currently curated PWMs
were majorly contributed by sequence-based methods. This also reveals the
distinctness and potential of structure-based methods, since up to now an abundance of
structure information has not been widely exploited to enhance our understandings
about the interactions between DNA-binding proteins and their binding sites.
61
Figure 11 Demonstration of base substitution.
This case demonstrates the ability of the employed all-atom potential function to replace the base types when the native DNA sequence in the selected template is not the same as the target DNA sequence to which the query protein can bind. A position is marked as ‘●’ if its most favorable base type was correctly predicted, or marked as ‘−’ otherwise. In addition, the symbol ‘↑’ stands for a successful substitution. The sequence shown is the DNA sequence in the selected template, where red nucleotides indicate the positions of which the bases are different to the most favorable base types in the annotated PWMs.
62