Discussion - 利用機器學習演算法篩選適當模板結構提升預測轉錄因子結合序列特徵之準確度

Chapter 4. Results

4.5 Discussion

It was discussed in the study of Dan et al. [42] that conformational changes were

commonly observed in DNA-binding proteins. To understand how common the

conformational changes are present in protein-DNA interactions and how large the

changes are usually observed, we further collected available structure pairs of unbound

and bound states for DNA-binding proteins from the PDB database. Since a protein

may have multiple unbound-bound structure pairs, we adopted a strict criterion that a

protein has transitions if at least one of the associated unbound-bound structure pair

has transitions. The definition of transitions between a structure pair is identical to that

of Dan et al.’s work (the DSSP program was used to assign secondary structures and

only segments in which the same transition was consistent for at least five consecutive

residues were considered). The results show 40.2% of the 132 proteins underwent SSE

transitions (changes on secondary structure) and 53.8% underwent D2O transitions.

The high ratios concur with the points of Dan et al.

On the other hand, it is observed that the RMSD values were not that large, i.e., all

structure pairs were less than 4Å (data not shown). If the criterion ‘RMSD ≤ 2Å’, a

rigorous threshold in general, is considered to indicate small structural variation,

93.2% proteins have at least one structure pair with small structural variation. In Table

7, we found that the ratio of proteins underwent SSE (0.0%) and D2O (14.3%, one

among the seven test cases) transitions were much lower than those of the overall

distribution (40.2% SSE and 53.8% D2O transitions). The most difference between

Table 7 and the analysis in this section is that in Table 7 the structure pair was selected

by the structure alignment score. This suggests that in practice using the best structure

alignment score helps to find structure pairs with few transitions for PWM prediction.

If the structure pair with the best RMSD is chosen to investigate the conformational

changes of a protein upon binding DNA, we found that ratios of proteins which

underwent SSE and D2O transitions dropped to 13.8% and 39.4%, respectively. These

results suggest that the proposed method will benefit the study of a large quantity of

DNA-binding proteins with only unbound structures in the PDB database.

To shift the problem of structure-based PWM prediction from native complexes to

unbound protein structures, the most challenging issue may be the construction of a

reliable synthetic protein-DNA complex on which physics- or knowledge-based

scoring functions can be applied to perform prediction. In this study, we observed that

in many cases, template structures with low structure similarity were selected by SVM

and then used to generate the synthetic complexes. An example is shown for

ATF2_HUMAN in Figure 9, where 1K79:A, 1K7A:A, 1GXP:B and 3G6T:B are the

top four templates selected by SVM. The structure similarity between the query

structure (1T2K:D) and template structures are 0.15502, 0.15535, 0.18413 and 0.22657,

respectively. However, the template with the highest structure similarity is 2H7H:A

(TM-score=0.88661). Although the selected template structures had lower structure

similarity, the PWM prediction performance was better. That is because the query

structure (1T2K:D) and template structure (2H7H:A) contained many alpha-helices

and the aligned structure generated by TM-align does not perform well in this situation.

In this study, the TM-score is not the only criterion for selecting appropriate template

structures. The extra features improve template selection and thus improve PWM

prediction performance.

In some cases, the best PWM is the 3rd or 4th ranked prediction, as observed by using

DBD2BS to consider all PWMs simultaneously and decide which positions in the

PWMs are important. This is shown in Figure 10, which shows two PWMs generated

by templates (1MJ2:C and 1MJO:C) aligned against each other. Their consistency

implies that these two PWMs are more important than the others. Thus, the final PWM

prediction for this protein is the average for each position in the two PWMs.

Figure 9 An example where templates with low structure similarity were selected.

This case demonstrates that using templates with low structure similarity can produce good predictions, where 2H4H:A (top-left) is the template with the highest structure similarity as compared to those templates selected by SVM (1K79:A, 1K7A:A, 1GXP:B and 3G6T:B). It can be seen that 3 of 4 templates selected by SVM resulted in better Ψ-scores despite having lower structure similarity as measured by the TM-score.

This case is due to the query structure having an abundance of alpha-helices, where the query protein is shown in green and the template protein is shown in brown.

Figure 10 An example where DBD2BS decides the final PWM prediction.

This case demonstrates how DBD2BS can decide the final PWM prediction from the four predictions for “CRP_ECOLI”. The SVM model suggests four templates (1NFK:A, 1YTF:C, 1MJ2:C and 1MJO:C) which were then used for PWM prediction.

The first column shows a template and its similarity with the query protein. The second column shows the predicted PWM based on this template. The third column illustrates the comparison of each template. It can be seen that two of the four predicted PWMs were highly consistent and thus lend greater confidence that these two PWMs generated by templates (1MJ2:C and 1MJO:C) are important.

Two concluding remarks are provided here. The DNA sequence in the selected

template is probably not the native DNA sequence to which the query protein can bind.

Thus the ability of the adopted potential function to handle the mutations of DNA

sequences embedded in the synthetic complex is critical to the success of the proposed

framework. Regarding this issue, we concluded that the selected atomic

knowledge-based potential function is generally able to predict the most favorable base

type without being affected by the original sequence present in the synthetic complex.

Three examples are shown in Figure 11 to illustrate this observation. Another

important issue related to the development of structure-based methods is their

applicability. In the PDB release of July 30, 2011, there are 114 DNA-binding proteins

that do not have native complexes but have unbound structures with potential

templates from homologues available. The definition of a pair of unbound structure

and the potential template is e-value<0.001 for the sequence alignment and TM-score

>0.5 for the structure alignment. Currently the public version of TRANSFAC database

contains 398 annotated PWMs for 133 proteins, most of which were determined via

sequence-based methods. However, the overlap between the 114 DNA-binding

proteins, which are the targets of this study, and the 133 proteins with known PWMs is

only 16. This small overlap concurs with the fact that the currently curated PWMs

were majorly contributed by sequence-based methods. This also reveals the

distinctness and potential of structure-based methods, since up to now an abundance of

structure information has not been widely exploited to enhance our understandings

about the interactions between DNA-binding proteins and their binding sites.

Figure 11 Demonstration of base substitution.

This case demonstrates the ability of the employed all-atom potential function to replace the base types when the native DNA sequence in the selected template is not the same as the target DNA sequence to which the query protein can bind. A position is marked as ‘●’ if its most favorable base type was correctly predicted, or marked as ‘−’ otherwise. In addition, the symbol ‘↑’ stands for a successful substitution. The sequence shown is the DNA sequence in the selected template, where red nucleotides indicate the positions of which the bases are different to the most favorable base types in the annotated PWMs.

Chapter 5. Web server

在文檔中利用機器學習演算法篩選適當模板結構提升預測轉錄因子結合序列特徵之準確度 (頁 62-70)