Introduction - 利用機器學習演算法篩選適當模板結構提升預測轉錄因子結合序列特徵之準確度

1.1 Motivation

DNA-binding proteins are important to many biological processes in organisms. For

example, transcription factors (TFs) activate or repress gene expression by using their

DNA-binding domains (DBDs) to recognize specific nucleotide sequences in the

genome. DNA sequences that can be recognized by the same DBD are usually

characterized by a probabilistic model, called position weight matrix (PWM), to

accommodate the variability of the TF-binding sequences. Specifically, with the profile

representation of TF binding sites (TFBSs), researchers can discover novel target genes

regulated by known TFs. Therefore, accurate prediction of such target DNA sequences

for DNA-binding proteins is an important step to understanding many biological

processes [1-3].

The most widely used technique for PWM inference of a TF is to collect a set of

promoter sequences that comprise of genes known to be regulated by a particular TF

and then detect frequently observed (over-represented) subsequences from the

collection [4-8]. Such a method requires a sufficient number of sequences for pattern

discovery and is currently available only for a small number of DNA-binding proteins.

Similarly, the most promising technique currently for discovering TF binding sites,

ChIP-seq [9], also has the potential limitation of requiring an antibody for the TF. An

alternative approach to predicting PWMs is based on analysis of protein-DNA complex

structures, which has been shown to perform well in determining which positions in a

PWM should be more conserved and not allowed to degenerate [10-12]. In this study

we focus on the structure-based approaches to complement predictions from

sequence-based approaches. The latter approaches provide relatively limited

information about how a DNA-binding protein binds to DNA. For example, when the

interaction involves multiple proteins, sequence-based approaches cannot tell how

many DBDs are required to interact with a binding site.

Though the knowledge-based potential functions [10, 12] (see Chapter 2) perform well

on native complexes in predicting target DNA sequences, this success has not been

extended to DNA-binding proteins lacking co-crystallized structures. In the 30 July

2011 release of Protein Data Bank (PDB) [13], only 403 out of about 1300

DNA-binding proteins have protein-DNA co-crystallized structures. This reveals an

immediate need to develop PWM predictors for unbound protein structures. Such a

predictor requires constructing a putative protein-DNA complex for the given unbound

protein structure before PWM prediction. For this purpose, protein-DNA docking is

one of the feasible ways to generate protein-DNA complexes but suffers high

computational cost [14, 15]. To overcome this disadvantage, Gao and Skolnick

recently employed an efficient way of generating protein-DNA complexes by structure

alignment [16]. This structure alignment-based technique is adopted in this study to

generate protein-DNA complexes which are then used to predict PWMs. Another

technique that can be considered for generating putative protein-DNA complexes is

homology modeling, which requires only the sequence of the query protein [11].

Inferring target DNA sequences directly from protein sequence is much more

challenging and beyond the scope of this work.

1.2 Framework of the study

This work proposes a framework of PWM prediction based on unbound protein

structures and investigates its feasibility and challenges. Given a query protein

structure and a template complex, the proposed method conducts structure alignment to

generate superimposed protein-DNA complexes. Based on the protein-DNA complex,

an atomic-level knowledge-based potential function is employed to predict PWMs to

which the query protein can bind. The work compiled a benchmark of seven

DNA-binding proteins which have annotated PWMs and structures of both

DNA-bound and unbound forms. Considering both forms is for the purpose of

comparing the performance of the potential function applied on the native and

synthetic complexes. The experimental results show that although the performance of

the synthetic complexes generated by the proposed framework is worse than native

complexes, it is better than those based on homologous complexes. Potential reasons

behind the performance difference between our synthetic complexes and the native

ones were further investigated by progressively adjusting the quality of synthetic

complexes toward conditions that mimic native complexes.

It was observed that for some instances, the best PWM prediction was generated by a

template protein structure with a low TM-score [17]. For example, the TM-score

between two alpha-helices are noticeably higher than for that between other protein

structural elements such as beta-sheets and coils. However, the superimposed

protein-DNA complex generated by this template did not perform well. For this reason,

other features were incorporated to build a model for selecting the most appropriate

structure template. The model is based on using a support vector machine (SVM) [18]

adapted to perform regression based on the following features: (i) similarity between

the query and template proteins; (ii) the proportion of different structural protein

elements of query and template proteins as calculated by DSSP [19]; and (iii) the

numbers of residues between proteins and DNA into superimposed complex within a

specified distance. The top four superimposed complexes suggested by SVM were

selected as candidate superimposed.

In this study, the synthetic complexes selected using the SVM approach are compared

with those based on protein-DNA docking. The results show that the proposed

framework was comparable to that based on docking and is much more efficient.

1.3 Web server - DBD2BS

By providing an automatic and integrated platform for these procedures, this web

server helps researchers analyze protein-DNA interactions. A list of 1,066 DBD-DNA

complexes (including 1,813 protein chains) is compiled for use as the template

database. For a given DBD-DNA complex, the DBD2BS employs an atom-level

knowledge-based potential function to infer PWMs. For protein structures without

existing co-crystallized complexes, the DBD2BS conducts structure alignment to

synthesize the bound state of the query structure and then performs PWM prediction

based on the synthetic DBD-DNA complexes. The DBD2BS also provides users with

an easy-to-use interface for visualizing the PWMs predicted based on different

templates and the spatial relationships of the query protein, the DBDs and the DNA.

The kernel of the proposed method, which makes predictions based on a given pair of

an unbound structures (query) and a user-specified complex (template), is released

along with this study as a Linux executable

(http://mbi.ee.ncku.edu.tw/res/Chen_2011/).

在文檔中利用機器學習演算法篩選適當模板結構提升預測轉錄因子結合序列特徵之準確度 (頁 9-15)