1.1 Motivation
DNA-binding proteins are important to many biological processes in organisms. For
example, transcription factors (TFs) activate or repress gene expression by using their
DNA-binding domains (DBDs) to recognize specific nucleotide sequences in the
genome. DNA sequences that can be recognized by the same DBD are usually
characterized by a probabilistic model, called position weight matrix (PWM), to
accommodate the variability of the TF-binding sequences. Specifically, with the profile
representation of TF binding sites (TFBSs), researchers can discover novel target genes
regulated by known TFs. Therefore, accurate prediction of such target DNA sequences
for DNA-binding proteins is an important step to understanding many biological
processes [1-3].
The most widely used technique for PWM inference of a TF is to collect a set of
promoter sequences that comprise of genes known to be regulated by a particular TF
and then detect frequently observed (over-represented) subsequences from the
collection [4-8]. Such a method requires a sufficient number of sequences for pattern
discovery and is currently available only for a small number of DNA-binding proteins.
2
Similarly, the most promising technique currently for discovering TF binding sites,
ChIP-seq [9], also has the potential limitation of requiring an antibody for the TF. An
alternative approach to predicting PWMs is based on analysis of protein-DNA complex
structures, which has been shown to perform well in determining which positions in a
PWM should be more conserved and not allowed to degenerate [10-12]. In this study
we focus on the structure-based approaches to complement predictions from
sequence-based approaches. The latter approaches provide relatively limited
information about how a DNA-binding protein binds to DNA. For example, when the
interaction involves multiple proteins, sequence-based approaches cannot tell how
many DBDs are required to interact with a binding site.
Though the knowledge-based potential functions [10, 12] (see Chapter 2) perform well
on native complexes in predicting target DNA sequences, this success has not been
extended to DNA-binding proteins lacking co-crystallized structures. In the 30 July
2011 release of Protein Data Bank (PDB) [13], only 403 out of about 1300
DNA-binding proteins have protein-DNA co-crystallized structures. This reveals an
immediate need to develop PWM predictors for unbound protein structures. Such a
predictor requires constructing a putative protein-DNA complex for the given unbound
protein structure before PWM prediction. For this purpose, protein-DNA docking is
3
one of the feasible ways to generate protein-DNA complexes but suffers high
computational cost [14, 15]. To overcome this disadvantage, Gao and Skolnick
recently employed an efficient way of generating protein-DNA complexes by structure
alignment [16]. This structure alignment-based technique is adopted in this study to
generate protein-DNA complexes which are then used to predict PWMs. Another
technique that can be considered for generating putative protein-DNA complexes is
homology modeling, which requires only the sequence of the query protein [11].
Inferring target DNA sequences directly from protein sequence is much more
challenging and beyond the scope of this work.
1.2 Framework of the study
This work proposes a framework of PWM prediction based on unbound protein
structures and investigates its feasibility and challenges. Given a query protein
structure and a template complex, the proposed method conducts structure alignment to
generate superimposed protein-DNA complexes. Based on the protein-DNA complex,
an atomic-level knowledge-based potential function is employed to predict PWMs to
which the query protein can bind. The work compiled a benchmark of seven
DNA-binding proteins which have annotated PWMs and structures of both
DNA-bound and unbound forms. Considering both forms is for the purpose of
4
comparing the performance of the potential function applied on the native and
synthetic complexes. The experimental results show that although the performance of
the synthetic complexes generated by the proposed framework is worse than native
complexes, it is better than those based on homologous complexes. Potential reasons
behind the performance difference between our synthetic complexes and the native
ones were further investigated by progressively adjusting the quality of synthetic
complexes toward conditions that mimic native complexes.
It was observed that for some instances, the best PWM prediction was generated by a
template protein structure with a low TM-score [17]. For example, the TM-score
between two alpha-helices are noticeably higher than for that between other protein
structural elements such as beta-sheets and coils. However, the superimposed
protein-DNA complex generated by this template did not perform well. For this reason,
other features were incorporated to build a model for selecting the most appropriate
structure template. The model is based on using a support vector machine (SVM) [18]
adapted to perform regression based on the following features: (i) similarity between
the query and template proteins; (ii) the proportion of different structural protein
elements of query and template proteins as calculated by DSSP [19]; and (iii) the
numbers of residues between proteins and DNA into superimposed complex within a
5
specified distance. The top four superimposed complexes suggested by SVM were
selected as candidate superimposed.
In this study, the synthetic complexes selected using the SVM approach are compared
with those based on protein-DNA docking. The results show that the proposed
framework was comparable to that based on docking and is much more efficient.
1.3 Web server - DBD2BS
By providing an automatic and integrated platform for these procedures, this web
server helps researchers analyze protein-DNA interactions. A list of 1,066 DBD-DNA
complexes (including 1,813 protein chains) is compiled for use as the template
database. For a given DBD-DNA complex, the DBD2BS employs an atom-level
knowledge-based potential function to infer PWMs. For protein structures without
existing co-crystallized complexes, the DBD2BS conducts structure alignment to
synthesize the bound state of the query structure and then performs PWM prediction
based on the synthetic DBD-DNA complexes. The DBD2BS also provides users with
an easy-to-use interface for visualizing the PWMs predicted based on different
templates and the spatial relationships of the query protein, the DBDs and the DNA.
The kernel of the proposed method, which makes predictions based on a given pair of
6
an unbound structures (query) and a user-specified complex (template), is released
along with this study as a Linux executable
(http://mbi.ee.ncku.edu.tw/res/Chen_2011/).
7