Evolutionary conservation and interacting preference for identifying Protein-DNA interactions

Protein-DNA interaction

2.4.2 Evolutionary conservation and interacting preference for identifying Protein-DNA interactions

residues contact

) m , BLOSUM62(d

S_M ⁱ

∑

^CR ⁱ ⁱ

= ∈ (Eq. 2.4.1.1)

where CR is the set of the contact residues between D and M; di and mi denote the corresponding ith contact residue of D and M, respectively. Here, the score of a misaligned residue is -4 which is the smallest in the BLOSUM62 matrix.

2.4.2 Evolutionary conservation and interacting preference for identifying Protein-DNA interactions

Figure 2.4.2.1 shows the scheme of our proposed scoring function for identifying DNA-binding domains and predicting protein-DNA interactions. We first compiled 1204 protein-DNA complex structures from the Protein Data Bank (PDB) [91]. Protein-DNA complex structures were then used as templates to identify potential DNA-binding proteins /domains. Third, the DNA-contact residues of these complexes are identified by using geometry information of the structures. For a given template structure T and a protein sequence/structure P, we obtained the alignment of T and P by using sequence/structure alignment tools. We then proposed a scoring function to quantitatively evaluate the function similarity between T and P based on conservation score of the DNA-contact residues and the interacting scores between contacted residues (protein side) and bases (DNA side) of the template T. Detailed descriptions are described as the following subsections.

Figure 2.4.2.1. The scoring schema using the evolutionary conservation and interacting preference.

Template preparation

Protein-DNA complex structures solved by X-ray crystallography (resolution > 3.0 Å) and NMR were obtained from the December 2007 release of the PDB. According to the work proposed by Luscombe et al. [122], we selected 1043 complexes by excluding complexes which are single-strand binding complexes or the numbers of DNA bases are less than 4. For each protein-DNA complex in this selected set, we identified the contact residues, whose heavy atoms are within a distance (distance ≤ 4.5 Å) of any heavy atoms of the bounded DNA, of the DNA-binding protein. The DNA-contacting residues are considered as the core part of a DNA-binding protein. To obtain reasonably extensive interface of a protein-DNA complex, a DNA-binding protein is required to have more than five contact residues and the number of residues of this protein is more than 50. The residue-DNA bases interacting pairs were also obtained from the protein-DNA complex. A residue R and a DNA base B are defined as an interacting pair if any heavy atoms of R and any heavy atoms of B are within a distance 4.5 Å.

Alignment Tools

For a given protein template T and a query protein sequence/structure P, we obtained the alignment by the following steps: If P is a protein structure, we used a structure alignment tool CE [132] to align T and P. The CE will return a Z score for the alignment representing the structure similarity of these two structures. The P is considered as a homologous protein of T if the Z score exceeds 3.7 based on CE’s statistical model. On the other hand, if P is a protein sequence, we applied the sequence alignment tool FASTA [126] to align the two

proteins (i.e. T and P). The P is considered as a homologous protein of T if the sequence identity exceeds 25% according to the observations of previous studies [66, 128, 130].

Scoring function

For a homologous protein P of a template T (i.e. the alignment of P and T satisfies above two criteria), we used three scoring methods to calculate the score of P based on aligned contact residues of T. These methods, including consensus score, interaction score, and combination score, are described in the following subsections.

Consensus Score

We calculate the consensus scores of P based on aligned contact residues of T. The BLOSUM62 matrix [92] is applied here to evaluate the change of contact residues. The consensus scoring function is defined as

residues contact

) p , BLOSUM62(t

S_cons ⁱ

∑

^CR ⁱ ⁱ

= ∈ (Eq. 2.4.2.1)

where CR is the set of the contact residues between T and P; ti and pi denote the ith contact residue of T and its corresponding aligned residue of P, respectively. Here, the score of a misaligned residue is -4, which is the smallest value in the BLOSUM62 matrix.

Interaction Score

The interaction score is obtained by the following steps. For all contact residue-base pairs between protein and DNA, respectively, in template T, we first replace the residues of those pairs with aligned residues in P. We used the knowledge-based scoring matrix M, which was proposed by Margalit and co-worker [116] to measure the preference of residues and DNA bases, to score the binding affinity between the target protein P and DNA based on template T (Figure 2.4.2.1). Finally the interaction score is given as

pairs contact

) M(R S_int

∑

ⁱ ⁱ

= (Eq. 2.4.2.2)

where M(Ri) is preference in matrix M and Ri is the ith contact pair in P. When a contact residue is aligned to gap, we used the smallest score (-3.93) in M to be the score.

Combination Score

The combination score, which is the linear combination of the consensus score and the interaction score, is given as

S_combinatio_n =ω₁⋅S_cons+ω₂⋅S_int (Eq. 2.4.2.3)

where w1 and w2 is the weight of the consensus and interaction scores, respectively. Here, we set both w1 and w2 to 1.

2.5 Result

2.5.1 Evolutionary conservation of DNA-contact residues in DNA-binding domains

Given a query domain, our method identified similar DNA-binding structures or homologous protein sequences from the template library. To evaluate the performance of our method, for each DNA-contact domain (D) in the template library we generated its corresponding positive and negative sets. The members in the positive set contain the domains similar to domain D based on SCOP, while domains in the negative set do not. By applying our method on these two sets, we found that the scores of the domains in the positive set are significantly higher than those of domains in the negative set. We further determined a threshold to achieve high precision and recall. Combining with the threshold, we applied our method on 66 known SCOP families of DNA-binding domains and 250 non-DNA-binding proteins to examine the performance.

Positive and negative set for each contact domain

We collected DNA-binding contact domains from SCOP database, the detail is described in Method. To remove redundant contact domains, domains with highly similar sequences (identity > 90%) are grouped using the NCBI software BLASTCLUST. In each group, the one with the maximal number of contact residues is chosen as the representative domain of a group. For a representative domain R, these protein domains in the same SCOP family are considered as the member of R according to SCOP95 (members whose similarity greater than 95% are excluded). Each member of R was aligned to R using the CE. We define a residue of R as misaligned if it is aligned to a gap. A family member is discarded if more than 20%

contact residues of R are misaligned between R and this member. Family members that satisfy

the above criteria are considered to be in the positive set. If there are less than five members in the positive set of R, the entire family of R is discarded. We finally yielded 66 representative domains with corresponding positive sets. For each R, we artificially generated 1000 domains to be the negative set. To do this, for each artificial domain, we replicate its residues from R. Then we randomly mutated the residue type of each contact residue of R.

Determining the threshold of similar DNA-binding function of a contact domain

For each representative domain R, each member in the positive and negative sets was scored by the method we developed. Ideally, the scores of domains in the positive set should be on average significantly higher than those of the negative set. We used the Kolmogorov-Smirnov (KS) test to examine the above criterion. The KS test is a nonparametric test to determine if two distributions differ significantly. According to our results, the scores are significantly different for the positive set and the negative set in most domains (97% of 66 sets have a p value less than 0.05).

Further, given a contact domain, we would like to determine a threshold for determining which domains have a similar DNA-binding function. For the two sets (positive and negative) of a representative domain, we separately transform all members' scores to z-scores by

δ μ

=s−

Z ,

(Eq. 2.5.1.1)

where s is the score of a member, μ is the mean score of the these two sets, and δ is the standard deviation. Figures 2.5.1.1(A) and (B) show the precision (ratio of the number of

retrieved true positive data to all retrieved data) and the recall (ratio of the number of retrieved true positive data to all true positive) with various z-score thresholds, respectively.

As shown in Figure 2.5.1.1(A), when we set the threshold greater than two, the precisions of using different thresholds are very similar (>90%).

If we set the z-score threshold to one, only 60% of families are with high precision. The results imply that larger thresholds will yield higher precisions, but the benefit is limited when the threshold is larger than two. Oppositely, as shown in Figure 2.5.1.1(B), larger thresholds will reduce the recall. According to these results, we take the z-score threshold as 2.0 and the domains with a z-score higher than the threshold will be considered as putative DNA-binding domains.

Figure 2.5.1.1. Precision and recall on different z-score thresholds. Our method results on different z-score thresholds for 66 representative domains. The distributions of the numbers of the families for (A) precisions and (B) recalls.

Non-DNA-binding proteins

We further apply our method to 250 non-nucleic-acid binding (non-DNA-binding) proteins, which were initially studied by Hobohm and Sander [133] and further specified by Stawiski et al. [105]. We align all non-redundant contact domains to those non-DNA-binding proteins using CE. Alignments whose z-scores (defined by CE) are greater than 3.7 with the misalign rate of contact residues less than 20% are chosen as non-DNA-binding domains. 177 non-DNA-binding domains pass the constraints among 250 proteins. We applied our method on these non-DNA-binding domains and transformed their scores to z-scores. Figure 2.5.1.2 shows the distribution of z-scores of non-DNA-binding domains. The scores approximately follow a normal distribution and the peak of the density occurred at Z = -1~0. Given a z-score threshold, the false positive rate is the ratio of number of domains whose z-score are beyond the threshold to the total non-DNA-binding domains. According to our previous analysis, we set the threshold to 2.0 and the false positive rate is less than 0.05. It shows that for non-DNA-binding domains, our method can recognize their non-binding with high accuracy.

Figure 2.5.1.2. Distribution of z-score values of 177 non-DNA-binding domains.

2.5.2 Evolutionary conservation and interacting preference for identifying Protein-DNA interactions

Identifying DNA-binding domains

Proteins operate in biological processes by using their functional domains and the domains of the same families usually have similar functions. We applied our scoring functions to identify family members of a DNA-binding protein/domain. For each crystal structures of protein-DNA complex, we identified DNA-binding domains based on the domain definition of Structure Classification of Proteins (SCOP, version 1.71) [121]. To create a non-redundant and reasonable DNA-binding set for evaluation, we first select the domains, which have at least 50 residues and more than five contact residues. To remove the redundant DNA-binding domains, we applied the NCBI software BLASTCLUST to cluster highly similar sequences (sequence identity >90%) into one group. In each group, a DNA-binding domain which has maximal contact residue in this group is selected as the representative domain. We finally yield 69 representative DNA-binding domains.

The family members (according to the classification of the SCOP database) are aligned to their representative domain. Two protein-DNA interfaces are often different if their 20%

contact residues are misaligned based on our observations. Here, we discarded the members if more than 20% misaligned contact residues. Each aligned member is scored by our scoring methods.

To show the statistical significance of the scores, we create 10,000 random domains for each representative domain by randomly mutating all contact residues of the

representative domain. We then translate the scores of the family members to Z-scores by

^δ

=s−

Z , (Eq. 2.5.2.1)

where s is the score of a member, μ and δ are the mean and the standard deviation, respectively, of 10,000 random domains. Figure 2.5.2.1 shows the distribution of Z scores in our scoring method. It shows that more than 80% members have statistic significant Z > 2 against random sets. This result indicates that the combination of the consensus and the interaction scoring function provides statistic meaning.

0 5 10 15 20

0 > Z 0 > Z > 1 1 > Z > 2 2 > Z > 3 3 > Z > 4 4 > Z > 5 5 > Z > 6 6 > Z > 7 7 > Z > 8 8 > Z > 9 Z > 10

Z-score

Family members (%) Figure 2.5.2.1. The distribution of Z-score of our scoring method.

0 0.2 0.4 0.6 0.8 1

Z

_Consensus

在文檔中生物系統從序列到結構與功能之計算研究---子計畫三：利用核糖核酸結構預測與核糖核酸-蛋白質互動關係分析推論蛋白質結構(III) (頁 48-56)