• 沒有找到結果。

Chapter 2. Materials and Methods

2.3 Methods

2.3.6 Profile score calculation

We use profiles to search for matched protein segments in protein sequences. The search window size is the length of profiles and shifts one residue each time. Each protein sequence should have N-(n-1) (N is the length of this sequence and n is the length of this pattern) profile scores, and we suppose the segment with the highest profile search score of this protein sequence should be the pattern candidate that we are looking for.

The scoring function is as follows:

S n

n

p i

PF

pi

∑∑

= =

= 1

20

1 (6)

Where S is the profile score, n is the length of a pattern, PFpi is the profile value of amino acid type i at position p. The score is 1 when a segment perfectly matches this profile.

Chapter 3

Protein Function Prediction and Conservation Residues Identification of ATP-, ADP-, and HEM-Binding Proteins

In order to identify the wealth of information present in protein structures, we analyzed conservation residues and patterns in multiple ligand-bound structure alignments.

To infer a major functional role from residue conservation, a function-based clustering is necessary before identifying conservation residues. Statistically, the bias of conservation may be from not having enough and convincing data, this is why we remove structures too much similar, the redundant domains, select alignment center domain C and generate alignments with clusters have more than four protein domains.

Most sequence and structure alignment techniques are protein-based alignment; in other words, these techniques analyze residue conservation only by comparing protein structure or protein sequence similarity. However, local alignment error sometimes happens when the sequence identity is less than 25% in sequence alignment or protein structures are much similar at regions far away from protein functional important region in structure alignment.

At the present, we have applied MuLiSA to ATP-, ADP-, and HEM-binding proteins and identified several conservation residues and pattern candidates. We have generated sequence profiles from multiple alignments and used them to discover protein sequences which may have these profiles. We also proved that MuLiSA is better than other tools in several cases and can discover functional information when comparing with SCOP [19] and PROSITE database[2]. Our major intention was to extract protein structure information from ligand-binding proteins and apply this information to protein function prediction. Table 1 shows some statistics about the dataset we used in this study. We applied MuLiSA to three

kinds of ligand-binding proteins; they are ATP-binding proteins, ADP-binding proteins and HEM-binding proteins. Through getting ligand-binding protein lists, selecting ligand-binding domains, domain clustering, non-redundant domains and alignment center C selection, we use MuLiSA and z-score of entropy calculation to identified conservation residues and pattern candidates of each cluster. These identified conservation residues may be functional important and we survey the literature and it proves that some of these identified conservation residues are critical to ligand-binding or correlate with conformation stability. After pattern candidate identification, we generate profiles of these pattern candidates and use these profiles predict protein functions.

3.1 ATP-binding proteins

3.1.1 Overview

ATP, adenosine triphosphate, is the major energy currency of the cell. It transfers energy from chemical bonds to endergonic reactions of the cell. ATP powers most of the energy-consuming activities of cells, such as muscle contraction, synthesis of polysaccharides, active transport of ions and nerve impulse. Because of ATP is a so important compound and because of the large number of experimental data, like ATP-binding protein structures and literatures, we choose ATP-binding proteins as our first research target. We have generated structure similarity matrix of non-redundant ATP-binding domains for functional-based domain clustering, and we also identified conservation residues and pattern candidates.

Finally, we used profiles of pattern candidates to undergo protein function prediction.

3.1.2 Structure similarity matrix and alignment center C selection

Figure 4 shows the structure similarity matrixes and SCOP classifications of 25 non-redundant ATP-binding domains. When comparing with classifications of SCOP database [19], protein domains with higher structure similarities are usually clustered together and they are always belong to same SCOP families. As we all agree that SCOP database [19]

is a convincing domain structural and functional classification database, it tells us that the multiple ligand-bound alignment and structure similarity calculation is reasonable and can reflect structural and functional information.

In Figure 4(A), domains belong to the same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row; in other words, the domain in this row is much similar with these compared domains than others. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame), it tell us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. Figure 4(B) shows the SCOP classification of protein domains in Figure 4(A).

3.1.3 Conservation residues identified from ATP-binding domains

After selecting alignment center C of each cluster, we use multiple ligand-bound structure alignment tool, MuLiSA, to generate multiple alignments.

We have identified several conservation residues (with z-score of position entropy > 2.5) of protein domains in “Protein kinases catalytic subunit family” and “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family”. In Table 2, conservation residues were identified and listed; the bold residues are these residues, verified by previous studies, that are

important in ATP-binding or conformation stability [23-32]. For example, in “Human Cyclin-Dependent Kinase 2 protein domain (SCOP code: d1hck__)”, we have identified residues A31, K129, N132 and D145 which interact with ATP through forming hydrogen bonds. Except for these four residues, we also identified six conservation residues and we believe that these residues are very likely in playing important role in ATP-binding.

Figure 5 shows the multiple ligand-bound structure alignment results and the identified conservation residues in “Protein kinases, catalytic subunit family” of ATP-binding domains.

In Figure 5(A), the identified conservation residues, aligned positions with z-score of entropy calculation > 2.5, are close to ATP in three-dimensional space. It implies that these conservation residues may play important role in ATP-binding. In Figure 5(B), the labeled residue numbers belong to protein domain d1phk__, which is the selected alignment center C of this cluster; and the red framed region means the PROSITE patterns. We observed that most identified conservation residues were on these PROSITE pattern region, it tell us that identifying pattern candidates from conservation residues extension may be a reasonable approach.

3.1.4 Pattern candidates identified from ATP-binding domains

We have identified pattern candidates of “Protein kinases catalytic subunit family” and

“Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” of ATP-binding domains. Table 3 lists these pattern candidates. We only choose the pattern candidates which are equal or longer than 5 residues and extending from identified conservation residues with z-score of entropy calculation > 2.5. Table 4 shows the comparison of PROSITE patterns and our defined pattern candidates that overlap with PROSITE patterns of ATP-binding domains.

These pattern candidates are partially overlapping with PROSITE patterns. However, the new pattern candidates which do not overlap with PROSITE patterns in Table 3, may be new clues

to search for ATP-binding proteins. For example, although the pattern candidate 1 in “Protein kinases catalytic subunit family” is overlapping with PROSITE pattern: Serine/ Threonine protein kinases active-site signature, there are also pattern candidate 2 and 3 in “Protein kinases catalytic subunit family” that do not overlap with PROSITE pattern.We found that identified pattern canididates are near ATP in 3-D space, therefore we believe that these two pattern candidates may be new clues to search for ATP-binding proteins (Figure 6).

All of these pattern candidates and PROSITE patterns were used to generate profiles and we will use these profiles for protein function prediction.

3.1.5 Profile verification and protein function prediction of ATP-binding proteins

In order to use profiles generated from our alignments to predict protein function, first we need to verify that the profiles we generated from our alignments is reasonable and convincing. Therefore, we use protein sequences which have PROSITE patterns: PS00178, PS00107, PS00108, PS00411, PS00190, PS00435, PS00436, PS00086 and PS00191. These PROSTIE patterns belong to 8 clusters listed in Table 1. Because pattern candidates identified from one cluster should be meaningful for sequences of this cluster, when we use profiles generated from these pattern candidates to search for protein sequences of this cluster, the sequences of this cluster should have higher profile scoring scores. In other words, a good pattern candidate can separate protein sequences of the cluster that have this pattern candidate from protein sequences of other clusters that don’t have this pattern candidate.

In order to compare with the performance of pattern candidates and PROSITE patterns, we also generated profiles of PROSITE patterns from our multiple alignments. If the performance of pattern candidates in one cluster is better than PROSITE patterns, we may find a novel pattern that is more meaningful than PROSITE patterns in this cluster. In Figure 7, we observed that our defined pattern candidate is worse than PROSITE pattern; however,

because the profile of PROSITE pattern is generated from our alignment, and the performance is good, it proved that the profile generated from our alignments is reasonable and convincing.

In order to verify the effectiveness of profiles generated from our alignments in protein function prediction, we compare the performance in profile search between dataset 1, which contains protein sequences with PROSITE pattern; and dataset 2, which contains protein sequences not only with PROSITE pattern but also have “ATP-binding” annotations in SWISS-PROT database. In Figure 8, dataset 1 contains protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature and dataset 2 contains protein sequences contain not only PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature but also have “ATP-binding” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than the area under curves of dataset 1.

Because the profile of pattern candidates were generated from alignments of ATP-binding domains and the protein sequences in dataset 1 are not all have “ATP-binding” annotations in

“KW” of SWISS-PROT database, we suppose that the profile of pattern candidate is more convincing in ATP-binding proteins but not proteins only with PROSITE patterns.

In Table 5, we summarized the average hit rate of true positive rate 50%, 60%, 70%, 80%, 90% and 100% in dataset 1: sequences with PROSITE pattern, and database 2:

sequences with PROSITE pattern and SWISS-PROT annotations for profile verification. We observed that whether in dataset 1 or dataset 2, the hit rate of PROSITE patterns are all higher than pattern candidates. Thus, the PROSITE pattern is really meaningful for protein sequences which have these PROSITE patterns.

However, we also observed that the hit rates in dataset 2 are generally higher than hit rates in dataset 1. Because dataset 1 only contains sequences with PROSITE patterns but database 2 contains sequences with PROSITE pattern and SWISS-PROT annotations, it tell us that the profiles we generated from multiple alignments of ATP-binding proteins may be more meaningful for protein sequences with “ATP-binding” annotations in SWISS-PROT

database.

Second, we used profiles of pattern candidates and PROSITE patterns of ATP-binding proteins to search for SWISS-PROT protein sequences that might have these patterns, and we suppose that the protein sequences with these pattern candidates may be ATP-binding proteins. We use all profiles of identified pattern candidates to search all protein sequences in SWISS-PROT database and give each sequence a profile scoring score. The given profile scoring score is the highest score of all profiles search. In this way, we can get a profile scoring ranking list in ATP-binding protein prediction (Figure 9). The sequences with higher profile score have higher possibility to be ATP-binding proteins. When one protein sequence has high profile score but not have “ATP-binding” annotations in SWISS-PROT database, we regard this protein might be an ATP-binding protein because it contains this pattern candidate.

Figure 9 shows the profile scoring list of protein function prediction in ATP-binding proteins. Two points must be mentioned. First, the framed sequences all have “ATP-binding”

annotations (except for P27604 and P25169); because these sequences all match novel pattern candidate, pattern candidate 2 in “Protein kinases catalytic subunit family” , we regard this pattern candidate is a new pattern of ATP-binding proteins. Second, the non-labeled sequences, P27604 and P25169, are the sequences that match profiles but don’t have

“ATP-binding” annotations in SWISS-PROT database, hence these two proteins might be the ATP-binding proteins but not identified yet.

In Table 6, we summarized the true-positive rates, profile scoring scores, and z-score of profile scoring scores of top 100, 500, 1000, 1500, 2000, 2500 and 3000 ranked sequences in profile scoring ranking list. We also compare the hit rates between pattern candidates and PROSITE patterns. We observed when protein sequences with profile scoring score 0.600, the true positive rate is 82.27% and the z-score is 2.87. Thus when protein sequences with profile scoring score higher than 0.600, we can say these protein sequence may be ATP-binding proteins with 82.27% confidence.

When comparing with the hit rate of our defined pattern candidates and PROSITE patterns, we observed that almost all the top 3000 ranked protein sequences with

“ATP-binding” annotations were all searched by pattern candidates. Although some of pattern candidates partially overlapped with PROSITE patterns, it tells us that the pattern candidates are useful for protein function prediction in ATP-binding proteins.

3.2 ADP-binding proteins

3.2.1 Overview

ADP, adenosine diphosphate, is a universe energy intermediate of the cell. ADP is the hydrolysis product of ATP. It can also transfers energy from chemical bonds to endergonic reactions of the cell. The main difference between ATP and ADP is that ATP contains two high energy bonds but ADP only have one. Because of ADP is also a universe energy intermediate of the cell, it is also an important compound and we choose ADP-binding proteins as our second research target.

We have also generated structure similarity matrix of non-redundant ADP-binding domains for functional-based domain clustering, and we also identified conservation residues and pattern candidates. Finally, we used profiles of pattern candidates to undergo protein function prediction.

3.2.2 Structure similarity matrix of ADP-binding domains

Figure 10 shows the structure similarity matrixes and SCOP classifications of 30 non-redundant ATP-binding domains. When comparing with SCOP classifications, protein domains with higher structure similarity are usually clustered together and they are always

belong to same SCOP families. It also tells us that the multiple ligand-bound structure alignments and structure similarity calculation in ADP-binding proteins is reasonable and can reflect structural and functional information.

In Figure 10(A), we also observed that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame). Figure 8(B) shows the SCOP classification of protein domains in Figure 10(A). We also chose alignment center C of each cluster in ADP-binding domains.

3.2.3 Conservation residues identified from ADP-binding domains

We have also identified several conservation residues in protein domains of “motor proteins family”. In Table 7, conservation residues were identified and listed; the bold residues are residues that were announced on literature that are important in ADP-binding or conformation stability[33-40].

Figure 11 shows the multiple ligand-bound structure alignment result and identified conservation residues in “motor proteins family” of ADP-binding domains. In Figure 11(A), the identified conservation residues are closed to ADP in three-dimensional space. It implies that these conservation residues may play important role in ADP-binding. In Figure 11(B), the labeled residue numbers are belonged to protein domain d1goja_, which is the selected alignment center C of this cluster, and the red framed region means the PROSITE patterns.

We observed that most identified conservation residues were on these region, it tell us that identifying pattern candidates from conservation residues extension may be a reasonable approach.

3.2.4 Pattern candidates identified from ADP-binding domains

We have identified pattern candidates of “motor proteins family” of ADP-binding domains. Table 8 lists these pattern candidates. Table 9 shows the comparison of PROSITE patterns and our defined pattern candidates that overlap with PROSITE patterns in ADP-binding domains. These pattern candidates are partially overlapping with PROSITE patterns. However, the new pattern candidates which do not overlap with PROSITE patterns in Table 8, may be new clues to search for ADP-binding proteins. We also found that identified pattern canididates are near ADP in 3-D space, therefore we believe that these three pattern candidates may be new clues to search for ADP-binding proteins (Figure 12). All of these pattern candidates were also used to generate profiles and we will use these profiles for protein function prediction.

3.2.5 Profile verification and protein function prediction of ADP-binding proteins

In order to compare with the performance of pattern candidate and PROSITE patterns, we also generated profiles of PROSITE patterns from our multiple alignments. In Figure 13, we observed that pattern candidate is worse than PROSITE pattern; however, because the profile of PROSITE pattern is generated from our alignments, and the performance is good, it also proved that the profile generated from our alignments is reasonable and convincing.

In order to verify the effectiveness of profiles generated from our alignments in protein function prediction, we also compared the performance in profile search between different datasets. However, because of the ambiguous annotations about ADP-binding proteins and we only chose one domain cluster, “motor proteins family”, in ADP-binding proteins, we chose protein sequences contain not only PROSITE pattern: Kinesin motor domain signature but also “motor protein” annotations in SWISS-PROT database.

In Figure 14, dataset 1 contains protein sequences with PROSITE pattern: Kinesin motor domain signature; dataset 2 contains protein sequences contain not only PROSITE pattern but also “motor protein” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is also larger than the area under curves of dataset 1. Because the profiles of pattern candidates were generated from motor protein domains alignments and the protein sequences in dataset 1 not all have “motor protein” annotations in SWISS-PROT database, we suppose that the profiles of pattern candidates are more convincing in motor proteins but not proteins only with PROSITE patterns.

In Table 10, we summarized the average hit rate of true positive rate 50%, 60%, 70%, 80%, 90% and 100% in dataset 1: sequences with PROSITE pattern, and database 2:

sequences with PROSITE pattern and SWISS-PROT annotations for profile verification. We observed that whether in dataset 1 or dataset 2, the hit rate of PROSITE patterns are all higher than pattern candidates. Thus, the PROSITE pattern is really meaningful for protein sequences which have these PROSITE patterns.

However, we also observed that the hit rates in dataset 2 are generally higher than hit rates in dataset 1. Because dataset 1 only contains sequences with PROSITE patterns but database 2 contains sequences with PROSITE pattern and SWISS-PROT annotations, it tell

However, we also observed that the hit rates in dataset 2 are generally higher than hit rates in dataset 1. Because dataset 1 only contains sequences with PROSITE patterns but database 2 contains sequences with PROSITE pattern and SWISS-PROT annotations, it tell