• 沒有找到結果。

Profile verification and protein function prediction of ATP-binding proteins

Chapter 3. Protein Function Prediction and Conservation Residues Identification

3.1 ATP-binding proteins

3.1.5 Profile verification and protein function prediction of ATP-binding proteins

In order to use profiles generated from our alignments to predict protein function, first we need to verify that the profiles we generated from our alignments is reasonable and convincing. Therefore, we use protein sequences which have PROSITE patterns: PS00178, PS00107, PS00108, PS00411, PS00190, PS00435, PS00436, PS00086 and PS00191. These PROSTIE patterns belong to 8 clusters listed in Table 1. Because pattern candidates identified from one cluster should be meaningful for sequences of this cluster, when we use profiles generated from these pattern candidates to search for protein sequences of this cluster, the sequences of this cluster should have higher profile scoring scores. In other words, a good pattern candidate can separate protein sequences of the cluster that have this pattern candidate from protein sequences of other clusters that don’t have this pattern candidate.

In order to compare with the performance of pattern candidates and PROSITE patterns, we also generated profiles of PROSITE patterns from our multiple alignments. If the performance of pattern candidates in one cluster is better than PROSITE patterns, we may find a novel pattern that is more meaningful than PROSITE patterns in this cluster. In Figure 7, we observed that our defined pattern candidate is worse than PROSITE pattern; however,

because the profile of PROSITE pattern is generated from our alignment, and the performance is good, it proved that the profile generated from our alignments is reasonable and convincing.

In order to verify the effectiveness of profiles generated from our alignments in protein function prediction, we compare the performance in profile search between dataset 1, which contains protein sequences with PROSITE pattern; and dataset 2, which contains protein sequences not only with PROSITE pattern but also have “ATP-binding” annotations in SWISS-PROT database. In Figure 8, dataset 1 contains protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature and dataset 2 contains protein sequences contain not only PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature but also have “ATP-binding” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than the area under curves of dataset 1.

Because the profile of pattern candidates were generated from alignments of ATP-binding domains and the protein sequences in dataset 1 are not all have “ATP-binding” annotations in

“KW” of SWISS-PROT database, we suppose that the profile of pattern candidate is more convincing in ATP-binding proteins but not proteins only with PROSITE patterns.

In Table 5, we summarized the average hit rate of true positive rate 50%, 60%, 70%, 80%, 90% and 100% in dataset 1: sequences with PROSITE pattern, and database 2:

sequences with PROSITE pattern and SWISS-PROT annotations for profile verification. We observed that whether in dataset 1 or dataset 2, the hit rate of PROSITE patterns are all higher than pattern candidates. Thus, the PROSITE pattern is really meaningful for protein sequences which have these PROSITE patterns.

However, we also observed that the hit rates in dataset 2 are generally higher than hit rates in dataset 1. Because dataset 1 only contains sequences with PROSITE patterns but database 2 contains sequences with PROSITE pattern and SWISS-PROT annotations, it tell us that the profiles we generated from multiple alignments of ATP-binding proteins may be more meaningful for protein sequences with “ATP-binding” annotations in SWISS-PROT

database.

Second, we used profiles of pattern candidates and PROSITE patterns of ATP-binding proteins to search for SWISS-PROT protein sequences that might have these patterns, and we suppose that the protein sequences with these pattern candidates may be ATP-binding proteins. We use all profiles of identified pattern candidates to search all protein sequences in SWISS-PROT database and give each sequence a profile scoring score. The given profile scoring score is the highest score of all profiles search. In this way, we can get a profile scoring ranking list in ATP-binding protein prediction (Figure 9). The sequences with higher profile score have higher possibility to be ATP-binding proteins. When one protein sequence has high profile score but not have “ATP-binding” annotations in SWISS-PROT database, we regard this protein might be an ATP-binding protein because it contains this pattern candidate.

Figure 9 shows the profile scoring list of protein function prediction in ATP-binding proteins. Two points must be mentioned. First, the framed sequences all have “ATP-binding”

annotations (except for P27604 and P25169); because these sequences all match novel pattern candidate, pattern candidate 2 in “Protein kinases catalytic subunit family” , we regard this pattern candidate is a new pattern of ATP-binding proteins. Second, the non-labeled sequences, P27604 and P25169, are the sequences that match profiles but don’t have

“ATP-binding” annotations in SWISS-PROT database, hence these two proteins might be the ATP-binding proteins but not identified yet.

In Table 6, we summarized the true-positive rates, profile scoring scores, and z-score of profile scoring scores of top 100, 500, 1000, 1500, 2000, 2500 and 3000 ranked sequences in profile scoring ranking list. We also compare the hit rates between pattern candidates and PROSITE patterns. We observed when protein sequences with profile scoring score 0.600, the true positive rate is 82.27% and the z-score is 2.87. Thus when protein sequences with profile scoring score higher than 0.600, we can say these protein sequence may be ATP-binding proteins with 82.27% confidence.

When comparing with the hit rate of our defined pattern candidates and PROSITE patterns, we observed that almost all the top 3000 ranked protein sequences with

“ATP-binding” annotations were all searched by pattern candidates. Although some of pattern candidates partially overlapped with PROSITE patterns, it tells us that the pattern candidates are useful for protein function prediction in ATP-binding proteins.