Chapter 4. Conclusions
4.3 Future works
There are still works to do to improve MuLiSA.
First, the alignment algorithm must be improved. In present results, we observed that sometimes there are gaps gapped in well-aligned segments, and the conservation patterns always forms secondary structure segment. To solve this problem, adding secondary structure information and prevent daps in well-aligned segment in the alignment algorithm may be a practicable solution to improve the alignment results.
Second, if we can find proteins binding similar compounds in different biochemical reaction step, such as ATP, ADP and AMP binding by same proteins, through multiple ligand-bound structure alignments, we can observe the residue variation in a reaction, and it may help us to make it clear that the importance and the role of function-dependent residues in a continuous reaction.
Third, multiple ligand-bound structure alignments can be modified to be an “active site-based multiple structure alignments”. When we knows functional important region of proteins, we can superimposed these region and identified more functional important residues for further research.
Table 1. Statistics of proteins, domains and pattern candidates.
Cytochrome b5 (5) d1cyo__ 4 1
Monodomain cytochrome c (23)
d1i54a_ 3 1
Heme 1145 860 131
Cytochrome c' (4) d1i54a_ d 3 1
a Number of ligand-binding proteins in PDBsum database.
b Number of ligand-binding domains selected by our program.
c The domain clusters that according to structure similarity and SCOP database classification;
the domain names are based on SCOP database nomenclature. We only choose domain clusters with domain number > 3 because the alignments are more statistical meaningful; and we only choose domain clusters with PROSITE patterns because we need benchmarks to verify our results. The numbers in the parentheses are the non-redundant domain numbers of each cluster.
d We choose same alignment center C of domain clusters: monodomain cytochrome c and cytochrome c', because same pattern candidates were identified in these clusters.
Table 2. Conservation residues identified from ATP-binding domains.
Family a Domain Conservation residues b d1phk__ G26 A46 L111 D149 K151 P152 N154 L156 D167 T186
d1atpe_ G50 A70 L128 D166 c K168 P169 N171 L173 D184 T201
d1qmza_ G11 A31 L87 D127 K129 P130 N132 L134 D145 T165
d1csn__ G19 A39 L92 D131 K133 P134 N136 L138 D154 T181
d1hck__ G11 A31 L87 D127 K129 P130 N132 L134 D145 T165
d1gol__ G30 A50 L110 D147 K149 P150 N152 L154 D165 T188
d1h1wa_ G89 A109 L167 D205 K207 P208 N210 L212 D223 T245
Protein kinases catalytic subunit
Z-score e 2.980 2.980 2.980 2.980 2.980 2.980 2.980 2.980 2.980 2.980
d1maua_ P10 G17 L23 D41 S81 Y125 D132 L135 P172 V179 K192 M193 S194 K195 L206 L272
d1n77a2 d G17 L23 T186 D194 L197 P228 G274 K243 H15 L253
d1gtra2+ P35 G42 I47 D67 S100 Y211 L221 P253 G314 M268 S269 K270 L39 L327
d1h3ea1 P46 G54 L59 D78 S129 Y175 D182 V184 G18 K232 M233 S234 K235 L243 L292
Class I aminoacyl-tRNA synthetases (RS), catalytic domain
Z-score 2.879 5.218 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 5.218 2.879
a The SCOP database families.
b Conservation residues identified by MuLiSA with z-score of entropy calculation > 2.5.
c Bold residues are residues that were announced on literature which are important in ligand-binding or conformation stability.
d The spare spaces are gaps in the alignments.
e The position z-score of entropy calculation.
+ The reference of this protein domain was not found; hence no residues were labeled.
Table 3. Pattern candidates identified from ATP-binding domains
Family Domain Pattern candidates a
1 b 2 3
d1phk__
d1atpe_
d1qmza_
d1csn__
d1hck__
d1gol__
Protein kinases catalytic subunit
d1h1wa_
1 d1maua_
d1n77a2 d1gtra2+ Class I aminoacyl-tRNA
synthetases (RS), catalytic domain
d1h3ea1
Table 4. Comparison of PROSITE patterns and pattern candidates of ATP-binding domains
Family Domain PROSITE patterns a Pattern
candidates b
PS00107 Protein kinases ATP-binding region signature.
[LIV]-G-{P}-G-{P}-[FYWMGST
PS00178 Aminoacyl-transfer RNA synthetases class-I signature
Table 5. Hit rate comparison of dataset difference in profile verification of ATP-binding proteins
Dataset 1 c Dataset 2 d
Family PROSITE patterns and
pattern candidates a No. of
sequence Hit rate e No. of
sequence Hit rate Protein kinases
Pattern candidate 1 b 84.79% 86.76%
Pattern candidate 2 64.19% 68.35%
Protein kinases catalytic subunit
Pattern candidate 3
859
catalytic domain Pattern candidate 1
1129
20.18%
1056
37.43%
a PROSITE patterns and pattern candidates that we identified.
b Pattern candidate 1 of “Protein kinases catalytic subunit family”(see Table 3).
c Dataset 1: sequences only with PROSITE patterns
d Dataset 2: sequences with PROSITE patterns and SWISS-PROT annotations
e Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.
Table 6. Hit rate comparison of pattern candidates and PROSITE patterns in protein function prediction of ATP-binding proteins
No. of top ranked sequence a
True-positive rate
Profile scoring score
Z-score of profile scoring
score b
Hit rate of all pattern candidates
Hit rate of PROSITE pattern 100 100.00%(100) 0.840 6.52 100.00% (100) 0.00% (0) 500 98.40% (492) 0.720 4.70 100.00% (492) 0.00% (0) 1000 95.70% (957) 0.650 3.63 99.79% (955) 0.21% (2) 1500 82.27% (1234) 0.600 2.87 97.65% (1205) 2.35% (29) 2000 76.65% (1533) 0.583 2.61 80.43% (1503) 19.57% (30) 2500 70.28% (1757) 0.567 2.37 94.25% (1656) 5.75% (101) 3000 61.53% (1846) 0.556 2.20 94.53% (1745) 5.47% (101)
a The top ranked sequence number. For example, 100 in this column means the 100 ranked sequences with highest profile scoring score in profile scoring ranking list of ATP-binding protein prediction.
b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.411515; the standard deviation of all SWISS-PROT sequence scores is 0.065701.
Table 7. Conservation residues identified from ADP-binding domains
Family Domain Conservation residues
d1goja_ P16 G88 G93 K94 S206 D235 G238 E240
d1bg2__ P17 G85 G90 K91 S202 D231 G234 E236
d1br2a2 P126 G177 G182 K183 S179 D465 G468 E178 d2ncda_ P357 G434 G439 G440 S551 D580 G583 E585 d1f9ta_ P395 G474 G479 K480 S597 D626 G629 E631
d1i5sa_ P14 G97 G102 K103 S215 D248 G251 E253
d1lkxa_ P50 G101 G106 K107 S103 D386 G389 E102
d2kin.1 P17 G86 G91 K92 S203 D232 G235 E237
motor proteins
Z-score 3.029 3.029 3.029 3.029 3.029 3.029 3.029 3.029
Table 8. Pattern candidates identified from ADP-binding domains
Family Domain Pattern candidates
1 2 3 4
d1goja_
d1bg2__
d1br2a2 d2ncda_
d1f9ta_
d1i5sa_
d1lkxa_
motor proteins
d2kin.1
Table 9. Comparison of PROSITE patterns and pattern candidates of ADP-binding domains
Family Domain PROSITE patterns Pattern candidates
PS00411 Kinesin motor domain signature.
[GSA]-[KRHPSTQVM]-[LIVMF]-x-[LIVMF]
-[IVC]-D-L-[AH]-G-[SAN]-E. 4
d1goja_
d1bg2__
d1br2a2 d2ncda_
d1f9ta_
d1i5sa_
d1lkxa_
motor proteins
d2kin.1
Table 10. Hit rate comparison of dataset difference in profile verification of motor proteins
Dataset 1 b Dataset 2 c
Family PROSITE patterns and
pattern candidates a No. of
sequence d Hit rate e No. of
sequence Hit rate Kinesin motor domain
signature 99.10% 99.50%
Pattern candidate 1 69.83% 72.47%
Pattern candidate 2 83.50% 85.24%
Pattern candidate 3 97.19% 97.56%
motor proteins
Pattern candidate 4
95
98.35%
89
98.75%
a PROSITE patterns and pattern candidates that we identified.
b Dataset 1: sequences with PROSITE pattern
c Dataset 2: sequences with PROSITE pattern and SWISS-PROT “motor protein” annotations.
d Number of sequences recorded which have PROSITE patterns in this cluster.
e Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.
Table 11 Hit rate comparison of pattern candidates and PROSITE patterns in protein function prediction of motor proteins
Top number of sequence a
True-positive rate
Profile scoring score
Z-score of profile scoring score b
Hit rate of all pattern candidates
Hit rate of PROSITE pattern
10 100% (10) 1.000 7.44 100.00% (10) 0.00% (0)
50 100% (50) 1.000 7.44 100.00% (50) 0.00% (0)
100 91.00% (91) 0.875 5.76 100.00% (91) 0.00% (0)
150 66.00% (99) 0.750 4.08 100.00% (99) 0.00% (0)
200 50.50% (101) 0.750 4.08 100.00% (101) 0.00% (0) 250 40.80% (102) 0.750 4.08 100.00% (102) 0.00% (0) 300 34.00% (102) 0.667 2.97 100.00% (102) 0.00% (0)
a The top ranked sequence number.
b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.445968; the standard deviation of all SWISS-PROT sequence scores is 0.074513.
Table 12. Conservation residues identified from HEM-binding domains
Family Domain Conservation residues
d1llp__ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238 d1cca__ R48 H52 G65 D106 G129 R130 P145 V169 H175 F198 D235 d1oafa_ R38 H42 G55 D95 G118 R119 P132 V157 H163 F186 D208 d1hsr__ R52 H56 G75 D116 G140 R141 P154 V178 H184 F208 D246
d1h5ma_ R38 H42 G48 D99 G122 R123 P139 V164 H170 F229 D247 d1b80a_ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238
d1bgp__ R45 H49 G55 D108 G131 R132 P149 V173 H179 F232 D250 d1mnp__ R42 H46 G62 D104 G128 R129 P142 V167 H173 F197 D242
d1fhfa_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F228 D246 d1pa2a_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F228 D246
d1qgja_ R38 H42 G48 D96 G119 R120 P135 V159 H165 F224 D242 d1qpaa_ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238 d1scha_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F221 D239 CCP-like
Z-score 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662
Table 12. Continued.
Family Domain Conservation residues
d1eupa_ A241 G242 E280 R283 D322 R336 F344 G345 G347 H349 C351 G353 d1dz4a_ V247 G248 E287 R290 D328 R342 F350 G351 G353 H355 C357 G359 d1bu7a_ A264 G265 E320 R323 D363 R378 F393 G394 G396 R398 C400 G402
d1cpt__ A267 G268 E306 R309 D348 R362 F370 G371 G373 H375 C377 G379 d1n6ba_ A294 G295 E351 R354 D394 H408 F425 S426 G428 R430 C432 G434 d1e9xa_ A256 G257 313 R316 I355 R369 F387 G388 G390 H392 C394 G396 d1ehea_ A239 G240 E278 R281 D321 R335 F345 G346 G348 H350 C352 A354 d1gwia_ A242 G243 E281 R284 D324 R339 F348 G349 G351 H353 C355 G357
d1io7a_ A209 G210 E246 R249 D288 R302 F310 G311 G313 H315 C317 G319 d1izoa_ E282 R285 D324 R338 Q352 G353 G355 H341 C363 G365 d1lfka_ A236 G237 E275 R278 D318 R332 F340 G341 G343 H345 C347 G349 d1n4ga_+ A233 G234 E272 R275 D315 R329 F338 G339 G341 H343 C345 G347
d1n97a_ A221 G222 E260 R263 R314 F329 G330 G332 R334 C336 G338 Cytochrome
P450
Z-score 2.705 3.208 3.723 3.723 2.705 3.208 3.208 3.208 3.723 2.697 3.723 3.208 d1cyo__ E11 T33 H39 H63
d1b5m__ E11 T33 H39 H63 d1cxya_ E13 T58 H70 H42
d1icca_ E11 T55 H63 H39
d1mj4a_ E10 T34 H40 H65 Cytochrome
b5
Z-score 3.253 3.253 3.253 3.253
Table 12. Continued.
Family Domain Conservation residues
d1i54a_ C14 C17 H18
d1cot__ C15 C18 H19
d1c75a_ C32 C35 C36
d1c2ra_ C13 C16 H17
d1c52__ C11 C14 H15
d1c6ra_ C15 C18 H19
d1cc5__+ C19 C22 H23
d1ccr__ C22 C25 H26
d1ycc__ C14 C17 H18
d1co6a_ C13 C16 H17
d1ctj__ C15 C18 H19
d1cxc__ C15 C18 H19
d1cyi__ C14 C17 H18
d1dw0a_ C43 C46 H47
d1f1fa_ C14 C17 H18
d1fj0b_ C13 C16 H17
d1gdva_ C14 C17 H18
d1hroa_ C19 C22 H23
d1jdla_ C15 C18 H19
d1ls9a_ C17 C20 H21
d2mtac_+ C57 C60 H61
d3c2c__+ C14 C17 H18
monodomain cytochrome c
d351c__ C12 C15 H16
d1a7va_ C113 C116 H117
d1bbha_ C121 C124 H125
d1cgo__ C116 C119 H120
d1cpq__ C118 C121 H122
Cytochrome c'
Z-score 3.505 3.505 3.505
Table 13. Pattern candidates identified from HEM-binding domains
Family Domain Pattern candidates
1 2 3
Table 13. Continued.
Family Domain Pattern candidates
1 d1i54a_
d1cot__
d1c75a_
d1c2ra_
d1c52__
d1c6ra_
d1cc5__
d1ccr__
d1ycc__
d1co6a_
d1ctj__
d1cxc__
d1cyi__
d1dw0a_
d1f1fa_
d1fj0b_
d1gdva_
d1hroa_
d1jdla_
d1ls9a_
d2mtac_
d3c2c__
monodomain cytochrome c
d351c__
d1a7va_
d1bbha_
d1cgo__
Cytochrome c'
d1cpq__
Table 14. Comparison of PROSITE patterns and pattern candidates of HEM-binding domains
Family Domain PROSITE patterns Pattern candidates
PS00436
PS00086 Cytochrome P450 cysteine heme-iron ligand signature.
PS00191 Cytochrome b5 family, heme-binding domain signature.
[FY]-[LIVMK]-x(2)-H-P-[GA]-G 1
d1cyo__
Table 14. Continued.
Family Domain PROSITE patterns Pattern candidates
PS00190 Cytochrome c family heme-binding site signature.
C-{CPWHF}-{CPWR}-C-H-{CFYW}. 1
Table 15. Hit rate comparison of dataset difference in profile verification of HEM-binding proteins
Dataset 1 b Dataset 2 c
Family PROSITE patterns and
pattern candidates a No. of
sequence Hit rate d No. of
sequence Hit rate
Peroxidases active site
signature. 8.86% 100.00%
Peroxidases proximal
heme-ligand signature. 8.39% 55.72%
Pattern candidate 1 4.86% 46.67%
Pattern candidate 2 6.94% 49.81%
CCP-like
Pattern candidate 3
205
4.95%
151
9.54%
Cytochrome P450 cysteine
heme-iron ligand signature. 86.05% 86.45%
Pattern candidate 1 65.49% 68.34%
Pattern candidate 2 38.09% 40.55%
Cytochrome P450
Pattern candidate 3
687
86.13%
675
86.26%
Cytochrome b5 family, heme-binding domain
signature.
79.43% 79.30%
Cytochrome b5
Pattern candidate 1
88
77.13%
78
82.53%
Cytochrome c family heme-binding site
signature.
87.12% 84.08%
monodomain cytochrome c and
Cytochrome c' Pattern candidate 1
1130
86.93%
897
84.02%
a PROSITE patterns and pattern candidates that we identified.
b Dataset 1: sequences with PROSITE pattern
c Dataset 2: sequences with PROSITE pattern and SWISS-PROT annotations
d Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.
Table 16. Hit rate comparison of pattern candidates and PROSITE pattern in protein function prediction of HEM-binding proteins
Top number of sequence a
True-positive rate
Profile scoring score
Z-score of profile scoring score b
Hit rate of all pattern candidates
Hit rate of PROSITE pattern 100 92.00% (92) 0.798 4.72 100.00% (92) 0.00% (0) 200 80.50% (161) 0.744 4.00 96.27% (155) 3.73% (6) 300 69.00% (207) 0.708 3.52 97.10% (201) 2.90% (6) 400 69.75% (279) 0.692 3.30 87.81% (245) 12.19% (34) 500 70.40% (352) 0.685 3.21 90.34% (318) 9.66% (34) 600 60.33% (362) 0.685 3.21 90.61% (328) 9.39% (34) 700 57.86% (405) 0.669 2.99 91.60% (371) 8.40% (34)
a The top ranked sequence number.
b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.436928; the standard deviation of all SWISS-PROT sequence scores is 0.071717.
Table 17. Prediction accuracy and coverage rates in protein function prediction
a Number of annotated SWISS-PROT protein sequences. Annotations: “ATP-binding”, “motor protein”, “Heme” and “kinesin” in “KW” of SWISS-PROT database. There are 151047 protein sequences in SWISS-PROT database.
b Number of true hit annotated protein sequences of the top 100% ranked protein sequences.
For example, there are 3462 the true hit protein sequence of top 13484 sequences in ATP-binding prediction profile scoring ranking list.
c We only choose motor proteins as our protein function prediction target.
Table 18. 10 predicted protein sequences with high scores in profile scoring ranking lists of protein function prediction in ATP-binding proteins, motor proteins and HEM-binding proteins.
Predicted ATP-binding proteins Predicted motor proteins Predicted HEM-binding proteins
1 (295a) P27604b
(Adenosylhomocysteinasec)
(85) P44531
(Ferric cations import ATP-binding protein fbpC 1)
(62) Q60613
(Putative diacylglycerol kinase K06A1.6)
(Hypothetical 65.5 kDa protein y4IJ)
6 (882) Q96RU7
(Neuronal cell death inducible putative kinase)
(B-box type zinc-finger protein ncl-1)
(126) P35100
(ATP-dependent Clp protease ATP-binding subunit clpA
homolog, chloroplast [Precursor])
(108) Q81MN9
(Polyphosphate kinase)
8 (938) Q9WTQ6
(Neuronal cell death inducible putative kinase)
(127) Q8CFL8
(Zinc finger SWIM domain containing protein 3)
(112) P55019
(Solute carrier family 12 member 3)
9 (947) Q8K4K2
(Neuronal cell death inducible putative kinase)
(T-cell surface glycoprotein CD1e [Precursor])
a Ranking serial number in profile scoring ranking lists.
b SWISS-PROT accession numbers.
c Protein name in SWISS-PROT database.
Prepare dataset
Dataset clustering
MuLiSA Select alignment
center C of each cluster
C centered Multiple Ligand-bound Structure Alignments
Identify conservation residues by z-score
of position entropy Identify pattern
candidates from conservation residue
extension
Tool evaluation Protein function
prediction
Figure 1. The workflow of analysis and identification of conservation patterns and residues in proteins by MuLiSA. This flow starts from dataset preparation and clustering, followed by multiple ligand-bound structure alignments (MuLiSA), tool evaluation and protein function prediction.
d1gtra2 : 0.3 + 0.39 + 0.34 = 1.03
d1h3ea1 : 1.10
d1maua2 : 1.19 d1n77a2 : 1.06
1 0.36
0.36 0.34
d1n77a2
0.36 1
0.44 0.39
d1maua2
0.36 0.44
1 0.3
d1h3ea1
0.34 0.39
0.3 1
d1gtra2
d1n77a2 d1maua2
d1h3ea1 d1gtra2
Figure 2. The alignment center C selection. The alignment center C was selected when domain of one cluster which has the highest structure similarity with other protein domains than others.
In this case, we select domain d1maua2 as the alignment center C of this cluster.
A
B
Figure 3. Identification of conservation residues at positions with z-score > 2.5. (A) The multiple alignments of four protein sequences. (B) The entropy and z-score values of each position. Figure 2(A) shows the alignment results of 13 protein sequences belongs to
“Cytochrome P450 family”. The numbers on the top of Figure 2(A) are the residue numbers of d1eupa_. The “+” symbols denotes the positions with z-score > 2.5, and we can observe in Figure 2(B) that these positions are with z-scores 3.208 and 3.723. The framed region is the possible pattern candidate.
A
Figure 4. (A) Structure similarity matrix of 25 non-redundant ATP-binding domains; (B) SCOP classification of 25 non-redundant ATP-binding domains. In Figure 4(A), domains belong to same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row; in other words, the domain in this row is much similar with these compared domains than others. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame), it tells us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. In Figure 4(B), protein domains were classified according to SCOP classification hierarchy: class, fold, superfamily, and family. The protein domains were named by SCOP database nomenclature.
A
B
Figure 5. MuLiSA result and identified conservation residues in “Protein kinases, catalytic subunit family” of ATP-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1phk__; blue: d1atpe_; green:
d1qmza_; red: d1csn__; grey: d1hck__; pink: d1gol__; light blue: d1h1wa_; (B) Multiple ligand-bound structure alignment result of “Protein kinases, catalytic subunit family” domains.
In Figure 5(A), the identified conservation residues, aligned positions with z-score of entropy calculation > 2.5, are close to ATP in three-dimensional space. It implies that these conservation residues may play important role in ATP-binding. In Figure 5(B), the labeled residue numbers belong to protein domain d1phk__, which is the selected alignment center C of this cluster; and the red framed region means the PROSITE patterns. We observed that most identified conservation residues were on these PROSITE pattern region, it tell us that identifying pattern candidates from conservation residues extension may be a reasonable approach.
Pattern candidate 1 Pattern candidate 2
Pattern candidate 3
ATP
Figure 6. Three pattern candidates of “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” on three-dimensional space. Pattern candidate 1 is overlapping with PROSITE pattern PS00108; pattern candidate 2 and 3 are novel pattern that we identified. All three pattern candidates are closed to ATP; hence they may be important in ATP-binding.
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
False-Positive Rate
T ru e -P o sitiv e R a te
PROSITE motifPattern candidate
Figure 7. Comparison of pattern candidate 1 and PROSITE pattern: Serine/ Threonine protein kinases active-site signature in “Protein kinases catalytic subunit family” for profile verification of ATP-binding proteins. In ROC curve, the area under curves represents the goodness of the test. We observed that our defined pattern candidate is worse than PROSITE pattern; however, because of that the profile of PROSITE pattern is generated from our alignments, it proved that the profile generated from our alignments is reasonable.
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
False-Positive Rate
T rue -P os it ive R a te
Dataset 1 Dataset 2
Figure 8. Comparison of datasets used in profile search by pattern candidate 1 of “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” of ATP-binding domains. Dataset 1: protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature. Dataset 2: protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature and also have “ATP-binding” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than area under curves of dataset 1. Because the profile of pattern candidates were generated from ATP-binding domains alignments and the protein sequences in dataset 1 are not all have “ATP-binding”
annotations, we think that the profile of pattern candidate is more meaningful in ATP-binding proteins but not proteins only with PROSITE pattern.
Figure 9. Profile scoring list of protein function prediction in ATP-binding proteins. The protein sequences with SWISS-PROT “ATP-binding” annotations were labeled by “+” symbol on ATP column. The protein accession numbers in SWISS-PROT database are list on Seq.
column. Values on “Score” column are the profile scoring scores. The “Pattern column” shows the matched protein sequence segment, the residue numbers of the first and the last residues are shown. Two points must be mentioned. First, the framed sequences all have “ATP-binding”
annotations (except for P27604 and P25169); because these sequences all match our new finding pattern candidate, we regard this pattern candidate is a new pattern of ATP-binding proteins. Second, the non-labeled sequences, P27604 and P25169, are the sequences that match profiles but don’t have “ATP-binding” annotations in SWISS-PROT database, hence these two proteins might be ATP-binding protein that not identified yet.
A
Figure 10. (A) Structure similarity matrix of 30 non-redundant ADP-binding domains; (B) SCOP classification of 30 non-redundant ADP-binding domains. In Figure 10(A), domains belong to same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame). In Figure 10(B), protein domains were classified according to SCOP classification hierarchy.
A
B
Figure 11. MuLiSA result and identified conservation residues in “motor proteins family” of ADP-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1goja_; blue: d1bg2__; green: d1br2a2; red: d2ncda_;
grey: d1f9ta_; orange: d1i5sa_; brown: d2kin.1; light blue: d1lkxa_; (B) Multiple ligand-bound structure alignment result of “motor proteins family” domains. In Figure 11(A), the identified conservation residues are closed to ADP in three-dimensional space. It implies that these conservation residues may play important role in ADP-binding. In Figure 11(B), the labeled residue numbers were belonged to protein domain d1goja_, which is the selected alignment center C of this cluster, and the red framed region means the PROSITE patterns.
Pattern candidate 1
Pattern candidate 2
Pattern candidate 4 Pattern candidate 3
ADP
Figure 12. Three pattern candidates of “motor proteins family” on three-dimensional space.
Pattern candidate 4 is overlapping with PROSITE pattern PS00411; pattern candidate 1, 2 and 3 are novel pattern that we identified. All three pattern candidates are closed to ADP; hence they may be important in ADP-binding.
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
False-Positive Rate
T rue -P os it ive Ra te
PROSITE motifPattern candidate
Figure 13. Comparison of pattern candidate 4 and PROSITE pattern: Kinesin motor domain signature in “motor proteins family” for profile verification of ADP-binding domains. We observed that our defined pattern candidate is worse than PROSITE pattern; however, because of that the profile of PROSITE pattern is generated from our alignment, it proved that the profile generated from our alignments is reasonable.
0 0.2 0.4 0.6 0.8 1
0 0.2 0.4 0.6 0.8 1
False-Positive
T ru e -P o sitiv e
Dataset 1 Dataset 2
Figure 14. Comparison of datasets used in profile search by pattern candidate 1 of “motor proteins family” of ADP-binding domains. Dataset 1: protein sequences contain PROSITE pattern: Kinesin motor domain signature. Dataset 2: protein sequences contain PROSITE pattern: Kinesin motor domain signature and also have “motor protein” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than area under curves of dataset 1. Because the profile of pattern candidates were generated from motor proteins domains alignments and the protein sequences in dataset 1 are not all have “motor protein” annotations, we think that the profile of pattern candidate is more meaningful in motor proteins but not proteins only with PROSITE pattern.
Figure 15. Profile scoring list of protein function prediction in motor proteins. The protein sequences with SWISS-PROT “motor protein” annotations were labeled by “+” symbol on motor column. The protein accession numbers in SWISS-PROT database are list on Seq.
column. Values on “Score” column are the profile scoring scores. The “Pattern column” shows the matched protein sequence segment, the residue numbers of the first and the last residues are shown. Two points must be mentioned. First, the framed sequences all have “motor”
annotations; because these sequences all match our new finding pattern candidate, we regard this pattern candidate is a new pattern of motor proteins. Second, the non-labeled sequences are the sequences that match profiles but don’t have “motor protein” annotations in SWISS-PROT database; hence these proteins might be motor proteins that not identified yet.
A
Figure 16. (A) Structure similarity matrix of 40 non-redundant HEM-binding domains; (B) SCOP classification of 40 non-redundant HEM-binding domains. In Figure 16(A), protein domains belong to same SCOP families are with same colors. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame); it tell us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. In Figure 16(B), protein domains were classified according to SCOP classification hierarchy: class, fold, superfamily, and family.
A
B
Figure 17. MuLiSA result and identified conservation residues in “Cytochrome b5 family” of HEM-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1cyo__; blue: d1b5m__; green: d1cxya_; red:
d1icca_; grey: d1mj4a_; (B) Multiple ligand-bound structure alignment result of “Cytochrome b5 family” domains. In Figure 17(A), the identified conservation residues are closed to heme in
d1icca_; grey: d1mj4a_; (B) Multiple ligand-bound structure alignment result of “Cytochrome b5 family” domains. In Figure 17(A), the identified conservation residues are closed to heme in