Future works

Chapter 4. Conclusions

4.3 Future works

There are still works to do to improve MuLiSA.

First, the alignment algorithm must be improved. In present results, we observed that sometimes there are gaps gapped in well-aligned segments, and the conservation patterns always forms secondary structure segment. To solve this problem, adding secondary structure information and prevent daps in well-aligned segment in the alignment algorithm may be a practicable solution to improve the alignment results.

Second, if we can find proteins binding similar compounds in different biochemical reaction step, such as ATP, ADP and AMP binding by same proteins, through multiple ligand-bound structure alignments, we can observe the residue variation in a reaction, and it may help us to make it clear that the importance and the role of function-dependent residues in a continuous reaction.

Third, multiple ligand-bound structure alignments can be modified to be an “active site-based multiple structure alignments”. When we knows functional important region of proteins, we can superimposed these region and identified more functional important residues for further research.

Table 1. Statistics of proteins, domains and pattern candidates.

Cytochrome b5 (5) d1cyo__ 4 1

Monodomain cytochrome c (23)

d1i54a_ 3 1

Heme 1145 860 131

Cytochrome c' (4) d1i54a_ ^d 3 1

a Number of ligand-binding proteins in PDBsum database.

b Number of ligand-binding domains selected by our program.

c The domain clusters that according to structure similarity and SCOP database classification;

the domain names are based on SCOP database nomenclature. We only choose domain clusters with domain number > 3 because the alignments are more statistical meaningful; and we only choose domain clusters with PROSITE patterns because we need benchmarks to verify our results. The numbers in the parentheses are the non-redundant domain numbers of each cluster.

d We choose same alignment center C of domain clusters: monodomain cytochrome c and cytochrome c', because same pattern candidates were identified in these clusters.

Table 2. Conservation residues identified from ATP-binding domains.

Family^a Domain Conservation residues^b d1phk__ G26 A46 L111 D149 K151 P152 N154 L156 D167 T186

d1atpe_ ^{G50 A70} ^L128 ^D166^c ^K168 ^P169 N171 L173 D184 T201

d1qmza_ G11 A31 L87 D127 K129 P130 N132 L134 D145 T165

d1csn__ G19 A39 L92 D131 K133 P134 N136 L138 D154 T181

d1hck__ G11 A31 L87 D127 K129 P130 N132 L134 D145 T165

d1gol__ ^{G30 A50} ^L110 D147 K149 P150 N152 L154 D165 T188

d1h1wa_ ^{G89 A109} ^L167 ^D205 ^K207 P208 N210 L212 D223 T245

Protein kinases catalytic subunit

Z-score^e ^2.980 ^2.980 ^2.980 ^2.980 ^2.980 ^2.980 2.980 2.980 2.980 2.980

d1maua_ ^P10 G17 L23 D41 S81 Y125 D132 L135 P172 V179 K192 M193 S194 K195 L206 L272

d1n77a2 ^d G17 L23 T186 D194 L197 P228 G274 K243 H15 L253

d1gtra2⁺ P35 G42 I47 D67 S100 Y211 L221 P253 G314 M268 S269 K270 L39 L327

d1h3ea1 P46 G54 L59 D78 S129 Y175 D182 V184 G18 K232 M233 S234 K235 L243 L292

Class I aminoacyl-tRNA synthetases (RS), catalytic domain

Z-score 2.879 5.218 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 2.879 5.218 2.879

a The SCOP database families.

b Conservation residues identified by MuLiSA with z-score of entropy calculation > 2.5.

c Bold residues are residues that were announced on literature which are important in ligand-binding or conformation stability.

d The spare spaces are gaps in the alignments.

e The position z-score of entropy calculation.

+ The reference of this protein domain was not found; hence no residues were labeled.

Table 3. Pattern candidates identified from ATP-binding domains

Family Domain Pattern candidates ^a

1^b 2 3

d1phk__

d1atpe_

d1qmza_

d1csn__

d1hck__

d1gol__

Protein kinases catalytic subunit

d1h1wa_

1 d1maua_

d1n77a2 d1gtra2⁺ Class I aminoacyl-tRNA

synthetases (RS), catalytic domain

d1h3ea1

Table 4. Comparison of PROSITE patterns and pattern candidates of ATP-binding domains

Family Domain PROSITE patterns^a Pattern

candidates^b

PS00107 Protein kinases ATP-binding region signature.

[LIV]-G-{P}-G-{P}-[FYWMGST

PS00178 Aminoacyl-transfer RNA synthetases class-I signature

Table 5. Hit rate comparison of dataset difference in profile verification of ATP-binding proteins

Dataset 1^c Dataset 2^d

Family PROSITE patterns and

pattern candidates ^a No. of

sequence Hit rate^e No. of

sequence Hit rate Protein kinases

Pattern candidate 1^b 84.79% 86.76%

Pattern candidate 2 64.19% 68.35%

Protein kinases catalytic subunit

Pattern candidate 3

859

catalytic domain Pattern candidate 1

1129

20.18%

1056

37.43%

a PROSITE patterns and pattern candidates that we identified.

b Pattern candidate 1 of “Protein kinases catalytic subunit family”(see Table 3).

c Dataset 1: sequences only with PROSITE patterns

d Dataset 2: sequences with PROSITE patterns and SWISS-PROT annotations

e Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.

Table 6. Hit rate comparison of pattern candidates and PROSITE patterns in protein function prediction of ATP-binding proteins

No. of top ranked sequence ^a

True-positive rate

Profile scoring score

Z-score of profile scoring

score ^b

Hit rate of all pattern candidates

Hit rate of PROSITE pattern 100 100.00%(100) 0.840 6.52 100.00% (100) 0.00% (0) 500 98.40% (492) 0.720 4.70 100.00% (492) 0.00% (0) 1000 95.70% (957) 0.650 3.63 99.79% (955) 0.21% (2) 1500 82.27% (1234) 0.600 2.87 97.65% (1205) 2.35% (29) 2000 76.65% (1533) 0.583 2.61 80.43% (1503) 19.57% (30) 2500 70.28% (1757) 0.567 2.37 94.25% (1656) 5.75% (101) 3000 61.53% (1846) 0.556 2.20 94.53% (1745) 5.47% (101)

a The top ranked sequence number. For example, 100 in this column means the 100 ranked sequences with highest profile scoring score in profile scoring ranking list of ATP-binding protein prediction.

b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.411515; the standard deviation of all SWISS-PROT sequence scores is 0.065701.

Table 7. Conservation residues identified from ADP-binding domains

Family Domain Conservation residues

d1goja_ P16 G88 G93 K94 S206 D235 G238 E240

d1bg2__ P17 G85 G90 K91 S202 D231 G234 E236

d1br2a2 P126 G177 G182 K183 S179 D465 G468 E178 d2ncda_ P357 G434 G439 G440 S551 D580 G583 E585 d1f9ta_ P395 G474 G479 K480 S597 D626 G629 E631

d1i5sa_ P14 G97 G102 K103 S215 D248 G251 E253

d1lkxa_ P50 G101 G106 K107 S103 D386 G389 E102

d2kin.1 P17 G86 G91 K92 S203 D232 G235 E237

motor proteins

Z-score 3.029 3.029 3.029 3.029 3.029 3.029 3.029 3.029

Table 8. Pattern candidates identified from ADP-binding domains

Family Domain Pattern candidates

1 2 3 4

d1goja_

d1bg2__

d1br2a2 d2ncda_

d1f9ta_

d1i5sa_

d1lkxa_

motor proteins

d2kin.1

Table 9. Comparison of PROSITE patterns and pattern candidates of ADP-binding domains

Family Domain PROSITE patterns Pattern candidates

PS00411 Kinesin motor domain signature.

[GSA]-[KRHPSTQVM]-[LIVMF]-x-[LIVMF]

-[IVC]-D-L-[AH]-G-[SAN]-E. 4

d1goja_

d1bg2__

d1br2a2 d2ncda_

d1f9ta_

d1i5sa_

d1lkxa_

motor proteins

d2kin.1

Table 10. Hit rate comparison of dataset difference in profile verification of motor proteins

Dataset 1^b Dataset 2^c

Family PROSITE patterns and

pattern candidates ^a No. of

sequence^d Hit rate^e No. of

sequence Hit rate Kinesin motor domain

signature 99.10% 99.50%

Pattern candidate 1 69.83% 72.47%

Pattern candidate 2 83.50% 85.24%

Pattern candidate 3 97.19% 97.56%

motor proteins

Pattern candidate 4

98.35%

98.75%

a PROSITE patterns and pattern candidates that we identified.

b Dataset 1: sequences with PROSITE pattern

c Dataset 2: sequences with PROSITE pattern and SWISS-PROT “motor protein” annotations.

d Number of sequences recorded which have PROSITE patterns in this cluster.

e Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.

Table 11 Hit rate comparison of pattern candidates and PROSITE patterns in protein function prediction of motor proteins

Top number of sequence^a

True-positive rate

Profile scoring score

Z-score of profile scoring score ^b

Hit rate of all pattern candidates

Hit rate of PROSITE pattern

10 100% (10) 1.000 7.44 100.00% (10) 0.00% (0)

50 100% (50) 1.000 7.44 100.00% (50) 0.00% (0)

100 91.00% (91) 0.875 5.76 100.00% (91) 0.00% (0)

150 66.00% (99) 0.750 4.08 100.00% (99) 0.00% (0)

200 50.50% (101) 0.750 4.08 100.00% (101) 0.00% (0) 250 40.80% (102) 0.750 4.08 100.00% (102) 0.00% (0) 300 34.00% (102) 0.667 2.97 100.00% (102) 0.00% (0)

a The top ranked sequence number.

b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.445968; the standard deviation of all SWISS-PROT sequence scores is 0.074513.

Table 12. Conservation residues identified from HEM-binding domains

Family Domain Conservation residues

d1llp__ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238 d1cca__ R48 H52 G65 D106 G129 R130 P145 V169 H175 F198 D235 d1oafa_ R38 H42 G55 D95 G118 R119 P132 V157 H163 F186 D208 d1hsr__ R52 H56 G75 D116 G140 R141 P154 V178 H184 F208 D246

d1h5ma_ R38 H42 G48 D99 G122 R123 P139 V164 H170 F229 D247 d1b80a_ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238

d1bgp__ R45 H49 G55 D108 G131 R132 P149 V173 H179 F232 D250 d1mnp__ R42 H46 G62 D104 G128 R129 P142 V167 H173 F197 D242

d1fhfa_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F228 D246 d1pa2a_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F228 D246

d1qgja_ R38 H42 G48 D96 G119 R120 P135 V159 H165 F224 D242 d1qpaa_ R43 H47 G66 D107 G131 R132 P145 V170 H176 F200 D238 d1scha_ R38 H42 G48 D99 G122 R123 P139 V163 H169 F221 D239 CCP-like

Z-score 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662 2.662

Table 12. Continued.

Family Domain Conservation residues

d1eupa_ A241 G242 E280 R283 D322 R336 F344 G345 G347 H349 C351 G353 d1dz4a_ V247 G248 E287 R290 D328 R342 F350 G351 G353 H355 C357 G359 d1bu7a_ A264 G265 E320 R323 D363 R378 F393 G394 G396 R398 C400 G402

d1cpt__ A267 G268 E306 R309 D348 R362 F370 G371 G373 H375 C377 G379 d1n6ba_ A294 G295 E351 R354 D394 H408 F425 S426 G428 R430 C432 G434 d1e9xa_ A256 G257 313 R316 I355 R369 F387 G388 G390 H392 C394 G396 d1ehea_ A239 G240 E278 R281 D321 R335 F345 G346 G348 H350 C352 A354 d1gwia_ A242 G243 E281 R284 D324 R339 F348 G349 G351 H353 C355 G357

d1io7a_ A209 G210 E246 R249 D288 R302 F310 G311 G313 H315 C317 G319 d1izoa_ E282 R285 D324 R338 Q352 G353 G355 H341 C363 G365 d1lfka_ A236 G237 E275 R278 D318 R332 F340 G341 G343 H345 C347 G349 d1n4ga_⁺ A233 G234 E272 R275 D315 R329 F338 G339 G341 H343 C345 G347

d1n97a_ A221 G222 E260 R263 R314 F329 G330 G332 R334 C336 G338 Cytochrome

P450

Z-score 2.705 3.208 3.723 3.723 2.705 3.208 3.208 3.208 3.723 2.697 3.723 3.208 d1cyo__ E11 T33 H39 H63

d1b5m__ E11 T33 H39 H63 d1cxya_ E13 T58 H70 H42

d1icca_ E11 T55 H63 H39

d1mj4a_ E10 T34 H40 H65 Cytochrome

Z-score 3.253 3.253 3.253 3.253

Table 12. Continued.

Family Domain Conservation residues

d1i54a_ C14 C17 H18

d1cot__ C15 C18 H19

d1c75a_ C32 C35 C36

d1c2ra_ C13 C16 H17

d1c52__ C11 C14 H15

d1c6ra_ C15 C18 H19

d1cc5__⁺ C19 C22 H23

d1ccr__ C22 C25 H26

d1ycc__ C14 C17 H18

d1co6a_ C13 C16 H17

d1ctj__ C15 C18 H19

d1cxc__ C15 C18 H19

d1cyi__ C14 C17 H18

d1dw0a_ C43 C46 H47

d1f1fa_ C14 C17 H18

d1fj0b_ C13 C16 H17

d1gdva_ C14 C17 H18

d1hroa_ C19 C22 H23

d1jdla_ C15 C18 H19

d1ls9a_ C17 C20 H21

d2mtac_⁺ C57 C60 H61

d3c2c__⁺ C14 C17 H18

monodomain cytochrome c

d351c__ C12 C15 H16

d1a7va_ C113 C116 H117

d1bbha_ C121 C124 H125

d1cgo__ C116 C119 H120

d1cpq__ C118 C121 H122

Cytochrome c'

Z-score 3.505 3.505 3.505

Table 13. Pattern candidates identified from HEM-binding domains

Family Domain Pattern candidates

1 2 3

Table 13. Continued.

Family Domain Pattern candidates

1 d1i54a_

d1cot__

d1c75a_

d1c2ra_

d1c52__

d1c6ra_

d1cc5__

d1ccr__

d1ycc__

d1co6a_

d1ctj__

d1cxc__

d1cyi__

d1dw0a_

d1f1fa_

d1fj0b_

d1gdva_

d1hroa_

d1jdla_

d1ls9a_

d2mtac_

d3c2c__

monodomain cytochrome c

d351c__

d1a7va_

d1bbha_

d1cgo__

Cytochrome c'

d1cpq__

Table 14. Comparison of PROSITE patterns and pattern candidates of HEM-binding domains

Family Domain PROSITE patterns Pattern candidates

PS00436

PS00086 Cytochrome P450 cysteine heme-iron ligand signature.

PS00191 Cytochrome b5 family, heme-binding domain signature.

[FY]-[LIVMK]-x(2)-H-P-[GA]-G 1

d1cyo__

Table 14. Continued.

Family Domain PROSITE patterns Pattern candidates

PS00190 Cytochrome c family heme-binding site signature.

C-{CPWHF}-{CPWR}-C-H-{CFYW}. 1

Table 15. Hit rate comparison of dataset difference in profile verification of HEM-binding proteins

Dataset 1^b Dataset 2^c

Family PROSITE patterns and

pattern candidates ^a No. of

sequence Hit rate^d No. of

sequence Hit rate

Peroxidases active site

signature. 8.86% 100.00%

Peroxidases proximal

heme-ligand signature. 8.39% 55.72%

Pattern candidate 1 4.86% 46.67%

Pattern candidate 2 6.94% 49.81%

CCP-like

Pattern candidate 3

205

4.95%

151

9.54%

Cytochrome P450 cysteine

heme-iron ligand signature. 86.05% 86.45%

Pattern candidate 1 65.49% 68.34%

Pattern candidate 2 38.09% 40.55%

Cytochrome P450

Pattern candidate 3

687

86.13%

675

86.26%

Cytochrome b5 family, heme-binding domain

signature.

79.43% 79.30%

Cytochrome b5

Pattern candidate 1

77.13%

82.53%

Cytochrome c family heme-binding site

signature.

87.12% 84.08%

monodomain cytochrome c and

Cytochrome c' Pattern candidate 1

1130

86.93%

897

84.02%

a PROSITE patterns and pattern candidates that we identified.

b Dataset 1: sequences with PROSITE pattern

c Dataset 2: sequences with PROSITE pattern and SWISS-PROT annotations

d Average hit rate when true positive rate are 50%, 60%, 70%, 80%, 90% and 100%.

Table 16. Hit rate comparison of pattern candidates and PROSITE pattern in protein function prediction of HEM-binding proteins

Top number of sequence^a

True-positive rate

Profile scoring score

Z-score of profile scoring score ^b

Hit rate of all pattern candidates

Hit rate of PROSITE pattern 100 92.00% (92) 0.798 4.72 100.00% (92) 0.00% (0) 200 80.50% (161) 0.744 4.00 96.27% (155) 3.73% (6) 300 69.00% (207) 0.708 3.52 97.10% (201) 2.90% (6) 400 69.75% (279) 0.692 3.30 87.81% (245) 12.19% (34) 500 70.40% (352) 0.685 3.21 90.34% (318) 9.66% (34) 600 60.33% (362) 0.685 3.21 90.61% (328) 9.39% (34) 700 57.86% (405) 0.669 2.99 91.60% (371) 8.40% (34)

a The top ranked sequence number.

b Z-score of profile scoring scores. The average of all SWISS-PROT sequence scores is 0.436928; the standard deviation of all SWISS-PROT sequence scores is 0.071717.

Table 17. Prediction accuracy and coverage rates in protein function prediction

a Number of annotated SWISS-PROT protein sequences. Annotations: “ATP-binding”, “motor protein”, “Heme” and “kinesin” in “KW” of SWISS-PROT database. There are 151047 protein sequences in SWISS-PROT database.

b Number of true hit annotated protein sequences of the top 100% ranked protein sequences.

For example, there are 3462 the true hit protein sequence of top 13484 sequences in ATP-binding prediction profile scoring ranking list.

c We only choose motor proteins as our protein function prediction target.

Table 18. 10 predicted protein sequences with high scores in profile scoring ranking lists of protein function prediction in ATP-binding proteins, motor proteins and HEM-binding proteins.

Predicted ATP-binding proteins Predicted motor proteins Predicted HEM-binding proteins

1 (295^a) P27604^b

(Adenosylhomocysteinase^c)

(85) P44531

(Ferric cations import ATP-binding protein fbpC 1)

(62) Q60613

(Putative diacylglycerol kinase K06A1.6)

(Hypothetical 65.5 kDa protein y4IJ)

6 (882) Q96RU7

(Neuronal cell death inducible putative kinase)

(B-box type zinc-finger protein ncl-1)

(126) P35100

(ATP-dependent Clp protease ATP-binding subunit clpA

homolog, chloroplast [Precursor])

(108) Q81MN9

(Polyphosphate kinase)

8 (938) Q9WTQ6

(Neuronal cell death inducible putative kinase)

(127) Q8CFL8

(Zinc finger SWIM domain containing protein 3)

(112) P55019

(Solute carrier family 12 member 3)

9 (947) Q8K4K2

(Neuronal cell death inducible putative kinase)

(T-cell surface glycoprotein CD1e [Precursor])

a Ranking serial number in profile scoring ranking lists.

b SWISS-PROT accession numbers.

c Protein name in SWISS-PROT database.

Prepare dataset

Dataset clustering

MuLiSA Select alignment

center C of each cluster

C centered Multiple Ligand-bound Structure Alignments

Identify conservation residues by z-score

of position entropy Identify pattern

candidates from conservation residue

extension

Tool evaluation Protein function

prediction

Figure 1. The workflow of analysis and identification of conservation patterns and residues in proteins by MuLiSA. This flow starts from dataset preparation and clustering, followed by multiple ligand-bound structure alignments (MuLiSA), tool evaluation and protein function prediction.

d1gtra2 : 0.3 + 0.39 + 0.34 = 1.03

d1h3ea1 : 1.10

d1maua2 : ^1.19 d1n77a2 : ^1.06

1 0.36

0.36 0.34

d1n77a2

0.36 1

0.44 0.39

d1maua2

0.36 0.44

1 0.3

d1h3ea1

0.34 0.39

0.3 1

d1gtra2

d1n77a2 d1maua2

d1h3ea1 d1gtra2

Figure 2. The alignment center C selection. The alignment center C was selected when domain of one cluster which has the highest structure similarity with other protein domains than others.

In this case, we select domain d1maua2 as the alignment center C of this cluster.

A

B

Figure 3. Identification of conservation residues at positions with z-score > 2.5. (A) The multiple alignments of four protein sequences. (B) The entropy and z-score values of each position. Figure 2(A) shows the alignment results of 13 protein sequences belongs to

“Cytochrome P450 family”. The numbers on the top of Figure 2(A) are the residue numbers of d1eupa_. The “+” symbols denotes the positions with z-score > 2.5, and we can observe in Figure 2(B) that these positions are with z-scores 3.208 and 3.723. The framed region is the possible pattern candidate.

Figure 4. (A) Structure similarity matrix of 25 non-redundant ATP-binding domains; (B) SCOP classification of 25 non-redundant ATP-binding domains. In Figure 4(A), domains belong to same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row; in other words, the domain in this row is much similar with these compared domains than others. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame), it tells us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. In Figure 4(B), protein domains were classified according to SCOP classification hierarchy: class, fold, superfamily, and family. The protein domains were named by SCOP database nomenclature.

Figure 5. MuLiSA result and identified conservation residues in “Protein kinases, catalytic subunit family” of ATP-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1phk__; blue: d1atpe_; green:

d1qmza_; red: d1csn__; grey: d1hck__; pink: d1gol__; light blue: d1h1wa_; (B) Multiple ligand-bound structure alignment result of “Protein kinases, catalytic subunit family” domains.

In Figure 5(A), the identified conservation residues, aligned positions with z-score of entropy calculation > 2.5, are close to ATP in three-dimensional space. It implies that these conservation residues may play important role in ATP-binding. In Figure 5(B), the labeled residue numbers belong to protein domain d1phk__, which is the selected alignment center C of this cluster; and the red framed region means the PROSITE patterns. We observed that most identified conservation residues were on these PROSITE pattern region, it tell us that identifying pattern candidates from conservation residues extension may be a reasonable approach.

Pattern candidate 1 Pattern candidate 2

Pattern candidate 3

ATP

Figure 6. Three pattern candidates of “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” on three-dimensional space. Pattern candidate 1 is overlapping with PROSITE pattern PS00108; pattern candidate 2 and 3 are novel pattern that we identified. All three pattern candidates are closed to ATP; hence they may be important in ATP-binding.

0 0.2 0.4 0.6 0.8 1

False-Positive Rate

T ru e -P o sitiv e R a te

PROSITE motif

Pattern candidate

Figure 7. Comparison of pattern candidate 1 and PROSITE pattern: Serine/ Threonine protein kinases active-site signature in “Protein kinases catalytic subunit family” for profile verification of ATP-binding proteins. In ROC curve, the area under curves represents the goodness of the test. We observed that our defined pattern candidate is worse than PROSITE pattern; however, because of that the profile of PROSITE pattern is generated from our alignments, it proved that the profile generated from our alignments is reasonable.

0 0.2 0.4 0.6 0.8 1

False-Positive Rate

T rue -P os it ive R a te

Dataset 1 Dataset 2

Figure 8. Comparison of datasets used in profile search by pattern candidate 1 of “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” of ATP-binding domains. Dataset 1: protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature. Dataset 2: protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature and also have “ATP-binding” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than area under curves of dataset 1. Because the profile of pattern candidates were generated from ATP-binding domains alignments and the protein sequences in dataset 1 are not all have “ATP-binding”

annotations, we think that the profile of pattern candidate is more meaningful in ATP-binding proteins but not proteins only with PROSITE pattern.

Figure 9. Profile scoring list of protein function prediction in ATP-binding proteins. The protein sequences with SWISS-PROT “ATP-binding” annotations were labeled by “+” symbol on ATP column. The protein accession numbers in SWISS-PROT database are list on Seq.

column. Values on “Score” column are the profile scoring scores. The “Pattern column” shows the matched protein sequence segment, the residue numbers of the first and the last residues are shown. Two points must be mentioned. First, the framed sequences all have “ATP-binding”

annotations (except for P27604 and P25169); because these sequences all match our new finding pattern candidate, we regard this pattern candidate is a new pattern of ATP-binding proteins. Second, the non-labeled sequences, P27604 and P25169, are the sequences that match profiles but don’t have “ATP-binding” annotations in SWISS-PROT database, hence these two proteins might be ATP-binding protein that not identified yet.

Figure 10. (A) Structure similarity matrix of 30 non-redundant ADP-binding domains; (B) SCOP classification of 30 non-redundant ADP-binding domains. In Figure 10(A), domains belong to same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame). In Figure 10(B), protein domains were classified according to SCOP classification hierarchy.

Figure 11. MuLiSA result and identified conservation residues in “motor proteins family” of ADP-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1goja_; blue: d1bg2__; green: d1br2a2; red: d2ncda_;

grey: d1f9ta_; orange: d1i5sa_; brown: d2kin.1; light blue: d1lkxa_; (B) Multiple ligand-bound structure alignment result of “motor proteins family” domains. In Figure 11(A), the identified conservation residues are closed to ADP in three-dimensional space. It implies that these conservation residues may play important role in ADP-binding. In Figure 11(B), the labeled residue numbers were belonged to protein domain d1goja_, which is the selected alignment center C of this cluster, and the red framed region means the PROSITE patterns.

Pattern candidate 1

Pattern candidate 2

Pattern candidate 4 Pattern candidate 3

ADP

Figure 12. Three pattern candidates of “motor proteins family” on three-dimensional space.

Pattern candidate 4 is overlapping with PROSITE pattern PS00411; pattern candidate 1, 2 and 3 are novel pattern that we identified. All three pattern candidates are closed to ADP; hence they may be important in ADP-binding.

0 0.2 0.4 0.6 0.8 1

False-Positive Rate

T rue -P os it ive Ra te

PROSITE motif

Pattern candidate

Figure 13. Comparison of pattern candidate 4 and PROSITE pattern: Kinesin motor domain signature in “motor proteins family” for profile verification of ADP-binding domains. We observed that our defined pattern candidate is worse than PROSITE pattern; however, because of that the profile of PROSITE pattern is generated from our alignment, it proved that the profile generated from our alignments is reasonable.

0 0.2 0.4 0.6 0.8 1

False-Positive

T ru e -P o sitiv e

Dataset 1 Dataset 2

Figure 14. Comparison of datasets used in profile search by pattern candidate 1 of “motor proteins family” of ADP-binding domains. Dataset 1: protein sequences contain PROSITE pattern: Kinesin motor domain signature. Dataset 2: protein sequences contain PROSITE pattern: Kinesin motor domain signature and also have “motor protein” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than area under curves of dataset 1. Because the profile of pattern candidates were generated from motor proteins domains alignments and the protein sequences in dataset 1 are not all have “motor protein” annotations, we think that the profile of pattern candidate is more meaningful in motor proteins but not proteins only with PROSITE pattern.

Figure 15. Profile scoring list of protein function prediction in motor proteins. The protein sequences with SWISS-PROT “motor protein” annotations were labeled by “+” symbol on motor column. The protein accession numbers in SWISS-PROT database are list on Seq.

annotations; because these sequences all match our new finding pattern candidate, we regard this pattern candidate is a new pattern of motor proteins. Second, the non-labeled sequences are the sequences that match profiles but don’t have “motor protein” annotations in SWISS-PROT database; hence these proteins might be motor proteins that not identified yet.

Figure 16. (A) Structure similarity matrix of 40 non-redundant HEM-binding domains; (B) SCOP classification of 40 non-redundant HEM-binding domains. In Figure 16(A), protein domains belong to same SCOP families are with same colors. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame); it tell us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. In Figure 16(B), protein domains were classified according to SCOP classification hierarchy: class, fold, superfamily, and family.

Figure 17. MuLiSA result and identified conservation residues in “Cytochrome b5 family” of HEM-binding domains. (A) Three-dimensional distributions of identified conservation residues and the ligand superimposition. Yellow: d1cyo__; blue: d1b5m__; green: d1cxya_; red:

d1icca_; grey: d1mj4a_; (B) Multiple ligand-bound structure alignment result of “Cytochrome b5 family” domains. In Figure 17(A), the identified conservation residues are closed to heme in

在文檔中 MuLiSA: 多重配體結構比對為基礎之蛋白質功能片段及重要氨基酸之預測分析 (頁 44-118)

Chapter 4. Conclusions

4.3 Future works

d1gtra2 : 0.3 + 0.39 + 0.34 = 1.03

d1h3ea1 : 1.10

d1maua2 : 1.19 d1n77a2 : 1.06

A

B

ATP

False-Positive Rate

T ru e -P o sitiv e R a te

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

False-Positive Rate

T rue -P os it ive R a te

Dataset 1 Dataset 2

False-Positive Rate

T rue -P os it ive Ra te

0 0.2 0.4 0.6 0.8 1

0 0.2 0.4 0.6 0.8 1

False-Positive

T ru e -P o sitiv e

Dataset 1 Dataset 2

d1maua2 : ^1.19 d1n77a2 : ^1.06