• 沒有找到結果。

Chapter 2. Materials and Methods

2.1 Overview

Identification of conservation patterns and residues in proteins by multiple ligand-bound structure alignments encompasses a variety of sequential computational phases, including dataset preparation, dataset clustering, multiple ligand-bound structure alignments, post-alignment analysis and entropy calculation, tool verify and protein function prediction (Figure 1).

In dataset preparation, we first select one kind of ligand-binding protein that we are interested and get ligand-binding protein list from PDBsum [18] database. Because we need precise protein structures to identified conservation residues and motifs, we only select protein structures resolved by X-ray diffraction. Then we select ligand-binding domains using programs from SCOP database [19].

In data clustering, we generate all-against-all multiple ligand-bound structure alignments of these selected ligand-binding domains and generate one structure similarity matrix and one sequence identity matrix for each kind of ligand-binding proteins. Once we have these two matrixes, we select non-redundant protein domains, and undergo protein domain clustering.

In the main step of MuLiSA, first we choose the alignment center domains C of each domain cluster based on structure similarity. Second, we undergo C centered multiple ligand-bound structure alignment. After we generate the alignments, z-score calculation of position entropy can help us to identify conservation residues of each domain cluster. For we believed that the functional important motifs mostly composed of functional important

residues, we identified pattern candidates by conservation residues extension.

Finally, we used SCOP [19] and PROSITE [2] databases to verify our results; and then we generate profiles of pattern candidates and use them to search for protein sequence with these patterns in SWISS-PROT database [20].

Appendix A shows the computational steps in the overall workflow and the programs we used in this study.

2.2 Preparation of training datasets and verifying datasets

2.2.1 Preparation of ligand-binding protein list

We have applied MuLiSA to three kinds of ligand-binding proteins, which are ATP-binding proteins, ADP-binding proteins, and HEM-binding proteins. The ligand-binding protein lists were taken from PDBsum database [18].

2.2.2 Preparation of ligand-binding domains

In order to get ligand-binding domains, first we need to get ligand-binding protein structures. Protein structure three-dimensional information was downloaded from Protein Data Bank (PDB) database[21] according to ligand-binding protein lists getting from PDBsum database [18]. The ligand-binding domains were chosen by our program GetDomain.c (see appendix B) and were downloaded from Structure Classification of Proteins (SCOP) database [19].

Ligand-binding domains were chosen with four criteria, they are as follows:

1. When one of distances between atoms of residues of the domain and atoms of ligands is near than 5Å, we think that this domain is a ligand-binding domain.

2. Because multiple ligand-bound structure alignment first superimposed the ligands of aligned proteins, we only choose protein domains which only bind with one ligand.

3. We only choose ligand-binding domains which the ligand they bind is only bind by one protein domain.

4. We only choose one protein domain in one protein structure.

Because the SCOP domain files do not contain ligand information, after choosing these domains we must add back ligand information from Protein Data Bank (PDB) database [21]

into these protein domain files. It must be mentioned that we only choose protein domains solved by x-ray crystallography because we think that these structures are more convincing.

2.2.3 Datasets for verification

To verify whether our alignment results is reasonable and can reflect protein function information, we use the classification of Structural Classification of Proteins (SCOP) database [19] as the benchmark of our structure similarity matrix for non-redundant domain clustering.

PROSITE patterns from PROSITE database [2] were also used to quality assessment and refinement of multiple ligand-bound structure alignments. The protein sequences and annotations were downloaded from SWISS-PROT database [20] and were used for profile verification and protein function prediction.

2.3 Methods

2.3.1 Multiple ligand-bound structure alignment

The main idea of this tool is that we try to align together conservation residues of proteins at ligand-binding sites by ligand superimposition; and then identify conservation

residues and patterns by z-score of entropy calculation. Because we have to change the three-dimensional coordinates of proteins along with superimposed ligands, we developed a structure superimpose tool to deal with this problem. We developed this program MuLiSA from ICP algorithm[22] (see appendix C), this program can make proteins and ligands rotation and displacement on three-dimensional space. After we get the superimposed protein structures, we regard two residues are aligned together based on three order rules:

„ Rule 1: Cβ or Cα (Gly) atom of amino acid residues in 1Å

„ Rule 2: Cβ or Cα (Gly) atom of same amino acid residues in 4Å

„ Rule 3: Cβ or Cα (Gly) atom of same group amino acid residues in 4Å or Cβ or Cα (Gly) atom of different group amino acid residues in 2Å.

The amino acid groups are defined as follows:

9 Basic amino acids: lysine, arginine, and histidine.

9 Acidic amino acids: aspratate, glutamate, asparagine, and glutamine.

9 Aromatic amino acids: phenylalanine, and tryptophan.

9 Aliphatic amino acids: glycine, alanine, valine, leucine, isoleucine, and methionine.

9 Hydroxyl containing amino acids: serine, threonine, and tyrosine;

9 Disulfide-bond forming amino acid: cysteine.

9 Cyclic amino acid: proline.

In ATP-binding proteins and ADP-binding proteins, because of the high divergence of phosphate groups, we aligned whole ligand and only “ribose plus base region” first and then choose the better one as the alignment result.

2.3.2 Sequence identity matrix and structure similarity matrix

If two protein domains have similar function and have highly similar structures in ligand-binding sites, these two protein domain structures should fit well in three-dimensional space. We introduced structure similarities in accordance with multiple ligand-bound structure alignments to present this information.

STab is the structure similarity of protein domain a and protein domain b. La is the length (residue numbers) of protein domain a, Lb is the length (residue numbers) of protein domain b, and L is the aligned residue number of protein domains a and b. STab is given as

{ L L } S

b a T

ab

L ,

= min (1)

We also generate un-gapped sequence identity matrix between protein domains for non-redundant protein domain selection based on only aligned residues of protein domains a and bSEab is the un-gapped sequence identity of protein domain a and protein domain b. mt is the number of identical aligned residues of protein domain a and protein domain b; mmt is the number of non-identical aligned residues of protein domain a and protein domain b.

mmt mt

S

Eab= +mt (2)

2.3.3 Non-redundant protein domains and alignment center C selection

Redundant protein domains must be removed because the profiles generated from alignments may be incredible. We regarded two protein domains are redundant protein domains when their structure similarity and sequence identity are both above 0.8; therefore, we first cluster these protein domains and only choose one with no mutation residues and with the smallest X-ray diffraction resolution.

In order to generate a convincing multiple alignments, we must choose an alignment

center domains C before we generate this alignments. In structure similarity matrixes, the non-redundant protein domain of one cluster which has the highest structure similarity with other protein domains than others was selected as the alignment center C of this cluster. This protein domain was used to be the alignment center of multiple ligand-bound structure alignment. Figure 2 shows one example.

2.3.4 Identification of conservation residues and pattern candidates

To identify these conservation residues, we used entropy (Sp), defined as

∑ ( )

where i and fpi denote the ith amino acid type, the probability of finding the amino acid type i at position p. The entropy is 0 when this position is totally conserved.

In order to estimate the statistical significance of the position entropy, z-score was applied to identify relative conservation positions.

σ µ

=

X

Z

p p (4)

where Zp is the z-score value of position p, σis the standard deviation of all positions entropy, μis the average value of all positions entropy and Xp is the entropy of position p. We identified a conservation position p when Zp > 2.5.

To identify pattern candidates, we extend protein segment from conservation residues.

First we extend from conservation residues to residues with z-score larger than 1.0 next to these conservation residues. When there is one “gap” (gap: residues with z-score less than 1.0) between two residues, “Gap tolerance” let us to exted the segment. For example, if one is larger than 1.0 and the other is larger than 2.5 (the sum of z-score of these two residues is larger than 3.5), we linked this “gap” and we extend this protein segment. If n “gaps” occurs, the sum of z-score of these two gap gapped residues must larger than n+2 and we can

continue to extend the protein segment. We only choose protein segments as pattern candidates extending from conservation residues and the length is equal or longer than 5 residues (Figure 3).

2.3.5 Profile generation

We generate alignment profiles of pattern candidates (discovered by our MuLiSA) and PROSITE patterns from multiple ligand-bound structure alignments.

PF

pi=

{ } f

ip where1i20 (5)

where PFp is the profile of position p; fpi is the probability of the ith amino acid type at position p.

2.3.6 Profile score calculation

We use profiles to search for matched protein segments in protein sequences. The search window size is the length of profiles and shifts one residue each time. Each protein sequence should have N-(n-1) (N is the length of this sequence and n is the length of this pattern) profile scores, and we suppose the segment with the highest profile search score of this protein sequence should be the pattern candidate that we are looking for.

The scoring function is as follows:

S n

n

p i

PF

pi

∑∑

= =

= 1

20

1 (6)

Where S is the profile score, n is the length of a pattern, PFpi is the profile value of amino acid type i at position p. The score is 1 when a segment perfectly matches this profile.

Chapter 3

Protein Function Prediction and Conservation Residues Identification of ATP-, ADP-, and HEM-Binding Proteins

In order to identify the wealth of information present in protein structures, we analyzed conservation residues and patterns in multiple ligand-bound structure alignments.

To infer a major functional role from residue conservation, a function-based clustering is necessary before identifying conservation residues. Statistically, the bias of conservation may be from not having enough and convincing data, this is why we remove structures too much similar, the redundant domains, select alignment center domain C and generate alignments with clusters have more than four protein domains.

Most sequence and structure alignment techniques are protein-based alignment; in other words, these techniques analyze residue conservation only by comparing protein structure or protein sequence similarity. However, local alignment error sometimes happens when the sequence identity is less than 25% in sequence alignment or protein structures are much similar at regions far away from protein functional important region in structure alignment.

At the present, we have applied MuLiSA to ATP-, ADP-, and HEM-binding proteins and identified several conservation residues and pattern candidates. We have generated sequence profiles from multiple alignments and used them to discover protein sequences which may have these profiles. We also proved that MuLiSA is better than other tools in several cases and can discover functional information when comparing with SCOP [19] and PROSITE database[2]. Our major intention was to extract protein structure information from ligand-binding proteins and apply this information to protein function prediction. Table 1 shows some statistics about the dataset we used in this study. We applied MuLiSA to three

kinds of ligand-binding proteins; they are ATP-binding proteins, ADP-binding proteins and HEM-binding proteins. Through getting ligand-binding protein lists, selecting ligand-binding domains, domain clustering, non-redundant domains and alignment center C selection, we use MuLiSA and z-score of entropy calculation to identified conservation residues and pattern candidates of each cluster. These identified conservation residues may be functional important and we survey the literature and it proves that some of these identified conservation residues are critical to ligand-binding or correlate with conformation stability. After pattern candidate identification, we generate profiles of these pattern candidates and use these profiles predict protein functions.

3.1 ATP-binding proteins

3.1.1 Overview

ATP, adenosine triphosphate, is the major energy currency of the cell. It transfers energy from chemical bonds to endergonic reactions of the cell. ATP powers most of the energy-consuming activities of cells, such as muscle contraction, synthesis of polysaccharides, active transport of ions and nerve impulse. Because of ATP is a so important compound and because of the large number of experimental data, like ATP-binding protein structures and literatures, we choose ATP-binding proteins as our first research target. We have generated structure similarity matrix of non-redundant ATP-binding domains for functional-based domain clustering, and we also identified conservation residues and pattern candidates.

Finally, we used profiles of pattern candidates to undergo protein function prediction.

3.1.2 Structure similarity matrix and alignment center C selection

Figure 4 shows the structure similarity matrixes and SCOP classifications of 25 non-redundant ATP-binding domains. When comparing with classifications of SCOP database [19], protein domains with higher structure similarities are usually clustered together and they are always belong to same SCOP families. As we all agree that SCOP database [19]

is a convincing domain structural and functional classification database, it tells us that the multiple ligand-bound alignment and structure similarity calculation is reasonable and can reflect structural and functional information.

In Figure 4(A), domains belong to the same SCOP families are with same colors. The bold values means the structure similarity is larger than the average value of the row; in other words, the domain in this row is much similar with these compared domains than others. In this matrix, we find that most domains of same SCOP family usually have higher structure similarity with each other (see the regions with red frame), it tell us that the multiple ligand-bound structure alignment and structure similarity calculation is reasonable and can reflect structural and functional information. Figure 4(B) shows the SCOP classification of protein domains in Figure 4(A).

3.1.3 Conservation residues identified from ATP-binding domains

After selecting alignment center C of each cluster, we use multiple ligand-bound structure alignment tool, MuLiSA, to generate multiple alignments.

We have identified several conservation residues (with z-score of position entropy > 2.5) of protein domains in “Protein kinases catalytic subunit family” and “Class I aminoacyl-tRNA synthetases (RS), catalytic domain family”. In Table 2, conservation residues were identified and listed; the bold residues are these residues, verified by previous studies, that are

important in ATP-binding or conformation stability [23-32]. For example, in “Human Cyclin-Dependent Kinase 2 protein domain (SCOP code: d1hck__)”, we have identified residues A31, K129, N132 and D145 which interact with ATP through forming hydrogen bonds. Except for these four residues, we also identified six conservation residues and we believe that these residues are very likely in playing important role in ATP-binding.

Figure 5 shows the multiple ligand-bound structure alignment results and the identified conservation residues in “Protein kinases, catalytic subunit family” of ATP-binding domains.

In Figure 5(A), the identified conservation residues, aligned positions with z-score of entropy calculation > 2.5, are close to ATP in three-dimensional space. It implies that these conservation residues may play important role in ATP-binding. In Figure 5(B), the labeled residue numbers belong to protein domain d1phk__, which is the selected alignment center C of this cluster; and the red framed region means the PROSITE patterns. We observed that most identified conservation residues were on these PROSITE pattern region, it tell us that identifying pattern candidates from conservation residues extension may be a reasonable approach.

3.1.4 Pattern candidates identified from ATP-binding domains

We have identified pattern candidates of “Protein kinases catalytic subunit family” and

“Class I aminoacyl-tRNA synthetases (RS), catalytic domain family” of ATP-binding domains. Table 3 lists these pattern candidates. We only choose the pattern candidates which are equal or longer than 5 residues and extending from identified conservation residues with z-score of entropy calculation > 2.5. Table 4 shows the comparison of PROSITE patterns and our defined pattern candidates that overlap with PROSITE patterns of ATP-binding domains.

These pattern candidates are partially overlapping with PROSITE patterns. However, the new pattern candidates which do not overlap with PROSITE patterns in Table 3, may be new clues

to search for ATP-binding proteins. For example, although the pattern candidate 1 in “Protein kinases catalytic subunit family” is overlapping with PROSITE pattern: Serine/ Threonine protein kinases active-site signature, there are also pattern candidate 2 and 3 in “Protein kinases catalytic subunit family” that do not overlap with PROSITE pattern.We found that identified pattern canididates are near ATP in 3-D space, therefore we believe that these two pattern candidates may be new clues to search for ATP-binding proteins (Figure 6).

All of these pattern candidates and PROSITE patterns were used to generate profiles and we will use these profiles for protein function prediction.

3.1.5 Profile verification and protein function prediction of ATP-binding proteins

In order to use profiles generated from our alignments to predict protein function, first we need to verify that the profiles we generated from our alignments is reasonable and convincing. Therefore, we use protein sequences which have PROSITE patterns: PS00178, PS00107, PS00108, PS00411, PS00190, PS00435, PS00436, PS00086 and PS00191. These PROSTIE patterns belong to 8 clusters listed in Table 1. Because pattern candidates identified from one cluster should be meaningful for sequences of this cluster, when we use profiles generated from these pattern candidates to search for protein sequences of this cluster, the sequences of this cluster should have higher profile scoring scores. In other words, a good pattern candidate can separate protein sequences of the cluster that have this pattern candidate from protein sequences of other clusters that don’t have this pattern candidate.

In order to compare with the performance of pattern candidates and PROSITE patterns, we also generated profiles of PROSITE patterns from our multiple alignments. If the performance of pattern candidates in one cluster is better than PROSITE patterns, we may find a novel pattern that is more meaningful than PROSITE patterns in this cluster. In Figure 7, we observed that our defined pattern candidate is worse than PROSITE pattern; however,

because the profile of PROSITE pattern is generated from our alignment, and the performance is good, it proved that the profile generated from our alignments is reasonable and convincing.

In order to verify the effectiveness of profiles generated from our alignments in protein function prediction, we compare the performance in profile search between dataset 1, which contains protein sequences with PROSITE pattern; and dataset 2, which contains protein sequences not only with PROSITE pattern but also have “ATP-binding” annotations in SWISS-PROT database. In Figure 8, dataset 1 contains protein sequences contain PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature and dataset 2 contains protein sequences contain not only PROSITE pattern: aminoacyl-transfer RNA synthetases class-I signature but also have “ATP-binding” annotations in SWISS-PROT database. We observed that the area under curves of dataset 2 is larger than the area under curves of dataset 1.

Because the profile of pattern candidates were generated from alignments of ATP-binding domains and the protein sequences in dataset 1 are not all have “ATP-binding” annotations in

“KW” of SWISS-PROT database, we suppose that the profile of pattern candidate is more

“KW” of SWISS-PROT database, we suppose that the profile of pattern candidate is more