Thesis Overview - PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具

Chapter 1. Introduction

1.3 Thesis Overview

We develop a novel sequence-based structure alignment: PiSA-BLAST for fast database searching. In chapter 2, we have prepared training set from ASTRAL SCOP database 1.65 40% set. We divide domain proteins of training set into many segments that have various kappa and alpha angle. Then, we find representative segments of each kappa and alpha angle cell and use cluster algorithm to group these representative segments. After that, we assign a new code for each representative group. Next, we need to develop a substitution matrix for new codes and use it to replace default matrix for sequence alignment tool. Finally, we can run sequence alignment tool to do fast protein structure searching in database.

In chapter 3, we demonstrated the conformation of representative segments that are belonging to the same coding region and the new substitution matrix for representative segments. In addition, we evaluated the database searching time and screening performance of

PiSA-BLAST with several testing sets by precision, recall, false positive rate and ROC curve.

Besides, we discussed the relationship between precision, sequence identity, structure similarity and theoretically expected number and given some examples to explain PiSA-BLAST how to works on practical applications and what weakness it has in this chapter.

Chapter 4 presented some conclusions and future perspectives. Our major contribution is to develop a novel fast structure alignment tool for protein database searching. The coded sequence has biological meanings. From 3D to 1D level, PiSA-BLAST can decrease execute time by translating 3D-structure to 1D-sequence and using sequence level to align structure.

From 1D to 3D, PiSA-BLAST can enhance the accuracy of sequence alignment for structure searching by adding segment information into 1D-sequence. Because of fast structure database searching, we can apply PiSA-BLAST in biological issues like fold assignment and homology searching. Furthermore, PiSA-BLAST can be used on several practical applications, for example, multiple structure alignment, finding structure motifs, protein function prediction, and protein-protein interaction in the future.

Chapter 2 Materials and Methods

Step-by-step illustration of the PiSA-BLAST methodology is showed in Figure 1. Given one known 3D structure for query protein in a structure database. Every 3D structure in database can be divided into 5-mer structure segments by its kappa and alpha angle. After determining segments, we translate these segments into encoded sequence according kappa and alpha clustering map. The following step is to run structure alignment with encoded sequence using sequence alignment tool: BLAST. As the result, we can gain alignment score, structure similarity and even superposition sites of two aligned protein.

The flowchart of research step is shown in Figure 2. First, we prepare training set from ASTRAL SCOP database 1.65 40% set [15, 16]. Second, we divide domain proteins of training set into many segments that are have various kappa and alpha angle. Then, we find representative segments of each kappa and alpha angle and use cluster algorithm [2] to group these representative segments. After that, we assign a new code for each representative group.

Next, we need to develop a substitution matrix for new codes and use it to replace default matrix for sequence alignment tool. We can use sequence alignment tool to do fast protein structure searching in database and evaluate the performance. Finally, we apply the PiSA-BLAST on practical application.

2.1 Preparing Training Set from Protein Structure Database

We prepare 792 pairs domain proteins in ASTRAL SCOP database 1.65 40% set [15, 16]

for developing of 3D-1D coding and establishing new substitution matrix. The principle of training set collecting is as follows.

First, we select families with at least two domain proteins and totally choice 882 families.

In these families, select one pair domain per ten domain proteins in random. Each pair domain belongs to the same family and sequence identity of each pair domain is less than 40%.

Second, after structure alignment of CE, the RMSD in pair domain proteins is less than 5Å.

Third, the residues in all selected domain proteins are exclude “X”.

We expect that our training set can reflect the real condition in composition of amino acids. Figure 3 shows that Comparison the amino acids compositions of our train set, including 1584 proteins for encoding the structured codes and the substitute matrix, with three well-known structure databases (DSSP database [17], SCOP 95 and SCOP 40 database [15]).

The distributions of amino acids compositions of these four databases are similar. So, our training set can provide right and meaningful information.

2.2 Dividing Protein Structures into Segments by Kappa-Alpha Angle Map

The kappa angle is described as virtual bond angle (bend angle) defined by the three C-alpha atoms of residues I-2, I, I+2. The range of kappa angle is 0° to 180°. The alpha angle is described as virtual torsion angle (dihedral angle) defined by the four C-alpha atoms of residues I-1, I, I+1, I+2. The range of alpha angle is –180 ° to 180 ° (described at http://www.cmbi.kun.nl/gv/dssp/de-scrip.html#SECSTRUC). According to the definition of kappa and alpha angle, we define the local structure with 5 residues long as a segment.

792 domain protein pairs have been divided into total 263696 segments. These segments

are separated by various kappa and alpha angle. Figure 4 shows the distribution of 263696 segments in various kappa and alpha angle. The color bar on the right side shows the distribution scale. These segments are encoded into 23 codes based on the distributions of kappa and alpha angle. The helix-like segments (e.g., A, B, C and D) have more than 9000 segments whose alpha angle ranging from 40° to 60° and kappa angle ranging from 100° to 120°. The strand-like segments (e.g., E and F) have over 3000 segments with alpha angle ranging from -180° to -140° and kappa angle ranging from 0° to 20°.

Because of the large number of segments, we need to cluster these segments for representative segments deciding and meaningful codes assigning.

2.3 Finding Representative Segments and Using Nearest Neighbor Clustering Algorithm for New Codes Assigning

There are total 648 cells on kappa and alpha angle map K. Each cell includes many segments shown in Figure 4. We use the simple way as follows to decide one representative segment for each cell.

We building inter-segment distance matrix for one cell. Let dij be the structure distance (measured by superimpose program [18]) between segment i and segment j. The number of i and j is equal to the number of segments for this cell. Then, we summarize each column of the distance matrix and get the minimum of sum of column. Hence we select the representative segment for one cell depend on its lowest total structure distance among other segments.

After finding representative segment for every cell, we use nearest neighbor clustering

algorithm [2] to group these representative segments with similar conformation. The algorithm is based on calculating a matrix, D, where N is the number of representative segments to be clustered. The matrix D is stored with the values of Rmsd for inter-representative segments. Dij is a measure of structure similarity (computed by superimpose program [18]) between representative segments i and j. Clusters are formed recursively by adding other representative segments according to the nearest neighbor criterion. The method of nearest neighbor clustering is as follows:

Input:

(1) The matrix D is stored with the values of RMSD for all inter-representative segments.

Dij is a measure of structure similarity between representative segments i and j (0≦i, j≦648).

(2) The matrix K is collected with the numbers of segments with various kappa and alpha angle. K_ab is a number, which means how many segments in alpha angle a° and kappa angle b° (0≦a≦36, 0≦b≦18).

Output:

The encoding rule map E point out that each cell with various alpha-kappa angle could be assign one letters of the alphabet. The size of encoding rule map is 36*18 according the range of kappa and alpha angle. The range of alpha angle is observed into 10° interval ranging from -180° to 180°. The range of kappa angle is observed into 10° interval ranging from 0° to 180°.

Step:

(1) Select one cell of E with particular kappa-alpha angle which the Kab is the most and this cell Eab did not assign any code yet to be the center of a cluster.

(2) Assume that the representative segment of this center is representative segment i.Sort the value from Di,0 toDi,648.

(3) According the result of sorting, from top to bottom, group every cell Ea’b’ repeatedly into the cluster with center Eab if the Ea’b’ fit in with following conditions.

(3.1) Given a threshold, t, on the nearest neighbor distance. Assume that the representative segment of Ea’b’ is representative segment j. The Dij is less than t.

(3.2) Given a threshold, u, for the maximum fragments number. If group into the cluster, the summation of the number of fragments in this cluster is still less than u.

(4) Check if this cell Ea’b’ has already grouped to other cluster.

(4.1) If not, group the cell E_a’b’ into the cluster with center E_ab and record the D_ijfor the optimized clustering.

(4.2) Otherwise, compare the value of the previous and present record of Dij.

(4.2.1) If the present record of Dij is less than the previous one, Ea’b’ would be re-assign into the present cluster. However, the sum of the number of fragments in this cluster must be less than u.

(4.2.2) If the previous record of Dij is less than the present one, do nothing and keep previous cluster.

(5) Repeat step 1 to 4 until every cells of E is clustered to 21 groups.

(6) First group has only one cell of E. This cell Eab is assigned to code “A”. The code

“A” with alpha angle more than 46° and kappa angle less than 114° will be assigned to another code ”Y”.

(7) Every cells of E in second group are assigned to code “B”, ones in third group are assigned to code “C”, and etc. Ones in last group are assigned to code “X”. There are exclude J, O and U in code assignment.

(8) If the Kab is less than 40, this cell would be assigned to code “Z”.

(9) Every Eab is assigned to one code and output result of encode rule map E.

Here, the threshold, t and u, is given depending on how many groups we want. Here the threshold t is 0.72, u is 18450, and 21 groups is made. The threshold u is given by the 7% of the number of total segments.

Each group in various cells is assigned to a new code. There are 21 codes named as letter

“A” to “X” (exclude “J”, “O” and “U”). If the number of segments in one cell is less than 50, this cell will be assigned to Code “Z”. In addition, when the structure is coding to sequence, the new code “A” with alpha angle more than 46° and kappa angle less than 114° will be assigned to another code ”Y”.

2.4 Generating a Substitution Matrix for 23 New Codes

The method of generating a substitution matrix refer to BLOSUM62 [19]. The elements of the substitution matrix are calculated as follows. For each residue position in the training set of pair database of aligned structural pairs, the statistics is counted at each aligned position.

Each protein chain is considered to be a coded sequence aligned to a structure. The substitution score for coded sequence i and j with homologous structure is given by the information value [20].

Let the total number of amino acid i, j pairs (1≦j≦i≦20) for each entry of the frequency table be fij. Then the observed probability of occurrence for each i, j pair is

20 1 1

/

ⁱ

ij ij ij

i j

q f f

= =

= ∑∑

⁽¹⁾

Next we estimate the expected probability of occurrence for each i, j pair. It is assumed that the observed pair frequencies are those of the population. In general, the probability of occurrence of the ith amino acid in an i, j pair is

i ii ij

/ 2

j i

p q q

≠

= + ∑

(2)

The expected probability of occurrence eij for each i, j pair is then pipj for i = j and pipj + pjpi = 2 pipj for i≠j.

Then, the substitution matrix scores are then defined as

log

2 ^ij

where λ is an arbitrary positive rational number. Here, λ is given 1.89 for the best performance and efficiency.

The following describes the overall procedure for generating the λ value and optimized gap penalties. In the first step, we tested the λ value observed into 0.5 interval ranging from 1.0 to 10.0. The result revealed that the λ value between 1.5 and 2.5 is better.

The second step is verifying the detail λ value observed into 0.1 interval ranging from 1.5 to 2.5. Furthermore, we test the six sets of open and extend gap penalty and λ value to find out the optimized parameter for the performance of PiSA-BLAST. As the Figure 9 showing,

the best combination of parameters is 8 for open gap penalty, 2 for extend gap penalty, and from 1.8 to 1.9 for the λ value. Finally, we experimented the best λ value from 1.82 to 1.93 according to the observation of results in second step. Figure 10 demonstrates that we acquire the best performance of database searching when λ value is 1.89.

2.5 Structure Searching by Sequence Alignment tool: BLAST and PSI-BLAST

We download standalone BLAST 2.2.10 [3, 4] from:

ftp://ftp.ncbi.nlm.nih.gov/blast/executables/snapshot/2004-12-05/

The default matrix: “BLOSUM 62” is replaced by the substitution matrix for 23 new codes. Use program: “formatdb” to create our own database made of 3D-1D coded and FASTA formatted protein sequences for BLAST searching. We execute BLAST by program

“Blastall”. Blastall may be used to perform all five flavors of blast comparison. A typical use of blastall would be to perform a “blastp” search (protein vs. protein) of a query file called INPUT would be:

blastall -p blastp –d DATABASE –i INPUT–o OUTPUT –M BLOSUM62 -G 8 -E 2 -F F

The output is placed into the result file OUTPUT and the search is performed against the 'DATABASE' database. Other blastall options showed above are “-M BLOSUM62” which is default scoring matrix, “-G 8 –E 2” which means that open gap penalty is 8 and extend one is 2, and “-F F” that is to tell blastall do not filter query sequence.

Furthermore, we also combine Position-Specific Iterated BLAST, or PSI-BLAST, with our method for detail database searching [3]. The PSI-BLAST program can do an iterative

search in which sequences found in one round of searching are used to build a score model for the next round of searching. When the PSI-BLAST is producing, the position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i.

There is another command to perform PSI-BLAST.

blastpgp -d DATABASE -i INPUT -o OUTPUT -F F -G 8 -E 2 -j 3 -t F -h 1e-15

Program “blastpgp” takes a protein query and perform PSI-BLAST search to create a position specific matrix using a protein database. Some of arguments used in PSI-BLAST are the same as BLAST. There are different options between BLAST and PSI-BLAST, such as “-j 3” which is the maximum number of rounds, “-t F” which means that program do not use composition based statistics, and “-h 1e-15” that is the e-value threshold for including sequences in the score matrix model. The e-value threshold is 0.001 in default. However, in order to obtain correct result and best performance, we change the value from 0.001 to 1e-15 for PiSA-BLAST.

The top part of the output of PSI-BLAST for each round distinguishes the sequences into:

sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string “producing”, which is part of the header for each round and likely does not appear elsewhere in the output.

PSI-BLAST “converges” and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round.

2.6 Evaluating the Performance

We compare both the results from BLAST, PSI-BLAST, CE and PiSA-BLAST against SCOP classifications [15, 16] which is regarded as the golden standard by the biologists.

SCOP classification hierarchy is made of 4 levels: class, fold, superfamily and family among which family is the most detailed classification. In our test, if a protein in the result set belong to the same SCOP family as the query protein, it is counted as a true hit.

We used 3 testing sets for evaluating the performance. First two experiments are refer to Aung et al. [14]. One involved a small database and a limited number of queries, and the other involved a large database and a greater number of queries. Third experiment involved the same number of queries as second testing set and SCOP 1.65 95% database.

In first experiment set, there are 10 proteins from Globins family (a.1.1.2 in SCOP) and 10 proteins from Serine/Threonin kinases family (d.144.1.1 in SCOP) from the representative ASTRAL dataset with less than 40% sequence homology. These 20 proteins are designated as the query proteins. Table 1 shows that the small test set selected from previous work [14].

There are 200 members in the database and 20 queries in two SCOP families are listed. In addition, other 180 proteins were selected from four major classes (All-α, All-β, α/β and α+β) of the same representative dataset. These 180 proteins were combine with the above-mentioned 20 query proteins to form the small target database of 200 proteins.

In second and third testing sets, we conduct another experiment using a large database containing 33311 proteins which is refer to Aung et al. [14] and containing 9354 domain proteins in SCOP 1.65 95% database. From them, Zeyar select 108 query proteins which belongs to 108 medium-size families (with ≥40 and ≤180 members) from four major classes,

and which have less than 40% sequence homology to each other. The lists of 108 query proteins are given respectively in Table 2.

Four common metrics were used to evaluate the quality of database searching, including precision, recall, false positive rate and ROC curve. The precision is defined as Ah/T_h. The recall and false positive rate can be given as Ah/A and (Th-Ah)/(T-A), respectively. Here, Ah is the number of true hits in the hit list, Th is the total number of domain proteins in the hit list, A is total number of true hits in the databases, and T is 33311 or 9354, the total number of domain proteins in these two large databases. The ROC curve plots the sensitivity against the

“1-specificity”. The sensitivity is equal to recall, and the “1-specidicity” is equal to false positive rate.

True hit, also called a relevant retrieval, is defined as an event of retrieving a protein from the database that belongs to the same “family” as the query. BLAST, PSI-BLAST and PiSA-BLAST can retrieve subject proteins by e-value and alignment score. However, CE did not provide sorted retrieval list when we use CE to perform one-against-all searching. For this reason, we need to sort the searching results by ourselves in order to obtain the retrieval list.

In Figure 12, Recall-precision curves of CE using z-score and rmsd to order searching results on 108 queries searching the SCOP 95 database. The results of CE searching which are sorted by z-score are much more accurate than by rmsd. Hence, the results of CE searching are sorted by its Z-score.

We test the database searching time for CE, BLAST, PSI-BLAST and PiSA-BLAST on the same machine (LINUX platform with Pentium IV processors 2.8GHz and 2GB memory).

We use the default parameters for CE, BLAST, PSI-BLAST and 7 target databases: small

database including 200 proteins, large database including 33311 proteins, PDB, nr-PDB, SCOP 1.65 database, SCOP 1.65 95% and SCOP 1.65 40% as the database searching for both methods.

2.7 Practical Applications

The advancements in the protein crystallography to determine the structures of the protein molecules, the sizes of the structure databases such as PDB are growing at a very fast

在文檔中 PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具 (頁 15-0)