EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison

(1)

EMPSC: A New Method Based on Ellipsoidal Model for Protein Structure Comparison

Yhi Shiau^1,2, Jia-Nan Wang¹, Yu-Feng Huang¹, Chien-Kang Huang³

yshiau@cht.com.tw, jnwang@mars.csie.ntu.edu.tw, yfhuang@csie.ntu.edu.tw, ckhuang@ntu.edu.tw

1Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan

2Chunghwa Telecom Laboratories, Tauyuan, Taiwan

3Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei, Taiwan

Abstract

This paper proposes a new method EMPSC for the well-known PSC (Protein Structure Comparison) problem. The proposed method EMPSC is a protein structural alignment algorithm based on ellipsoidal model abstraction. We segment the protein 3D structure into two different kinds of structures, including Secondary Structure Elements recognized by DSSP¹ and other coil/loop structures. These SSEs will be the initial alignment center for obtaining the transformation coordinate systems. Different heuristic filters and geometric hashing based global alignment estimation are used for quick finding better initial alignments. In the refined alignment stage of analysis, a standard refinement algorithm is invoked to fine-tune the alignment outputted by the first stage. Our experimental results reveal that EMPSC generally achieves comparable accuracy and better performance in comparison with the existing PSC algorithms. Moreover, we analyzed the factors that affect the EMPSC performance and SSE-based PSC algorithms. Further investigation in multiple protein structure comparison and local structure comparison will be continued.

Keywords: Eigenvector, Dynamic Programming, Geometric Hashing, Secondary Structure Elements.

Availability: The EMPSC tool is accessible at our lab website http://ballerina.csie.ntu.edu.tw Introduction

Since 1747 Beccari discovered proteins, the proteins play the important role in biochemical reactions, the study of protein functionalities attracted biochemical researchers. Within these research topics, Protein Structure Comparison (PSC) is one of the most basic and important subjects to detect the evolutionary and functional relationships between them. And we know that the functionality of one protein is related to its 3D structure², that is, proteins with similar substructures may have similar functions. Therefore, improving the methodology and tools of PSC is an important issue in molecular biology and bioinformatics for many years^3-7.

Today the biochemists need more fast and accurate PSC tools, as the protein database grows fast with

(2)

the help of computation power in the recent biochemistry research. For examples, the Protein Data Bank⁸ content increases to 35,813 proteins on 28-Mar-2006, the Swiss-Prot content increases to 163,000 and the TrEMBL increases to 1,450,000. Obviously, continuing to improve PSC tools to handle the fast growing massive protein data is a great challenge.

In order to detect the functional or evolutionary relationships between proteins, the PSC algorithms compare the 1D sequences information or 3D structures information of amino acid sequences. The purpose of PSC is to identify maxima equivalent Cα atoms upon which to optimally align the 3D structures of compared proteins. Previously proposed PSC algorithms exploit many different computing approaches, including Monte Carlo (Dali⁹), Dynamic programming^10-12 (VAST¹³), 3D clustering¹⁴, graph theory¹⁵, spline approximation¹⁶ and geometric hashing¹⁷. In order to further speed up the PSC performance, quick-n-dirty approaches is applied in today’s PSC, like CE¹⁸, SAP¹⁹, ProSup^20,21, and FLASH²². As the PSC is the NP-hard problem, most approaches tried to propose different heuristics to approximate the optimal solution. Therefore, quick-n-dirty approaches are the main stream of today’s PSC algorithms. In the following section, we will further explore these quick-n-dirty approaches.

CE method started from the concept of aligned fragment pairs (AFPs), that is, the initial alignment finding is grown from AFPs. CE uses a combinatorial extension of an alignment path defined by AFPs to obtain the extension paths that pass the similarity threshold defined by the inter-residue distance constraint. The inter-residue distance constraint is an approximation for superimposing the two proteins. In refinement, the top-20 alignment paths are evaluated based on r.m.s.d. and the best one selected. The best alignment path is further refined with the dynamic programming approach to obtain the final global alignment solution. The problem of CE method is it is time consumed while calculating of inter-residue distance constraint. On the other hand, according to the observation of our experiments, we found CE method tends to find most corresponding residues rather than smaller r.m.s.d

SAP method uses iterated double dynamic programming and applied it in both the initial alignment finding and the refinement processes. The initial alignment finding begins from every residue pair;

each residue from the compared proteins. SAP finds out the potentially equivalent residue pairs based on local structure and environment. Then, for each potential equivalent residue pair, SAP calculates the alignment score with dynamic programming approach. The parameters of the scoring function include the direction component, the orientation component, the sequence term, and the spatial term.

The initial alignment finding is done by sorting all the elements of the bias-matrix and taking a number (top-20) of the highest scoring elements. After initial alignment finding, top-20 results are further refined with the dynamic programming approach. In order to avoid the exhaustive search of all possible residue pairs, the useful option of taking a randomized selection has been exploited in SAP results. The main issue of SAP is the higher time complexity because double dynamic programming is

(3)

very time consumed for large matrix.

ProSup method tried to approximate the solution from seed pairs. The seed fragments are the similar fragments of compared proteins. ProSup superimposes all possible seeds and evaluates the whole protein with the superimposing equivalences approach instead of using dynamic programming. In refinement, a standard procedure that combines dynamic programming and least-square constraint^23,24 further refines the initial alignments. ProSup can output multiple solutions. The accuracy and efficiency is affected by the length of seed fragments.

FLASH method greatly reduces the time complexity by with an aggressive abstraction of protein structures. FLASH identifies the Secondary Structure Elements (SSEs) of compared proteins with DSSP tool. It works on a vector-represented SSE as a data reduction for protein’s 3D structure. The experiments revealed that the SSE in the protein provides better initial alignment performance.

FLASH only considered the information from α-helix and β-sheet, and neglected the coil information.

It establishes the angle-distance map for all SSE pairs. After calculating the SSE matching probabilities, FLASH uses a greedy procedure to select viable alignment solutions (at least 3).

Statistical significance is applied in FLASH to filter out inappropriate initial alignment findings. In refinement, FLASH applies the same standard refined alignment algorithm as ProSup. With SSE-based data reduction, FLASH greatly speedup the time complexity in initial alignment finding.

And it can output multiple solutions. However, if two proteins have similar local structure, but the global similarity doesn’t exceed the criterion of statistical significance, then FLASH stops further comparison and won’t give any result. In addition, if the protein has no SSE structure, then FLASH method doesn’t apply. The single vector representation also raises problems. To represent α-helix in a single vector is proper, as the α-helix is hard to bend. But the representing vector of β-sheet may lose some structure information. As the β-sheet structure is usually bending or curved, the length of identified β-sheet will affect the derived single vector effectiveness seriously.

Based on previous study, we hope to develop a new method which is more efficient than conventional PSC approaches without the limits of FLASH. In this paper, we propose a new protein structural alignment method based on ellipsoidal model, named as EMPSC, which applies generic heuristic mapping function in initial alignment stage. There are four steps in EMPSC method – preprocessing, initial alignment, refinement and final evaluation. EMPSC identify the SSEs of the target protein with DSSP tool. The remaining parts of this protein, mostly the coil/loop structure, will also be considered.

Rather than using single vector representation as FLASH, we use ellipsoid model to represent each sequence segment. The detailed algorithm describes in the Proposed Method section.

EMPSC has many better characteristics of previous approaches. The initial alignment finding is selected from segment pairs (mainly SSEs) instead of residue pairs. EMPSC can also output multiple solutions. Like FLASH, we believe that SSEs is clearly more important for protein’s structure

(4)

conformation. And abstract the structure information with SSEs (α-helix, β-sheet and coil) could be better than CE’s AFP and ProSup’s seed. In addition, EMPSC like CE and ProSup, the initial alignment finding come from the view of local alignment.

Proposed Method

The workflow diagram of EMPSC is depicted in Fig 1, and the algorithm will be described in details.

The algorithm basically has four steps.

(1) Preprocessing: Segment the proteins into SSEs with DSSP tool, and generate the ellipsoidal representation for each segment (mainly α-helix, β-sheet).

(2) Initial alignment: Generate the potential aligned segment pairs, looking for a good initial alignment via heuristic filtering.

(3) Refinement: Iteratively apply a dynamic programming algorithm to refine the initial alignment, which is a well-known procedures as most PSC algorithms.

(4) Final evaluation: Evaluate the refined alignments and provide the number of corresponding residues and r.m.s.d of alignment solutions.

(5)

(1a) Identify the SSEs from protein

(1b) Cluster remaining residues from protein

(1c) Generate the ellipsoidal representation for each segment

(2a) Generate the pairs of candidate aligned SSE segments from compared proteins

(2b) Filter the pairs with heuristic filtering function

(2d) Rank the candidate pairs with a fast global alignment approach, and Select Top-N best pairs

(3) Iteratively tune the global alignment center with superimposing compared proteins

(dynamic programming approach)

(4) Final evaluation

(2c) Superimpose candidate pairs by center-eigenvector transformation

For each protein For each pair of compared proteins

Fig 1. The workflow diagram of the EMPSC algorithm.

Preprocessing: Generate Ellipsoidal Representation

As most modern PSC algorithms, in the initial alignment step, we don’t estimate the alignment quality by comparing the structures of two proteins atom by atom, but comparing their abstract models. The ellipsoid model rather than all residues of SSEs, were used to represent the proteins’ overall 3D structures in EMPSC. The step (1) of the EMPSC algorithm is the process of generating ellipsoidal representation for each segment of the target protein. At first, in step (1a), we use the DSSP tool to identify the SSEs (α-helix, β-sheet) of the protein. In step (1b), contiguous residues of remaining will be clustered into one segment. Therefore, the remaining residues of this protein (that is the coil/loop information) are further clustered to a set of new segments (coil sub-segments) according the adjacencies of residues. After that, SSEs will be represented by a set of 3D-ellipsoidal model. Fig 2 depicted the relationship between residue chain and 3D-ellipsoidal model. The PCA (Principal Component Analysis)²³ is applied in finding their 3 orthogonal eigenvectors and 3 respective

(6)

eigenvalues. According to the steps (1c), we can decompose the protein into a set of residue segments and ellipsoidal representations as Fig 3.

Residue chain

Fig 2. A 3D-ellipsoidal ball and their 3 eigenvectors

(a) (b) (c)

Fig 3. The generating process of ellipsoidal representation. The figure (a) displays the example protein. The figure (b) shows that the protein was processed by DSSP and the SSEs is identified. The figure (c) shows the

further processing of remaining segments.

Initial Alignment: Find a Good Superimposed Transformation for the Initial Alignments

Finding good superimposed transformations are the main task in obtaining initial alignments of EMPSC. The whole process is described in step (2) of the EMPSC algorithm. In general, the initial alignment searching is a filtering process that eliminates impossible or dissimilar segments mapping and finds the top-N best segments pairs, that is, top-N superimposed transformations between two compared proteins.

At first (2a), we generate all possible initial alignments from the compared proteins. In this step, every pair SSE segments, including α-helix and β-sheet only, of the compared proteins can be the center of the new coordinates, and the remaining coil sub-segments will be used in the biochemical filtering step (2b).

Before estimating the quality of every initial alignment in step (2d), EMPSC will first check the local

(7)

structure of each initial alignment, that is, EMPSC will calculate the similarity between the two mapping SSE pair (α-helix and β-sheet) and the at most four surrounding coil sub-segments. In step (2b), EMPSC will further filter the initial alignments with a heuristic filtering function. Conceptually, we can define an integrated filtering function combining all factors that is effective for the judgment of good initial alignment. However, instead of implementing one integrated filtering function, we currently implemented several subsequent filtering processes which filter out unmatched or dissimilar pairs. These processes include three different kind of filtering – type filter, mass filter, and biochemical filter. The type filter makes sure the secondary structure types of mapping segments are the same (such as α-α, or β-β). The mass filter makes sure the difference of reside numbers between the two mapping segments are must less than four residues. The biochemical filter checks the similarity of biochemical properties between two segments. We found that the biochemical properties of coil segments that before or after the SSEs (α-helix and β-sheet) are beneficial for initial finding, rather than biochemical properties of SSE itself. Therefore, in this filtering process, EMPSC makes sure biochemical features of the surrounding coil sub-segments are similar. The biochemical feature is currently defined as the ratios of hydrophobic residues, polar with uncharged resides and polar with charged residues in compared segments. The detail of biochemical filter is described as following.

The biochemical filter:

Given two compared reside sequences A and B, the biochemical_diff score is defined as following:

( )

∑

=

−

= ³

1

, _

i i

i b

a B A diff l biochemica

where a1, b1 are the ratio of the residues belonging to hydrophobic, a2, b2 are the ratio of the residues belonging to polar with uncharged, a3, b3 are the ratio of the residues belonging to polar with charged, corresponding to proteins A and B, respectively.

If the biochemical_diff is bigger than threshold (empirical value is 0.7), filter out the targeted candidate pairs.

The remaining candidates mapping SSE segments need to pass all filtering criteria. In step (2c), EMPSC aligns the geometric center and the 3 primary eigenvectors of the candidate mapping segments, and then, new coordinates for the two compared proteins will be generated, as Fig 4.

In step (2d), a fast global alignment estimation based on geometric hashing is developed to estimate the quality of the initial alignments. The geometric hashing is a fast way to compare the 3D structure.

Take the geometric center of mapping SSEs as the origin, the position of each Cα atom of one target protein is transforms to the polar coordinate system. In the preprocessing of geometric hashing, Every Cα atom will be put into the hash table according to the distance between the Cα atom and the origin.

We also transform the other protein into new coordinate, and calculate the estimated alignment score.

The estimated scoring function of global alignment is described in Fig 5. Finally, only top-N candidates will be selected for further refinement. That is, we reserve the top-N superimposed

(8)

transformations as the good initial alignments.

Fig 4. Align the 3 eigenvectors of two ellipsoids according to their magnitude.

Global Alignment Estimation based on Geometric Hashing

Assume OA, OB are the origins of new coordinate for protein A and B, respectively.

RA and RB are two resides in protein A and B, respectively.

The hashing function is

( ) ( )

i i

i

i O R bin size bin index of R

R % _ _ _ _

A 1

,

hash =dist =

& ,

where i can be A or B.

In order to avoid the collision, the bin size of the hash table is larger than the diameter of most proteins.

The scoring function of global alignment estimation is defined as

( )

_∑ {

( ) ( ( ) ( )) ( ( ) )

}

∈ ∈ − = ∧ <

=

B

B A A

P R

c B A B

A B

A P c

B R

A P d R R R R R R d

P, max dist , | hash hash dist ,

Score ² ² ,

where dc is the distance cutoff for alignment construction.

The higher score implies the lower r.m.s.d. and the higher number of corresponding residues.

Fig 5. The global alignment estimation based on geometric hashing

Refinement and Final Evaluation

The step (3) applies the same refined methods as most quick-n-dirty PSC algorithms. The least square method is applied in refinement process, and the step (3) will repeatedly refine the initial alignment solutions until the number of corresponding residues converges. In this step, the algorithm will iteratively tune the global alignment center with superimpose two proteins using the dynamic programming approach. Finally, in step 4, EMPSC will output the tuned alignments of the top-N candidate from step (2) as the N alternative solutions.

Complexity Analysis

In data preprocessing, EMPSC will clusters the protein to form a set of ellipsoids with the DSSP tool and proposed ellipsoid clustering method. The ellipsoid clustering is very fast (less than 1 second),

(9)

and the time complexity is O(r) where r is the number of residues of the segments. In the initial alignment finding stage, the time complexity for EMPSC algorithm in this stage is O(eloge + pn) with the scoring function based on fast the O(n)hash function, where n is the number of residues in the protein, e is the number of segments in the protein, p is the number of mapping SSE candidate segment pairs and p is much smaller than e. The complexity of refined alignment stage is O(Cn²), while the C is number of iterations before the refinement process is converged for each initial alignment. According to the observation, the number of iterations in EMPSC is usually less than 10.

In the discussion section, we will find how the refinement process affects the execution time of EMPSC.

Experiments and Results

Three experiments are tested in order to test EMPSC in different conditions of protein structure comparison problems. In each experiments, EMPSC provides maximal 10 alternative solutions, that is, the Top-10 initial alignments in EMPSC are selected. These results reveal the efficiency and effectiveness of EMPSC in comparison with Dali, CE, VAST, ProSup, and FLASH. The results of Dali and ProSup are coming from the original papers. The results of CE and FLASH are gathered from our experiment environment, and they are consistent with the original papers. The computing environment for the experiments consists of Dual Pentium-4 Xeon 3.06GHz CPU and 2GB DRAM memory. All testing programs are not parallelized.

One-against-all search for structural neighbors

As previous research work, we choose cAMP-dependent protein kinase to experiment on one-against-all search for structural neighbors. In order to compare with the existing results of Dali, CE, VAST, ProSup, and FLASH, the parameter dc (the distance cutoff for alignment construction) is assigned to 6Å. For all method, we listed the number of maximal correspondent residues and minimal r.m.s.d. Since we can find the program of CE and FLASH, we also list the execution time of CE and FLASH running in our computing environment. Comparing the value of r.m.s.d and number of corresponding residues, EMPSC can perform as well as other previous methods.

Table 1. An experimental set of structural neighbors of cAMP-dependent protein kinase(1atp:E) identified by different PSC methods

A sample set of structural neighbors of cAMP-dependent protein kinase (1atp:E)(336)

Dali CE VAST ProSup FLASH EMPSC

Protein

(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec

2cpk:E(336) 0.4 / 336 0.37/336/4.9 0.4 / 334 0.4 / 336 0.37/336/0.38 0.37/336/6.1 1apm:E(341) 0.3 / 336 0.33/336/4.94 0.3 / 334 0.3 / 336 0.33/336/0.51 0.32/336/6.19

(10)

1cdk:A(343) 0.4 / 336 0.38/336/4.95 0.4 / 334 0.4 / 336 0.38/336/0.39 0.38/336/6.36 1ydt:E(334) 0.5 / 334 0.45/336/4.5 0.5 / 334 0.5 / 334 0.45/336/0.31 0.45/334/6.03 1bkx:A(337) 0.8 / 334 0.76/336/4.61 0.7 / 314 0.8 / 336 0.76/336/0.09 0.75/334/6.16 1bx6:_(337) 1.0 / 334 1.01/336/4.67 1.0 / 314 1.0 / 336 1.01/336/0.43 1.01/334/6.09 1stc:E(334) 1.1 / 334 1.1/336/4.98 1.1 / 333 1.1 / 334 1.1/336/0.07 1.09/334/5.25 1cmk:E(350) 2.0 / 335 2/336/5.82 2.0 / 331 1.5 / 316 1.72/330/0.77 1.71/330/6.72 1daw:A(327) 3.1 / 267 2.77/266/10.99 2.8 / 259 2.0 / 239 1.87/250/0.53 1.92/252/5.91 1qmz:C(296) 2.5 / 259 2.07/252/6.94 2.3 / 233 1.9 / 239 1.9/251/0.67 1.96/253/5.38 1day:A(327) 2.7 / 263 2.61/262/9.96 2.9 / 262 2.0 / 239 1.96/252/0.49 1.98/253/5.99 1koa:_(447) 2.8 / 261 2.7/258/10.81 2.4 / 225 2.1 / 233 2.16/249/0.42 2.14/249/8.26 1jnk:_(346) 2.8 / 253 2.49/194/11.97 3.0 / 240 2.2 / 220 2.19/242/0.29 2.23/244/6.21 1gag:A(300) 2.8 / 265 2.87/267/7.78 2.7 / 247 2.3 / 232 2.36/251/0.67 2.46/254/5.37 1bl7:A(351) 3.5 / 254 3.14/246/8.79 3.1 / 223 2.3 / 220 2.39/235/0.54 2.4/236/6.33 1cja:B(327) 4.7 / 165 4.19/165/10.9 - 2.7 / 115 2.85/143/0.48 3.01/149/6.36 1e7v:A(850) 4.0 / 159 4.43/165/43.51 - 2.8 / 116 3/142/0.75 3.1/155/15.59 1bo1:B(318) 3.9 / 138 3.9/145/12.25 - 3.0 / 103 2.98/136/0.16 2.9/135/5.71 1b40:A(517) 3.4 / 45 5.68/83/21.4 - 2.9 / 57 3/107/0.67 3.36/105/9.44

1lar:B(533) 2.6 / 34 5.77/123/23.11 - 3.0 / 66 3.07/88/0.79 3.21/86/10.14

10 difficult cases

In this experiment, we experiment on a well-known data set, 10 difficult cases²⁵ reported by Fisher, 1996. Table 2 displays all the structure alignment results for 10 difficult cases. The EMPSC performs worse in case (1ten:_(89) vs. 3hhr:B(195))

Table 2. Comparison of different structure alignment results for 10 difficult cases 10 difficult cases (Fisher 1996)

Dali CE VAST ProSup FLASH EMPSC

Protein 1 (residues)

Protein 2

(residues) rmsd/#res rmsd/#res/sec rmsd/#res rmsd/#res rmsd/#res/sec rmsd/#res/sec 1bge:B(159) 2gmf:A(121) 3.3 / 94 4.02/102/2.59 2.3 / 71 2.4 / 87 -/-/-^a 2.56/95/0.44 1cew:I(108) 1mol:A(94) 2.3 / 81 2.34/81/2.07 2.0 / 71 1.9 / 76 1.92/79/0.07 2.11/81/0.49

1cid:_(177) 2rhe:_(114) 3.1 / 96 2.97/98/2.4 2.0 / 78 2.3 / 84 2.24/94/0.24 2.23/94/1.19 1crl:_(534) 1ede:_(310) 3.6 / 212 3.91/220/16.29 3.7 / 186 2.6 / 161 2.49/191/0.79 2.7/199/9.3 1fxi:A(96) 1ubq:_(76) 2.5 / 52 2.79/64/1.79 2.1 / 48 2.6 / 54 2.47/62/0.03 2.56/63/0.47 1ten:_(89) 3hhr:B(195) 1.9 / 86 1.9/87/2.14 1.5 / 76 1.7 / 85 1.73/86/0.21 2.2/76/1.01 1tie:_(166) 4fgf:_(124) 3.1 / 114 2.86/115/2.23 1.6 / 76 2.4 / 104 2.28/108/0.29 2.44/113/1.15 2sim:_(381) 1nsb:A(390) 3.2 / 289 2.99/276/9.24 4.2 / 299 2.6 / 248 2.61/276/7.8 2.71/282/8.96 2aza:A(129) 1paz:_(120) 3.0 / 82 2.9/85/1.94 2.1 / 70 2.6 / 82 2.34/81/0.1 2.22/82/0.88

(11)

3hla:B(99) 2rhe:_(114) 3.0 / 74 3.46/85/2.49 2.3 / 58 2.7 / 71 2.94/79/0.09 2.75/77/0.65

a This result is available in the original FLASH paper, but we could not get any result while running the FLASH program provided by the authors.

Special cases in global alignment – dissimilar protein comparisons

In this experiment, we experiments on dissimilar but comparable proteins. If two proteins are dissimilar or quite different in same family, EMPSC can still obtain better results than ProSup and FLASH methods. There proteins are selected from the first experiment, because they belong to the same family. Table 3 displays the proteins that we selected and the results. According to these results, EMPSC performs comparablely both in number of corresponding residues and r.m.s.d.

The “–” notation in FLASH column of Table 3 represents that FLASH didn’t find any statistical significant solutions. As FLASH uses the statistical method to increase its computing speed in choosing candidates, in similar cases of proteins, it can get very good solutions. But in dissimilar cases of proteins, the statistical significance evaluation will reject further processing. Generally speaking, only globally similar proteins could pass the statistical significance measurement. Even though, there is some significant local alignment information, it will not process. Comparing the global structure from local structure alignment is the advantage of EMPSC algorithm. EMPSC algorithm performs the worst in the following three cases – (1e7v:A(850) vs. 1jnk:_(346)), (1e7v:A(850) vs. 1day:A(327)) and (1bo1:B(318) vs. 1day:A(327)).

Table 3. Comparison of dissimilar cases of proteins with ProSup, FLASH, and EM

CE ProSup FLASH EMPSC

rmsd/#res/sec rmsd/#res Rmsd/#res/sec rmsd/#res/sec 1cja:B(327) 1daw:A(327) 3.73/157/8.15 3.0 / 131 -/-/-^a 3.02/153/5.83 1cja:B(327) 1qmz:C(296) 3.58/153/7.28 3.0 / 133 2.81/150/0.39 2.83/151/5.26 1cja:B(327) 1day:A(327) 3.79/157/8.14 3.1 / 131 -/-/- 3.04/152/5.71 1cja:B(327) 1koa:_(447) 4.6/173/15.81 3.0 / 130 2.92/150/0.8 3.14/160/7.9 1cja:B(327) 1jnk:_(346) 4.24/162/12.93 2.9 / 133 3.09/140/0.83 3.07/157/6.22 1e7v:A(850) 1daw:A(327) 3.59/135/28.94 3.1 / 110 3.19/147/1.93 2.99/144/15.02 1e7v:A(850) 1qmz:C(296) 4.34/156/45.17 2.8 / 119 2.78/139/1.5 3.15/147/13.61 1e7v:A(850) 1day:A(327) 4.25/148/42.39 3.0 / 110 -/-/- 3.48/97/14.92 1e7v:A(850) 1koa:_(447) 4.31/163/77.09 3.2 / 115 2.99/146/1.1 3.17/144/20.84 1e7v:A(850) 1jnk:_(346) 3.91/143/58.76 3.0 / 120 2.89/152/1.31 3.23/88/15.93 1bo1:B(318) 1daw:A(327) 3.75/144/9.01 3.1 / 121 3.13/129/0.39 2.86/136/5.58 1bo1:B(318) 1qmz:C(296) 3.6/145/8.55 2.8 / 123 2.72/133/0.28 2.71/134/4.99 1bo1:B(318) 1day:A(327) 4.04/151/10.05 2.9 / 123 2.85/139/0.39 3.4/100/5.52

(12)

1bo1:B(318) 1koa:_(447) 4.03/146/16.97 2.8 / 118 -/-/- 2.81/130/7.77 1bo1:B(318) 1jnk:_(346) 3.96/151/16.28 2.8 / 114 -/-/- 3.21/139/5.9

a “─” represents that FLASH doesn’t provide any solution.

Discussion

Efficiency and Number of Alternative Solutions

Although, in previous section, we listed the execution time of every comparison In Table 1, Table 2 and Table 3, it is very hard to observe the performance relationship between CE, FLASH and EMPSC.

Therefore, we added the reside numbers of the two compared proteins, and plot the relationship diagram of execution time vs. total residues, as Fig 6. In this figure, it is obviously that EMPSC is truly faster than CE, especially for large protein structure comparisons. However, EMPSC looks slower than FLASH.

0 10 20 30 40 50 60 70 80 90

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Execution Time (seconds) CE

EMPSC FLASH Trend (CE) Trend (EMPSC) Trend (FLASH)

Fig 6. The execution time of CE, FLASH and EMPSC, given different total residues of compared proteins.

These trend lines for each method are polynomial regressions of order 2 which are provided by MS Excel Trend function.

In order to know whether we can further speed up EMPSC, we did more experiments about EMPSC with different numbers of alternative solutions. We repeated the experiments in previous section with Top-3 and Top-5 alternative solutions and compared it with Top-10 results and FLASH. The detail results listed as Table 4 in appendix section, and we plot the diagram of the execution time versus

(13)

different number of alternative solutions, as Fig 7. In this diagram, we can found that execution time of EMPSC is perfectly proportional to the number of alternative solutions. After profiling our EMPSC program, we found EMPSC spend most execution time in alignment refining process. And we found that, if FLASH provides alternative solutions, it spends about the same execution time as EMPSC.

Two data points of FLASH in Fig 7 show such case. This conclusion can be applied in any PSC algorithm that claims fast but provides only one solution (like FAST¹⁵), except hash-based alignment refining algorithm.

Obviously, according to our observation, EMPSC is a good choice for solving protein structure comparison problems. In addition, we can conclude that further enhancement of PSC algorithms should be focused on the alignment refining process.

0 5 10 15 20 25 30 35

0 200 400 600 800 1000 1200 1400 1600

Total Residues of Compared Proteins

Execution Time (seconds)

FLASH EMPSC Top-3 EMPSC Top-5 EMPSC Top-10 Trend (FLASH) Trend (EMPSC Top-3) Trend (EMPSC Top-5) Trend (EMPSC Top-10)

Fig 7. The execution time of FLASH and EMPSC Top-3, Top-5, Top-10 alternative solutions. These trend lines for each method are polynomial regressions of order 2 which are provided by MS Excel Trend function.

Characteristic of EMPSC Algorithm

The proposed EMPSC algorithm possesses three major features, which we believe that these features make EMPSC a good choice of PSC algorithms. First, the ellipsoidal representation can provide a good summary of 3D information for residues segment. Particularly, two ellipsoidal models can easily map to each other via transformations or rotations of coordination systems. As we said in introduction, to represent α-helix in a single vector is proper, because the α-helix is hard to bend. The representing vector of β-sheet does drop some structure information. As the β-sheet structure is usually bending or

(14)

curved, the length of identified β-sheet will affect the derived single vector effectiveness seriously.

With the ellipsoidal model, EMPSC can effectively abstract the curved β-sheet, because the 3 orthogonal eigenvector of the ellipsoid keeps more information of residues’ distribution in space. This is an advantage of EMPSC in comparison with the vector representation in FLASH algorithm. In addition, the ellipsoidal model does not only support the α-helix and β-sheet structures abstraction, but also can be used to represent loop or coil structures. As the results of the above experiments reveal that EMPSC is a good PSC solution in comparison with previous algorithms, it is convinced that ellipsoidal representation at least provides a good abstraction of 3D structure information as well as others (SAP’s residue pair, CE’s AFP, ProSup’s seed, and even FLASH’s SSE).

Second, EMPSC provide a platform that can plug in different filters for different purposes. Via the different combinations of filters, EMPSC can filter the candidate mapping segment pairs according to profession people’s requirement. In our current experiments results, the combination of type filter, mass filter and biochemical filter can get a good accuracy and efficiency in most cases. In addition, we also found that biochemical filter is especially effective for comparing similar proteins of the same family. Besides of the three filters, we also tried the eigenvector filter which makes sure the eigenvector^26,27 of the mapping segments are similar. Unfortunately the eigenvector filter doesn’t show any further improvement, so we didn’t use it in current EMPSC algorithm.

Third, like traditional PSC algorithms, EMPSC is not only good at global structure comparisons for similar proteins, but also provides useful information in local structure comparisons for dissimilar proteins in same family. That is, local alignment is viable for EMPSC algorithm. Fig 8 and Fig 9 reveal that even under the dissimilar condition of global alignment, two proteins may still share some common local structures with biological significance. To detect the similar local structure in proteins is as important as searching similar global structure in traditional PSC problems.

In order to view the results of EMPSC algorithm, we also developed a tool that can output the comparison results in molscript²⁸, and provide a web service now. Fig 8 and Fig 9 mentioned in previous section are the sample outputs. Instead of superimposing the structures of two proteins, we draw the results in vertically tiled windows. In these pictures, the yellow part in the picture indicates that the SSE is chosen as the aligned center, and the red part indicates the corresponding residues in each compared protein.

(15)

(a) 1daw:A (b) 1cja:B

Fig 8. The structure comparisons results of protein 1daw:A and protein 1cja:B with EMPSC algorithms (The distance between aligned residues ≤ 6Å). The yellow part means that the SSE is chosen as an aligned center. The

red part means the corresponding residues in each protein.

(a) 1qmz:C (b) 1bo1:B

Fig 9. The structure comparisons results of protein 1qmz:C and protein 1bo1:B with EMPSC algorithms (The distance between aligned residues ≤ 6Å). The yellow part means that the SSE is chosen as an aligned center. The

red part means the corresponding residues in each protein.

Further Development

In the future, we will further revise the EMPSC algorithms in the following aspects. First, the amino acid types of residues will be investigated whether they are helpful for EMPSC. Second, SSE-based

(16)

PSC algorithms, like FLASH and EMPSC, have to rely on SSE identification tools¹⁵. Under some conditions the compared proteins sharing similar global structures but dramatically different SSE identifications as Fig 10, it is hard for SSE-based algorithms to find a good alignment center. A possible solution is to segment the longer SSE or connect some shorter SSEs. This can be done by modifying the SSE identification tools, like DSSP, without modifying original SSE-based PSC algorithms. Third, EMPSC has potential for local structure comparison, but result optimization of local structure alignment should be further investigated. As mentioned above, we will further enhance the global and local alignment ability of EMPSC to develop multiple protein structure alignment in the near future.

Fig 10. The proteins with similar global structures but dramatically different SSE identifications – protein 1apt (left) and protein 1bxo (right).

References

1. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983;22(12):2577-2637.

2. Brändén C-I, Tooze J. Introduction to protein structure. New York: Garland Pub.; 1999. xiv, 410 p.

3. Brown NP, Orengo CA, Taylor WR. A protein structure comparison methodology. Comput Chem 1996;20(3):359-380.

4. Gibrat JF, Madej T, Bryant SH. Surprising similarities in structure comparison. Curr Opin Struct Biol 1996;6(3):377-385.

5. Holm L, Sander C. Mapping the protein universe. Science 1996;273(5275):595-603.

6. Koehl P. Protein structure similarities. Curr Opin Struct Biol 2001;11(3):348-353.

7. Orengo C. Classification of protein folds. Curr Opin Struct Biol 1994;4(3):429-440.

8. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res 2000;28(1):235-242.

9. Holm L, Sander C. Dali: a network tool for protein structure comparison. Trends Biochem Sci

(17)

1995;20(11):478-480.

10. Gerstein M, Levitt M. Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures. Proc Int Conf Intell Syst Mol Biol 1996;4:59-67.

11. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48(3):443-453.

12. Subbiah S, Laurents DV, Levitt M. Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol 1993;3(3):141-148.

13. Madej T, Gibrat JF, Bryant SH. Threading a database of protein cores. Proteins 1995;23(3):356-369.

14. Vriend G, Sander C. Detection of common three-dimensional substructures in proteins. Proteins 1991;11(1):52-58.

15. Zhu J, Weng Z. FAST: a novel protein structure alignment algorithm. Proteins 2005;58(3):618-627.

16. Can T, Wang YF. CTSS: A Robust and Efficient Method for Protein Structure Alignment Based on Local Geometrical and Biological Features. Proc IEEE Comput Soc Bioinform Conf 2003;2:169-179.

17. Chang P-K, Chen C-C, Ouhyoung M. A Tool for Structure Alignment of Molecules. IEEE Sixth International Symposium on Multimedia Software Engineering - Special Session on

Bioinformatics 2004:354-361.

18. Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998;11(9):739-747.

19. Taylor WR. Protein structure comparison using iterated double dynamic programming. Protein Sci 1999;8(3):654-665.

20. Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ. Automated large scale evaluation of protein structure predictions. Proteins 1999;Suppl 3:7-14.

21. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein structure alignment. Protein Eng 2000;13(11):745-752.

22. Shih ES, Hwang MJ. Protein structure comparison by probability-based matching of secondary structure elements. Bioinformatics 2003;19(6):735-741.

23. Lesk AM. Protein architecture : a practical approach. Oxford England ; New York: IRL Press;

1991. xiv, 287 p.

24. Zhang Z. Iterative point matching for registration of free-form curves and surfaces. Int J Comput Vision 1994;13(2):119-152.

25. Fischer D, Elofsson A, Rice D, Eisenberg D. Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac Symp Biocomput 1996:300-318.

26. Bezdek JC, Pal MR, Keller J, Krisnapuram R. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing: Kluwer Academic Publishers; 1999. 792 p.

27. Frigui H, Krishnapuram R. A Robust Competitive Clustering Algorithm With Applications in

(18)

Computer Vision. IEEE Trans Pattern Anal Mach Intell 1999;21(5):450-465.

Appendix

Table 4. The results of repeated experiments using EMPSC with Top-3, Top-5, Top-10 alternative solutions respectively.

EMPSC Top-3 EMPSC Top-5 EMPSC Top-10 Protein 1

(residues)

Protein 2

(residues) rmsd/#res/sec rmsd/#res/sec rmsd/#res/sec Structural neighbors of cAMP-dependent protein kinase

1atp:E(336) 2cpk:E(336) 0.37/336/2.12 0.37/336/3.2 0.37/336/6.1 1atp:E(336) 1apm:E(341) 0.32/336/2.12 0.32/336/3.2 0.32/336/6.19 1atp:E(336) 1cdk:A(343) 0.38/336/2.15 0.38/336/3.3 0.38/336/6.36 1atp:E(336) 1ydt:E(334) 0.45/334/2.22 0.45/334/3.2 0.45/334/6.03 1atp:E(336) 1bkx:A(337) 0.75/334/2.31 0.75/334/3.4 0.75/334/6.16 1atp:E(336) 1bx6:_(337) 1.01/334/2.11 1.01/334/3.2 1.01/334/6.09 1atp:E(336) 1stc:E(334) 1.09/334/2.04 1.09/334/3.2 1.09/334/5.25 1atp:E(336) 1cmk:E(350) 1.71/330/2.13 1.71/330/3.5 1.71/330/6.72 1atp:E(336) 1daw:A(327) 1.92/252/2.08 1.92/252/3.1 1.92/252/5.91 1atp:E(336) 1qmz:C(296) 1.96/253/1.85 1.96/253/2.9 1.96/253/5.38 1atp:E(336) 1day:A(327) 1.98/253/2.06 1.98/253/3.2 1.98/253/5.99 1atp:E(336) 1koa:_(447) 2.14/249/2.92 2.14/249/4.3 2.14/249/8.26 1atp:E(336) 1jnk:_(346) 2.23/244/2.13 2.23/244/3.3 2.23/244/6.21 1atp:E(336) 1gag:A(300) 2.46/254/1.82 2.46/254/3 2.46/254/5.37 1atp:E(336) 1bl7:A(351) 2.4/236/2.16 2.4/236/3.4 2.4/236/6.33 1atp:E(336) 1cja:B(327) 3.01/149/2.01 3.01/149/3.1 3.01/149/6.36 1atp:E(336) 1e7v:A(850) 3.2/70/5.39 3.1/155/8.2 3.1/155/15.59 1atp:E(336) 1bo1:B(318) 3.4/69/2.05 3.6/92/3 2.9/135/5.71 1atp:E(336) 1b40:A(517) 3.36/105/3.29 3.36/105/5.3 3.36/105/9.44 1atp:E(336) 1lar:B(533) 3.21/86/2.91 3.21/86/5 3.21/86/10.14

10 difficult cases (Fisher 1996)

1bge:B(159) 2gmf:A(121) 2.56/95/0.33 2.56/95/0.3 2.56/95/0.44 1cew:I(108) 1mol:A(94) 2.11/81/0.19 2.11/81/0.3 2.11/81/0.49 1cid:_(177) 2rhe:_(114) 2.16/93/0.4 2.23/94/0.6 2.23/94/1.19 1crl:_(534) 1ede:_(310) 2.7/199/3.42 2.7/199/5 2.7/199/9.3 1fxi:A(96) 1ubq:_(76) 2.56/63/0.15 2.56/63/0.2 2.56/63/0.47 1ten:_(89) 3hhr:B(195) 2.2/76/0.35 2.2/76/0.5 2.2/76/1.01 1tie:_(166) 4fgf:_(124) 3.28/61/0.41 3.28/61/0.6 2.44/113/1.15

(19)

2sim:_(381) 1nsb:A(390) 2.67/280/3.68 2.67/280/5.2 2.71/282/8.96 2aza:A(129) 1paz:_(120) 2.25/82/0.3 2.25/82/0.5 2.22/82/0.88

3hla:B(99) 2rhe:_(114) 2.87/43/0.23 2.8/79/0.4 2.75/77/0.65 Protein family comparisons

1cja:B(327) 1daw:A(327) 3.2/72/1.92 2.9/152/3 3.02/153/5.83 1cja:B(327) 1qmz:C(296) 2.83/151/1.94 2.83/151/2.8 2.83/151/5.26 1cja:B(327) 1day:A(327) 2.92/145/1.97 3.04/152/3 3.04/152/5.71 1cja:B(327) 1koa:_(447) 3.14/160/2.7 3.14/160/4.2 3.14/160/7.9 1cja:B(327) 1jnk:_(346) 3.07/157/2.23 3.07/157/3.2 3.07/157/6.22 1e7v:A(850) 1daw:A(327) 3.4/77/5.27 3.52/83/8 2.99/144/15.02 1e7v:A(850) 1qmz:C(296) 3.2/93/4.73 3.2/93/7.2 3.15/147/13.61 1e7v:A(850) 1day:A(327) 3.5/79/5.27 3.5/79/8 3.48/97/14.92 1e7v:A(850) 1koa:_(447) 3.67/79/7.48 3.67/79/11.3 3.17/144/20.84 1e7v:A(850) 1jnk:_(346) 3.23/88/5.63 3.23/88/8.5 3.23/88/15.93 1bo1:B(318) 1daw:A(327) 3.6/97/1.99 3.6/97/3.4 2.86/136/5.58 1bo1:B(318) 1qmz:C(296) 2.71/134/1.85 2.71/134/2.6 2.71/134/4.99 1bo1:B(318) 1day:A(327) 3.4/100/1.99 3.4/100/2.9 3.4/100/5.52 1bo1:B(318) 1koa:_(447) 3.42/66/2.6 3.42/66/4.4 2.81/130/7.77 1bo1:B(318) 1jnk:_(346) 3.2/138/2.08 3.2/138/3.1 3.21/139/5.9