• 沒有找到結果。

(submitted to APPLICATIONS NOTE in Bioinformatics)

Title:

A Web-based Three-dimensional Protein Retrieval System by Matching Visual Similarity

Authors:

Jeng-Sheng Yeh1*, Ding-Yun Chen1, Bing-Yu Chen2 and Ming Ouhyoung1, 3 Email: {jsyeh, dynamic, robin, ming}@cmlab.csie.ntu.edu.tw

1Department of Computer Science and Information Engineering

2Department of Information Management

3Graduate Institute of Network and Multimedia National Taiwan University

Taipei 106 Taiwan

Running Head:

(less than 50 characters) 3D Protein Retrieval: a Visual-based Approach

Word Counts:

(less than 1000 words plus one figure) 990

Summary: A web-based three-dimensional (3D) protein retrieval system is available for protein structure data including all PDB and FSSP dataset.

In this system, we use a visual-based matching method to compare the protein structure from multiple viewpoints. It takes less than three seconds for each query with 90% accuracy on the average.

Availability: The web-based query interface and downloadable files can be accessed via

http://3d.csie.ntu.edu.tw/ProteinRetrieval/

Contact: [email protected]

Supplementary information: Further details of the proposed method are available at

http://graphics.csie.ntu.edu.tw/~jsyeh/3Dprotein/

INTRODUCTION

There are more than 25,000 protein structure files in Protein Data Bank (PDB) (Berman et al., 2000) now, with additional one hundred added per week. Hence the necessity of protein structural retrieval is increasing.

Therefore we propose a visual-based method to find the similarity of protein structures automatically and it can also provide some clues for protein classification.

Several algorithms and servers are proposed to analyse those protein structures in PDB to help the prediction of protein’ functions because the shape of protein may determine its function. The following tools are mainly based on alignment of primary structure (1D sequence data), secondary structure (helix/sheet) and/or 3D atom coordinates. For instance, EMBL SSM (Krissinel &

Henrick, 2003) uses a graph-matching algorithm to map secondary structure elements as a start to iteratively align atoms. To compare the 3D protein structures, the Dali/FSSP ((Holm & Sander, 1998) database is developed based on exhaustive 3D structure comparison of protein structures currently in PDB. Several image processing-based methods were also proposed for protein structure comparison (Sandak, et al., 1995) (Chi, et al., 2004). Shape histogram (Ankerst, et al., 1999) is used to compare the 3D structure on the surface of proteins. Here however we would like to provide an alternative tool based on views instead of atom positions only.

In this paper, we present a visual-based protein retrieval system, which is available on Internet with web-enabled interface. The concept of the visual-based matching method is through human perception, therefore, the result of retrieval can be used and manipulated more intuitively and quickly. Biologists can receive the ranked results of a given query. The design of user interface is described as

displayed in terms of visual similarity ranking. The users can also pick one of the results for further query by clicking again. If users want to query by an unpublished protein structure, they can upload the protein structure file in PDB format. The server will calculate the necessary 3D features and make a query.

For output display, users can choose their preference for display. One of the configurations is to display in all figures of protein in similarity ranking. Another configuration can be used to display the results with metadata information from PDB files including protein name, EC number and SwissProt ID. This system output can link to other online databases, such as OCA (Prilusky, 2004) and PDBsum (Laskowski, 2001).

METHODS

The proposed method is based on LightField descriptors (Chen, et al, 2003) to match 3D protein structures in visual-based similarity. The core idea of the multi-view projection method is to compare 3D object from multiple 3D projection views. The retrieval process is divided into off-line feature extraction and on-line protein retrieval.

In off-line feature extraction, the projection views are first pre-rendered from the solvent-accessible surface of protein, which is computed by Connolly’s msp package (Connolly, 1983). Then the 2D shape Zernike moment descriptors and Fourier descriptors are extracted as features for each projection view. In our system, 100 projection views are rendered around the centre of 3D structure for the visual-based matching.

In on-line protein retrieval, the dissimilarity value of two proteins is calculated by the summation of the distance between descriptors in each corresponding views. In addition, in order to accelerate the matching speed in such a large database, we use iterative algorithm and early rejection of non-relevant models to match features efficiently. To iteratively reject non-relevant protein structures, lower frequency parts of Zernike moment descriptors and Fourier descriptors are matched in the initial stage, and higher frequency parts of those descriptors are applied in each stage to refine the top ranked results of retrieval. After iteratively reject models stage-by-stage, a query in whole database (more than 25,000 proteins) can be finished in less than 3 seconds in a Pentium 4 2.4GHz PC. Figure (1a) shows a typical example of protein retrieval in the proposed web-based system.

DISCUSSION

(Holm & Sander, 1998), are analysed and classified. For every class, we’ll skip it if there is only one molecule inside. Figure (1b) is the similarity matrix, which shows that the proteins with the same FSSP class name will be clustered together. The box in the upper-right corner is the enlarged part of the small box in the centre. The similarity value is the inverse of dissimilarity value, which is the sum of the distances in all the corresponding views. Figure (1c), calculated by psbPlot (Shilane, et al., 2004), is the precision-recall plot of these 4,997 proteins.

We create a query for each protein from 4,997 proteins and plot the recall rate of other proteins while those proteins have the same FSSP class name in the 4,997 proteins. It shows that our visual-based matching method may provide some useful clues to help biochemists retrieve and analyse protein 3D structure.

Compared to the shape histogram method (Ankerst, et al., 1999), the accuracy of nearest neighbour classification by using our method is 92.8% (4,997 proteins in 362 classes, and actually 25591 proteins are also tested with similar result.), which is very similar to the 91.6% in Ankerst’s method on the previous version of FSSP dataset (3,422 proteins in 281 classes).

A full set of 25,120 proteins is available in http://3d.csie.ntu.edu.tw/ProteinRetrieval/. Note that DNA files in PDB are not included. As for statistics of our web server, there are 1177 accesses from the first prototype (2,051 proteins inside) of June 16, 2003 to current release (June 16, 2004). Now the system is extended to 25,120 proteins and synchronized to RCSB PDB weekly.

This project is supported in parts by CIET-NTU (MOE), NSC93-2622-E-002-033, NSC93-2752-E- 002-007-PAE, NSC93-2213-E-002-083, and NSC93-2213-E-002-084.

Ankerst, M., Kastenmuller, G., Kriegel, H.-P. & Seidl, T.

(1999) Nearest Neighbor Classification in 3D Protein Databases. Proc. 7th Int. Conf Intelligent System for Molecular Biology (ISMB’99). 34-43.

Berman, H. M., Feng, J. W. Z., Gilliand, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) The protein data bank. Nucleic Acids Research, 28, 235-42. http://www.pdb.org/.

Chen, D.-Y., Tian, X.-P., Shen, Y.-T. & Ouhyoung, M.

(2003) On visual similarity based 3d model retrieval.

Computer Graphics Forum, 22 (3), 223-33. (Proc.

Eurographics 2003).

Chi, P.-H, Scott, G. & Shyu, C.-R. (2004) A fast protein structure retrieval system using image-based distance matrices and multidimensional index. BIBE’04

Connolly, M. L. (1983) Solvent-accessible surfaces of proteins and nucleic acid. Science, 221 (4612), 709-13.

Holm, L. & Sander, C. (1998) Touring protein fold space with Dali/FSSP. Nucleic Acids Research, 26 (1), 316-9.

Krissinel, E. & Henrick, K. (2003) Protein structure comparison in 3d based on secondary structure matching (ssm) followed by ca alignment, scored by a new structural similarity function. Proc. 5th Int. Conf.

Molecular Structure Biology. 88

Laskowski, R. A. (2001) PDBsum: summaries and analyses of pdb structures. Nucleic Acids Research, 29 (1), 221-2.

Prilusky, J. (2004). OCA, a browser-database for structure/function.

http://bip.weizmann.ac.il/oca-bin/ocamain/.

Sandak, B., Nussinov, R. & Wolfson, HJ. (1995) An automated computer vision and robotics-based technique for 3-D flexible biomolecular docking and matching, CABIOS 11(1), 87-99.

Shilane, P., Min, P, & Funkhouser, T. (2004) The Princeton Shape Benchmark, Proc. Shape Modeling International (SMI’04), 2004.

(a) (b) (c)

Figure 1. (a) The query result of our server after submitting the shape of a query protein (4dfr: dihydrofolate reductase). (b) The resulting similarity matrix (4,997 x 4,997) while the intensity of (x, y) shows the dissimilarity value between protein x and protein y, i.e., a darker pixel (x, y) means that protein x and protein y are much similar.

The box in the upper-right part is the enlarged sample of the small box in centre. (c) The precision-recall plot: “given different recall rates (x-axis, 0 to 100%), plot the precision values (y-axis, 0 to 100%) of the correct classification.”

For comparison purpose, we choose 4,997 proteins to retrieve similar shapes to see if proteins with same FSSP class name will be retrieved. Please visit the supplementary web page for further details.

相關文件