三維模型檢索及其應用於分子三維結構比對與功能分類(1/2)

(1)

三維模型檢索及其應用於分子三維結構比對與功能分類

(1/2)

計畫類別：個別型計畫計畫編號： NSC93-2213-E-002-083- 執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日執行單位：國立臺灣大學資訊工程學系暨研究所計畫主持人：歐陽明報告類型：精簡報告處理方式：本計畫可公開查詢

中華民國 94 年 5 月 30 日

(2)

行政院國家科學委員會補助專題研究計畫期中進度報告

三維模型檢索及其應用於分子三維結構比對與功能分類

(1/2)

計畫編號:NSC93-2213-E-002-083 全程計畫:民國 93 年 8 月 1 日至民國 95 年 7 月 31 日本年度計畫:民國 93 年 8 月 1 日至民國 94 年 7 月 31 日計畫主持人:歐陽明台灣大學資訊工程學系教授

一、執行進度

第一年預期完成之工作項目為: a. 完成三維資料模型的比對系統，能將三維分子資料納入，直接比對。對一般三維資料模型, 嘗試部分比對。 b. 完成視覺化的比對分析系統，並將部分蛋白質片斷資料比實作部分成果。直到目前為止，我們已實作完成預定進度。並將系統置於網頁上供人使用做分子三維結構比對與功能分類。http://www.cmlab.csie.ntu.edu.tw/~jsyeh/3Dprotein/ 及 http://www.cmlab.csie.ntu.edu.tw/~zick/bio/index.html 有個別的介紹及工具下載。線上比對系統是以工具方式放於網路(http://3d.csie.ntu.edu.tw/ProteinRetrieval) 供全世界人使用, 就如同之前我們的 3D search engine (http://3d.csie.ntu.edu.tw) 一樣。並將創新部分發表為國際期刊與會議論文。

目前本研究己被 Bioinformatics 期刊接受。另一部分成果則發表在國際會議

IEEE Sixth International Symposium on Multimedia Software Engineering (IEEE-MSE2004) Special Session on Bioinformatics, FL, USA。兩篇論文如下, 並將內容附於此報告之後。值得一提的是, 我們是最早可以將 DNA/RNA 及蛋白質三維結構混合做比較的實作。

1. Jeng-Sheng Yeh, Ding-Yun Chen, Bing-Yu Chen and Ming Ouhyoung, A Web-based Three-dimensional Protein Retrieval System by Matching Visual Similarity, accepted to appear in Bioinformatics (Application Note)

2. Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung, A Tool for Structure Alignment of Molecules. Proc. of IEEE Sixth International Symposium on Multimedia Software Engineering (IEEE-MSE2004) Special Session on Bioinformatics, pp. 354-361, Miami, USA, Dec. 2004. (ISBN 0-7695-2217-3) 中研院黃明經教授的研究, 則除了提供最佳比對外, 也提供所有次佳的比對可能性, 而且速度非常快。請見 [Shih and Hwang 2003, Shih and Hwang 2004]。

(3)

二、三維模型建構方法 (3D model construction)

研究方法分成兩大主軸, 一是三維資料模型的比對系統, 另一則是以三維結構資料比對預測蛋白質功能。

三維資料模型的比對系統. 除全部一起比對外, 部分比對的方法較受注目的為 Spin-Images[Johnson97] , 其次為使用 Graph Matching [Siddiqi99] 另外, 可用於三維分子比對上還有 Geometric Hashing 方法.[Lamdan88]

下列論文則有綜合運用上述各法, 加上一維分子序列比對[Kim03, Shindyalov98, Blankenbecler93, Milik03, Honig03, Zemla03]。但是上面所列的方法，只侷限在利用一維的胺基酸序列來輔助蛋白質結構的三維資料比對，當所要比對的結構不只是蛋白質，而擴大到 tRNA、DNA 的三結結構，或是蛋白質與核酸結構的複合物時，將會出現問題。在本計畫中，我們提出了一個運用幾何資訊的方法。運用這個方法，得以預測分子可能擁有的功能及活化位置。我們預計提出的解決部分比對的方法(比對一般分子而言), 將包含三個步驟, 第一個步驟是利用 Geometric Hashing 來找出 Global alignment。第二步驟, 以前述所找出的 Global alignment 一對對配對結果當輸入, 然後用 Interactive Closet Point 來做最佳化。局部最佳化(local optimization)其結果將會加入更多的配對原子(以一個 distance threshold 之內)。以上第二部分將形成一個 Interactive 方式, 直到不能夠再增加配對原子為止。

第三步驟考慮到所謂的 buried surface problem。其原因是, 我們所得到的配對原

子很可能位在分子的表面之內, 就無法與其他的 ligand 或蛋白質形成

binding。因此, 第三步將要計算所謂的 minimal binding surface。此部分表面將位在 Solvent Accessible Surface。如果此部分面積太小, 即使擁有了 binding site residue 的三維結構, 也無法其他分子作用或結合。上述三步驟除第一部分已經有人發表外, 其他第二第三部分都是一種創新。蛋白質在生物內部的許多反應都佔了相當重要的地位，而蛋白質則是由許多胺基酸一個個相互連接而成。自然界中胺基酸的種類共有二十種，其性質各異，但是基本構造都是一樣的。胺基酸的構造如圖一所示，其中心為碳原子，接在碳原子上的有羧基(-COOH)、胺基(-NH2)、支鏈(side chain) 。

(4)

圖一當兩個胺基酸相連接在一起時，必須脫去一分子的水形成肽鍵。蛋白質就是藉著許多胺基酸相互連接在一起，這一長串胺基酸會在空間中摺疊成極為複雜的三維結構。因此，蛋白質中每個胺基酸的 N、Cα、C 便形成了蛋白質的骨架，支鏈部分則附著在骨架上。當蛋白質欲與基質或配體作用時，蛋白質本身的結構有可能會有些許的改變。大部分的情況下，蛋白質骨架不會有明顯的改變，只有支鏈部分才有。因此我們的方法使用由 N、Cα、C 所組成的平面當作新的參考平面，如圖二所示。圖二每一個胺基酸中的 N、Cα、C 可形成一個三角形，由這個三角形可以重新定義蛋白質中每個原子的三維座標。因為 N-Cα 與 Cα-C 之間的距離是固定的，且 N-Cα-C 所夾的角度也不變，因此兩個蛋白質間的相似部分的骨架也會是大致相似的。將蛋白質的一組 N、Cα、C ，對應到另外一個蛋白質的 N、Cα、C，便等同於在空間中做了旋轉及平移。利用這項特性，原子的三維座標也做了轉換。同時，N、Cα、C 所形成的三角形可以決定唯一的新座標系統。藉由這個辦法，我們將會選擇蛋白質中每個胺基酸的 N、Cα、C 作為新的座標系統。

(5)

由 N、Cα、C 決定新座標系統的方法如圖三，其計算方法則如下： 1. 由C Nα uuuuuv 求得與之平行的單位向量 1 euv...(1) 2. euuv2 由下列方法求得 1 2 1 e C C e e C C α α × = ⋅ uv uuuuv uuv uv uuuuv ...(2) 3. euuv3則由下列方法求得 3 2 1 euuv uuv uv=e ×e ...(3) 圖三對蛋白質中的每個胺基酸，都使用上述方法求得新的座標系統。相對應於新的座標系統，每個原子也都會有新的三維座標。在下一個步驟，也就是 Geometry Hashing Algorithm( 幾何雜湊演算法)，必須用到這些新的三維座標。通常蛋白質的三維構形與其擁有的功能有相當大的關係。因此有相同功能的蛋白質，也通常有局部的相似結構。由上述，相同功能的蛋白質間的相似局部構造，則可以用來表示活化位置。

因此，我們將使用Geometry Hashing Algorithm(幾何雜湊演算法)[Haim97]，來找

尋蛋白質間的相似結構。原本Geometry Hashing Algorithm 是電腦視覺中用來比

對幾何特徵的一個方法。在辨識物體這項運用上，此演算法能夠有效率地執行，且能夠以平行處理來加速。同時，它也能被用來做局部區域的比對。

(6)

在一個 Hash Table 中。當真正要做比較的時候，便可以省下許多計算的時間。在辨識階段，主要目的為比較兩個蛋白質是否有局部相似結構。在這個階段藉由

存取特製的Geometry Hashing Table，便可以迅速地得知空間中的原子分佈狀態。

簡言之，Geometry Hashing Algorithm 包含兩個步驟：前處理階段、辨識階段。在前處理階段，許多已經知道結構的蛋白質三維結構，會依照先前所提的以許多新座標系統，來求得旋轉及平移之後的原子三維資訊，將之建為資料庫。這些資訊在真正比較的時候，並不會因為受測蛋白質的不同而有所改變，因此這些資訊其實是可以事先算好，以使得真正要比較的時候能夠更快速地達成比較三維局部結構的目的。因此在辨識階段，則是要把受測蛋白質跟資料庫中的許多蛋白質作比較。而資料庫中的每個蛋白質，則存在許多座標系統下的三維資料。當完成辨識階段，我們便能得知受測蛋白質與資料庫中的蛋白質的最相似局部結構。找尋蛋白質的相似三維局部結構的方法闡述：前處理階段依照前述，蛋白質的每個胺基酸都能決定一個新的座標系統。由胺基酸中的 N、 Cα、C，能夠得到三個互相垂直的單位向量，這三個單位向量便是新座標系統的座標軸。有了這個新的座標系統，便能夠對蛋白質中的每個原子計算新的座標。這些原子的新座標，則會被加入到Hash Table 中。辨識階段在此階段，同樣會對受測蛋白質以新座標系統，做多次的座標轉換。再將這些座標轉換的結果，與前一個階段建立的Hash Table 中的資料做比對。受測蛋白質的某個原子，如果與Hash Table 中的某個原子位置相近，則會對兩個蛋白質間的相似度貢獻些許分數，稍後將提及貢獻分數的算法。加總所有原子貢獻的分數，便是比較結果的相似度。在這一個階段，會將受測蛋白質每個座標系統下的三維結構，與資料庫中蛋白質的每個座標系統下的三維結構做比較。圖四解釋如何完成辨識階段。

(7)

圖四 Geometry Hashing Algorithm Hashing 函式以原子到原點的距離當作索引，以原子的三維座標為輸出值。例如，A1 在某個座標系統下的座標為 (x, y, z)，則它的 Hashing 函式的索引計算方式為。值得注意的是，兩個蛋白質如果有相似局部結構，其相似結構中的相對應原子的位置並不會完全一樣。當兩個原子之間的距離在一定距離之內（例如 1Å內），則這兩個原子視為同一點。另外，兩個蛋白質也有可能在幾何上分別擁有相似位置的原子，但其化學性質卻截然不同。我們使用了胺基酸序列比對常用的相似矩陣 ─ Dayhoff PAM250，來提高我們方法的正確率。如果兩個分子在空間中距離相近，即相距在 1Å之內，則使用 PAM250 來求得貢獻的分數，如果在空間中沒有相近的原子，其貢獻的分數為 PAM250 中的最小值，也就是 -8。表一為 PAM250 相似矩陣。表一 PAM250 相似矩陣。

(8)

前處理階段對資料庫中的每個蛋白質，執行下列步驟： 1. 讀取蛋白質中每個原子的三維座標，假設共有 n 個原子。 2. 對每個胺基酸所定義出的新座標系 a. 計算所有原子的新座標 b. 依照新座標算出 Hashing 索引值，並將之量化。再將量化過的索引值，及原子座標加入Hash Table 中辨識階段當有受測蛋白質欲找出其與資料庫中的相似局部結構。對受測蛋白質執行： 1. 讀取受測蛋白質每個原子的三維座標，假設共有 s 個原子。 2. 選擇受測蛋白質某個胺基酸所定義出的新座標系，計算所有原子的新座標。並將之與資料庫中蛋白質某個座標系統下的狀況做比較。

3. 求出每個原子座標的 Hashing 索引值，利用 Hashing 索引值來搜尋 Hash Table 中有相近位置的原子。如果受測蛋白質的原子和資料庫中蛋白質的某個原子在空間中相近，則累加其貢獻的分數。 4. 所有原子貢獻的分數總合，即為受測蛋白質某個座標系統下，對資料庫中蛋白質某個座標系統下的相似度。紀錄目前相似度最高的座標系統各是由哪個胺基酸決定，並且計算出旋轉及平移。 5. 回到步驟 2.，重新選擇受測蛋白質的座標系統，與資料庫中蛋白質的座標系統。

(9)

IUBMB (The International Union of Biochemistry and Molecular Biology) 將酵素分為六類。列之如下： EC1 Oxidoreductase EC2 Transferase EC3 Hydrolase EC4 Lyase EC5 Isomerase EC6 Ligase 這個分類準則是依據各自相異的化學反應特性。接下來，我們將簡述我們選擇測試資料的方法與原因。並且也將會列出依據我們的方法所得到的正確率與詳細實驗結果。依照蛋白質功能的不同，將 EC5 下的所有蛋白質分類，並且把這些蛋白質的活化位置的三維結構紀錄下來。蛋白質活化位置的幾何結構能夠決定蛋白質的功能。因此有相同功能的蛋白質，其活化位置附近的形狀也十分相似。因此我們把 PDB 中擁有活化位置資訊的蛋白質收集起來，並取出活化位置的三維結構資訊，集合成為活化位置資料庫。資料庫中的每個活化位置都對應特定的蛋白質功能及酵素分類代碼 (enzyme class number, E.C. No.)。其中酵素分類代碼表示活化位置的特性。接下來，將解釋如何選擇受測蛋白質，及如何將之與活化位置資料庫中的活化位置做比對。收集所有 EC5 下已知酵素分類代碼、但未知活化位置區域的蛋白質，當作受測蛋白質。如果已知酵素分類代碼，也知活化位置區域的蛋白質，也當做受測蛋白質，則一定能夠完全比對出活化位置資料庫中的某個活化位置，如此會使得結果偏頗。對每個受測蛋白質，我們將會找出其三維結構與活化位置資料庫每個活化位置的相似三維局部結構，同時會算出每兩者之間的相似程度，並且找出相似程度最高的蛋白質。接著比較受測蛋白質的 E.C. No. 與最相似活化位置的 E.C. No.，如

(10)

所謂的ICP 演算法, 是在電腦視覺裡面解決配對定位的方法。將兩個三維的物件, 計算估計他們兩個相對的旋轉平移資料。以致於兩個三維物件可以順利地疊合在一起。

並且在過程中, 將兩個物件之點對點距離(root mean square distance) 減為最小。這個方法可以將前面 Geometry Hashing 所得到的資料做最佳化, 以得到最多的配對原子數目。

第三步驟考慮到所謂的 buried surface problem。其原因是, 我們所得到的配對原

子很可能位在分子的表面之內, 就無法與其他的 ligand 或蛋白質形成

binding。因此, 第三步將要計算所謂的 minimal binding surface。此部分表面將位在 Solvent Accessible Surface。如果此部分面積太小, 即使擁有了 binding site residue 的三維結構, 也無法其他分子作用或結合。

(11)

三、實作結果及效能描述

分子對齊的部分，我們的工具可以用來比較兩個不同種類的分子。在Princeton University中的Dept. of Ecology and Evolutionary Biology, 由教授Laura Landweber 領導的團隊的一個研究生Mr. Han Liang提供了分子擬態的資料和些許問題。

主要狀況是發生在不同分子間的比對時, 尤其是 DNA/RNA 對於蛋白質的結構比對部分。所以除了在前面所提到的 Geometry Hashing for protein structure 原始 reference frames 座標軸的定對外, 關於 DNA/RNA 的比較軸有如下的修改, 使得能比較的分子更具有普遍性及一般性。其中下圖(a)為蛋白質結構的座標基底計算。(b) 為任意分之之基底。

而下圖中的 (a) 為依照 Amino Acid 結構所建構之座標基底。(b) 為依照核酸結構所建構之基底。

(12)

一組資料含有相同的EFG (Elongation Factor-G)和EF-tu (the complex of Elongation Factor-Tu and tRNA), 及方向性. 另一組資料含有RRF (Ribosomal Recycling Factor) 和tRNA, 但是他們原本並沒有相同的旋轉方向。關於兩組資料的對齊結果顯示在後面的兩張圖之中。我們工具計算出來的結果, EFG和EF-tu/tRNA對應的旋轉矩陣是對映的平移矩陣則是 (-2.09762, 0.935684, -6.83966)。而 RRF 與 tRNA 對映的旋轉矩陣是對映的平移矩陣則是 (-36.6722, 65.1501, 33.2718)。

Alignment of two molecules using our tool for EFG vs. EF-tu/tRNA complex, where the atom number is over 4000 and the computation time is about 36 hours on a Pentium-4 3GHz PC.(Nissen et. al 1995 shows that the binding to ribosome is at the same place and

orienttation.)This picture is from Professor Laura Landweber’s group of Ecology and Evolutionary Biology Dept. Princeton University, and the orientation is manually selected.

(13)

Alignment of two molecules using our tool for RRF vs. tRNA, where the atom number is over 1000 and the computation time is about 24 minutes on a Pentium-4 3GHz PC.

(Selmer et. al 1999 shows that the binding to ribosome is at different place and orientation.), and again the orientation is manually selected.

(14)

跟其他對齊工具的比較為了跟其他的工具做個比較, 我們將用和Blankenbecler et al.的paper一樣的一組蛋白質. 值得注意的是, 其他的蛋白質比對方法都假設知道已對應好的一串1D 蛋白質對齊方式, 而且他們只針對骨幹分子Ca的對應才有效率. 我們的工具並不需要有那些前提, 並且能使用在任意的分子上, 包含tRNA. 而且為了做比較, 我們使用六個相同的蛋白質. 由圖8可以看出來我們的工具比其它的方法表現得還要好. 下圖 (a) 在Blankenbecler的paper 被提出來, 在那張圖中, Yale, Dali, CE 和 Lund 的方法也被做比較. 下圖是將他們與我們的工具的做比較。

(15)

我們的方法比較好的理由如下 : 1. 在一組成對原子的固定RMSD, 我們的方法有最多的骨幹分子Ca。 2. 給定固定的對稱的數字Ca, 我們的方法有最低的RMSD。關於計算的時間方面, 最主要的時間是花在第一個步驟 geometry hashing 部分。在蛋白質的例子中, 座標系都只建立在氨基酸Ca原子上, 因此只需少量的運算。比對六對蛋白質時, 所有的對齊計算可在6秒到47秒的時間中做完.表格2記錄了在一台Pentium-4.3GHz 電腦上的計算時間。

(16)

在另外一個實驗中，我們使用 3D Model Retrieval 的技巧來做蛋白質的搜尋。在我們的實驗裡，使用2001年10月為止的FSSP資料庫中的前362個類別(編號 12asA類別 ~ 編號1bsvA類別)，其中的4997種蛋白質都已經分析且歸類過。每一個類別中，我們將不考慮單一分子的例子，下圖中間是表示的類似矩陣，他顯示出了蛋白質若是屬於相同的FSSP類別，則結果會叢聚在一起，右上角部分的方塊是中間那小方塊的放大顯示。其中相似值是差異值的的倒數，而差異值則是指所有相對應視野的總和。下圖右方則是利用psbPlot(Shilane, et al., 2004)對這4997種蛋白質所求出的 precision-recall plot。我們利用這4997種已分類於FSSP類別中的蛋白質，來求得其中每種蛋白質對於其他蛋白質的recall rate。以上證明了我們的visual-based對應的方式，對於生物化學家在做蛋白質3D結構的檢索及分析上是非常有幫助的。

跟shape histogram method(Ankerst, et al., 1999)做比較，利用我們的方法來做相似度的分類可以達到92.8%的正確率(362個類別共4999種蛋白質，實際上，我們對 25591種蛋白質做過實驗都可以得到類似的結果)，這相對於Ankerst's method在之前對於之前FSSP資料(281個類別共3422種蛋白質)做的結果91,6%的正確率是相當接近的。在http://3d.csie.ntu.edu.tw/ProteinRetrieval/ 上面，我們總共有25,120種蛋白質，注意，在PDB中的DNA檔案是不包含在內的。根據我們網路伺服器的統計，在2003 年6月到現在(2004年6月)，一共有1177個使用者曾經使用過這套系統。現在我們的系統更加入了25,120種蛋白質的模型，更與RCSB PDB每個禮拜都做同步更新。五、研究成果

目前本研究已成為國際會議 IEEE Sixth International Symposium on Multimedia Software Engineering (IEEE-MSE2004) Special Session on Bioinformatics, FL, USA

(17)

的正式論文。同時也被 Bioinformatics 期刊所接受, 兩篇論文如下:

3. Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung, A Tool for Structure Alignment of Molecules. IEEE Sixth International Symposium on Multimedia Software Engineering (IEEE-MSE2004) Special Session on Bioinformatics, FL, USA

4. Jeng-Sheng Yeh, Ding-Yun Chen, Bing-Yu Chen and Ming Ouhyoung, A Web-based Three-dimensional Protein Retrieval System by Matching Visual Similarity, to appear in Bioinformatics (Application Note)

(18)

Reference Lists

[Bamborough96] P. Bamborough, F.E. Cohen: Modeling protein-ligand complexes. Structural Biology (1996) 6:236-241.

[Besl92] Besl, P. J, N. D. McKay, “A method for registration of 3-D shapes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 14, pp. 239–256, 1992.

[Bernstein77] F.C. Bernstein, T.F. Koetzle, et al: The Protein Data Bank. A computer-based archival file for macromolecular structures. Eru. J. Biochem.(1977) 80:319-324. [Blankenbecler93] R. Blankenbecler, M. Ohlsson, C. Peterson, and M. Ringner, "Matching protein structures with fuzzy alignments", PNAS, October 14, 2003; 100(21): 11936 - 11940.

[Brady00] G. Patrick Brady, Jr., Pieter F.W. Stouten: Fast Prediction and visualization of protein binding pockets with PASS. J Comput Aided Mol. (2000) 14:383-401.

[Brooks90] Frederick. P. Brooks, Jr., Ming Ouhyoung, James J. Batter, P. Jerome Kilpatric, "Project Grope-- Haptic Displays for Scientific Visualization", ACM Computer Graphics, Vol. 24, No.4, pp.177-185. ( ACM SIGGRAPH90).

[Cappello02] V. Cappello, A. Tramontano, U. Koch: Classification of proteins based on the properties of the ligan-binding site: the case of adenini-binding proteins. Proteins. (2002) 47:106-115.

[Chen02] S.-C. Chen, T. Chen: Protein retrieval by matching 3D srufaces. GENSIPS 2002 , Raleigh, North Carolina, USA., (October 2002) CP2-09.

[Chen98] Chien-cheng Chen, Leuo-hong Wang, Cheng-yan Kao, Wen-Chin Chen and Ming Ouhyoung, "Molecular Binding in Structure-based Drug Design: a Case Study of the Population-based Annealing Genetic Algorithms", Tenth International Conference on Tools with Artificial Intelligence 98(TAI98), Taipei, Twiwan, ROC, November, 1998, pp.328-335.

[Chen_CGW02] Ding-Yun Chen and Ming Ouhyoung, "A 3D Object Retrieval System Based on Multi-Resolution Reeb Graph", Proc. of Computer Graphics Workshop, pp.16, Tainan, Taiwan, June 2002.

[Chen_EG03] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen and Ming Ouhyoung, "On Visual Similarity Based 3D Model Retrieval", Proc. EUROGRAPHICS, Granada, Spain, Sept. 2003.

(19)

[Chen_ICS02] Ding-Yun Chen, Ming Ouhyoung, "A 3D Model Alignment and Retrieval System", Proc. of ICS2002 (International Computer Symposium), Hualien, R.O.C, December 2002.

[Chen_Thesis03] Ding-Yun Chen, Three-Dimensional Model Shape Description and Retrieval Based on LightField Descriptors, PhD Thesis, National Taiwan University, 2003.

[Haim97] Haim J W: Geometric hashing: an overview. IEEE Comput. Sc. and Eng.(1997) 4: 10-21. [Hendlich03-1] M. Hendlich, A. Bergner, J. Gunther, G. Klebe: Relibase: design and development of a database for comprehensive analysis of protein-ligand interactions. J. Mol. Biol. (2003) 326:607-620.

[Hendlich03-2] M. Hendlich, A. Bergner, J. Gunther, G. Klebe: Utilising structural knowledge in drug design strategies: applications using Relibase. J. Mol. Biol. (2003) 326:621-636.

[Honig03] S. Goldsmith-Fischman and B. Honig, "Structural genomics: Computational methods for structure analysis", Protein Sci., September 1, 2003; 12(9): 1813 - 1821.

[Johnson97] Andrew Johnson, Spin-Images: A Representation for 3-D Surface Matching, Ph.D. thesis, Robot ics Institute, Carnegie Mellon University, Pittsburgh, PA, Aug. 1997

[Kim03] D. Kim, D. Xu, J.-t. Guo, K. Ellrott, and Y. Xu, "PROSPECT II: protein structure prediction program for genome-scale applications", Protein Eng., September 1, 2003; 16(9): 641 - 650.

[Lamdan88] Y. Lamdan and H. Wolfson, "Geometric Hashing: A General and Efficient Model-Based Recognition Scheme", Proceedings of the IEEE Second International Conference on Computer Vision, Dec. 1988, pp 38-249.

[Liu_PG03] Pin-Chou Liu, Fu-Che Wu, Wan-Chun Ma, Rung-Huei Liang, Ming Ouhyoung, "Automatic Animation Skeleton Construction Using Repulsive Force Field", Pacific Graphics 2003, Oct 2003, Canmore, Alberta, Canada.

(20)

[Milik03] M. Milik, S. Szalma, and K. A. Olszewski,"Common Structural Cliques: a tool for protein structure and function analysis", Protein Eng., August 1, 2003; 16(8): 543 - 552.

[Najmonovich00] R. Najmanovich, J. Kuttner, V. Sobolev, M. Edelman: Side-chain flexibility in proteins upon ligand binding. Proteins. (2000) 39:261-268.

[Rockey02] W.M. Rockey, A. H. Elcock: Progress toward virtual screening for drug side effects. Proteins. (2002) 48:664-71.

[Shen_EG03] Yu-Te Shen, Ding-Yun Chen, Xiao-Pei Tian, and Ming Ouhyoung, "3D Model Search Engine Based on Lightfield Descriptors", EUROGRAPHICS, Granada, Spain, Sept. 2003.

[Shih and Hwang 2003] E. S. C. Shih, M. J. Hwang. Alternative Alignments from Comparison of Protein Structures. Proteins: Structure, Function and Bioinformatics. 56. 519-527, 2004.

[Shih and Hwang 2004] E. S. Shih and M. J. Hwang. Protein structure comparison by probability-based matching of secondary structure elements. Bioinformatics. 19. 735-741, 2003.

[Shindyalov98] Shindyalov IN, Bourne PE, " Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Engineering 11(9) 739-747, 1998.

[Siddiqi99] K. Siddiqi, A. Shokoufandeh, S. Dickinson, and S. Zucker, "Shock Graphs and Shape Matching", International Journal of Computer Vision, Volume 30, 1999, pp 1--24.

[Smith02] G.R. Smith, M.J.E. Sternberg: Prediction of protein-protein interactions by docking methods. Structure Biol. (2002) 12:28-35.

[Tian_Thesis03] Xiao-Pei Tian, A 3D Model Retrieval System Based on MEPG-7 3DSSD, Master Thesis, National Taiwan Univesity, June 2003.

[Totrov97] M. Totrov, R. Abagyan: Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins. (1997) S1:215-220.

[Wang95] Leuo-hong Wang, Cheng-yan Kao, Ming Ouhyoung, Wen-chin Chen, "Molecular Binding: A Case Study of the Population-Based Anneealing Genetic

(21)

Algorithm", Proc. of ICEC'95 (1995 IEEE International Conference on Evolutionary Computing), pp. 41-50, Australia, Nov. 1995.

[Zemla03] A. Zemla, "LGA: a method for finding 3D similarities in protein structures", Nucleic Acids Res., July 1, 2003; 31(13): 3370 - 3374.

(22)

Title Page

(submitted to APPLICATIONS NOTE in Bioinformatics)

Title:

A Web-based Three-dimensional Protein Retrieval System by Matching Visual Similarity

Authors:

Jeng-Sheng Yeh1*_{, Ding-Yun Chen}1_{, Bing-Yu Chen}2_{and Ming Ouhyoung}1, 3

Email: {jsyeh, dynamic, robin, ming}@cmlab.csie.ntu.edu.tw

1_{Department of Computer Science and Information Engineering} 2_{Department of Information Management}

3_{Graduate Institute of Network and Multimedia}

National Taiwan University Taipei 106

Taiwan

Running Head:

(less than 50 characters) 3D Protein Retrieval: a Visual-based Approach

Word Counts:

(less than 1000 words plus one figure) 990

(23)

Summary: A web-based three-dimensional (3D)

protein retrieval system is available for protein structure data including all PDB and FSSP dataset. In this system, we use a visual-based matching method to compare the protein structure from multiple viewpoints. It takes less than three seconds for each query with 90% accuracy on the average.

Availability: The web-based query interface and

downloadable files can be accessed via http://3d.csie.ntu.edu.tw/ProteinRetrieval/

Contact: [email protected]

Supplementary information: Further details of the

proposed method are available at

http://graphics.csie.ntu.edu.tw/~jsyeh/3Dprotein/

INTRODUCTION

There are more than 25,000 protein structure files in Protein Data Bank (PDB) (Berman et al., 2000) now, with additional one hundred added per week. Hence the necessity of protein structural retrieval is increasing. Therefore we propose a visual-based method to find the similarity of protein structures automatically and it can also provide some clues for protein classification.

Several algorithms and servers are proposed to analyse those protein structures in PDB to help the prediction of protein’ functions because the shape of protein may determine its function. The following tools are mainly based on alignment of primary structure (1D sequence data), secondary structure (helix/sheet) and/or 3D atom coordinates. For instance, EMBL SSM (Krissinel & Henrick, 2003) uses a graph-matching algorithm to map secondary structure elements as a start to iteratively align atoms. To compare the 3D protein structures, the Dali/FSSP ((Holm & Sander, 1998) database is developed based on exhaustive 3D structure comparison of protein structures currently in PDB. Several image processing-based methods were also proposed for protein structure comparison (Sandak, et al., 1995) (Chi, et al., 2004). Shape histogram (Ankerst, et al., 1999) is used to compare the 3D structure on the surface of proteins. Here however we would like to provide an alternative tool based on views instead of atom positions only.

In this paper, we present a visual-based protein retrieval system, which is available on Internet with web-enabled interface. The concept of the visual-based matching method is through human perception, therefore, the result of retrieval can be used and manipulated more intuitively and quickly. Biologists can receive the ranked results of a given query. The design of user interface is described as

displayed in terms of visual similarity ranking. The users can also pick one of the results for further query by clicking again. If users want to query by an unpublished protein structure, they can upload the protein structure file in PDB format. The server will calculate the necessary 3D features and make a query.

For output display, users can choose their preference for display. One of the configurations is to display in all figures of protein in similarity ranking. Another configuration can be used to display the results with metadata information from PDB files including protein name, EC number and SwissProt ID. This system output can link to other online databases, such as OCA (Prilusky, 2004) and PDBsum (Laskowski, 2001).

METHODS

The proposed method is based on LightField descriptors (Chen, et al, 2003) to match 3D protein structures in visual-based similarity. The core idea of the multi-view projection method is to compare 3D object from multiple 3D projection views. The retrieval process is divided into off-line feature extraction and on-line protein retrieval. In off-line feature extraction, the projection views are first pre-rendered from the solvent-accessible surface of protein, which is computed by Connolly’s msp package (Connolly, 1983). Then the 2D shape Zernike moment descriptors and Fourier descriptors are extracted as features for each projection view. In our system, 100 projection views are rendered around the centre of 3D structure for the visual-based matching.

In on-line protein retrieval, the dissimilarity value of two proteins is calculated by the summation of the distance between descriptors in each corresponding views. In addition, in order to accelerate the matching speed in such a large database, we use iterative algorithm and early rejection of non-relevant models to match features efficiently. To iteratively reject non-relevant protein structures, lower frequency parts of Zernike moment descriptors and Fourier descriptors are matched in the initial stage, and higher frequency parts of those descriptors are applied in each stage to refine the top ranked results of retrieval. After iteratively reject models stage-by-stage, a query in whole database (more than 25,000 proteins) can be finished in less than 3 seconds in a Pentium 4 2.4GHz PC. Figure (1a) shows a typical example of protein retrieval in the proposed web-based system.

(24)

(Holm & Sander, 1998), are analysed and classified. For every class, we’ll skip it if there is only one molecule inside. Figure (1b) is the similarity matrix, which shows that the proteins with the same FSSP class name will be clustered together. The box in the upper-right corner is the enlarged part of the small box in the centre. The similarity value is the inverse of dissimilarity value, which is the sum of the distances in all the corresponding views. Figure (1c), calculated by psbPlot (Shilane, et al., 2004), is the precision-recall plot of these 4,997 proteins. We create a query for each protein from 4,997 proteins and plot the recall rate of other proteins while those proteins have the same FSSP class name in the 4,997 proteins. It shows that our visual-based matching method may provide some useful clues to help biochemists retrieve and analyse protein 3D structure.

Compared to the shape histogram method (Ankerst, et al., 1999), the accuracy of nearest neighbour classification by using our method is 92.8% (4,997 proteins in 362 classes, and actually 25591 proteins are also tested with similar result.), which is very similar to the 91.6% in Ankerst’s method on the previous version of FSSP dataset (3,422 proteins in 281 classes).

A full set of 25,120 proteins is available in http://3d.csie.ntu.edu.tw/ProteinRetrieval/. Note that DNA files in PDB are not included. As for statistics of our web server, there are 1177 accesses from the first prototype (2,051 proteins inside) of June 16, 2003 to current release (June 16, 2004). Now the system is extended to 25,120 proteins and synchronized to RCSB PDB weekly.

This project is supported in parts by CIET-NTU (MOE), NSC93-2622-E-002-033, NSC93-2752-E- 002-007-PAE, NSC93-2213-E-002-083, and NSC93-2213-E-002-084.

Ankerst, M., Kastenmuller, G., Kriegel, H.-P. & Seidl, T. (1999) Nearest Neighbor Classification in 3D Protein Databases. Proc. 7th_{Int. Conf Intelligent System for}

Molecular Biology (ISMB’99). 34-43.

Berman, H. M., Feng, J. W. Z., Gilliand, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000) The protein data bank. Nucleic Acids Research, 28, 235-42. http://www.pdb.org/.

Chen, D.-Y., Tian, X.-P., Shen, Y.-T. & Ouhyoung, M. (2003) On visual similarity based 3d model retrieval.

Computer Graphics Forum, 22 (3), 223-33. (Proc. Eurographics 2003).

Chi, P.-H, Scott, G. & Shyu, C.-R. (2004) A fast protein structure retrieval system using image-based distance matrices and multidimensional index. BIBE’04

Connolly, M. L. (1983) Solvent-accessible surfaces of proteins and nucleic acid. Science, 221 (4612), 709-13. Holm, L. & Sander, C. (1998) Touring protein fold space

with Dali/FSSP. Nucleic Acids Research, 26 (1), 316-9. Krissinel, E. & Henrick, K. (2003) Protein structure comparison in 3d based on secondary structure matching (ssm) followed by ca alignment, scored by a new structural similarity function. Proc. 5th_{Int. Conf.}

Molecular Structure Biology. 88

Laskowski, R. A. (2001) PDBsum: summaries and analyses of pdb structures. Nucleic Acids Research, 29 (1), 221-2.

Prilusky, J. (2004). OCA, a browser-database for structure/function.

http://bip.weizmann.ac.il/oca-bin/ocamain/.

Sandak, B., Nussinov, R. & Wolfson, HJ. (1995) An automated computer vision and robotics-based technique for 3-D flexible biomolecular docking and matching, CABIOS 11(1), 87-99.

Shilane, P., Min, P, & Funkhouser, T. (2004) The Princeton Shape Benchmark, Proc. Shape Modeling

International (SMI’04), 2004.

(a) (b) (c)

Figure 1. (a) The query result of our server after submitting the shape of a query protein (4dfr: dihydrofolate

reductase). (b) The resulting similarity matrix (4,997 x 4,997) while the intensity of (x, y) shows the dissimilarity

value between protein x and protein y, i.e., a darker pixel (x, y) means that protein x and protein y are much similar. The box in the upper-right part is the enlarged sample of the small box in centre. (c) The precision-recall plot: “given different recall rates (x-axis, 0 to 100%), plot the precision values (y-axis, 0 to 100%) of the correct classification.” For comparison purpose, we choose 4,997 proteins to retrieve similar shapes to see if proteins with same FSSP class name will be retrieved. Please visit the supplementary web page for further details.

(25)

Pei-Ken Chang, Chien-Cheng Chen and Ming Ouhyoung

Department of Computer Science and Information Engineering, National Taiwan University {zick, ccchen}@cmlab.csie.ntu.edu.tw, [email protected]

Abstract

In this paper, a novel tool is proposed to align two molecules (not just proteins) based on their 3D structural data, and the user can observe the result of alignment visually via the tool. Most existing tools are designed only for alignment of proteins. Here, a new tool is developed to address shared structural features between protein structures and tRNA structures, that is, molecular mimicry, although they are two very different types of molecules.

In order to align two molecules A and B, Geometric Hashing is applied to globally find initial matching of approximately overlapped atoms, thus parts of molecule A can be matched to parts of molecule B. Next, a fine tuning process is introduced, which is based on local optimization of overlapped parts, and the Iterative Closest Point (ICP) is used until the number of overlapped atoms within a given distance threshold can not be increased any more. The results show that our method is useful to structurally align two molecules, not restricted to align two proteins only. Besides, our tool outperforms in terms of RMSD and number of matched atom pairs in comparison to other tools.

1. Introduction

Search engines for 3D models have been developed in recent years [1] [2], however, can similar techniques be used in molecules? If so, the benefit can be great. The reason is that large number of protein structures can be determined by high throughput machines, classifying proteins into families and assigning functions to those novel proteins become major tasks in recent years. The Protein Data Bank (PDB) currently contains more than 25,000 structures and it is estimated that the number of structures in the PDB may exceed 35,000 by 2005. Though proteins have been grouped together on the basis of structural similarities in the FSSP [3], CATH [4], and SCOP

databases [5], much effort still has been put into finding the similarities among proteins. Moreover, the rapid growth in the amount of structural data of proteins far exceeds the ability of experimental techniques to identify the locations and key amino acids of active sites. Although the structural genomics initiative (SGI) proposes to solve 10,000 protein 3D structures in this decade, however, many biological functions still remain unknown.

With the help of alignment tools, the structural similarity between proteins is revealed, as well as the functional and evolutional relationships. Holm and Sander [6] mentioned that structural similarities among distantly related proteins are often preserved in the process of evolution, but very little similarity at the sequence level.

There is an interesting problem studied, that is molecular mimicry. The molecular mimicry problem [7] is that a protein and a nucleic acid share a similar substructure, and sometimes it will even extend to similarity in interaction. Nissen et. al [8] indicated that the structure of Elongation Factor-G is similar to that of the complex of Elongation Factor-Tu and tRNA. Selmer et. al [9] mentioned that Ribosomal Recycling Factor looks like tRNA. In addition, exploitation of 3D structural data is a key factor to enhance structure-based drug design (SBDD), and the prediction of protein functions and possible active sites in proteins have become quite popular in SBDD, especially at front-ends to molecular docking [10] [11] or alternative active sites are sought otherwise.

This paper is organized as follows. Some related works are discussed in section 2. The geometric hashing algorithm and ICP algorithm we use are detailed in section 3. The experimental results are provided in section 4 while conclusion is given in section 5.

2. Previous Work

In general, structure alignment based on 3D structure has been shown to be NP complete by

(26)

alignment are needed. Fisher et al. [13] used geometric hashing for a Cα-only representation of protein

structure, and a follow-up is described in Tsai et al. [14]. Their method is based on preprocessing and recognition algorithms of complexity O(n3_{), where n is}

the number of residues of interest. Later, Pennec and Ayache [15] [16] introduced a 3D reference frame attached to each residue, which reduces the complexity of recognition to O(n2_{). Shindyalov and Bourne [17]}

proposed a method that involves a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs) rather than the more conventional techniques which use dynamic programming and Monte Carlo optimization. Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded thereby leading to a single optimal alignment.

Zemla [18] proposed LGA (local-global alignment) algorithm, where longest continuous sequence is first found, and then a second step called GDT (global distance test) is applied. Both longest segment of residues under selected RMSD (root mean square distance) and largest set of equivalent residues that deviate less than a given distance threshold are obtained. Blankenbecler et al. [19] proposed to use fuzzy alignment variables and iterative minimization of a cost function. Milik et al. [20] used graph matching and represented atoms as nodes and bond distance as edge labels. The search method is based on comparison of local structure features of proteins that share a common biochemical function, and so does not depend on overall similarity of structures and sequences of compared proteins.

From the above survey, it is clear that all the above papers are concerned with proteins, and complexity reduction in alignment according to features of proteins or segments of aligned one dimensional sequence. Therefore, they can not solve the general molecule alignment problem unless the tools are modified.

3. Algorithms

In this paper, we propose a tool to align two molecules based on their 3D structural data. The alignment problem between two molecules A and B is solved in two steps: Geometric Hashing and a fine tuning process. Geometric Hashing globally finds initial matching of approximately overlapped atoms. Thus, parts of molecule A can be matched to parts of molecule B. Secondly, the fine tuning process is based

number of overlapped atoms within a given distance threshold can not be increased any more.

3.1. Geometric Hashing: Step One

Geometric hashing algorithm is introduced to structurally align two molecules. Geometric hashing algorithm is a technique originally developed in computer vision for object recognition and can easily be made parallel [21] [22]. In short, the geometric hashing algorithm is composed of two stages:

preprocessing and recognition. The basic idea is to

store in a database at preprocessing time a redundant representation of the models by rigid transformation. By doing so, the representation of the query object processed at recognition time will present some similarities with that of some database models. Matching is possible even when the recognizable database objects have undergone transformations or when only partial information is present.

Often the two interesting molecules are both proteins, so we will illustrate the solution in such a situation first. For some cases, e.g. molecular mimicry, two molecules belong to different type, there would be some variance while calculating, and we will describe later.

The three atoms N, Cα and C in each amino acid

form a triangle which uniquely defines the position and orientation of the amino acid in the three-dimensional structure of a protein. Since the length of N−Cα and

Cα−C are fixed, and N−Cα−C bond angle is also

changeless. As alignment considered, the correspondence between two triplets of points in three-dimensional space is sufficient to uniquely determine a rigid transformation. With this mechanism, we can choose a single residue as a basis. A basis is calculated by the following steps and illustrated in Figure 1(a).

1. Normalize NCJJJJJK_α to eJK₁ 2. 1 2 1 e C C e e C C α α × = × JK JJJJK JJK JK JJJJK 3. eJJK₃ = ×eJJK JK₂ e₁

There are two phases, preprocessing and recognition, in the geometric hashing algorithm. To solve the problem of representation by different reference coordinates, coordinate information based on different reference frame of a model is encoded in the preprocessing phase and stored in a large memory, in this case, a hash table. The contents of the hash table

(27)

offline to reduce the time needed for recognition. Accessing to the memory is based on geometric information that is invariant of the object’s pose and computed directly from the scene. During the recognition phase, the method accesses the previously constructed hash table using the indices of the encoded coordinate information of the input object and finds their common spatial features.

In the phase of preprocessing, we calculate one basis for each residue to generate coordinates for each atom in a protein. In the phase of recognition, we choose a reference frame of the protein B. For each different reference frame of protein A in the hash table, we accumulate the number of matched atoms by checking whether there are two atoms close enough. We set a threshold distance MatchThres (MatchThres = 1 to 2Å is proper), beyond which atoms will not be considered as a match. If no atoms can be matched within MatchThres, we assign the score to 0. If there is an atom within MatchThres, we assign the score to 1. The process is repeated with each reference frame of the protein B until all the reference frames of these two proteins have been tested.

In the case of aligning two different kinds of molecules, the algorithm is slightly modified while creating the bases. For each atom whose coordinate is

P, select two atoms connected with the atom, assuming

that the coordinates for these two atoms are Q1 and Q2

respectively. The rule for constructing basis is 1. Normalize PQJJJJK₁ to eJK₁ 2. ₂ 1 2 e e PQ = × JK JJJJK 3. eJJK₃ = ×eJJK JK₂ e₁

and is illustrated in Figure 1(b). The origin of the new coordinate frame is P. If an atom is connected with n atoms, there would be n× n( −1) coordinate frames made for this atom. In this way, the number of constructed coordinate frames is too large so that the execution is not efficient. In order to decrease the execution time, the criteria for selecting atoms to create bases is listed in Table 1. Then we calculate two bases for each residue, while we calculate four bases for each nucleotide. In proteins, the “ CA ” atom is on the backbone and attached with a side-chain, and the “ CB ” atom is the attached atom. In nucleic acids, the “ C4*” atom and the “ C3*” atom are both on the similar position as the “ CA ” atom in proteins. And “ O4*” atom and “ C2*” are on the similar position as the “ CB ” atom in proteins. This is illustrated in Figure 2.

Table 1. The rule for selecting atoms to construct coordinate frames.

Type of the

molecule Name of the atom lie in P Name of the atom lie in Q1

Proteins “ CA ” “ CB ” “ C4*” “ O4*” Nucleic Acids

“ C3*” “ C2*”

(a) (b) Figure 1. Calculation of a basis. (a) The protein structure. (b) The general molecule structure.

(28)

Figure 2. A sketch of molecules to explain the rule for coordinate frame construction. (a) Amino acid. (b)Nucleotide.

3.2. Fine Tuning Process: Step Two

Once the previous process is done by geometric hashing for global optimization with an output of approximate alignment, the following process is a fine tuning process based on local optimization of overlapped parts. This step is necessary, since the 3D structural data in PDB always involve sampling error in X-ray crystallography in determining atom positions. Furthermore, geometric hashing just provides initial alignment. Therefore the alignment needs fine tuning, and so Iterative Closest Point (ICP) algorithm [23] [24] is chosen. As illustrated in Figure 3, ICP algorithm is used in this process repeatedly, until the number of overlapped atoms within a given distance threshold can be increased no more.

The ICP algorithm proposes a solution to a key registration problem below: given two three-dimensional shapes, estimate the optimal translation and rotation that register the two shapes by minimizing the mean square distance between them. The algorithm guarantees that a local minimum of a mean square objective function is found [23]. In our implementation, we select 100 rigid transformations that lead to maximum numbers of overlapped pairs. The results show that ICP indeed increases the number of atoms matched.

Figure 3. The flow chart for fine tuning process.

4. Experimental Results

4.1. The Molecular Alignment Problem

Our tool can be used in solving the comparison of two molecules that belong to different types. The data and the problem of molecular mimicry (Figure 4, Figure 5) are provided by a graduate student Mr. Han Liang from Professor Laura Landweber’s group in Dept. of Ecology and Evolutionary Biology, Princeton University [25].

One data set [8] consists of EFG (Elongation Factor-G) and EF-tu (the complex of Elongation Factor-Tu and tRNA), and the orientations of the original data are almost the same. The other data set [9] consists of RRF (Ribosomal Recycling Factor) and tRNA, but they are not in the same orientation originally. The aligning results of these two data sets are shown in Figure 6 and Figure 7.

After calculation by our tool, the rotation matrix between EFG and EF-tu/tRNA is

⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − − − 984929 . 0 15437 . 0 0780061 . 0 160734 . 0 983483 . 0 0832201 . 0 0638711 . 0 0945041 . 0 993473 . 0

and the translation vector is

(−2.09762 0.935684 −6.83966).

For the case of RRF and tRNA, the rotation matrix is ⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − − − 626458 . 0 293322 . 0 722159 . 0 776578 . 0 314407 . 0 545962 . 0 0669089 . 0 902835 . 0 42475 . 0

(29)

Figure 4. EFG vs. EF-tu/tRNA complex (Nissen et. al 1995 shows that the binding to

ribosome is at the same place and orient-tation.) This picture is from Professor Laura

Landweber’s group of Ecology and Evolu-tionary Biology Dept. Princeton University,

and the orientation is manually selected.

Figure 5. RRF vs tRNA (Selmer et. al 1999 shows that the binding to ribosome is at different place and orientation.), and again

the orientation is manually selected.

and the translation vector is

(−36.6722 65.1501 33.2718)

4.2. Comparison with Other Alignment Tools

In order to compare with other tools, we will use the same set of proteins as in the paper of Blankenbecler et al. [19]. Note that other protein alignment methods usually use the knowledge of matched 1D sequence alignment for proteins, and they are optimized for proteins only focusing on backbone atoms Cα matching. Our tool does not have this

assumption, and will work for arbitrary molecules, including tRNA. Still, for comparison purpose, we use the same set of six proteins. Figure 8 shows that our tool is better compared to other methods, where Figure 8(a) is reported from Blankenbecler’s [19], in which Yale [26], Dali [27] [28], CE [17] and Lund [19]

The reasons why our method is better are

1. Given a fixed RMSD for pairs of matched atoms, our method has the most number of backbone Cα atoms;

2. Given fixed number of matched Cα, our method

has the lowest RMSD.

In terms of computation cost, the major cost is in the first step, the geometric hashing. In the case of proteins, the coordinate frames are generated from the amino acid Cα atoms only, and thus the computation

cost is low. For the six pairs of target proteins, all alignment calculation is done ranging from 6 seconds to 47 seconds. Table 2 shows the computation time on a Pentium-4 3GHz PC.

In the case of molecules such as RNA and DNA, the nucleic acid has a carbon ring in its base, and therefore the number of possible coordinate frames tends to be much more than that of proteins. Certainly, the computation time is longer. In the case of RRF vs. tRNA, where there are over 1000 atoms in tRNA, the computation time is around 24 minutes, while in the case of EFG vs. EF-tu/tRNA complex (over 4000 atoms), the computation time can be as long as 36 hours on the same 3 GHz PC. Even so, our tool can still solve this problem, which is a very important problem called "molecular mimicry". As far as we know, our method is the first one to solve this kind of problems, because our algorithm is sequence independent, and does not use the knowledge of 1D sequence similarity in molecule pairs.

5. Conclusion

A novel tool is developed to align two molecules based on 3D structural data. In contrast to other algorithms, it takes more computation time to align two molecules by our tool. However, other tools might be restricted to align two proteins. The experiments are conducted based on the data from the PDB and demonstrate that the proposed tool is useful and versatile.

The first experiment is the molecular alignment problem. Given two molecules, our tool will generate the rotation matrix and translation vector so that the above two molecules are optimally aligned. In our experiments, the results are the same, no matter where we randomly place the molecules in a different location with different orientation.

(30)

atom number is over 4000 and the computation time is about 36 hours on a Pentium-4 3GHz PC.

Figure 7: Alignment of two molecules using our tool for RRF vs. tRNA, where the atom number is over 1000 and the computation time is about 24 minutes on a Pentium-4 3GHz PC.

(31)

(a) (b) Figure 8: Alignment results for a set of protein pairs in terms of RMSD of matched atom pairs

and number of aligned atoms (N). In this figure, (a) is from Blankenbecler et al. fuzzy alignment method. The results from Yale (red squares), Dali (green triangles), CE (blue circles), and Lund method (solid lines) are also given in their paper. (b) is from our tool as a comparison. It shows

that our results are better as compared with other methods.

Table 2: Computation time of alignment of six pairs of proteins, where MatchThres means the threshold used in initial geometric hashing, while the other columns are in seconds.

MatchThres (Å) 8DFR-4DFRa 1MBD-1MBA 1TIE-4FGF 1CID-2RHE 7FABl2-1REIa 1FXIa-1UBQ

1.0 7 5 4 3 2 1 1.5 9 6 5 5 2 1 2.0 11 7 6 5 3 1 2.5 13 10 8 7 3 2 3.0 18 12 9 9 4 3 3.5 22 16 13 11 5 3 4.0 30 20 15 13 7 3 4.5 37 26 20 18 8 6 5.0 47 34 24 22 10 6

In the second experiment, several protein pairs are used to compare the results with four popular alignment tools, namely Yale [26], Dali [27] [28], CE [17] and Lund [19] methods. Our tool performs the best in terms of RMSD and number of matched atom pairs.

6. References

[1] D.Y. Chen, X.P. Tian, Y.T. Shen, and M. Ouhyoung, “On visual similarity based 3D model retrieval” Comput. Graph.

Forum, 22(3), 2003, pp. 223-232.

[2] T. Funkhouser, P. Min, M. Kazhdan, J. Chen , A. Halderman, D. Dobkin, and D. Jacobs, “A search engine for 3d models” ACM T. Graphics, 22(1), Jan. 2003, pp. 83-105. [3] L. Holm and C. Sander, “Touring protein fold space with Dali/FSSP” Nucl. Acids Res., 26, 1998, pp. 316-319. [4] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton, “CATH - a hierarchic classification of protein domain structures”, Structure, 5(8), Aug. 1997, pp. 1093-1108.

[5] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, “SCOP: a structural classification of proteins for the investigation of sequences and structures”, J. Mol. Biol., 247, 1995, pp. 536-540.

[6] L. Holm and C. Sander, “Mapping the protein universe”,

Science, 273, Aug. 1996, pp. 595-602.

[7] P. Nissen, M. Kjeldgaard, and J. Nyborg, “Macromolecular mimicry”, EMBO J., 19, 2000, pp. 489-495. [8] P. Nissen, M. Kjeldgaard, S. Thirup, G. Polekhina, L. Reshetnikova, B.F.C. Clark, and J. Nyborg, “Crystal structure of the ternary complex of Phe-tRNAPhe_{, EF-Tu, and} a GTP analog”, Science, 270, Dec. 1995, pp. 1464-1472. [9] M. Selmer, S. Al-Karadaghi, G. Hirokawa, A. Kaji, and A. Liljas, “Crystal structure of Thermotoga maritima ribosome recycling factor: a tRNA mimic”, Science, 286, Dec. 1999, pp. 2349-2352.

[10] V. Cappello, A. Tramontano, and U. Koch, “Classification of proteins based on the properties of the

(32)

[11] G.R. Smith and M.J. Sternberg, “Prediction of protein-protein interactions by docking methods”, Curr. Opin. Struct.

Biol., 12(1), Feb. 2002, pp. 28-35.

[12] R.H. Lathrop, “The protein threading problem with sequence amino acid interaction preferences is NP-complete”,

Protein Eng., 7, 1994, pp. 1059- 1068.

[13] D. Fischer, O. Bachar, R. Nussinov, and H. Wolfson, “An efficient automated computer vision based technique for detection of three dimensional structural motifs in proteins”,

J. Biomol. Struct. Dyn., 9(4), Feb. 1992, pp. 769-789.

[14] C.J. Tsai, S.L. Lin, H. Wolfson, and R. Nussinov, “Techniques for searching for structural similarities between protein cores, protein surfaces and between protein-protein interfaces”, Techniques in Protein Chemistry, VII, 1996, pp. 419-429.

[15] X. Pennec and N. Ayache, “An O(n2_{) algorithm for 3D} substructure matching of proteins”, Shape and Pattern

Matching in Computational Biology - Proc. First Int. Workshop, 1994, pp. 25-40.

[16] X. Pennec and N. Ayache, “A geometric algorithm to find small but highly similar 3D substructures in proteins”,

Bioinformatics, 14(6), 1998, pp. 516-522.

[17] I.N. Shindyalov and P.E. Bourne, “Protein structure alignment by incremental combinatorial extension (CE) of the optimal path”, Protein Eng., 11(9), Sep. 1998, pp. 739-747.

[18] A. Zemla, “LGA: A method for finding 3D similarities in protein structures”, Nucleic Acids Res., 31(13), Jul. 2003, pp. 3370-3374.

Ringner, “Matching protein structures with fuzzy alignments”, Proc. Natl. Acad. Sci. USA., 100(21), Oct. 2003, pp. 11936-11940.

[20] M. Milik, S. Szalma, and K.A. Olszewski1, “Common structural cliques: a tool for protein structure and function analysis”, Protein Eng., 16(8), Aug. 2003, pp. 543-552. [21] Y. Lamdan and H.J. Wolfson, “Geometric hashing: a general and efficient model-based recognition scheme”,

Proceedings of the Second ICCV, 1988, pp. 238-249.

[22] H.J. Wolfson and I. Rigoutsos, “Geometric hashing: an overview”, IEEE comp. Science and Eng., 4, 1997, pp. 10-21. [23] P.J. Besl and N.D. McKay, “A method for registration of 3-D shapes”, IEEE T. Pattern ANAL., 14, 1992, pp. 239-256. [24] Z. Zhang, “Iterative point matching for registration of free-form curves and surfaces”, Int. J. Comput. Vision, 13(2), 1994, pp. 119-152.

[25] H. Liang and L.F. Landweber, “Computational tests of molecular mimicry between tRNA and protein translation factors”, submitted, 2004.

[26] M. Gerstein and M. Levitt, “Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of protein structures”, Proc. Int. Conf. Intell. Syst.

Mol. Biol., 4, 1996, pp. 59-67.

[27] L. Holm and C. Sander, “Protein structure comparison by alignment of distance matrices”, J. Mol. Biol., 233, 1993, pp. 123-138.

[28] L. Holm and J. Park, “DaliLite workbench for protein structure comparison”, Bioinformatics, 16, 2000, pp. 566-567.

三維模型檢索及其應用於分子三維結構比對與功能分類(1/2)

三維模型檢索及其應用於分子三維結構比對與功能分類

(1/2)

中 華 民 國 94 年 5 月 30 日

行政院國家科學委員會補助專題研究計畫期中進度報告

三維模型檢索及其應用於分子三維結構比對與功能分類

(1/2)

一、 執行進度

二、 三維模型建構方法 (3D model construction)

Title Page

(submitted to APPLICATIONS NOTE in Bioinformatics)

Title:

Authors:

Running Head:

Word Counts:

INTRODUCTION

METHODS

中華民國 94 年 5 月 30 日

一、執行進度

二、三維模型建構方法 (3D model construction)