蛋白質結構的研究---溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等之預測(III)

(1)

行政院國家科學委員會專題研究計畫成果報告

蛋白質結構的研究：溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等之預測(3/3)

研究成果報告(完整版)

計畫類別：個別型

計畫編號： NSC 95-2221-E-011-150-

執行期間： 95 年 08 月 01 日至 96 年 07 月 31 日執行單位：國立臺灣科技大學資訊工程系

計畫主持人：李漢銘

計畫參與人員：博士班研究生-兼任助理：王榮英、毛敬豪

碩士班研究生-兼任助理：葉哲甫、林恆生、劉麗貞、陳裕傑

處理方式：本計畫可公開查詢

中華民國 96 年 10 月 31 日

(2)

行政院國家科學委員會補助專題研究計畫成果報告

蛋白質結構的研究：溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等之預測 (結案報告)

計畫類別：■個別型計畫 □整合型計畫計畫編號：NSC 95-2221-E-011-150

執行期間：95 年 08 月 01 日至 96 年 07 月 31 日

計畫主持人：李漢銘教授

計畫參與人員：王榮英、毛敬豪、葉哲甫、林恆生、劉麗貞、陳裕傑成果報告類型(依經費核定清單規定繳交)：□精簡報告 ■完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

執行單位：國立台灣科技大學資訊工程系

中華民國 96 年 8 月 15 日

(3)

行政院國家科學委員會專題研究計畫成果報告

蛋白質結構的研究：溶劑可接觸性、蛋白質二級結構及蛋白質 摺疊構型等之預測

Protein Structure: Prediction of Protein Solvent Accessibility, Protein Secondary Structure and Fine-Grained Protein Fold Assignment

計畫編號：NSC 95-2221-E-011-150

執行期限：95 年 08 月 01 日至 96 年 07 月 31 日計畫主持人：李漢銘教授國立台灣科技大學資訊工程系

計畫參與人員：王榮英、毛敬豪、葉哲甫、林恆生、劉麗貞、陳裕傑國立台灣科技大學資訊工程系

一、中文摘要

生命之存在，乃因為蛋白質維繫著我們的生命。蛋白質不僅僅催化細胞中之大部分反應，更掌控著所有細胞之生命程序。自此之後，開啟了針對蛋白質相關的廣泛研究。幾乎自然界中所存在的所有蛋白質，皆可由基因之編碼序列來決定之。

因為基因的編碼順序可決定組成蛋白質的氨基酸序列，而蛋白質的氨基酸序列內則包含了必要之訊息，用以決定此一蛋白質將如何摺疊成其所相對應的立體架構 [1]。換句話說，若能夠詳細的瞭解蛋白質之三維結構，則能進一步的瞭解此蛋白質在一個生物體內所扮演的功能。

於本計畫中，我們分別針對蛋白質結構研究中非常重要之三個研究主題，既蛋白質之溶劑可接觸性，蛋白質之二級結構及蛋白質之摺疊構型等進行預測。於蛋白質之溶劑可接觸性方面，我們成功的利用了多線性迴歸分析法[2]及 SVM-Cabins [3]

法，得到極佳之分析與預測結果。

而蛋白質之二級結構乃由於不同胺基酸有不同的支鏈，有些胺基酸的支鏈會互相以氫鍵連結，使一級結構的直線排列產生扭曲而形成。傳統上將其分類成三類分別為α-螺旋(α-helices)，β-摺板(β-shells)

及迴圈(coil)。於本計畫中，我們亦針對蛋白質二級結構預測方面，完成了一系列各種編碼方式(或特徵萃取)，對其預測準確度之影響及其優缺點之比較。

至於蛋白質摺疊構型的預測方面，由於欲預測的27 個類別中，許多類別彼此之間有著相似性的存在。故此問題是屬於頗為困難的多類別分類問題。於本計畫中，

我們以支持向量機為基礎，發展了一個針對多類別SCOP (Structure Classifcation of Proteins) 資料庫，進行摺疊構型 (fine- grained folds)之預測方法。

此外我們亦針對蛋白質進行一系列的相關研究，如以蛋白質分類的方式為 EST (基因表達序列標籤)作功能標定，及以序列比對為基礎並應用分類器技術擷取並整合生物資訊來源之蛋白質序列註解系統，透過上述蛋白質相關之研究將使我們更能深入的進行未來的研究規劃。

目前本計畫之相關成果，已經以論文之方式整理完成，並發表的研究論文共有 5 篇。列之於下:

1. “SVM-Cabins: A Novel Method for Numerical Value Prediction of Solvent Accessibility Using Accumulation Cutoff Set and Support Vector Machine,” Jung-Ying Wang ,

(4)

2

Hahn-Ming Lee, and Shandar Ahmad, Proteins: Structure, Function, Genetics, volume 68, issue 1, pp82-91, (2007).

(SCI: 4.684)

2. “Prediction and Evolutionary Information Analysis of Proteins Solvent Accessibility Using Multiple Linear Regression” Jung-Ying Wang, Hahn-Ming Lee, and Shandar Ahmad, Proteins: Structure, Function, Genetics, volume 61, issue 3, pp481-491, (2005).

(SCI: 4.684)

3. “MAPS: An Integrated System for Protein Sequence Annotation Using Support Vector Machine” Jung-Ying Wang, Cheng-Kang Liu and Hahn-Ming Lee, Journal of the Chinese Institute of Engineers, revised. (SCI)

4. “ESTFastAnnotator-High Throughput Protein Domain Prediction on EST by Utilizing InterPro,” Chuang NY, Lee HM and Ho JM, The 9th Conference on Artificial Intelligence and Applications, 5-6 Nov. (2004).

5. “Improving SIM-based Annotation Method of Protein Sequence Using Support Vector Machine”, Jung-Ying Wang, Cheng-Kang Liu and Hahn-Ming Lee; 2006 SCIS & ISIS (Joint 3rd International Conference on Soft Computing and Intelligent Systems and 7th International Symposium on advanced Intelligent Systems), Tokyo, Japan, 20-24 September , (2006).

關鍵詞：溶劑可接觸性、多線性迴歸分析、

蛋白質結構預測、蛋白質二級結構、支持向量機

ABSTRACT

We live because protein keeps our life.

Proteins not only catalyze most of the reactions in living cells, they control virtually all cellular process. Since then, protein problems have been widely studied. The coding sequences of genes determine the amino-acid sequences of almost all naturally

occurring proteins. In addition, proteins contain within their amino acid sequences the necessary information to determine how that protein will fold into a three dimensional structure [1], and the stability of the resulting structure. The field of protein folding and stability has been a critically important area of research for years. In another word, understanding the structure of a protein allows one to understand the function it plays within an organism.

In this project, we focus on the studies of predict the proteins solvent accessibility, prediction of proteins secondary structure and the assignment of protein fine-grained folds. We have successfully used the methods of multiple linear regression [2]

and SVM-Cabins [3] to predict and analyze the real value of solvent accessibility.

The secondary structure consists of local folding regularities maintained by hydrogen bonds and is traditionally subdivided into three classes: α-helices, β -shells and coil. In this project, we finished the analysis and comparison of the different coding schemes and feature extractions.

Predicting a protein from one of 27 folding classes is very difficult because of similarity among different classes. In this project, we present a computational method based on SVM for the assignment of a protein sequence to a folding class in the SCOP (Structure Classification of Proteins) database.

Furthermore, the related research topics of protein are also studied by us, including the topic of improving SIM-based annotation method of protein sequence using support vector machine, and the topic of function annotation by protein cluster selection.

Key words: solvent accessibility; multiple linear regression; protein structure prediction;

secondary structure prediction; support vector machine

二、緣由與目的

找出體內蛋白質有哪些、如何作用、

(5)

如何與環境刺激交互作用，是生命科學先驅性研究的重點。蛋白質不僅在細胞裡催化大多數的反應，更控制著細胞的代謝過程。目前確定蛋白質三維結構的主要技術是利用X射線繞射結晶儀及磁核共振儀等方法，但以上兩種方法皆屬相當緩慢且耗時的過程。目前由以上實驗的方法所得知的蛋白質三維結構接近四萬個左右 ( PDB 2006)，與已知完整的蛋白質之氨基酸序列數目，有一非常明顯的差距 [4][5]。故如何利用機器學習等各種方法，由組成蛋白質之氨基酸的順序來預測蛋白質的三維結構是一非常重要的研究課題 [6] [7]。

預測蛋白質結構的方法概括來說可分成二大類 [8]，其中一類為基於物理化學原理的 ” 重頭起算法 ”(ab-initio method) [9-15]，重頭起算法不使用已知蛋白質分子的結構資料，而是利用經驗式的位能函數，由計算組成原子間之各種相互作用，

從原子的層次上模擬蛋白質分子在水溶液中之行為，利用能量最小化方法計算出合理的蛋白質分子三維結構。另一大類為經驗法則(empirical method) [16]，利用已知蛋白質分子結構的資料如(Brookhaven 蛋白質結構資料庫PBD) [17]，或蛋白質序列的資料，去預測未知蛋白質序列會屬於那一種摺疊構型(fold)。時至今日直接由蛋白質胺基酸序列來預測蛋白質的三維結構，仍未獲得明顯的成功 [18-20]。然而在將某一個蛋白質序列，正確的分派至某一個已知的蛋白質摺疊(fold)類別上，此方面的研究卻有相當大的進展 [21-28]。

因蛋白質摺疊的主要驅動力之一，為序列中核心殘基之疏水性所造成(疏水性殘基一般埋在蛋白內部，而親水性殘基位於表面)，故蛋白質序列中各個殘基之溶劑可接觸面積為研究蛋白質摺疊之一關鍵因素 [29]。而且，溶劑可接觸面積的預測對下列之各項研究亦非常有助益。例如蛋白質之設計 [30][31]，蛋白質序列重複片段 (motif)之驗證鑑定 [32]，細胞膜間 (trans- membrane) 區域之預測 [33-35]，抗原決定基（Antigenic determinants）之研究(因蛋白的親水部位與蛋白抗原表位有密切的聯繫)

[36][37]，蛋白質側鏈構造之研究[38]，髮夾彎結構穩定性之研究(hairpin structural stability) [39] 和蛋白質序列之排比等 [40][41]。因此，如何發展出一有系統的蛋白質序列中各個殘基之溶劑可接觸面積的預測方法，是研究生物資訊科學的一個重要議題，故於近年來溶劑可接觸性 (solvent accessibility) 一直是有關蛋白質三維結構相關研究中之一個非常熱門的研究題材。

另一邁向成功的預測蛋白質的三維結構的方法是去預測蛋白質局部的多胜汰鍵 (poly-peptide chain) 構型，即所謂的蛋白質二級結構的預測。而蛋白質二級結構乃由胜汰鍵局部的規律摺疊而組成，最主要的構成力量就是氫鍵 (hydrogen bonds)。一般而言，其被分為三類，分別為 α- 螺旋 (α-helices)，β-摺板(β-shells)及迴圈(coil)。

由於蛋白質序列之資訊與蛋白質二級結構有密切的關連，故其為”計算分子生物學”

上之一個典型問題，同時此問題非常適合以機器學習(machine learning)之理論來求解。

故本研究擬深入的逐步探討，有關蛋白質方面相關的研究主題主要包括有：蛋白質溶劑可接觸性預測、蛋白質二級結構預測及蛋白質三維結構之摺疊辨識等等，

作一有系統的研究，由重要的底層基礎研究開始，逐步的建構出一扎實的研究。

三、結果與討論

於本研究計劃中，我們主要是針對蛋白質之溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等進行預測。今分別對上述主題進行結果與討論之。

蛋白質溶劑可接觸性面積之預測:

(1) 多線性迴歸分析法

近年來類神經網路與支持向量機等機器學習理論，被大量的使用於各類生物資訊問題的預測上，其雖然可得到較高的準確度，然其最大的問題在於上述理論並不夠通透，無法深入的探究為什麼會得此結果，並進行進一步的分析，如目標殘基與

(6)

4

其相鄰殘基間彼此之合作與競爭關係對預測之影響等。

故於本計畫中所提出的第一種理論方法，乃由蛋白質序列及其演化資訊中萃取特徵後，再利用多線性迴歸分析法，來預測蛋白質之溶劑可接觸面積的實際值。應用此方法我們可得到某特定位置之目標殘基與相鄰殘基間的迴歸係數及相關係數。

針對預測溶劑可接觸性其實際的實驗值。利用多線性迴歸分析法，其迴歸式可表示如下：

13 21

, ,

1 1

( )

i j

i j i j

i j

ASA i α ⁼ ⁼ β X

= =

= +

∑ ∑

(1) 以最小平方法來求得未知的最佳迴歸參數 α 及 βi,j (共計有 1+13×21=274 係數)。針對所探討的三個資料集 Barton502 [42]，

Meller860 [43] 及 Yuan1277 [44]，進行結果的分析與比較。其結果如表一所示。

表一. 多線性迴歸分析法之結果。資料集進行5 折疊式(five-fold)交插驗證，並以平均絕對誤差(MAE)及相關係數度量之。

Barton 502

Yuan 1277

Meller 860

# Protein 502 1277 860

# Residue 83830 256656 212316 Orthogonal

coding

%MAE

18.9 (0.52)

19.1 (0.55)

18.7 (0.53)

PSSM coding

%MAE

16.6 (0.63)

17.2 (0.63)

16.8 (0.62)

PSSM + Composition + sequence length

%MAE

16.2 (0.64)

16.4 (0.66)

16.2 (0.65)

基於此方法預測的準確度可達平均絕對誤差 18.9%左右，可與現今最佳之預測理論相比較，如類神經網路等之結果[43]。

同時我們亦利用蛋白質的演化資訊，建構出一個與目標殘基，左右相鄰數個殘基位置的關聯矩陣，用以分析左右相鄰殘基位置對溶劑可接觸面積的影響。我們觀察到若目標位置的殘基屬於疏水性殘基，則對溶劑可接觸面積將顯現出強的負相關性。

反之亦然，既若目標位置的殘基屬於極性及帶電殘基，則對溶劑可接觸面積將顯現出正的相關性。而相鄰殘基對目標殘基之溶劑可接觸面積相關性影響，將隨著相鄰之距離而大幅的降低。另一方面我們觀察到，較長的蛋白質序列可得到較佳的預測結果。

(2) SVM-Cabins 法

由以上之研究成果可知，既是使用較複雜之迴歸預測法，如類神經網路及支持向量機等的迴歸預測，其能達到的最佳溶劑可接觸性面積之預測誤差以平均絕對誤差 MAE (mean absolute value)表示約為 15.3%~16.0%。似乎很難再以相關的迴歸理論，進一步的提升其預測準確度。

接著於我們整理所有收集到的文獻，

分析後發現近年來使用支持向量機為機器學習理論[45][46]，對溶劑可接觸性面積作傳統的兩類別(暴露於外及深埋其中)之預測結果，皆有相不錯的準確度，範圍約 70%~90% 依其選擇的切割閥值 (cut-off threshold)設定而變化。但當利用相關機器學習理論於三類別(既暴露於外，介於中間及深埋其中)及三類別以上之多類別時，則其預測結果將大幅下降，例如三個類別將降至55%~70%之間，而四個類別將進一步的降低至55%~60%之間。經過以上的分析我們發現支持向量機於傳統的兩類別，溶劑可接觸性面積分類問題上如同其他許多生物資訊上的問題般有著相當成功的表現。此啟發了我們一個研究的想法，於是我們將研究焦點轉向為如何利用優秀的兩類別SVM 分類模式，來作溶劑可接觸性面積實際值(ASA)之預測。

故有關蛋白質溶劑可接觸性面積預測之研究，本計畫中所提出的第二種理論方法，乃利用累進切割集及支持向量機，發

(7)

展出一個全新觀念的溶劑可接觸面積預測系統。我們將此系統命名為SVM-Cabins，

本系統先利用由蛋白質演化資訊所得的位置加權矩陣(position specific scoring matrix, PSSM) 為特徵值，接著以累進切割集的切割方式，利用某一門檻值來進行傳統的兩類別切割，而後再以支持向量機針對不同的切割集，作溶劑可接觸面積之兩類別的預測，最後我們再以一演算法將所得的所有切割集的兩類別預測結果，映射成溶劑可接觸面積之實際值。上述系統當我們採用 13 個不同的切割集來做預測，針對 Barton502 資料集，可達到平均絕對誤差 15.1%及相關係數 0.66，如表二所示，此預測之準確度為至今最佳的結果。由於本系統先採用傳統的兩類別預測，再導入至實際值之預測，故本系統可同時達到對兩類 (既傳統及實際值預測)之最佳化。本系統之理論可以利用於任何預測之值介於一數值範圍間的所有問題。

表二. 針對 Barton 502 資料集進行 5 折疊式(five-fold)交插驗證之結果。(以平均絕對誤差(MAE)及相關係數度量之)

我們將目前常用之重要理論包括: 樣式字典法[2]，多線性迴歸法 [3]，支持向量機迴歸分析法 [44]及類神經網路中之 RVP-Net [47]及 SNNS [48]等，與我們所提之SVM-Cabins 法相比較。表三所示為比較之結果。由結果可看到SVM-Cabins 為至今最佳的結果。

表三. SVM-Cabins 與其他預測理論之比較

RS-126 Barton-502 Data set

MAE Corre MAE Corre

Method (%) -

lation

(%) - lation Look-up

table

* * 19.7 0.47

RVP-Net 19.4 0.48 18.8 0.48

SVR * * 18.5 0.52

MLR * * 16.2 0.64

SNNS * * 15.9 0.65

SVM-Cabins 15.5 0.66 15.1 0.66

* Lack of values

蛋白質二級結構之預測:

二級結構分成八個類別，於本研究中我們將依常用的轉換法則[49]，將 DSSP 之八個類別轉換成三類別，再來進行預測。

其轉換法則如下：H (α-helix), I (π-helix) 及 G (310-helix) 轉成 α -helix 類別， E (extended strand) 轉成β-sheet 類別，其餘的皆轉成 coil 類別。注意當使用不同的轉換時，亦會造成預測上的誤差。

於機器學習理論上，我們主要使用的為支持向量機 (support vector machine)。編碼的方面採用的有正交編碼 (orthogonal coding)，密碼子 (codon)編碼，萃取氨基酸之各種物理化學性質編碼，及含有演化資訊的 PSSM 編碼等。經實驗後發現最有效的編碼方式，仍為含有演化資訊的 PSSM 編碼。若使用7-fold 交差驗證法，對 RS126 資料庫可得到平均75.72%的準確度。而對 CB513 資料庫則可得到平均 75.80%的準確度。而其餘的編碼方式，可能因缺乏演化的資訊，故所含有用的相關生物資訊較少，導致其結果皆無法突破 PSSM 編碼方法。確定 PSSM 為最有效的編碼方式後，

接著我們試著以萃取自氨基酸之各種物理化學性質，配合 PSSM 進行編碼，但結果仍無太大的突破。因此如何更有效的將生物之資訊加入已有之編碼中，如何更有效的利用氨基酸的性質與二級結構間之關係，將成為突破現今二級結構預測準確度的關鍵。本計劃將應用以上的研究經驗，

繼續對蛋白質二級結構之預測進行研究。

找出蛋白質二級結構的最佳預測模型之組合，以改善預測的效果。

Test sets

A B C D E Total

#Protei n

101 101 100 100 100 502

MAE

(%) 14.8 14.9 15.1 15.0 15.5 15.0 Corre-

lation

0.67 0.67 0.66 0.65 0.64 0.66

(8)

6

蛋白質摺疊構型之預測:

本計畫針對 SCOP 蛋白質結構資料庫中，27 種不同類型的蛋白質折疊方式以支持向量機為機器學習之核心理論，進行 SCOP 多類別之分類研究。為能與現有之結果比較，我們採用與Ding 及 Dubchak [50]

相同之資料集，進行交插驗證 (cross validation)。

於編碼之策略上我們除利用最常採用之氨基酸之組成外，另利用氨基酸之物理化學之性質進行特徵值之萃取(編碼)，包含氨基酸之親疏水性(hydrophobicity) ，氨基酸正規化之凡得瓦爾體積(normalized van der Waals volumn) ，氨基酸之極性 (polarity)，氨基酸之極化率(polarizability) 及預測之二級結構等。除此之外，我們並將上述之各項特徵值做各種之排列組合，

以期望能挖掘出有用的特徵值組合。如 HVP 之組合表示我們將以氨基酸之親疏水性加上凡得瓦爾體積及極性為一組編碼之特徵值進行預測。所以這樣做之原因在於我們發現若僅以一種特徵值編碼，不論是單一之特徵值或經組合後之特徵值，其預測之準確度對此27 個類別之分類結果皆偏低，但有趣的是我們觀察到不同之特徵值或經組合後之特徵值，其對不同類別之準確度會有穿插之效果。如此編碼法我們將得到許多依某特徵值編碼或特徵值組合編碼之SVM 子分類器。

另外我們亦觀察到因為現有27 個類別之多，故若採用一對一(one-against-one)之方式來將兩類別之 SVM 分類器用以實做 27 個類別之分類，將因為引入大量之雜訊而使得準確度下降。經實驗後發現若採用一對多(one-against-all) 之方式來實做 SVM 之多分類問題，將因可大幅降低雜訊而使得結果提昇。除此之外我們於指派預測結果方面，採取更保守的方法，即於 one-against-all 分類時若某筆測試資料(test data)被上述之某一編碼法之 one-against-all 分類器預測為正類別(positive class) 時，我們會將此測試資料歸類為此類別。但若預測為負類別(negative class)時，則此測試資

料不會被給予任何類別。

圖一.預測蛋白質摺疊構型之 SVM 分類器之架構圖。其中X, Y, 或 Z 表示依某一特徵編碼之SVM 分類器

經過上述之過程後若某一筆測試資料被超過一個以上之 SVM 子分類器指派為該類別時，則以投票之機制選出最後之預測結果。但若某一筆測試資料未被認何之 SVM 子分類器指派為該類別時(表較難以我們所採之各種特徵值有效的分類)，則我們利用一定會將該筆測試資料指派某類別之SVM one-against-one 分類器予以分類。

整個預測理論之架構圖如圖一所示。

四、成果自評

根據本研究計畫內容，按各年度計畫之執行，已完成之具體成果如下：

(1) 利用多線性迴歸分析法，來預測蛋白質之溶劑可接觸面積的實際值。藉由此方法可得到某特定位置之目標殘基與相鄰殘基間的迴歸係數及相關係數。進一步我們建構出一個與目標殘基，左右相鄰數個殘基位置的關聯矩陣，用以分析左右相鄰殘基位置對溶劑可接觸面積的影響。

(2) 提出一全新觀念的 SVM-Cabins 系統，

本系統之理論可以利用於任何預測之值介於一數值範圍間的所有問題。我們

(9)

並利用此系統發展出一個溶劑可接觸面積預測方法。由於本系統先採用傳統的兩類別預測，再導入至實際值之預測，故本系統可同時達到對兩類(既傳統及實際值預測) 溶劑可接觸面積預測之最佳化。

(3) 有關蛋白質二級結構之研究方面，我們已完成各種二級結構編碼方式(特徵萃取)對其預測準確度之影響及其優缺點之比較。此研究之成果將可提供我們後續擬應用複合式預測器 (hybrid predictor)架構之預測模型中之編碼取捨，期改善預測的效果提昇蛋白質二級結構預測其準確性。

(4) 針對生物學上目前最廣為使用的蛋白質分類資料庫， SCOP (Structure Classifcation of Proteins) 進行更精細的摺疊構型(fine-grained folds)的預測研究。發現對此問題若我們利用常採用於生物資訊上的各種特徵值去萃取特徵時，僅蛋白質氨基酸之組成此一特徵值有相對較好之結果，但其整體多類別之分類準確度仍不理想。其餘之特徵值效果更差，但某些特徵值或其組合可能會對多類別中之某幾個類別頗有效果，故我們利用此特性，發展了一摺疊構型的預測方法。

(5) 本計畫已發表有關蛋白質研究方面之相關論文共計 5 篇其中更有 2 篇發表於蛋白質研究方面之重要期刊 Proteins (SCI: 4.684)，顯示本計畫之研 究成果豐碩且紮實。

參考文獻

1. Moult J, Pedersen J, Judson R, Fidelis K.

A large-scale experiment to assess protein structure prediction methods. Proteins 1995; 23: ii--v.

2. Wang J -Y, Ahmad S, Gromiha MM and Sarai A. Look-up tables for protein solvent accessibility prediction and nearest neighbor effect analysis.

Biopolymers 2004; 75: 209-216.

3. Wang J -Y, Lee H –M, Ahmad S.

Prediction and evolutionary information

analysis of proteins solvent accessibility using multiple linear regression".

Proteins 2005; 61: 481-491.

4. Moult J; Fidelis K, Zemla A, Hubbard T.

Critical assessment of methods of protein structure prediction (CASP)-Round V.

Proteins 2003; 53: 334-339.

5. John B, Sali A. Detection of homologous proteins by an intermediate sequence search. Protein Sci 2004; 13: 54–62.

6. Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991; 9(1): 56-68.

7. Ginalski K, Rychlewski L. Protein structure prediction of CASP5 comparative modeling and fold recognition targets using consensus alignment approach and 3D assessment.

Proteins 2003; 53 (Suppl 6): 410-417.

8. D. Baker, A. Sali. Protein structure prediction and structural genomics.

Science 2001; 294(5540):93-96.

9. Kihara D, Zhang Y, Lu H, Kolinski A, Skolnick J. Ab initio protein structure prediction on a genomic scale:

Application to the Mycoplasma genitalium genome. Proc Natl Acad SciUSA 2002;99: 5993–5998.

10. Xia Y, Levitt M, Huang ES, Samudrala R.

Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 2000;300:171–185.

11. Huang ES, Samudrala R, Ponder JW. Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J Mol Biol 1999;290:267–281.

12. Huang ES, Samudrala R, Ponder JW. Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J Mol Biol 1999;290:267–281.

13. Zhang CT, Hou J, Kim SH. Fold prediction of helical proteins using torsion angle dynamics and predicted restraints. Proc Natl Acad SciUSA 2002;99: 3581– 3585.

14. Srinivasan R, Rose GD. Ab initio prediction of protein structure using

(10)

8

LINUS. Proteins 2002;47: 489–495.

15. Simons KT, Strauss C, Baker D.

Prospects for ab initio protein structural genomics. J Mol Biol 2001;306:1191–1199.

16. Baker D, Sali A. Protein structure

prediction and structural genomics.

Science 2001;294:93–96.

17. Berman H, Westbrook J, Feng Z, Gililand G, Bhat T, Weissig H, Shindyalov I, Bourne P. The protein data bank. Nucleic Acids Res. 2000; 28: 235.

18. Rost B. Rising accuracy of protein secondary structure prediction in: 'Protein structure determination, analysis, and modeling for drug discovery' (eds. D Chasman), 2003.

19. Hvidsten TR, Kryshtafovych A, Komorowski J, Fidelis K. A novel approach to fold recognition using sequence- derived properties from sets of structurally similar local fragments of proteins. Bioinformatics 2003;2, ii81-ii89.

20. Rost B, Sander C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 1994; 19, 55.

21. Blundell TL, Sibanda BL, Sternberg MJ, Thornton JM. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 1987;326: 347 –352.

22. Russell A, Torda AE. Protein sequence threading: Averaging over structures.

Proteins 2002;47: 496 –505.

23. Kolinski A, Betancourt MR, Kihara D, Rotkiewicz P, Skolnick J. Generalized comparative modeling (GENECOMP): A combination of sequence comparison, threading, and lattice modeling for protein structure prediction and refinement. Proteins 2001;20: 133–149.

24. Dubchak I, Muchnik I, Holbrook SR, Kim S-H. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci USA 1995;92:8700–8704.

25. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim S-H. Recognition of a protein fold

in the context of the structural classification of proteins (SCOP).

Proteins 1999;35:401– 407.

26. Chou KC, Liu WM, Maggiora GM, Zhang CT. Prediction and classification of domain structural classes. Proteins 1998;31:97–103.

27. Chou KC, Zhang CT. Prediction of protein structural classes. Crit Rev Biochem Mol Biol 1995; 30:275–349.

28. Dubchak I, Holbrook SR, Kim S-H.

Prediction of protein folding class from amino acid composition. Proteins 1993;16:79 –91

29. Chan HS, Dill KA. Origins of structures in globular proteins. Proc Natl Acad Sci USA 1990; 87: 6388-6392.

30. Baumann G, Froömmel C, Sander C.

Polarity as a criterion in protein design.

Protein Eng. 1989; 2: 329-334.

31. Soppl MJ. Recognition of errors in three-dimensional structures of proteins.

Proteins 1993; 17: 355-362.

32. Cornette JL, Cease KB, Margalit H, Spouge JL, Berzofsky JA, DeLisi C.

Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol 1987;

195: 659-685.

33. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982; 157:

105-132.

34. Eisenberg D, Schwarz E, Komaromy M, Wall R. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J Mol Biol 1984; 179: 125-142.

35. Engelman DM, Steitz TA, Goldman A.

Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Chem 1986;

15: 321-353.

36. Both GW, Sleigh MJ. Complete nucleotide sequence of the haemagglutinin gene from a human influenza virus of the Hong Kong subtype. Nucleic Acids Res 1980; 8:

2561-2575.

37. Hopp TP, Woods KR. Prediction of

(11)

protein anti-genic determinants from amino acid sequences. Proc Natl Acad Sci USA 1981; 78: 3824-3828.

38. Eyal E, Najmanovich R, McConkey BJ, Edelman M, Sobolev V. Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. J Comput Chem 2004; 25(5):

712-724.

39. Russell SJ, Blandl T, Skelton NJ, Cochran AG. Stability of cyclic beta -hairpins: asymmetric contributions from side chains of a hydrogen-bonded cross-strand residue pair. J Am Chem Soc 2003; 125: 388-395.

40. Gaboriaud C, Bissery V, Benchetrit T, Mornon JP. Hydrophobic cluster analysis:

an efficient new way to compare and.

analyse amino acid sequences. FEBS Lett 1987; 224: 149-55.

41. Lemesle-Varloot L, Henrissat B, Gaboriaud C, Bissery V, Morgat A, Mornon JP. Hydrophobic cluster analysis:

procedures to derive structural and functional information from 2-D-representation of protein sequences.

Biochimie 1990; 72: 555-74.

42. Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000; 40: 502-511.

43. Adamczak R, Porollo A, Meller J.

Accurate prediction of solvent accessibility using neural networks-based regression. Proteins 2004; 56(4): 753-67.

44. Yuan Z, Huang B. Prediction of protein accessible surface areas by support vector regression. Proteins 2004; 57: 558-564.

45. Kim H, and Park H. Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor. Proteins 2004; 54(3):557–562.

46. Yuan Z, Burrage K, Mattick J. Prediction of protein solvent accessibility using support vector machines. Proteins 2002;

48:566-570.

47. Ahmad S, Gromiha MM, Sarai A:

Real-value prediction of solvent accessibility from amino acid sequence.

Proteins 2003; 50: 629-635.

48. Garg A, Kaur H, Raghava GP. Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure.

Proteins 2005; 61: 318-324.

49. Rost, B. and Sander, C. (1993c).

Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232:584–599.

50. Ding CH, Dubchak I. Multi-class protein fold recognition using support vector machines and neural networks.

Bioinformatics 2001;17:349 –358.

蛋白質結構的研究---溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等之預測(III)

行政院國家科學委員會專題研究計畫 成果報告

行政院國家科學委員會補助專題研究計畫成果報告

蛋白質結構的研究：溶劑可接觸性、蛋白質二級結構 及蛋白質摺疊構型等之預測 (結案報告)

中 華 民 國 96 年 8 月 15 日

行政院國家科學委員會專題研究計畫成果報告

∑ ∑

行政院國家科學委員會專題研究計畫成果報告

蛋白質結構的研究：溶劑可接觸性、蛋白質二級結構及蛋白質摺疊構型等之預測 (結案報告)

中華民國 96 年 8 月 15 日