行政院國家科學委員會補助國內專家學者出席國際學術會議報告 - 生物系統從序列到結構與功能之計算研究---子計畫三：利用核糖核酸結構預測與核糖核酸-蛋白質互動關係分析推論蛋白質結構(III)

97 年 9 月 15 日報告人姓

名胡毓志服務機構

及職稱

交通大學資訊工程系

副教授時間

會議

地點

07/14/2008-07/17/2008 Las Vegas, U.S.A.

本會核定補助文號

NSC 96-2221-E-009-042-

會議名稱

(中文) Biocomp 生物資訊暨計算生物學國際研討會 (英文) Biocomp 2008

發表論文題目

1.(中文) 利用蛋白質結構字元表述蛋白質區間特性 (英文) Using Protein Structural Alphabet to Characterize Local Structure Features

附件一

一、參加會議經過

於 07/13 辦理註冊報到，隔日隨即參加開幕演說，於 07/14-07/17 期間，

參加與會學者之論文發表，並與多位國外學者討論相關研究議題。會議中不乏中國大陸籍學者之論文，對於我國內生物資訊的發展，應可產生良性刺激，提供非常多的助益與新的發展方向。

二、與會心得

根據議程中部分美國研究學者所述，由於經濟壓力上升，美國 NIH 已將研究主軸放在 translational research，希望藉由在實驗室的研究成果實際應用於人類醫學。本次參加人數及國家眾多，其研究領域更包括計算機科學、醫學、生物學等之應用，藉由討論及論文發表，獲得寶貴經驗，對於未來研究提供了新的方向。其中更結識他國友人，經由研討，可明白其他國家的發展經驗。從這次與會學習的經驗，我們可以得知國外研究之重點，

作為我國在生物科技的發展依據。

三、考察參觀活動(無是項活動者省略) 無

四、建議

生物科技是目前國內新興研究發展之重要產業，懇請國科會及相關單位，

能多支持與獎勵國內學者多參與此類國際研討會，除了增加我國在國際相關領域的能見度，同時，提供相互學習之機會。此外，建議由國科會主導，

召集國內各大學與民間企業支援，以召開國際性生物資訊與相關科技研討會，邀請國內外學者共同參與，這是直接提昇我國在生技發展地位的最有效做法。

五、攜回資料名稱及內容 The Proceedings of Biocomp2008

Using Protein Structural Alphabet to Characterize Local Structure Features

Shih-Yen Ku¹² and Yuh-Jyh Hu¹³

1Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan

2MIB Program at Institute of Statistical Science, Academia Sinica, Taipei, Taipei

3Institute of Biomedical Engineering, National Chiao Tung University, Hsinchu, Taiwan

Abstract - As the number of available 3D protein structures increases rapidly, a wider variety of studies can be conducted more efficiently, among which is the design of protein structural alphabet. With the structural alphabet, not only can we describe the global folding structure of a protein as a 1D sequence, but we can also characterize local structures in proteins. Previously, we applied a combinatorial approach to protein structural alphabet design. In our previous work, we verified the usefulness of our structural alphabet by demonstrating the competitive accuracy in protein alignment, compared with alphabets. Here we took a further step by applying motif finding tools to our alphabet with the aim to characterize protein structure local features. Two structure domains, TIM and EGF, were used to evaluate the performance of our structural alphabet. Our method successfully recovered their sub-domains as common motifs in our structural alphabet.

Keywords: protein structure, structural alphabet, motifs

Introduction

As all proteins have a certain degree of structural similarities to other proteins, and they probably share a common ancestor in evolution.

Based on evolutionary relationships and the principles governing the 3D structures, a protein structure hierarchy, SCOP, was constructed mainly by visual inspection with the assistance of various automatic tools to compare protein structures. The original aim of SCOP was to serve as a tool for understanding protein evolution through the relationships between sequences and structures [1].

The conservation in local active sites may reflect biological meanings, and their structural patterns can be used to predict protein functions [2], e.g., the binding sites for metal-binding proteins [3]. The conserved local structural features can be identified in various ways and described in different representations. For example, some have attempted to investigate the relationships between

local sequences and structures by identifying common structural motifs first, then characterizing amino acid preferences [4-6]. Others instead have adopted the inverse approach by examining structural correlates from recurring sequence patterns found to obtain sequence-structure motifs [7,8].

Unlike those works above on correlations between protein local structures and sequence patterns, we first convert protein 3D structures into 1D structural alphabet letters, and then identify and represent conserved local features as 1D structural alphabet sequence motifs. Besides, our goal is to mine the protein families for conserved local characteristics rather than to predict 3D structures of novel proteins as those studies mentioned above. There are several advantages of 1D structural alphabet over 3D co-ordinates representations. First, 1D representation of protein structures is more efficient in comparison and more economical in storage. Second, many previously designed and widely used 1D sequence alignment tools can be directly applied to protein structures as well as sequences. Third, conserved protein local structural features can be described as 1D sequence motifs and be identified by various well-developed sequence motif-finding tools. Four, this type of 1D-based approaches can serve as a pre-processor to filter out remotely related or irrelevant proteins before we apply other more accurate but more computationally intensive structure analysis tool.

Previous analysis of protein structures has shown the importance of repetitive secondary structures, in particular, α-helix and β-sheet.

Together with variable coils, they constituted a basic standard 3-letter structural alphabet. In spite

of the increase in predictive accuracy, the approximation of 3D structures with only a 3-letter alphabet is apparently too crude for the more refined 3D reconstruction [9-13]. Various more complex structural alphabets have been developed by taking into account the heterogeneity of backbone protein structures through sets of small protein fragments frequently observed in different protein structure databases [14-21]. Unlike most other works, we developed a multi-strategy method for structural alphabet design, which combined self-organizing maps, minimum spanning tree algorithm and k-means algorithm [22]. The performance of our alphabet was demonstrated by the competitive accuracy in all-alpha protein search within SCOP using the standard 1D sequence alignment tool, FASTA [23].

In this paper, we introduced an improved version of our alphabet design pipeline, to which we added a substitution matrix self-trainer. The substitution matrix used in aligning proteins represented by structural alphabets affects the accuracy of alignment. In our earlier work, we applied the identity matrix in the alignment [22].

Though the preliminary results successfully demonstrated the feasibility of our alphabet, yet a more appropriate matrix will further improve its applicability. The substitution matrix is a crucial factor in the successful application of 1D sequence alignment tools to search for similar 3D structures.

We thus developed an automatic matrix training framework that can generate appropriate substitution matrices for new alphabets when applied in standard 1D sequence alignment methods, e.g. FASTA. Based on the alphabet we constructed, we can transform proteins into 1D structural alphabet representations. To identify protein local structure features, we applied the motif-finding tool MEME [24] to detect the common motifs. We tested two protein families in SCOP, TIM and EGF. The results showed our method successfully recovered their structure domains.

在文檔中生物系統從序列到結構與功能之計算研究---子計畫三：利用核糖核酸結構預測與核糖核酸-蛋白質互動關係分析推論蛋白質結構(III) (頁 95-99)