INTRODUCTION - 資料融合及山峰群聚法應用於改善蛋白質結構預測與分析

1.1 Statements of the Problems

The structure prediction and classification of proteins plays a very important role in bioinformatics, since three-dimensional (3-D) structure of a protein is essential for understanding and studying its function. A lot of efforts have been put by researchers to find the relations between protein sequences and their three-dimensional structures but it’s still a difficult and unsolved problem. There are several famous classification databases such as Structure Classification of Protein (SCOP) [1], Class, Architecture, Topology, and Homologous superfamily (CATH) [2], DIAL-derived domain database (DDBASE) and Pfam [3], which imbue the structure with context and analysis. However, the number of known 3-D protein structures is much less than that of the determined protein sequences. Thus, there is still the need for some effective methods to investigate the protein structure from its primary sequence.

Finding the 3-D structure of a protein using X-ray crystallography or by nuclear magnetic resonance imaging is not only time consuming but also quite expensive and hence alternative computational approaches are being tried. Computational structure prediction methods will provide valuable information for the large fraction of sequences whose structures will not be determined experimentally.The first class of protein structure prediction methods, including threading or fold recognition and comparative modeling, rely on detectable similarity spanning most of the modeled sequence and at least one known structure. The second class of methods, de novo or ab initio methods, predict the structure from sequence alone, without relying on similarity at the fold level between the modeled sequence and any of the known structures [4].

Among the former methods, fold recognition methods are widely used and effective because it is believed that there are a strictly limited number of different protein folds in nature,

chemistry of polypeptide chains. There is, therefore, a good chance that a protein which has a similar fold to the target protein has already been studied by X-ray crystallography or NMR spectroscopy and can be found in the PDB (Protein Data Bank). The basic idea is that the target sequence (the protein sequence for which the structure is being predicted) is threaded through the backbone structures of a collection of template proteins. Fold recognition methods can utilize the profile information derived from properties of amino acid sequences and the structures in the fold library and even take into account the local secondary structure (e.g.

whether the amino acid is part of an alpha helix) or even evolutionary information (how conserved the amino acid is) for structure prediction. Previous research [5], [6] have shown that an accuracy rate of 70-80% has been achieved to classify most of proteins into four classes according to their amino acid sequence information (i.e., all-alpha (α), all-beta (β), alpha/beta (α/β) and alpha+beta (α+β)) [1]. In summary, these four classes contain 82.5% folding patterns, 84.7% superfamilies and 88.1% families in the SCOP database (SCOP release version 1.65 [7]).

In [8], Ding and Dubchak proposed a taxonmetric approach for protein folding classification (into 27 folding patterns) beyond four simple classes had the highest overall prediction accuracy rate at 56.5%. In Huang et al. [9], extra features were defined and improved the prediction accuracy rate by 9% to reach 65.5%. In this dissertation, we use a combinatorial fusion technique to facilitate feature selection and combination for improving predictive accuracy in this problem and obtain better prediction accuracy rate of 87% for four classes and 69.6% for 27 folding categories.

For the ab initio structure prediction, to predict three-dimensional protein structures from amino-acid sequences alone is a long-standing challenge in computational molecular biology.

Since the search space is enormous even for proteins with moderate sequence lengths, the modeling of a protein structure de novo without using templates is quite difficult. To allow rapid and efficient searching of conformational space, often only a subset of the atoms in the

protein chain is represented explicitly. Recently, methods based on assembly of short fragments have shown a great promise [10]. Among these methods, 3-D building blocks approaches have been proposed to use a set of proteins with known 3-D structures, first construct libraries of building blocks or short structural motifs, the structures that appear frequently and have some sequence to structure relation. These building blocks are then used to construct or analyze structures of new proteins. The short structural fragments that recur across different protein families can often be viewed as stand-alone units which fold independently and hence can help assignment of building blocks to unknown proteins for reconstruction of 3-D structures [11].

Extraction of good representative building blocks is the key to the success of such approaches.

Unger et al. [12] proposed a two-stage clustering algorithm to choose hexamers (fragments of length 6) with a large number of neighbors to be the centers of clusters and hence building blocks. A similar approach was used by Micheletti et al. who considered the largest number of nearby points within a similarity cutoff called “proximity score” [13] to select cluster centers.

Kolodny et al., on the other hand, used a simulated annealing k-means method to extract clusters with the minimal total variance score [14]. In this dissertation, a modified form of the mountain clustering / subtractive clustering method [15]-[16] is proposed to find building blocks. Results of some preliminary investigation are reported in [17]. The use of the modified mountain clustering method is computationally expensive when the training data set size is large. To reduce the computational burden, we also propose an incremental version of the structural mountain clustering method. Our experiments with some benchmark datasets show that the proposed algorithms can find better representative building blocks than the method in [12] that selects cluster centers by counting neighbors. We also propose two alternative ways for displaying the quality of the building blocks.

1.2 Organization of the Dissertation

including the Structural Classification of Proteins (SCOP), 3D building block approach, features used for classification and the datasets used in the dissertation. Chapter 3 explores the results and methods by data fusion approach for the prediction of protein folds. Chapter 4 investigates how to find the useful building blocks for construction of protein structures using a structural variant of Mountain clustering methods. Chapter 5 develops the incremental version of Structural Mountain clustering methods. At last, we make the conclusions in Chapter 6.

在文檔中資料融合及山峰群聚法應用於改善蛋白質結構預測與分析 (頁 11-15)