S UMMARY AND D ISCUSSION - PROTEIN FOLD PREDICTION BY DATA FUSION APPROACH

3. PROTEIN FOLD PREDICTION BY DATA FUSION APPROACH

3.5. S UMMARY AND D ISCUSSION

Methods of combining multiple classification systems or multiple scoring systems have been used in a variety of applications domains including information retrieval, pattern recognition, microarray gene expression analysis, and molecular similarity searching [24], [28], [32]-[39].

More recently, criteria to select the classification systems or scoring systems for combination and to decide ways to combine these systems have been discussed and studied [25]-[28], [39]. It has been demonstrated in Combinatorial Fusion Analysis (see [27] and its references) that (a) the combination of multiple systems (or features) would improve the performance only if (1) each of the individual systems (features or functions) has a relatively high performance, and (2) each individual systems are distinctive (or different), and (b) combination by rank outperform combination by score under certain conditions.

In this study, we use criterion (a)(1) to select features and then apply criterion (a)(2) by computing the diversity rank/score graph in order to select the pair of features with the highest diversity. Criterion (b) is then used to combine these two features using ranks. We have applied the concept of Combinatorial Fusion to improve accuracy in protein structure prediction. In particular, we have successfully improved the overall predictive accuracy rate of 87% for the second structure (the four classes) and 69.6% for the folding patterns (the 27 folding categories).

We improve previous results by Huang et al. [9] (65.5% for folding structure) and Ding and Dubchak [8] (56.5% for folding structure) by incorporating the method of combinatorial fusion in their approach using neural network (NN) with the radial basis function network (RBFN) using the hierarchical learning architecture (HLA).

One of the novelties of our current work is the notion of a diversity rank/score function

d

_i(A,B) between a pair of features A and B (See e.g. Figures 3 and 4). This function characterizes the diversity of ranking (or scoring) behavior between features A and B across the whole spectrum of all protein sequences under consideration. This parameter is then used to select appropriate and diverse features for combination. The current work is the first of a series of on-going projects towards the protein structure prediction problem using HLA, NN-RBFN, and Combination Fusion Analysis. Following the current work, we have observed the following:

(A) The method of combinatorial fusion we used in this study is computational efficient,

able to adapt to different situations and approaches, and scalable to a large number of classes (or folding patterns) and a large number of proteins.

(B) In this study, we considered only combination of a pair of two features in order to

improve the performance. It may be possible to achieve better results with combination

features should be different. As such, the diversity between three or more features should be defined. This will be studied in a latter work.

(C) Although it has been shown (e.g. [41]) that combining multiple predictors or servers

improves fold recognition, we note here that combining all the features or multiple scoring systems together may not guarantee optimal performance (see also [26] and [27]).

(D) We used rank combination due to criterion (b) which was demonstrated to be better

under certain conditions analytically and by simulation in Hsu and Taksa [25]. We observed that score combination does have its merit when the two features combined are similar and homogeneous with respect to their scoring functions, rank function, or rank/score function. We decided to use the rank combination because the pair of features to be combined satisfies criteria (a)(1) and (a)(2) and these two items are precisely the conditions stipulated in [25], [26], [28], [37] and [38].

(E) In our feature selection process, we selected top four performers out of the original

eight features. The ideal case is to select those features which perform much better than the others. That means there is a big difference on the performance between those selected and those not selected.

Our current work represents the first of a series of investigations on the protein structure prediction problem using HLA and Combinatorial Fusion. It has generated several issues and topics worthy of further study. We summarize some of them here:

(1)

Our diversity rank/score function di(A,B) for the feature pair (A,B) with respect to protein pi is defined using the variation of the rank/score functions between A and B. As indicated in [25], [27] and [28], variation of the rank functions or the score functions between A and B can be used also to define the diversity score function. We will explore

these two other options in a latter work.

(2)

The effectiveness of our fusion of multiple features is limited by the set of eight original chosen features. It might be worthwhile to study the content of original set of features.

For example, we like to explore the diversity among the original features such as local vs. global, physical vs. chemical and bi-gram vs. tri-gram scheme.

(3)

Related to observation (D) above, one might ask if it is better to expand the scope and the number of features. In this study, we started with eight features and four are selected using the CFA criteria. In a separate paper [42], eleven features are collected and three features are selected according to the criteria (a)(1) and (a)(2) in Remark 1. We have, in Lin et al. [42], obtained a slightly better overall accuracy rate of 87.8% for four classes and 70.9% for 27 folding categories.

(4)

Our results improve previous results by Huang et al. [9] and Ding and Dubchak [8]

which used neural network with radial basis function in a hierarchical learning architecture. Work has been performed to improve those results which used other machine learning technique such as kernel method, SVM and genetic algorithm. For example, Yu et al. [43] has obtained good accuracy rate using SVM with n-peptide coding schemes and jury voting. Future work can be performed to improve these results using our combinatorial fusion approach.

4. Finding 3-D building blocks of protein structures by Mountain Clustering Approach

4.1 Introduction

Discovering the relations between protein sequences and their 3-D structures is an important research topic and has received a lot of attention because knowing the 3-D structure of a protein helps biologists to study the functions of the proteins, perform rational drug design, and design novel proteins. Finding the 3-D structure of a protein using X-ray crystallography or by nuclear magnetic resonance imaging is time consuming and expensive and hence alternative approaches are being tried. Several approaches such as comparative modeling, fold recognition [9], [44], ab-initio prediction [45]-[46] and 3-D building blocks approach [12]-[13] have been proposed. As pointed out by Bujnicki [10] modeling of a protein structure de novo without using templates is quite difficult because the search space is enormous even for proteins with moderate sequence lengths. The methods based on assembly of short fragments have shown a great promise [10]-[14]. Among these methods, 3-D building blocks approaches have been successfully applied to construct libraries of well-chosen short structural motifs extracted from known structures [13]-[14], [47]-[54]. These building blocks are then used to construct or analyze structures of new proteins. The short structural fragments that recur across different protein families can often be viewed as stand-alone units which fold independently and hence can help assignment of building blocks to unknown proteins for reconstruction of 3-D structure [11]. The clustering method used in [11] is a two stage process, where building blocks are classified according to their SCOP protein family and clustered within the family in the first stage, and then merged in the second stage. The building block cutting algorithm uses a stability score function that involves properties like compactness, hydrophobicity, and isolatedness. The critical building block finding algorithm uses a score function based on the contacts the

building block has with other building blocks. This is an involved and interesting approach. Our proposed approach is comparatively very simple and does not use those physical/ chemical/

structural properties of the residues.

In [52] Anishetty et al. suggested that rigid tri-peptides have no correlation with protein's secondary structure and tri-peptide data may be used to predict plausible structures for oligopeptides. The hybrid protein model of de Brevern et al learns 3-D protein fragments encoded into a structural alphabet consisting of 16 protein blocks (PBs) [54]-[55]. Benros et al.

[53] further continued this study considering 11-residue fragments encoded as a series of seven protein blocks. They had built a library of 120 overlapping prototypes with good local approximation of 3-D structures. Every protein block in [54] is only five-residue long and described by eight dihedral angles. Each of them serves as a building block approximately representing a known structural motif like central α-helices, central β-strands, β-strand-N-caps and so on. Consequently, a protein’s 3-D structure can be represented by a string of alphabets.

And unlike our approach, the similarity between fragments is defined by the RMS deviation on angular values. The clustering algorithm used is a self-organizing map type neural network.

The effectiveness of such a method heavily depends on the extraction of good representative 3-D building blocks. Unger et al. [12] proposed a two-stage clustering algorithm to choose hexamers (fragments of length 6) having a large number of neighbors to be the building blocks.

These center hexamers are called the 3-D building blocks [12]. Micheletti et al. also used largest number of nearby points within a similarity cutoff called “proximity score” [13] to select cluster centers, while Kolodny et al. proposed a simulated annealing k-means method to perform the clustering task with the minimal total variance score [14], [50]-[51]. In this study, a modified form of the mountain clustering / subtractive clustering method [15]-[16] is proposed to find building blocks. Our experiments with some benchmark datasets show that it can find

ways of depicting the quality of the building blocks.

4.2 3-D Building Block Approach

The 3-D building block approach involves several steps. First, we need to decide on the fragment length. Then, given a set of proteins (training data) we need to compile the whole set into all possible fragments of the selected length. Next, a clustering method is used to divide these fragments into clusters and pick up the center of each cluster to be a building block. If these building blocks are good enough, then they can be used successfully to represent all original fragments within a tolerable limit and therefore can be used to reconstruct the 3-D structure of a whole protein within some tolerance.

4.2.1 Distance Measure between 3-D structures

A well-accepted definition of dissimilarity between two fragments is the Root-Mean-Square (RMS) deviation between two structures computed after alignment of the two fragments to the greatest possible extent using the BMF (best molecular fit) algorithm [56], [57]. Given two structures s and t , the RMS can be calculated as follows:

2 / 1 1

2)/( 2)]

[(

∑

= − −

= _i^K

r

_i^s

r

_i^t

K

RMS

(4)

where

r is 3-D coordinate of

_i^s

i

^th

C atom in the molecule

_α s and K denotes the number of atoms in the structure. Typically, for the computation of RMS, one should divide by K , but for the ease of comparison with published results, we divide by ( K -2) as done in [12].

4.2.2 Method of Reconstruction

Following [12] we use this procedure: First, we replace each original hexamer of a protein by its closest building block. Then, since the building blocks overlap, we align every two consecutive building blocks using the BMF algorithm. The chain grows as follows. Onto the suffix (the last

five residues) of the first building block we fit the prefix (the first five residues) of the next building block. The 3-D position of the sixth (last) residue of the latter hexamer is, thus, determined and is added to the growing chain. This process is repeated until the whole protein is reconstructed.

4.2.3 Performance Measure

To evaluate the performance of the proposed method, we use the same two criteria as in [14]: (1) Local-fit RMS, which measures how well the fragments of the target proteins can be represented by the library of building blocks at hand. It takes the average of all coordinate RMS deviations between every fragment and its associated building block. (2) Global-fit RMS, which measures the RMS deviation of the reconstructed 3-D structure from the entire native structure of the target. In addition we also use two alternative ways, as explained later, for assessment of quality of the building blocks.

在文檔中資料融合及山峰群聚法應用於改善蛋白質結構預測與分析 (頁 46-53)