P ROTEIN S ECONDARY S TRUCTURE P REDICTION

Chapter 1 Introduction

1.1 P ROTEIN S ECONDARY S TRUCTURE P REDICTION

Proteins can perform various functions when they fold into proper three-dimensional structures. Because determining the structure of a protein through wet-lab experiments can be time-consuming and labor-intensive, computational approaches are preferable. To characterize the structural topology of proteins, Linderstrøm-Lang proposed the concept of a protein structure hierarchy with four levels: primary, secondary, tertiary, and quaternary. The primary structure of a protein refers to its amino acid sequence. The secondary structure consists of the coiling or bending of amino acids. The tertiary structure is the folding of a molecule upon itself by disulfide bridges and hydrogen bonds.

The quaternary structure refers to the complex structure formed by the interaction of 2 or more polypeptide chains. In the hierarchy, protein secondary structure (PSS) plays an important role in analyzing and modeling protein structures because it represents the local conformation of amino acids into regular structures.

There are three basic secondary structure elements (SSEs): α-helices (H), β-strands (E), and coils (C). Many researchers employ PSS as a feature to predict the tertiary structure [1-4], function [5-8], or subcellular localization [9] of proteins. It is noteworthy that, among the various features used to predict protein function, such as amino acid

[10]. Moreover it has been suggested that secondary structure alone may be sufficient for accurate prediction of a protein’s tertiary structure [11].

Current PSS prediction methods can be classified into two categories: template-based methods and sequence profile-based methods [12]. Template-based methods use protein sequences of known secondary structures as templates, and predict PSS by finding alignments between a query sequence and sequences in the template pool. The nearest-neighbor method belongs to this category. It uses a database of proteins with known structures to predict the structure of a query protein by finding nearest neighbors in the database. By contrast, sequence profile-based methods (or machine learning methods) generate learning models to classify sequence profiles into different patterns. In this category, Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) are the most widely used machine learning algorithms [13-19].

Template-based methods are highly accurate if there is a sequence similarity above a predefined threshold between the query and some of the templates; otherwise, sequence profile-based methods are more reliable. However, the latter may under-utilize the structural information in the training set when the query protein has some sequence similarity to a template in the training set [12]. An approach that combines the strengths of both types of methods is required for generating reliable predictions irrespective of whether the query sequence is similar or dissimilar to the templates in the training set.

To measure the accuracy of secondary structure prediction methods, researchers often use the average three-state prediction accuracy (Q3) accuracy or the segment overlap (SOV) measure [20-21]. The estimated theoretical limit of the accuracy of secondary structure assignment from the experimentally determined 3D structure is 88% of the Q3 accuracy [5, 22], which is deemed the upper bound for secondary structure prediction. However, PSS prediction has been studied for decades and has reached a bottleneck, since the Q₃ accuracy remains at approximately 80 % and further improvement is very difficult, as demonstrated by the CASP competitions. Currently, the most effective PSS prediction methods are based on machine learning algorithms, such as PSIPRED [15], SVMpsi [17], PHDpsi [23], Porter [24] and SPINE [25], which employ ANN or SVM learning models.

The two most successful template-based methods are NNSSP [26-27] and PREDATOR [28]. They use the structural information obtained from local alignments among query proteins and template proteins, and their Q3 accuracy is approximately 70%. Thus, the difference in the accuracy of the two categories is approximately 10%.

In a previous work on PSS prediction [29], we proposed a method called PROSP, which utilizes a sequence-structure knowledge base to predict a query protein’s secondary structure. The knowledge base consists of sequence fragments, each of which is associated with a corresponding structure profile. The profile is a position specific scoring matrix that indicates the frequency of each SSE at each position. The average Q₃ accuracy of PROSP is approximately 75%.

Dictionary-based approaches are widely used in the field of natural language processing (NLP) [30-32]. We generate synonymous words from a protein sequence and its similar sequences. The definition of a synonymous word is given in the Chapter Two. The major differences between SymPred and PROSP are as follows. First, the constitutions of the dictionary (SymPred) and the knowledge base (PROSP) are different. Second, the scoring systems of SymPred and PROSP are different. Third, unlike PROSP, SymPred allows inexact matching. Our experiment results show that SymPred can achieve 81.0%

Q₃ accuracy on a non-redundant dataset, which represents a 5.9% performance improvement over PROSP.

There are significant differences between SymPred and other methods in the two categories described earlier. First, in contrast to template-based methods, SymPred does not generate a sequence alignment between the query protein and the template proteins.

Instead, it finds templates by using local sequence similarities and their possible variations. Second, SymPred is not a machine learning-based approach. Moreover, it does not use a sequence profile, so it cannot be classified into the second category.

However, like machine learning-based approaches, SymPred could capture local sequence similarities and generate reliable predictions. Therefore, SymPred could combine the strengths of template-based and sequence profile-based methods. The experiment results on the two latest independent test sets (EVA_Set1 and EVA_Set2) show that, in terms of Q3 accuracy, SymPred outperforms other existing methods by 1.4%

to 5.4%.

在文檔中一個基於同義字辭典的蛋白質序列分析與分類的方法 (頁 16-20)