1.1 Motivation and the Background of this Research
The solvent accessibility of amino acid residues plays an important role in tertiary structure prediction, especially in the absence of significant sequence similarity of a query protein to those with known structures. The prediction of solvent accessibility is less accurate than secondary structure prediction in spite of improvements in recent researches.
Predicting the three-dimensional (3D) structure of a protein from its sequence is an important issue because the gap between the enormous number of protein sequences and the number of experimentally determined structures has increased [1], [2]. However, the prediction of the complete 3D structure of a protein is still a big challenge, especially in the case where there is no significant sequence similarity of a query protein to those with known structures [3]–[6]. The prediction of solvent accessibility and secondary structure has been studied as an intermediate step for predicting the tertiary structure of proteins, and the development of knowledge-based approaches has helped to solve these problems [7]–[11].
Secondary structures and solvent accessibilities of amino acid residues give a useful insight into the structure and function of a protein [11]–[14]. In particular, the knowledge of solvent accessibility has assisted alignments in regions of remote sequence identity for threading [1], [15]. However, in contrast to the secondary
determined solvent accessibility into a finite number of discrete states such as buried,
intermediate and exposed states. Also, the prediction accuracies of solvent
accessibilities are lower than those for secondary structure prediction, since the solvent accessibility is less conserved than secondary structure [1], although there has been some progress recently.
The prediction of solvent accessibility, as well as that of the secondary structure, is a typical pattern classification problem. The first step for solving such a problem is the feature extraction, where the important features of the data are extracted and expressed as a set of numbers, called feature vectors. The performance of the pattern classifier depends crucially on the judicious choice of the feature vectors. In the case of the solvent accessibility prediction, using evolutionary information such as multiple sequence alignment and position-specific scoring matrix generally has given good prediction results [16], [17]. Once an appropriate feature vector has been chosen, a classification algorithm is used to partition the feature space into disjoint regions with decision boundaries. The decision boundaries are determined using feature vectors of a reference sample with known classes, which are also called the reference dataset or training set. The class of a query data is then assigned depending on the region it belongs to.
Various classification algorithms have been developed. Bayesian statistics is a parametric method where the functional form of the probability density is assumed for each class, and its parameters are estimated from the reference data.
In nonparametric methods, no specific functional form for the probability density
networks, support vector machines and nearest neighbor methods. In the neural network methods, the decision boundaries are set up before the prediction using a training set. Support vector machines are similar to neural networks in that the decision boundaries are determined before the prediction, but in contrast to neural network methods where the overall error function between the predicted and observed class for the training set is minimized, the margin in the boundary is maximized.
In the k-nearest neighbor methods, the decision boundaries are determined implicitly during the prediction, where the prediction is performed by assigning the query data the class most matched represented among the k-nearest reference data.
The standard k-nearest neighbor rule is to place equal weights on the k-nearest reference data for determining the class of the query, but a more general rule is to use weights proportional to a certain power of distance. Also, by assigning the fuzzy membership to the query data instead of a definite class, one can estimate the confidence level of the prediction. The method employing these more general rules is called the fuzzy k-nearest neighbor methods [18].
Neural network methods are very popular and have been widely used for solvent accessibility prediction [1], [7], [19]–[22], and support vector machines, a recently developed method, shows comparable results to neural network methods [23]–[25].
Bayesian statistics has also been used by Thompson and Goldstein (1996).
The k-nearest neighbor method has been frequently used for the classification of biological and medical data, and despite its simplicity, the performances are competitive compared to many other methods. However, the k-nearest neighbor
used to predict protein secondary structure [26]–[28].
In this thesis, we apply the modified fuzzy k-nearest neighbor method to the prediction of solvent accessibility where PSI-BLAST [29] profiles are used as the feature vectors. We obtain relatively high accuracy on various benchmark tests.
1.2 Thesis Outline
The organization of this thesis is structured as follows. Chapter 1 introduces the motivation and the background of this thesis. In Chapter 2, we will first introduce the data set and the definition of protein solvent accessibility. Then the k-nearest neighbor algorithm and quick radial basis function network will be described. Moreover, we will propose five different methods to predict protein relative solvent accessibility in Chapter 3. In Chapter 4, the experiment of computer simulation and the results are conducted and compared with other methods. Finally, the conclusion and discussion of this thesis is presented in Chapter 5.