• 沒有找到結果。

快速輻射半徑基底函數網路演算法於蛋白質相對溶劑可接觸性預測的應用

N/A
N/A
Protected

Academic year: 2021

Share "快速輻射半徑基底函數網路演算法於蛋白質相對溶劑可接觸性預測的應用"

Copied!
58
0
0

加載中.... (立即查看全文)

全文

(1)

國 立 交 通 大 學

電 機 與 控 制 工 程 研 究 所

碩 士 論 文

快速輻射半徑基底函數網路演算法

於蛋白質相對溶劑可接觸性預測的應用

Applying Quick Radial Basis Function Network

to Protein Relative Solvent Accessibility Prediction

學 生: 游 涵 任

指 導 教 授: 張 志 永

(2)

快速輻射半徑基底函數網路演算法

於蛋白質相對溶劑可接觸性預測的應用

Applying Quick Radial Basis Function Network

to Protein Relative Solvent Accessibility Prediction

學 生 : 游涵任 Student : Han-Jen Yu

指導教授 : 張志永 Advisor : Jyh-Yeong Chang

國立交通大學

電機與控制工程學系

碩士論文

A Thesis

Submitted to Department of Electrical and Control Engineering

College of Electrical Engineering and Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of Master in

Electrical and Control Engineering

July 2006

Hsinchu, Taiwan, Republic of China

(3)

快速輻射半徑基底函數網路演算法

於蛋白質相對溶劑可接觸性預測的應用

學生:游涵任 指導教授:張志永博士 國立交通大學電機與控制工程研究所

摘要

蛋白質在生物體中一直扮演著很重要的角色,蛋白質被發現的數量及其結構 逐年增加。隨著蛋白質的應用越來越廣泛,待解決的課題也就越來越多。例如: 蛋白質二級結構預測問題、蛋白質相對溶劑可接觸性預測問題等。目前在蛋白質 結構問題的解決上,科學家都是利用X光繞射以及核磁共振 (NMR) 來取得實驗結 果。這些方法雖然正確率高,但是相對地所要花費的時間及成本是相當高的。因 此利用電腦科學中的機器學習 (Machine learning) 演算法來預測這些問題,相 信能夠有效降低實驗與時間成本。 本篇論文,我們利用修改的快速輻射半徑基底函數網路演算法,混合從 PSI-BLAST 產生的位置加權矩陣,針對蛋白質相對溶劑可接觸性預測問題進行研 究。最近歐等人 [10],發展出快速輻射半徑基底函數網路演算法,是一種較快 速且精確設計之網路,應用於蛋白質二級結構預測有顯著的效果。我們的修改的

(4)

快速輻射半徑基底函數網路演算法,應用於蛋白質相對溶劑可接觸性預測。我們 使用五種不同的快速輻射半徑基底函數網路演算法,應用在三態相對溶劑可接觸 性預測和二態相對溶劑可接觸性預測。此五種方法包括:(1) 快速輻射半徑基底 函數網路演算法、(2) 二階快速輻射半徑基底函數網路演算法、(3) 一般混合快 速輻射半徑基底函數網路演算法、(4) 地域趨勢混合快速輻射半徑基底函數網路 演算法、以及(5) 全域趨勢混合快速輻射半徑基底函數網路演算法。我們選擇有 最佳表現的一般混合快速輻射半徑基底函數網路演算法,做為建議的演算法。我 們也將修改的快速輻射半徑基底函數網路演算法的實驗結果,與近幾年的其他方 法比較,並且提出我們的方法改進方向的建議。

(5)

Applying Quick Radial Basis Function Network to

Protein Relative Solvent Accessibility Prediction

STUDENT: HAN-JEN YU ADVISOR: JYH-YEONG CHANG

Institute of Electrical and Control Engineering

National Chiao-Tung University

Abstract

Proteins have been played an important role in a creature and the numbers of

proteins and their structures have been increased with years. Since protein

applications are more widely used, there will be a lot of problems to be solved. For

example, there are protein secondary structure prediction problem, protein relative

solvent accessibility problem and so on. Nowadays, scientists use X-ray diffraction or

nuclear magnetic resonance (NMR) to solve the protein structure problem. Although

they can achieve high accuracy, it is expensive and long to solve this protein problem.

To reduce the time and the costs, it is imperative to use machine learning algorithms

to solve this protein problem.

(6)

In this thesis, we study protein relative solvent accessibility problem using a

modified QuickRBF method combined with a position-specific scoring matrix (PSSM)

generated from PSI-BLAST. The QuickRBF method, recently developed by Ou et al.

[10], has been applied to protein secondary structure prediction with excellent results.

Our modified QuickRBF method is applied on relative solvent accessibility prediction.

Five different kind of QuickRBF approaches are applied on three-state, E, I, and B,

and two-state, E, and B, relative solvent accessibility predictions. These five

approaches include: (1) QuickRBF, (2) Two-Stage QuickRBF, (3) Common Fusion

QuickRBF, (4) Local Tendency Fusion QuickRBF, and (5) Global Tendency Fusion

QuickRBF. We recommend the Common Fusion QuickRBF approach which has the

best performance as our modified QuickRBF method. We also compare the results of

the modified QuickRBF method with other methods in the recent years, and suggest

(7)

Acknowledgement

I would like to express my sincere appreciation to my advisor, Dr. Jyh-Yeong

Chang. Without his patient guidance and inspiration during the two years, it is

impossible for me to overcome the obstacles and complete the thesis. In addition, I am

thankful to all my lab members for their discussion and suggestion.

Finally, I would like to express my deepest gratitude to my family. Without their

(8)

Contents

Abstract (Chinese)…………..………..3 Abstract (English).………...……….5 Acknowledgement………..………..7 List of Figures………..………...…10 List of Tables………..………...11 Chapter 1. Introduction………..………..12

1.1 Motivation and The Background of This Research………12

1.2 Thesis Outline……….…...13

Chapter 2. Materials and Methods……….….…14

2.1 Training and Data Set………..14

2.2 Definition of Solvent Accessibility……….………...…....16

2.2.1 Static Residue Solvent Accessibility………..16

2.2.2 Residue Relative Solvent Accessibility………..18

2.3 PSI-BLAST Profiles………22

2.4 Quick Radial Basis Function Network…..……...……….24

2.5 Coding Scheme………..26

2.6 Several Prediction System Structures...………...27

2.6.1 QuickRBF Approach………...………....27

2.6.2 Two-Stage QuickRBF Approach……….29

2.6.3 Common Fusion QuickRBF Approach……….31

2.6.4 Local Tendency Fusion QuickRBF Approach………...………..32

(9)

Chapter 3. Experiment and Simulation Results………38

3.1 Experiment Procedure of Five Kind QuickRBF Approaches……….38

3.2 Classification Accuracy of Five Kind QuickRBF Approaches…………...39

3.3 Matthew’s Correlation Coefficients of Five QuickRBF Approaches...45

3.4 Comparison With Other Approaches...………...50

Chapter 4. Conclusion and Discussion………..………54

(10)

List of Figures

Fig. 2.1. Measure Accessibility……….………….………...17

Fig. 2.2. Binary Model……….……...………….….…………...21

Fig. 2.3. Raw Profile From PSI-Blast Log File………...……...23

Fig. 2.4. BLOSUM 62 Matrix……….….………...23

Fig. 2.5. General Architecture of Radial Basis Function Network.…………...25

Fig. 2.6. Architecture of QuickRBF Method….…………...28

Fig. 2.7. Architecture of Two-Stage QuickRBF Method….……..……...30

Fig. 2.8. Architecture of Common Fusion QuickRBF Method…………...31

Fig. 2.9. Architecture of Local Tendency Fusion QuickRBF Method……...…...33

Fig. 2.10. Architecture of Global Tendency Fusion QuickRBF Method...…...37

Fig. 3.1. Accuracy Plot of Five Kind QuickRBF Methods...44

(11)

List of Tables

Table 2.1. Database of Non-Homologous Proteins Used For Seven-Fold Cross Validation...15

Table 2.2. Definition of Solvent Accessibility States...19

Table 2.3. Occurrence Numbers Used For Local and Global Tendency Fusion QuickRBF Methods..34

Table 3.1. Classification Accuracy of Five Kind QuickRBF Methods on The RS126 Data Set...41

Table 3.2. The accuracy tables A of Common Fusion QuickRBF on each fold and RS126...46

Table 3.3. Matthew’s Correlation Coefficients of Five Kind QuickRBF Methods on RS126...48

(12)

Chapter 1. Introduction

1.1 Motivation and The Background of This Research

The knowledge of protein structures is valuable for understanding mechanisms

of diseases of living organisms and for facilitating discovery of new drugs. Protein

structure can be experimentally determined by NMR spectroscopy and X-ray

crystallography techniques or by molecular dynamics simulations. However, the

experimental approaches are marred by long experimental time, prone to difficulties,

and expensive. There are more than 130,000 protein sequences in Swissport (release

41.20), but less than about 37,000 three-dimensional (3D) protein structure are in the

Protein Data Bank (PDB). Only a small fraction, 17%, of the enormous number of

sequenced proteins has their structure determined. In order to reduce the gap between

sequence and structure, developing reliable and applicable structure prediction

method has become a more important task in computational biology. An intermediate

but useful step is to predict protein secondary structure (PSS) or relative solvent

accessibility (RSA), which provides some knowledge and simplifies the complicated

3D structure prediction problem [1], [2]. The usual goal of RSA prediction is to

classify a pattern of residues in amino acid sequences to a pattern of solvent

accessibility elements: buried (B), intermediate (I) and exposed (E) residues. Though

the prediction of solvent accessibility is less accurate than that of secondary structure

from the homology approach, since it is less conserved than secondary structure [3],

there has been much effort to improve prediction accuracy to obtain important

(13)

tertiary structure from sequences. Many different techniques have been proposed for

RSA prediction, which broadly fall into the following categories: (1) Bayesian, (2)

neural networks, and (3) information theoretical approaches. The Bayesian methods

provide a framework to take into account local interactions among amino acid

residues, by extracting the information from single sequences or multiple sequence

alignments to obtain posterior probabilities for RSA prediction [4]. Neural networks

use residues in a local neighborhood, as inputs, to predict the RSA of a residue at a

particular location by finding an arbitrary, nonlinear mapping [5]–[8]. The

information theoretical approaches use mutual information between the sequences of

amino acids and solvent accessibility values derived from a single amino acid residue,

or pairs of residues, in a neighborhood for RSA prediction [9]. In this study, we

propose a modified QuickRBF method for RSA prediction combined with a

position-specific scoring matrix (PSSM) generated from PSI-BLAST. The QuickRBF

method, recently developed by Ou et al. [10], is applied on protein secondary

structure prediction. Our modified QuickRBF method is applied on relative solvent

accessibility prediction. We also compare the results of the modified QuickRBF

method with other methods.

1.2 Thesis Outline

The organization of this thesis is structured as follows. Chapter 1 introduces the

role of machine learning, the motivation and the background of this thesis. In Chapter

2, we will first introduce the data set and the definition of protein solvent accessibility.

Then we will propose five different kind of QuickRBF methods to predict protein

relative solvent accessibility. In Chapter 3, the experiment of computer simulation and

(14)

this thesis is presented in Chapter 4.

Chapter 2. Materials and Methods

2.1 Training and Data Set

The set of 126 nonhomologous globular protein chains used in the experiment of

Rost and Sander [3] and referred to as the RS126 set was used to evaluate the

accuracy of the prediction. The proteins in the RS126 data set have less than 25%

pairwise sequence identity. This set was used to evaluate different methods of relative

solvent accessibility prediction, for example, PHDacc [3] and other methods

[21]–[23]. In this paper, we performed a sevenfold cross-validation test on this set as

defined in Table 2.1. In order to avoid the selection of extremely biased partitions, the

RS126 set was divided into subsets of approximately same composition of each type

of RSA state. One subset was chose as the testing set while the rest was merged into

the training set. This procedure was repeated seven times to cover whole RS126 data

(15)

Table 2.1. The database of non-homologous proteins used for seven-fold cross validation. All proteins have less than 25% pairwise similarity for lengths greater than 80 residues.

256b_A 2aat 8abp 6acn 1acx 8adh 3ait

2ak3_A 2alp 9api_A 9api_B 1azu 1cyo 1bbp_A Fold_A

1bds 1bmv_1 1bmv_2 3blm 4bp2

2cab 7cat_A 1cbh 1cc5 2ccy_A 1cdh 1cdt_A

3cla 3cln 4cms 4cpa_I 6cpa 6cpp 4cpv Fold_B

1crn 1cse_I 6cts 2cyp 5cyt_R

1eca 6dfr 3ebx 5er2_E 1etu 1fc2_C 1fdl_H

1dur 1fkf 1fnd 2fxb 1fxi_A 2fox 1g6n_A Fold_C

2gbp 1a45 1gd1_O 2gls_A 2gn5

1gpl 4gr1 1hip 6hir 3hmg_A 3hmg_B 2hmz_A

5hvp_A 2i1b 3icb 7icd 1il8_A 9ins_B 1l58 Fold_D

1lap 5ldh 1gdj 2lhb 1lmb_3

2ltn_A 2ltn_B 5lyz 1mcp_L 2mev_4 2or1_L 1ovo_A

1paz 9pap 2pcy 4pfk 3pgm 2phh 1pyp Fold_E

1r09_2 2pab_A 2mhu 1mrt 1ppt

1rbp 1rhd 4rhv_1 4rhv_3 4rhv_4 3rnt 7rsa

2rsp_A 4rxn 1s01 3sdh_A 4sgb_I 1sh1 2sns Fold_F

2sod_B 2stv 2tgp_I 1tgs_I 3tim_A

1bks_A 1bks_B 1tnf_A 1ubq 2tmv_P 2tsc_A 2utg_A Fold_G

(16)

2.2 The Definition of Protein Solvent Accessibility

2.2.1 Static Residue Solvent Accessibility

The native structure of globular proteins exists only in the presence of water [11],

and therefore the analysis of their interactions with water is central to the theory of

protein structure [12]. The term “accessible surface area” was introduced by Lee and

Richards [13] to quantitatively describe the extent to which atoms on the protein

surface can form contacts with water. For a particular protein atom it is defined as the

area over which the centre of a water molecule can be placed while retaining van der

Waals contacts with that atom and not penetrating any other atom. The principal goal

is to predict the extent to which a residue embedded in a protein structure is accessible

to solvent. Solvent accessibility can be described in several ways [13]–[15]. The most

detailed fast method compiles solvent accessibility by estimating the volume of a

residue embedded in a structure that is exposed to solvent as shown in Fig. 2.1; note:

this method was developed by Lee and Richards [13] and later implemented in DSSP

[16]. Different residues have a different possible accessible area.

Studies of solvent accessibility in proteins have led to many new insights into

protein structure [13]–[18]. Knowledge of solvent accessibility has proved useful for

identifying protein function, sequence motifs, and domains, and for formulating

hypotheses about antigenic determinants, site-directed mutagenesis, humanization of

antibodies, and on the correctness of designed or experimentally determined protein

structures. Furthermore, knowledge of solvent accessibility has assisted alignments in

(17)

Fig. 2.1. Measure accessibility. Residue solvent accessibility is usually measured by rolling a spherical water molecule over a protein surface and summing the area that can be accessed by this molecule on each residue (typical values range from 0-300 Å2

). To allow comparisons between the accessibility of long extended and spherical amino acids, typically relative values are compiled (actual area as percentage of maximally accessible area). A more simplified description distinguishes two states: exposed (here residues numbered 1-3 and 10-12) and buried (here residues 4-9) residues. Since the packing density of native proteins resembles the crystals, values for solvent accessibility provide upper and lower limits to the number of possible inter-residue contacts.

(18)

2.2.2 Residue Relative Solvent Accessibility

How can the solvent accessibility of a residue embedded in a 3D structure be cast

into a simple number? One simple way is to count the number of water molecules in

direct contact with the residue, as estimated by the program DSSP for the first

hydration shell. For comparison between amino acids of different sizes, the relative

solvent accessibility is a useful quantity as defined in Table 2.2.

Amino acid relative solvent accessibility is the degree to which a residue in a

protein is accessible to a solvent module. The relative solvent accessibility can be

calculated by the formula as follows:

RelAcc (%) = 100 ×Acc / MaxAcc (%) ,

where Acc is the solvent accessible surface area of the residue observed in the 3D

structure, given in Angstrom units, calculated from coordinates by the dictionary of

protein secondary structure (DSSP) program [16]. The number of water molecules

around a residue can be approximated by Acc/10, and MaxAcc is the maximum value

of solvent accessible surface area of each kind of residue for a Gly-X-Gly extended

(19)

Table 2.2. Definition of solvent accessibility states.

z Solvent accessibility:

Acc = solvent accessibility of a residue (given in Å2) calculated from coordinates using DSSP [16]. W≈ Acc/10, approximates the number of water molecules around the residue.

z Relative solvent accessibility:

RelAcc = Acc/MaxAcc, with maximal accessibility (measured in Å2) for the amino acids given by the table following (amino acids in one-letter code; B stands for D or N; Z for E or Q, and X for an undetermined amino acid) [18][19].

AA A B C D E F G H I K L M

MaxAcc 106 160 135 163 194 197 84 184 169 205 164 188

AA N P Q R S T V W X Y Z

MaxAcc 157 136 198 248 130 142 142 227 180 222 196

z Two-state (binary) model for accessibility (B/E) :

Buried (B) Exposed (E)

RelAcc ≤ 0% RelAcc > 0% RelAcc < 5% RelAcc ≥ 5% RelAcc < 9% RelAcc ≥ 9% RelAcc < 16% RelAcc ≥ 16% Thresholds to distinguish two states

RelAcc < 25% RelAcc ≥ 25%

z Three-state (ternary) model for accessibility (B/I/E) :

Buried (B) Intermediate (I) Exposed (E) Thresholds to distinguish three states RelAcc < 9% 9% ≤ RelAcc < 36% RelAcc ≥ 36%

z Measure for evaluation of conservation and accuracy of prediction:

Q2 percentage of conserved, or correctly predicted, residues in two states defined

by thresholds given above.

Q3 percentage of conserved, or correctly predicted, residues in three states

(20)

RelAcc can hence adopt values between 0% and 100%, with 0% corresponding

to a fully buried and 100% to a fully accessible residue, respectively. Different

arbitrary threshold values of relative solvent accessibility are chose to define

categories: buried and exposed as shown in Fig. 2.2, or ternary categories: buried,

intermediate, or exposed. The precise choice of the threshold is not well defined [3].

We used two kind of class definitions: (1) buried (B) and exposed (E); and (2)

buried (B), intermediate (I), and exposed (E). For the two-state, B and E definition,

we chose various thresholds of the relative solvent accessibility such as 25%, 16%,

9%, 5%, and 0%. For the three-state, B, I, and E, description of relative solvent

accessibility, one set of thresholds that we selected is the same as those in Rost and

Sander [3]:

Buried (B): RelAcc < 9%

Intermediate (I): 9% ≤ RelAcc < 36% Exposed (E): RelAcc ≥ 36%

(21)

Fig 2.2. Binary model: thick and dark line is buried residues; thin and light line is exposed residues [20].

(22)

2.3 PSI-BLAST Profiles

It is well known that evolutionary information in the form of multiple alignments

and profiles significantly improves the accuracy of, for instance, secondary structure

prediction methods [4], [24]–[27]. This is so because the secondary structure of a

family is more conserved than the primary amino acid sequence. Similar effects have

been reported for the prediction of contact number and relative solvent accessibility.

For relative solvent accessibility, a corresponding increase of 5% has been described

both with neural networks [25] and Bayesian methods.

PSI-BLAST [28] generates the profile of a protein in the form of an N×20

position-specific scoring matrix as shown in Fig. 2.3, where N is the length of the

sequence. PSI-BLAST is run with default options, -j 3, -h 0.001, and -e 10.0, and the

non-redundant protein sequence database (ftp://ncbi.nlm.nih.gov/blast/db) filtered by

PFILT [29] to mask out regions of low complexity sequence, the coiled coil regions

and transmembrane spans. The BLOSUM62 [30] substitution matrix as shown in Fig.

2.4, is used for PSI-BLAST. These profiles were scaled to the required 0–1 range

using the standard logistic function:

) exp( 1 1 ) ( x x f − + = ,

(23)

Fig. 2.3. Raw profile from PSI-Blast log file

Fig. 2.4. BLOSUM 62 substitution matrix (Lower) and difference matrix (Upper) obtained by subtracting the PAM 160 matrix position by position. These matrices have identical relative entropies (0.70); the expected value of BLOSUM 62 is -0.52; that for PAM 160 is -0.57.

(24)

2.4 Quick Radial Basis Function Network

The radial basis function network (RBFN) is a special type of neural networks

with several distinctive features [31]. Since its first proposal, the RBFN has attracted

a high degree of interest in research communities. A RBFN consists of three layers,

namely the input layer, the hidden layer, and the output layer. The input layer

broadcasts the coordinates of the input vector to each of the nodes in the hidden layer.

Each node in the hidden layer then produces an activation based on the associated

radial basis function. Finally, each node in the output layer computes a linear

combination of the activations of the hidden nodes. How a RBFN reacts to a given

input stimulus is completely determined by the activation functions associated with

the hidden nodes and the weights associated with the links between the hidden layer

and the output layer. The general mathematical form of the output nodes in a RBFN is

as follows: cj(x) =

= k i 1 wjiΦ( ||x-μi ||;σi ) ,

where cj(x) is the function corresponding to the j-th output unit (class-j) and is a linear

combination of k radial basis functions Φ(.) with centerμi and bandwidthσi . Also, wj

is the weight vector of class-j and wji is the weight corresponding to the j-th class and

i-th center. The general architecture of RBFN is shown in Fig. 2.5. It can be seen that

constructing a RBFN involves determining the number of centers, k, the center

locations, μi , the bandwidth of each center, σi , and the weights, wji. That is, training a

RBFN involves determining the values of three sets of parameters: the centers (μi ),

the bandwidths (σi ), and the weights (wji), in order to minimize a suitable cost

(25)

Fig. 2.5. General architecture of Radial Basis Function Network

In QuickRBF package [10], it is focused on the calculation of the weights, and

conducting a simple method to determine the centers and bandwidths. Therefore, it

selects the centers randomly in the package. Also, it utilizes a fixed bandwidth of each

kernel function, which is set to five for each kernel function. After the centers and

bandwidths of the kernel functions in the hidden layer have been determined, the

transformation between the inputs and the corresponding outputs of the hidden units is

now fixed. The network can thus be viewed as an equivalent single-layer network

with linear output units. Then, the Least Mean Square Error method is used to

determine the weights associated with the links between the hidden layer and the

output layer.

Ou used a single-layer Quick Radial Basis Function Network [10] to analyze

protein secondary structure with excellent prediction results on the RS126 data set.

There are more details about QuickRBF can be found in QuickRBF package

(http://csie.org/~yien/quickrbf/index.php). Here, we propose a modified QuickRBF

(26)

2.5 Coding Scheme

As with Hua and Sun’s work [32], the present analysis used the classical local

coding scheme of the protein sequences with a sliding window. PSI-BLAST matrix

with n rows and 20 columns can be defined for single sequence with n residues. For

the first layer in the prediction, each residue is represented using 20 components in a

vector, based on the PSSM. In order to allow a window to extend over the N-terminus

and the C-terminus, an additional 21st unit (spacer) was attached to each residue.

Then, each input vector has 21×w components, where w is a sliding window size. For

the second layer, the vector corresponding to a residue has four elements in the

three-state prediction and three elements in the two-state prediction, where the first

three elements represent the three relative solvent accessibility states, E, I, and B, in

the three-state prediction and the first two elements represent the two relative solvent

accessibility states, E and B, in the two-state prediction. Both the last units were

added in order to allow a window to extend over the N-terminus and the C-terminus.

If the window length is v, the dimension of the feature vector is 4×v for the second

(27)

2.6 Several Prediction System Structures

Five different kind of QuickRBF approaches are applied on three-state, E, I, and

B, and two-state, E and B, relative solvent accessibility predictions. These five

approaches include: (1) QuickRBF, (2) Two-Stage QuickRBF, (3) Common Fusion

QuickRBF, (4) Local Tendency Fusion QuickRBF, and (5) Global Tendency Fusion

QuickRBF.

2.6.1 QuickRBF Approach

A QuickRBF structure was used in the prediction system as shown in Fig. 2.6.

The QuickRBF classifier classifies each residue of each sequence into the three

relative solvent accessibility states, E, I, or B, by using the values of matrices of

PSI-BLAST profile as the inputs. The outputs represent the tendency that the residue

belongs to that state. The one-against-rest strategy was used for the multiclass

classification, so each residue was classified into the state with the largest output

(28)

Fig. 2.6. Architecture of QuickRBF method. The system includes two parts: the PSI-BLAST profile, and the classifier. The profile is transformed into a number of 21*17 dimension vectors using the slide-window method. These vectors are input into the QuickRBF classifier. The outputs of the QuickRBF classifier are a number of 3D vectors representing the tendency that the residue belongs to that state. The one-against-rest strategy was used to classify each residue into the state with the largest value.

PSI-BLAST Profile

Coding : transform the 17*20 matrix into a 17*21 dimension vector

QuickRBF Classifier Data Normalization

Classifier

(29)

2.6.2 Two-Stage QuickRBF Approach

A Two-Stage QuickRBF structure was used in the prediction system as shown in

Fig. 2.7. The first stage is a QuickRBF classifier that classifies each residue of each

sequence into the three relative solvent accessibility states, E, I, or B, by using the

values of matrices of PSI-BLAST profile as the inputs. The outputs of the first stage

represent the tendency that the residue belongs to that state. The second stage

QuickRBF classifier also classifies each residue of each sequence into the three

relative solvent accessibility states, E, I, or B, by using the RSA three-state tendency

matrices from the outputs of the first stage as the inputs. The outputs of the second

stage also represent the tendency that the residue belongs to that state. As with an

One-Stage QuickRBF approach, the second stage also uses the one-against-rest

strategy, with each residue classified into the state with the largest output value for a

(30)

Second Stage

Fig. 2.7. Architecture of Two-Stage QuickRBF method. The system includes three parts: the PSI-BLAST profile, the first stage, and the second stage. The profile is transformed into a number of 21*17 dimension vectors using the slide-window method. These vectors are input into the first-stage QuickRBF. The outputs of the first-stage QuickRBF are a number of 3D vectors representing the tendency that the residue belongs to that state. Using the slide-window method, the outputs of the first-stage QuickRBF are transformed into a number of 4*15 dimensional vector, which are used as the inputs of the second-stage QuickRBF. The final decisions are based on the outputs of the second-stage QuickRBF.

PSI-BLAST Profile

Coding : transform the 17*20 matrix into a 17*21 dimension vector

First-Stage QuickRBF Data Normalization

Coding : transform the 3*15 matrix into a 4*15 dimension vector

Second-Stage QuickRBF

First Stage

(31)

2.6.3 Common Fusion QuickRBF Approach

Three kind of fusion QuickRBF approaches were used to combine the outputs of

a QuickRBF approach and the outputs of a Two-Stage QuickRBF approach. One is

the Common Fusion QuickRBF approach, and the others are the Local Tendency

Fusion QuickRBF approach and the Global Tendency Fusion QuickRBF approach.

The architectures of these three approaches were illustrated in Figs. 2.8, 2.9, and 2.10.

The common fusion strategy adds up the tendency outputs from a QuickRBF

approach and the tendency outputs from a Two-Stage QuickRBF approach. Then we

also use the one-against-rest strategy to classify each residue into the state with the

largest value.

Fig. 2.8. Architecture of Common Fusion QuickRBF method

Common Fusion Rule:

E ( o n e - s t a g e ) + E ( t w o - s t a g e ) I ( o n e - s t a g e ) + I ( t w o - s t a g e ) B ( o n e - s t a g e ) + B ( t w o - s t a g e )

outputs from QuickRBF

outputs from Two-Stage QuickRBF

(32)

2.6.4 Local Tendency Fusion QuickRBF Approach

The local tendency fusion strategy also adds up the tendency outputs from a

QuickRBF approach and the tendency outputs from a Two-Stage QuickRBF approach.

An occurrence number table is then applied in the summation as shown in Table 2.3.

There are three occurrence numbers which are Oe, Oi and Ob, where Oe means the

numbers of exposed components in the test data, and Oi means the numbers of

intermediate components in the test data, and Ob means the numbers of buried

components in the test data. These three occurrence numbers represent the potential

factors for the affection ability of the three relative solvent accessibility states. In

other words, if an occurrence number is larger, the tendency of a residue which

belongs to that state is larger. Then we also use the one-against-rest strategy to

(33)

Fig. 2.9. Architecture of Local Tendency Fusion QuickRBF method. These three occurrence numbers are based on each test fold.

outputs from QuickRBF

Local Tendency Fusion Rule: { E (one-stage) + E (two-stage) } * Oe { I (one-stage) + I (two-stage) } * Oi { B (one-stage) + B (two-stage) } * Ob

outputs from Two-Stage QuickRBF

(34)

Table 2.3. Occurrence numbers used for local and Global Tendency Fusion QuickRBF method. From Fold_A to Fold_G, these occurrence numbers of each fold are used for Local Tendency Fusion QuickRBF method. And the occurrence numbers of RS126 dataset are used for Global Tendency Fusion QuickRBF method.

Threshold: 9% ; 36% Threshold: 25% component dataset Oe Ob Oe + Ob Fold_A 1983 2293 4276 Fold_B 1844 1861 3705 Fold_C 1629 1835 3464 Fold_D 1842 2012 3854 Fold_E 1400 1471 2871 Fold_F 1366 1405 2771 Fold_G 1709 2142 3851 RS126 11773 13019 24792 component dataset Oe Oi Ob Oe + Oi + Ob Fold_A 1524 1220 1532 4276 Fold_B 1441 1090 1174 3705 Fold_C 1269 1026 1169 3464 Fold_D 1436 1196 1222 3854 Fold_E 1081 829 961 2871 Fold_F 1036 835 900 2771 Fold_G 1271 1204 1376 3851 RS126 9058 7400 8334 24792

(35)

Table 2.3.(continued) Threshold: 16% component dataset Oe Ob Oe + Ob Fold_A 2373 1903 4276 Fold_B 2231 1474 3705 Fold_C 1977 1487 3464 Fold_D 2261 1593 3854 Fold_E 1679 1192 2871 Fold_F 1630 1141 2771 Fold_G 2083 1768 3851 RS126 14234 10558 24792 Threshold: 9% component dataset Oe Ob Oe + Ob Fold_A 2744 1532 4276 Fold_B 2531 1174 3705 Fold_C 2295 1169 3464 Fold_D 2632 1222 3854 Fold_E 1910 961 2871 Fold_F 1871 900 2771 Fold_G 2475 1376 3851 RS126 16458 8334 24792

(36)

Table 2.3.(continued) Threshold: 5% component dataset Oe Ob Oe + Ob Fold_A 3028 1248 4276 Fold_B 2769 936 3705 Fold_C 2502 962 3464 Fold_D 2866 988 3854 Fold_E 2098 773 2871 Fold_F 2048 723 2771 Fold_G 2773 1078 3851 RS126 18084 6708 24792 Threshold: 0% component dataset Oe Ob Oe + Ob Fold_A 3652 624 4276 Fold_B 3297 408 3705 Fold_C 3010 454 3464 Fold_D 3378 476 3854 Fold_E 2536 335 2871 Fold_F 2451 320 2771 Fold_G 3360 491 3851 RS126 21684 3108 24792

(37)

2.6.5 Global Tendency Fusion QuickRBF Approach

The global tendency fusion strategy is mostly the same with the local tendency

fusion strategy. The difference between these two tendency fusion strategies is that

these three occurrence numbers used for the global tendency fusion strategy are the

occurrence numbers of three kind of components in the RS126 data set.

Fig. 2.10. Architecture of Global Tendency Fusion QuickRBF method. These three occurrence numbers are based on the RS126 data set.

outputs from QuickRBF

Global Tendency Fusion Rule: { E (one-stage) + E (two-stage) } * Oe { I (one-stage) + I (two-stage) } * Oi { B (one-stage) + B (two-stage) } * Ob

outputs from Two-Stage QuickRBF

(38)

Chapter 3. Experiment and Simulation Results

3.1 Experiment Procedure of Five QuickRBF Approaches

Five different kind of QuickRBF approaches are applied on three-state, E, I, and

B, and two-state, E and B, relative solvent accessibility predictions. These five

methods are QuickRBF, Two-Stage QuickRBF, Common Fusion QuickRBF, Local

Tendency Fusion QuickRBF, and Global Tendency Fusion QuickRBF.

For QuickRBF, One-Stage QuickRBF approach, each residue is coded as a

21-dimensional vector, where the first 20 elements of the vector are the corresponding

elements in PSI-BLAST matrix and the last unit was added in order to allow a

window to extend over the N- and the C-terminus. The window length is 17 and the

dimension of the feature vector is 21×17. The number of the centers randomly

selected from the training data set is 12000 and the bandwidth is five for each kernel

function. The architecture of QuickRBF in the three-state prediction is shown

previously in Fig. 2.6.

For Two-Stage QuickRBF, the window lengths are 17 for the first layer and 15

for the second layer. The dimension of the feature vector for the first layer is 21×17.

The dimensions of the feature vectors for the second layer are 4×15 in the three-state

prediction and 3×15 in the two-state prediction. The numbers of the centers randomly

selected from the training data set are 12000 for the first layer and 500 for the second

layer. The bandwidths are both five for the first and second layer. The architecture of

(39)

These three kind of fusion strategies, Common Fusion, Local Tendency Fusion,

and Global Tendency Fusion, are the combinations of One-Stage QuickRBF and

Two-Stage QuickRBF. Different rules are used for each fusion strategy. The

architectures of these three fusion strategies in the three-state prediction are shown

previously in Figs. 2.8, 2.9, and 2.10.

3.2 Classification Accuracy of Five QuickRBF Approaches

The results and the comparison of the five different kind of QuickRBF

approaches on RS126 data set are listed in Table 3.1. On the RS126 data set,

QuickRBF gave the overall prediction accuracy 59.67% for the three-state prediction

with respect to thresholds: 9%; 36% and 87.99%, 80.06%, 78.46%, 76.98%, 75.60%,

respectively for the two-state prediction with respect to thresholds of 0%, 5%, 9%,

16%, 25%.

Two-Stage QuickRBF gave the overall prediction accuracy 59.11% for the

three-state prediction with respect to thresholds: 9%; 36% and 87.55%, 79.25%,

78.21%, 76.15%, 73.76%, respectively for the two-state prediction with respect to

thresholds of 0%, 5%, 9%, 16%, 25%.

Common Fusion QuickRBF gave the overall prediction accuracy 60.07% for the

three-state prediction with respect to thresholds: 9%; 36% and 87.82%, 80.15%,

78.58%, 77.13%, 75.66%, respectively for the two-state prediction with respect to

(40)

Local Tendency Fusion QuickRBF gave the overall prediction accuracy 59.91%

for the three-state prediction with respect to thresholds: 9%; 36% and 87.60%,

77.20%, 75.84%, 76.54%, 75.55%, respectively for the two-state prediction with

respect to thresholds of 0%, 5%, 9%, 16%, 25%.

Global Tendency Fusion QuickRBF gave the overall prediction accuracy 59.91%

for the three-state prediction with respect to thresholds: 9%; 36% and 87.60%,

77.25%, 75.82%, 76.43%, 75.48%, respectively for the two-state prediction with

respect to thresholds of 0%, 5%, 9%, 16%, 25%.

The accuracy plot of the above the five kind QuickRBF approaches is shown in

Fig. 3.1 and Common Fusion QuickRBF is numbered three. Among these five

QuickRBF approaches, Common Fusion QuickRBF has better performance for either

the three-state prediction or the two-state prediction. Common Fusion QuickRBF is

then decided as our modified QuickRBF approach because of the better performance

(41)

Table 3.1. RSA classification accuracy of five kind QuickRBF methods on the RS126 data set with PSI-BLAST pssm profiles.

QuickRBF accuracy: % threshold dataset 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) Fold_A 61.25 86.72 80.52 78.70 78.16 76.92 Fold_B 60.11 88.77 81.30 79.65 77.84 76.06 Fold_C 62.99 87.90 81.76 81.18 79.19 77.77 Fold_D 59.37 88.53 80.95 79.09 77.09 75.61 Fold_E 60.85 88.65 81.92 80.22 78.65 76.63 Fold_F 59.11 89.03 81.60 79.18 76.18 74.27 Fold_G 53.99 86.32 72.34 71.18 71.75 71.90 Average 59.67 87.99 80.06 78.46 76.98 75.60 Two-Stage QuickRBF accuracy: % threshold dataset 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) Fold_A 60.71 86.62 80.43 78.30 76.52 74.58 Fold_B 60.59 88.18 78.11 78.79 76.73 75.17 Fold_C 62.73 88.14 81.41 81.21 77.34 77.74 Fold_D 57.16 88.30 79.94 78.15 75.71 72.00 Fold_E 59.80 88.92 81.85 80.49 77.81 72.34 Fold_F 58.72 88.92 80.04 77.55 75.78 72.21 Fold_G 54.06 83.80 72.97 72.97 73.18 72.29 Average 59.11 87.55 79.25 78.21 76.15 73.76

(42)

Table 3.1. (continued)

Common Fusion QuickRBF

accuracy: % threshold dataset 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) Fold_A 61.20 86.67 80.40 78.53 77.95 76.19 Fold_B 61.27 88.88 80.59 79.54 78.19 76.90 Fold_C 63.40 88.08 81.67 81.67 78.96 78.09 Fold_D 59.06 88.43 80.83 78.54 76.93 74.73 Fold_E 61.20 88.75 82.41 80.74 78.34 76.04 Fold_F 59.33 88.70 81.81 78.85 76.69 74.92 Fold_G 55.05 85.23 73.36 72.16 72.84 72.76 Average 60.07 87.82 80.15 78.58 77.13 75.66

Local Tendency Fusion QuickRBF

of accuracy: % threshold dataset 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) Fold_A 60.69 85.45 76.38 75.89 75.96 76.31 Fold_B 60.89 88.99 79.00 77.11 77.30 76.87 Fold_C 62.96 87.01 78.41 78.03 79.27 78.38 Fold_D 59.70 87.68 78.20 76.99 77.19 74.18 Fold_E 60.92 88.30 78.44 77.01 77.85 75.62 Fold_F 59.18 88.49 78.20 77.05 76.15 74.77 Fold_G 55.03 87.25 71.80 68.79 72.06 72.73 Average 59.91 87.60 77.20 75.84 76.54 75.55

(43)

Table 3.1. (continued)

Global Tendency Fusion QuickRBF

accuracy: % threshold dataset 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) Fold_A 60.29 85.48 76.08 75.07 75.16 76.61 Fold_B 61.16 89.02 79.16 77.38 77.92 76.65 Fold_C 63.05 87.01 78.32 78.06 79.30 78.32 Fold_D 59.68 87.68 78.41 77.43 77.30 74.16 Fold_E 60.89 88.26 78.44 77.05 78.23 75.44 Fold_F 59.26 88.49 78.49 77.27 76.18 74.67 Fold_G 55.05 87.25 71.88 68.50 70.89 72.50 Average 59.91 87.60 77.25 75.82 76.43 75.48

Comparison of five kind QuickRBF methods

accuracy: % threshold method 3-state (9% ; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%) QuickRBF 59.67 87.99 80.06 78.46 76.98 75.60 Two-Stage QuickRBF 59.11 87.55 79.25 78.21 76.15 73.76

Common Fusion QuickRBF 60.07 87.82 80.15 78.58 77.13 75.66

Local Tendency Fusion QuickRBF 59.91 87.60 77.20 75.84 76.54 75.55

(44)

50 55 60 65 70 75 80 85 90 9& 36 0 5 9 16 25 T hreshold (%) Ac cu ra cy ( % )

1: QuickRBF

2: Two-Stage QuickRBF

3: Common Fusion QuickRBF

4: Local Tendency Fusion QuickRBF

5: Global Tendency Fusion QuickRBF

(45)

3.3 Matthew’s Correlation Coefficients of Five QuickRBF Approaches

Another measure used to evaluate the performance of prediction methods is the

Matthew’s Correlation Coefficient (MCC). It can be calculated from an accuracy table

A by the following equations:

ij

A = number of residues predicted to be in type j and observed to be in type i,

) )( )( )( ( i i i i i i i i i i i i i o n u n o p u p o u n p MCC + + + + − = , i p = A , ii i n =

∑∑

≠ ≠ 3 3 i j k i jk A , i o =

≠ 3 i j ji A , i u =

≠ 3 i j ij A , for i = E, I, B.

Also, pi , ni , oi and ui are the number of true positives, true negatives, false positives

and false negatives for class i, respectively. The MCCs have the same value for the

two classes in the case of the two-state prediction, i.e. MCCE = MCCB.

First, the accuracy tables A of Common Fusion QuickRBF on each fold and the

RS126 data set is shown in Table 3.2. Then, the MCCs of five kind QuickRBF

methods on the RS126 data set is shown in Table 3.3. In a similar trend as Table 3.1,

MCC’s of Common Fusion QuickRBF usually perform well, although not always the

(46)

Table 3.2. The accuracy tables A of Common Fusion QuickRBF on each fold and RS126. 3-state (9%; 36%) AEE AII ABB AEI AEB AIE AIB ABE ABI Fold_A 1276 294 1042 141 107 587 339 277 213 Fold_B 1134 285 837 187 120 455 350 144 193 Fold_C 1033 257 905 130 106 433 336 138 126 Fold_D 1067 276 940 180 189 433 487 145 137 Fold_E 843 196 724 132 106 347 286 120 117 Fold_F 816 170 645 113 107 387 278 155 100 Fold_G 835 279 1011 207 229 434 491 174 191 RS126 7004 1757 6104 1090 964 3076 2567 1153 1077 2-state (25%) AEE ABB AEB ABE Fold_A 1695 1563 288 730 Fold_B 1417 1432 427 429 Fold_C 1247 1458 382 377 Fold_D 1244 1636 598 376 Fold_E 952 1231 448 240 Fold_F 996 1080 370 325 Fold_G 1109 1693 600 449 RS126 8660 10093 3113 2926 2-state (16%) AEE ABB AEB ABE Fold_A 2134 1199 239 704 Fold_B 1842 1055 389 419 Fold_C 1613 1122 364 365 Fold_D 1803 1162 458 431 Fold_E 1402 847 277 345 Fold_F 1368 757 262 384 Fold_G 1596 1209 487 559 RS126 11758 7351 2476 3207

(47)

Table 3.2. (continued) 2-state (9%) AEE ABB AEB ABE Fold_A 2489 869 255 663 Fold_B 2228 719 303 455 Fold_C 2051 778 244 391 Fold_D 2300 727 332 495 Fold_E 1742 576 168 385 Fold_F 1657 528 214 372 Fold_G 2003 776 472 600 RS126 14470 4973 1988 3361 2-state (5%) AEE ABB AEB ABE Fold_A 2818 620 210 628 Fold_B 2449 537 320 399 Fold_C 2259 570 243 392 Fold_D 2660 455 206 533 Fold_E 1939 427 159 346 Fold_F 1885 382 163 341 Fold_G 2245 580 528 498 RS126 16255 3571 1829 3137 2-state (0%) AEE ABB AEB ABE Fold_A 3612 94 40 530 Fold_B 3206 87 91 321 Fold_C 2956 95 54 359 Fold_D 3330 78 48 398 Fold_E 2473 75 63 260 Fold_F 2416 42 35 278 Fold_G 3202 80 158 411 RS126 21195 551 489 2557

(48)

Table 3.3. Matthew’s Correlation Coefficients of Five Kind QuickRBF Methods on RS126. 3-state (9%; 36%) MCC method MCCE MCCI MCCB QuickRBF 0.477 0.146 0.500 Two-Stage QuickRBF 0.485 0.130 0.491

Common Fusion QuickRBF 0.488 0.142 0.502

Local Tendency Fusion QuickRBF 0.487 0.123 0.508 Global Tendency Fusion QuickRBF 0.489 0.123 0.505

2-state (25%) MCC

method MCCE = MCCB

QuickRBF 0.517

Two-Stage QuickRBF 0.476

Common Fusion QuickRBF 0.511

Local Tendency Fusion QuickRBF 0.439

Global Tendency Fusion QuickRBF 0.440

2-state (16%) MCC

method MCCE = MCCB

QuickRBF 0.524

Two-Stage QuickRBF 0.512

Common Fusion QuickRBF 0.528

Local Tendency Fusion QuickRBF 0.514

(49)

Table 3.3. (continued) 2-state (9%) MCC method MCCE = MCCB QuickRBF 0.493 Two-Stage QuickRBF 0.499

Common Fusion QuickRBF 0.500

Local Tendency Fusion QuickRBF 0.513

Global Tendency Fusion QuickRBF 0.513

2-state (5%) MCC

method MCCE = MCCB

QuickRBF 0.449

Two-Stage QuickRBF 0.457

Common Fusion QuickRBF 0.464

Local Tendency Fusion QuickRBF 0.482

Global Tendency Fusion QuickRBF 0.483

2-state (0%) MCC

method MCCE = MCCB

QuickRBF 0.249

Two-Stage QuickRBF 0.262

Common Fusion QuickRBF 0.256

Local Tendency Fusion QuickRBF 0.281

(50)

3.4 Comparison with other Approaches

Comparison of performance of modified QuickRBF approach with other

methods in RSA prediction on the RS126 data set is shown in Table 3.4. Accuracy

plot of modified QuickRBF approach and other methods is shown in Fig. 3.2.

Modified QuickRBF, Common Fusion, is our method, and reported 60.1% for

the three-state prediction with respect to 9%; 36% thresholds, and 87.8%, 80.2%,

78.6%, 77.1%, 75.7%, respectively for the two-state predictions with respect to

thresholds of 0%, 5%, 9%, 16%, 25%.

Fuzzy k-NN (Sim, Kim and Lee, 2005) used fuzzy k-nearest neighbor method

[23] using PSI-BLAST profiles as feature vectors, and shows slightly better prediction

accuracies than other methods on the RS126 data set ,and reported 63.8% for the

three-state prediction with respect to 9%; 36% thresholds, and 87.2%, 82.2%, 79.0%,

78.3%, respectively for the two-state predictions with respect to thresholds of 0%, 5%,

16%, 25%.

PHDacc (Rost and Sander, 1994) used a neural network method [3] using

evolutionary profiles of amino acid substitutions derived from multiple sequence

alignments, and reported 57.5% for the three-state prediction with respect to 9%; 36%

thresholds, and 86.0%, 74.6%, 75.0%, respectively for the two-state predictions with

(51)

SVMpsi (Kim and Park, 2004) was based on a support vector machine [21] using

the position-specific scoring matrix generated from PSI-BLAST, and reported 59.6%

accuracy for the three-state prediction with respect to 9%; 36% thresholds and 86.2%,

79.8%, 77.8%, 76.8%, respectively accuracies for the two-state predictions with

respect to thresholds of 0%, 5%, 16%, 25%.

Two-Stage SVMpsi (Nguyen and Rajapakse, 2005) used a Two-Stage SVMpsi

approach [22] using the position-specific scoring matrix generated from PSI-BLAST,

and reported 90.2%, 83.5%, 81.3%, 79.4%, respectively accuracies for the two-state

predictions with respect to thresholds of 0%, 5%, 9%, 16%. These prediction

accuracies are obtained from their published results. The three state accuracy of

Modified QuickRBF is 60.1%, which is 3.7% lower than Fuzzy k-NN (63.8%) and

(52)

Table 3.4. Comparison of performance of modified QuickRBF approach with other methods in RSA prediction on the RS126 data set with PSSMs generated by PSI-BLAST.

PHDacc (Rost and Sander, 1994) used neural networks [3].

SVMpsi (Kim and Park, 2004) was based on support vector machine [21].

Two-Stage SVMpsi (Nguyen and Rajapakse, 2005) used a two-stage SVM approach [22]. Fuzzy k-NN (Sim, Kim and Lee, 2005) used fuzzy k-nearest neighbor method [23].

accuracy: % threshold method 3-state (9%; 36%) 2-state (0%) 2-state (5%) 2-state (9%) 2-state (16%) 2-state (25%)

Modified QuickRBF (Common Fusion) 60.1 87.8 80.2 78.6 77.1 75.7

PHDacc 57.5 86.0 ─ 74.6 75.0 ─

SVMpsi 59.6 86.2 79.8 ─ 77.8 76.8

Two-Stage SVMpsi ─ 90.2 83.5 81.3 79.4 ─

(53)

50 55 60 65 70 75 80 85 90 95 9& 36 0 5 9 16 25 Threshold (%) A ccu racy ( % )

1: Modified QuickRBF (Common Fusion)

2: PHDacc

3: SVMpsi

4: Two-stage SVMpsi

5: Fuzzy k-NN

(54)

Chapter 4.

Conclusion and Discussion

In this study, we have applied the five QuickRBF approaches, which are

QuickRBF, Two-Stage QuickRBF, Common Fusion QuickRBF, Local Tendency

Fusion QuickRBF, and Global Tendency Fusion QuickRBF, to predict relative solvent

accessibility, using PSI-BLAST profiles as feature vectors. Our best method,

Common Fusion QuickRBF, achieved the similar performance as the researches did in

the recent years. Because the goal of this thesis was to provide a new approach for

relative solvent accessibility, the results suggest that the modified QuickRBF

approach is a successful one.

In the future strategy, we can apply our method on a larger data set. Data set

growth can give an indirect advantage to our method. And our modified QuickRBF

(55)

References

[1] J. Chandonia and M. Karplus, “New methods for accurate prediction of protein

secondary structure,” Protein Engineering, vol. 35, pp. 293–306, 1999.

[2] D.W. Mount, “Bioinformatics: sequence and genome analysis,” Cold Spring

Harbor, NY: Cold Spring Harbor Laboratory Press, 2001.

[3] B. Rost and C. Sander, “Conservation and prediction of solvent accessibility in

protein families,” Proteins, vol. 20, pp. 216–226, 1994.

[4] M.J. Thompson and R.A. Goldstein, “Predicting solvent accessibility: higher

accuracy using Bayesian statistics and optimized residue substitution classes,”

Proteins, vol. 47, pp. 142–153, 1996.

[5] S. Pascarella, R.D. Persio, F. Bossa, and P. Argos, “Easy method to predict

solvent accessibility from multiple protein sequence alignments,” Proteins, vol.

32, pp. 190–199, 1999.

[6] X. Li and X.M. Pan, “New method for accurate prediction of solvent

accessibility from protein sequence,” Proteins, vol. 42, pp. 1–5, 2001.

[7] G. Pollastri, P. Baldi, P. Fariselli, and R. Casadio, “Prediction of coordination

number and relative solvent accessibility in proteins,” Proteins, vol. 47, pp.

142–153, 2002.

[8] S. Ahmad and M.M. Gromiha, “NETASA: neural network based prediction of

solvent accessibility,” Bioinformatics, vol. 18, pp. 819–824, 2002.

[9] H. Naderi-Manesh, M. Sadeghi, S. Araf, and A.A.M. Movahedi, “Predicting of

protein surface accessibility with information theory,” Proteins, vol. 42, pp.

(56)

[10] Y.Y. Ou, “QuickRBF is an Efficient Construction of Radial Basis Function

Networks with the Cholesky Decomposition,” http://csie.org/~yien/quickrbf/ .

[11] J.D. Bernal, I. Fankuchen, and M. Perutz, “An X-Ray study of chymotrypsin and

haemoglobin,” Nature, vol. 141, pp. 523–524, 1938.

[12] W.A. Kauzmann, “Some factors in the interpretation of protein denaturation,”

Protein Chem., vol. 14, pp. l–63, 1959.

[13] B.K. Lee and F.M. Richards, “The interpretation of protein structures: estimation

of static accessibility,” J. Mol. Biol., vol. 55, pp. 379–400, 1971.

[14] C. Chothia, “The nature of the accessible and buried surfaces in proteins,” J. Mol.

Biol., vol. 105, pp. 1–12, 1976.

[15] M.L. Connolly, “Solvent-accessible surfaces of proteins and nucleic acids,”

Science, vol. 221, pp. 709–713, 1983.

[16] W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern

recognition of hydrogen-bonded and geometrical features,” Biopolymers, vol. 22,

pp. 2577–2637, 1983.

[17] J. Janin, “Surface and inside volumes in globular proteins,” Nature, vol. 277, pp.

491–492, 1979.

[18] G.D. Rose, A.R. Geselowitz, G.J. Lesser, R.H. Lee, and M.H. Zehfus,

“Hydrophobicity of amino acid residues in globular proteins,” Science, vol. 229,

pp. 834–838, 1985.

[19] C. Sander, M. Scharf, and R. Schneider, “Design of protein structures in: Protein

Engineering: A practical Approach,” Oxford University press, pp. 89–115, 1992

[20] H. Hirakawa and S. Kuhara, “Prediction of Hydrophobic Cores of Proteins Using

Wavelet Analysis,” Genome Inform Ser Workshop Genome Information, vol. 8,

(57)

[21] H. Kim and H. Park, “Prediction of protein relative solvent accessibility with

support vector machines,” Proteins, vol. 54, pp. 557–562, 2004.

[22] M.N. Nguyen and J.C. Rajapakse, “Prediction of protein relative solvent

accessibility with a two-stage SVM approach,” Proteins, vol. 59, pp. 30–37,

2005.

[23] J. Sim, S.Y. Kim, and J. Lee, “Prediction of protein solvent accessibility using

fuzzy k-nearest neighbor method,” Bioinformatics, vol. 21, pp. 2844–2849, 2005.

[24] D.T. Jones, “Protein secondary structure prediction based on position specific

scoring matrices,” J. Mol. Biol., vol. 292, pp. 195–202, 1999.

[25] B. Rost and C. Sander, “Combining evolutionary information and neural

networks to predict protein secondary structure,” Proteins, vol. 19, pp. 55–72,

1994.

[26] G.J. Barton, “Protein secondary structure prediction,” Curr. Opin. Struct. Biol.,

vol. 5, pp. 372–376, 1995.

[27] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and G. Soda, “Exploiting the past

and the future in protein secondary structure prediction,” Bioinformatics, vol. 15,

pp. 937–946, 1999.

[28] S.F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of

protein database search programs,” Nucleic Acids Res., vol. 25, pp. 3389–3402,

1997.

[29] D.T. Jones, “Protein secondary structure prediction based on position-specific

scoring matrices,” J. Mol. Biol., vol. 292, pp. 195–202, 1999.

[30] S. Henikoff and J.G. Henikoff, “Amino acid substitution matrices from protein

blocks,” Proc. Natl. Acad. Sci., vol. 89, pp. 10915–10919, 1992.

(58)

[32] S. Hua and Z. Sun, “A novel method of protein secondary structure prediction

with high segment overlap measure: support vector machine approach,” J. Mol.

數據

Fig. 2.1. Measure accessibility. Residue solvent accessibility is usually measured by  rolling a spherical water molecule over a protein surface and summing the area that  can be accessed by this molecule on each residue (typical values range from 0-300  Å
Table 2.2. Definition of solvent accessibility states.  z  Solvent accessibility:
Fig 2.2. Binary model: thick and dark line is buried residues; thin and light line is  exposed residues [20]
Fig. 2.3. Raw profile from PSI-Blast log file
+7

參考文獻

相關文件

Strands (or learning dimensions) are categories of mathematical knowledge and concepts for organizing the curriculum. Their main function is to organize mathematical

volume suppressed mass: (TeV) 2 /M P ∼ 10 −4 eV → mm range can be experimentally tested for any number of extra dimensions - Light U(1) gauge bosons: no derivative couplings. =&gt;

For pedagogical purposes, let us start consideration from a simple one-dimensional (1D) system, where electrons are confined to a chain parallel to the x axis. As it is well known

The observed small neutrino masses strongly suggest the presence of super heavy Majorana neutrinos N. Out-of-thermal equilibrium processes may be easily realized around the

incapable to extract any quantities from QCD, nor to tackle the most interesting physics, namely, the spontaneously chiral symmetry breaking and the color confinement.. 

(1) Determine a hypersurface on which matching condition is given.. (2) Determine a

• Formation of massive primordial stars as origin of objects in the early universe. • Supernova explosions might be visible to the most

The difference resulted from the co- existence of two kinds of words in Buddhist scriptures a foreign words in which di- syllabic words are dominant, and most of them are the