Training and Data Set - Materials and Methods

Chapter 2. Materials and Methods

2.1 Training and Data Set

The set of 126 nonhomologous globular protein chains used in the experiment of Rost and Sander [3] and referred to as the RS126 set was used to evaluate the accuracy of the prediction. The proteins in the RS126 data set have less than 25%

pairwise sequence identity. This set was used to evaluate different methods of relative solvent accessibility prediction, for example, PHDacc [3] and other methods [21]–[23]. In this paper, we performed a sevenfold cross-validation test on this set as defined in Table 2.1. In order to avoid the selection of extremely biased partitions, the RS126 set was divided into subsets of approximately same composition of each type of RSA state. One subset was chose as the testing set while the rest was merged into the training set. This procedure was repeated seven times to cover whole RS126 data set.

Table 2.1. The database of non-homologous proteins used for seven-fold cross validation. All proteins have less than 25% pairwise similarity for lengths greater than 80 residues.

256b_A 2aat 8abp 6acn 1acx 8adh 3ait

2ak3_A 2alp 9api_A 9api_B 1azu 1cyo 1bbp_A Fold_A

1bds 1bmv_1 1bmv_2 3blm 4bp2

2cab 7cat_A 1cbh 1cc5 2ccy_A 1cdh 1cdt_A

3cla 3cln 4cms 4cpa_I 6cpa 6cpp 4cpv

Fold_B

1crn 1cse_I 6cts 2cyp 5cyt_R

1eca 6dfr 3ebx 5er2_E 1etu 1fc2_C 1fdl_H 1dur 1fkf 1fnd 2fxb 1fxi_A 2fox 1g6n_A Fold_C

2gbp 1a45 1gd1_O 2gls_A 2gn5

1gpl 4gr1 1hip 6hir 3hmg_A 3hmg_B 2hmz_A 5hvp_A 2i1b 3icb 7icd 1il8_A 9ins_B 1l58 Fold_D

1lap 5ldh 1gdj 2lhb 1lmb_3

2ltn_A 2ltn_B 5lyz 1mcp_L 2mev_4 2or1_L 1ovo_A 1paz 9pap 2pcy 4pfk 3pgm 2phh 1pyp Fold_E

1r09_2 2pab_A 2mhu 1mrt 1ppt

1rbp 1rhd 4rhv_1 4rhv_3 4rhv_4 3rnt 7rsa

2rsp_A 4rxn 1s01 3sdh_A 4sgb_I 1sh1 2sns

Fold_F

2sod_B 2stv 2tgp_I 1tgs_I 3tim_A

1bks_A 1bks_B 1tnf_A 1ubq 2tmv_P 2tsc_A 2utg_A Fold_G

2wrp_R 4ts1_A 4xia_A 6tmn_E 9wga_A

2.2 The Definition of Protein Solvent Accessibility

2.2.1 Static Residue Solvent Accessibility

The native structure of globular proteins exists only in the presence of water [11], and therefore the analysis of their interactions with water is central to the theory of protein structure [12]. The term “accessible surface area” was introduced by Lee and Richards [13] to quantitatively describe the extent to which atoms on the protein surface can form contacts with water. For a particular protein atom it is defined as the area over which the centre of a water molecule can be placed while retaining van der Waals contacts with that atom and not penetrating any other atom. The principal goal is to predict the extent to which a residue embedded in a protein structure is accessible to solvent. Solvent accessibility can be described in several ways [13]–[15]. The most detailed fast method compiles solvent accessibility by estimating the volume of a residue embedded in a structure that is exposed to solvent as shown in Fig. 2.1; note:

this method was developed by Lee and Richards [13] and later implemented in DSSP [16]. Different residues have a different possible accessible area.

Studies of solvent accessibility in proteins have led to many new insights into protein structure [13]–[18]. Knowledge of solvent accessibility has proved useful for identifying protein function, sequence motifs, and domains, and for formulating hypotheses about antigenic determinants, site-directed mutagenesis, humanization of antibodies, and on the correctness of designed or experimentally determined protein structures. Furthermore, knowledge of solvent accessibility has assisted alignments in regions of remote sequence identity.

Fig. 2.1. Measure accessibility. Residue solvent accessibility is usually measured by rolling a spherical water molecule over a protein surface and summing the area that can be accessed by this molecule on each residue (typical values range from 0－300 Å²). To allow comparisons between the accessibility of long extended and spherical amino acids, typically relative values are compiled (actual area as percentage of maximally accessible area). A more simplified description distinguishes two states:

exposed (here residues numbered 1－3 and 10－12) and buried (here residues 4－9) residues. Since the packing density of native proteins resembles the crystals, values for solvent accessibility provide upper and lower limits to the number of possible inter-residue contacts.

2.2.2 Residue Relative Solvent Accessibility

How can the solvent accessibility of a residue embedded in a 3D structure be cast into a simple number? One simple way is to count the number of water molecules in direct contact with the residue, as estimated by the program DSSP for the first hydration shell. For comparison between amino acids of different sizes, the relative solvent accessibility is a useful quantity as defined in Table 2.2.

Amino acid relative solvent accessibility is the degree to which a residue in a protein is accessible to a solvent module. The relative solvent accessibility can be calculated by the formula as follows:

RelAcc (%) = 100 ×Acc / MaxAcc (%) ,

where Acc is the solvent accessible surface area of the residue observed in the 3D structure, given in Angstrom units, calculated from coordinates by the dictionary of protein secondary structure (DSSP) program [16]. The number of water molecules around a residue can be approximated by Acc/10, and MaxAcc is the maximum value of solvent accessible surface area of each kind of residue for a Gly-X-Gly extended tripeptide conformation.

Table 2.2. Definition of solvent accessibility states.

z Solvent accessibility:

Acc = solvent accessibility of a residue (given in Å²) calculated from coordinates using DSSP [16]. W≈ Acc/10, approximates the number of water molecules around the residue.

z Relative solvent accessibility:

RelAcc = Acc/MaxAcc, with maximal accessibility (measured in Å²) for the amino acids given by the table following (amino acids in one-letter code; B stands for D or N; Z for E or Q, and X for an undetermined amino acid) [18][19].

AA A B C D E F G H I K L M

MaxAcc 106 160 135 163 194 197 84 184 169 205 164 188

AA N P Q R S T V W X Y Z

MaxAcc 157 136 198 248 130 142 142 227 180 222 196

z Two-state (binary) model for accessibility (B/E) :

Buried (B) Exposed (E)

RelAcc ≤ 0% RelAcc > 0%

RelAcc < 5% RelAcc ≥ 5%

RelAcc < 9% RelAcc ≥ 9%

RelAcc < 16% RelAcc ≥ 16%

Thresholds to distinguish two states

RelAcc < 25% RelAcc ≥ 25%

z Three-state (ternary) model for accessibility (B/I/E) :

Buried (B) Intermediate (I) Exposed (E) Thresholds to distinguish three states RelAcc < 9% 9% ≤ RelAcc < 36% RelAcc ≥ 36%

z Measure for evaluation of conservation and accuracy of prediction:

Q₂ percentage of conserved, or correctly predicted, residues in two states defined by thresholds given above.

Q₃ percentage of conserved, or correctly predicted, residues in three states defined by thresholds given above.

RelAcc can hence adopt values between 0% and 100%, with 0% corresponding to a fully buried and 100% to a fully accessible residue, respectively. Different arbitrary threshold values of relative solvent accessibility are chose to define categories: buried and exposed as shown in Fig. 2.2, or ternary categories: buried, intermediate, or exposed. The precise choice of the threshold is not well defined [3].

We used two kind of class definitions: (1) buried (B) and exposed (E); and (2) buried (B), intermediate (I), and exposed (E). For the two-state, B and E definition, we chose various thresholds of the relative solvent accessibility such as 25%, 16%, 9%, 5%, and 0%. For the three-state, B, I, and E, description of relative solvent accessibility, one set of thresholds that we selected is the same as those in Rost and Sander [3]:

Buried (B): RelAcc < 9%

Intermediate (I): 9% ≤ RelAcc < 36%

Exposed (E): RelAcc ≥ 36%

Fig 2.2. Binary model: thick and dark line is buried residues; thin and light line is exposed residues [20].

2.3 PSI-BLAST Profiles

It is well known that evolutionary information in the form of multiple alignments and profiles significantly improves the accuracy of, for instance, secondary structure prediction methods [4], [24]–[27]. This is so because the secondary structure of a family is more conserved than the primary amino acid sequence. Similar effects have been reported for the prediction of contact number and relative solvent accessibility.

For relative solvent accessibility, a corresponding increase of 5% has been described both with neural networks [25] and Bayesian methods.

PSI-BLAST [28] generates the profile of a protein in the form of an N×20 position-specific scoring matrix as shown in Fig. 2.3, where N is the length of the sequence. PSI-BLAST is run with default options, -j 3, -h 0.001, and -e 10.0, and the non-redundant protein sequence database (ftp://ncbi.nlm.nih.gov/blast/db) filtered by PFILT [29] to mask out regions of low complexity sequence, the coiled coil regions and transmembrane spans. The BLOSUM62 [30] substitution matrix as shown in Fig.

2.4, is used for PSI-BLAST. These profiles were scaled to the required 0–1 range using the standard logistic function:

) exp(

1 ) 1

(

x x

f

= + − ,

where x is the raw profile matrix value.

Fig. 2.3. Raw profile from PSI-Blast log file

Fig. 2.4. BLOSUM 62 substitution matrix (Lower) and difference matrix (Upper) obtained by subtracting the PAM 160 matrix position by position. These matrices have identical relative entropies (0.70); the expected value of BLOSUM 62 is -0.52;

that for PAM 160 is -0.57.

2.4 Quick Radial Basis Function Network

The radial basis function network (RBFN) is a special type of neural networks with several distinctive features [31]. Since its first proposal, the RBFN has attracted a high degree of interest in research communities. A RBFN consists of three layers, namely the input layer, the hidden layer, and the output layer. The input layer broadcasts the coordinates of the input vector to each of the nodes in the hidden layer.

Each node in the hidden layer then produces an activation based on the associated radial basis function. Finally, each node in the output layer computes a linear combination of the activations of the hidden nodes. How a RBFN reacts to a given input stimulus is completely determined by the activation functions associated with the hidden nodes and the weights associated with the links between the hidden layer and the output layer. The general mathematical form of the output nodes in a RBFN is as follows:

where c_j(x) is the function corresponding to the j-th output unit (class-j) and is a linear combination of k radial basis functions

Φ

(．) with center

μ

i and bandwidth

σ

i . Also, wj

is the weight vector of class-j and w_jiis the weight corresponding to the j-th class and

i-th center. The general architecture of RBFN is shown in Fig. 2.5. It can be seen that

constructing a RBFN involves determining the number of centers, k, the center locations,

μ

i , the bandwidth of each center,

σ

i , and the weights, wji. That is, training a RBFN involves determining the values of three sets of parameters: the centers (

μ

_i), the bandwidths (

σ

i ), and the weights (wji), in order to minimize a suitable cost function.

Fig. 2.5. General architecture of Radial Basis Function Network

In QuickRBF package [10], it is focused on the calculation of the weights, and conducting a simple method to determine the centers and bandwidths. Therefore, it selects the centers randomly in the package. Also, it utilizes a fixed bandwidth of each kernel function, which is set to five for each kernel function. After the centers and bandwidths of the kernel functions in the hidden layer have been determined, the transformation between the inputs and the corresponding outputs of the hidden units is now fixed. The network can thus be viewed as an equivalent single-layer network with linear output units. Then, the Least Mean Square Error method is used to determine the weights associated with the links between the hidden layer and the output layer.

Ou used a single-layer Quick Radial Basis Function Network [10] to analyze protein secondary structure with excellent prediction results on the RS126 data set.

There are more details about QuickRBF can be found in QuickRBF package (http://csie.org/~yien/quickrbf/index.php). Here, we propose a modified QuickRBF system to predict protein relative solvent accessibility.

2.5 Coding Scheme

As with Hua and Sun’s work [32], the present analysis used the classical local coding scheme of the protein sequences with a sliding window. PSI-BLAST matrix with n rows and 20 columns can be defined for single sequence with n residues. For the first layer in the prediction, each residue is represented using 20 components in a vector, based on the PSSM. In order to allow a window to extend over the N-terminus and the C-terminus, an additional 21st unit (spacer) was attached to each residue.

Then, each input vector has 21×w components, where w is a sliding window size. For the second layer, the vector corresponding to a residue has four elements in the three-state prediction and three elements in the two-state prediction, where the first three elements represent the three relative solvent accessibility states, E, I, and B, in the three-state prediction and the first two elements represent the two relative solvent accessibility states, E and B, in the two-state prediction. Both the last units were added in order to allow a window to extend over the N-terminus and the C-terminus.

If the window length is v, the dimension of the feature vector is 4×v for the second layer in the three-state prediction and 3×v in the two-state prediction.

2.6 Several Prediction System Structures

Five different kind of QuickRBF approaches are applied on three-state, E, I, and B, and two-state, E and B, relative solvent accessibility predictions. These five approaches include: (1) QuickRBF, (2) Two-Stage QuickRBF, (3) Common Fusion QuickRBF, (4) Local Tendency Fusion QuickRBF, and (5) Global Tendency Fusion QuickRBF.

2.6.1 QuickRBF Approach

A QuickRBF structure was used in the prediction system as shown in Fig. 2.6.

The QuickRBF classifier classifies each residue of each sequence into the three relative solvent accessibility states, E, I, or B, by using the values of matrices of PSI-BLAST profile as the inputs. The outputs represent the tendency that the residue belongs to that state. The one-against-rest strategy was used for the multiclass classification, so each residue was classified into the state with the largest output value for a QuickRBF approach.

Fig. 2.6. Architecture of QuickRBF method. The system includes two parts: the PSI-BLAST profile, and the classifier. The profile is transformed into a number of 21*17 dimension vectors using the slide-window method. These vectors are input into the QuickRBF classifier. The outputs of the QuickRBF classifier are a number of 3D vectors representing the tendency that the residue belongs to that state. The one-against-rest strategy was used to classify each residue into the state with the largest value.

PSI-BLAST Profile

Coding : transform the 17*20 matrix into a 17*21 dimension vector

QuickRBF Classifier Data Normalization

Classifier

outputs of the classifier

2.6.2 Two-Stage QuickRBF Approach

A Two-Stage QuickRBF structure was used in the prediction system as shown in Fig. 2.7. The first stage is a QuickRBF classifier that classifies each residue of each sequence into the three relative solvent accessibility states, E, I, or B, by using the values of matrices of PSI-BLAST profile as the inputs. The outputs of the first stage represent the tendency that the residue belongs to that state. The second stage QuickRBF classifier also classifies each residue of each sequence into the three relative solvent accessibility states, E, I, or B, by using the RSA three-state tendency matrices from the outputs of the first stage as the inputs. The outputs of the second stage also represent the tendency that the residue belongs to that state. As with an One-Stage QuickRBF approach, the second stage also uses the one-against-rest strategy, with each residue classified into the state with the largest output value for a Two-Stage QuickRBF approach.

Second Stage

Fig. 2.7. Architecture of Two-Stage QuickRBF method. The system includes three parts: the PSI-BLAST profile, the first stage, and the second stage. The profile is transformed into a number of 21*17 dimension vectors using the slide-window method. These vectors are input into the first-stage QuickRBF. The outputs of the first-stage QuickRBF are a number of 3D vectors representing the tendency that the residue belongs to that state. Using the slide-window method, the outputs of the first-stage QuickRBF are transformed into a number of 4*15 dimensional vector, which are used as the inputs of the second-stage QuickRBF. The final decisions are based on the outputs of the second-stage QuickRBF.

PSI-BLAST Profile

Coding : transform the 17*20 matrix into a 17*21 dimension vector

First-Stage QuickRBF Data Normalization

Coding : transform the 3*15 matrix into a 4*15 dimension vector

Second-Stage QuickRBF

First Stage

outputs of the second stage

2.6.3 Common Fusion QuickRBF Approach

Three kind of fusion QuickRBF approaches were used to combine the outputs of a QuickRBF approach and the outputs of a Two-Stage QuickRBF approach. One is the Common Fusion QuickRBF approach, and the others are the Local Tendency Fusion QuickRBF approach and the Global Tendency Fusion QuickRBF approach.

The architectures of these three approaches were illustrated in Figs. 2.8, 2.9, and 2.10.

The common fusion strategy adds up the tendency outputs from a QuickRBF approach and the tendency outputs from a Two-Stage QuickRBF approach. Then we also use the one-against-rest strategy to classify each residue into the state with the largest value.

Fig. 2.8. Architecture of Common Fusion QuickRBF method

Common Fusion Rule:

E ( o n e - s t a g e ) + E ( t w o - s t a g e ) I ( o n e - s t a g e ) + I ( t w o - s t a g e ) B ( o n e - s t a g e ) + B ( t w o - s t a g e )

outputs from QuickRBF

outputs from Two-Stage QuickRBF

outputs of Common Fusion QuickRBF

2.6.4 Local Tendency Fusion QuickRBF Approach

The local tendency fusion strategy also adds up the tendency outputs from a QuickRBF approach and the tendency outputs from a Two-Stage QuickRBF approach.

An occurrence number table is then applied in the summation as shown in Table 2.3.

There are three occurrence numbers which are Oe, Oi and Ob, where Oe means the numbers of exposed components in the test data, and O_i means the numbers of intermediate components in the test data, and Ob means the numbers of buried components in the test data. These three occurrence numbers represent the potential factors for the affection ability of the three relative solvent accessibility states. In other words, if an occurrence number is larger, the tendency of a residue which belongs to that state is larger. Then we also use the one-against-rest strategy to classify each residue into the state with the largest value.

Fig. 2.9. Architecture of Local Tendency Fusion QuickRBF method. These three occurrence numbers are based on each test fold.

outputs from QuickRBF

Local Tendency Fusion Rule:

{ E (one-stage) + E (two-stage) } * Oe { I (one-stage) + I (two-stage) } * Oi { B (one-stage) + B (two-stage) } * Ob

outputs from Two-Stage QuickRBF

outputs of Local Tendency Fusion QuickRBF

Table 2.3. Occurrence numbers used for local and Global Tendency Fusion QuickRBF method. From Fold_A to Fold_G, these occurrence numbers of each fold are used for Local Tendency Fusion QuickRBF method. And the occurrence numbers of RS126 dataset are used for Global Tendency Fusion QuickRBF method.

Threshold: 9% ; 36%

RS126 11773 13019 24792

component

dataset Oe Oi Ob Oe + Oi + Ob

Fold_A 1524 1220 1532 4276 Fold_B 1441 1090 1174 3705 Fold_C 1269 1026 1169 3464

Fold_D 1436 1196 1222 3854

Fold_E 1081 829 961 2871

Fold_F 1036 835 900 2771

Fold_G 1271 1204 1376 3851 RS126 9058 7400 8334 24792

Table 2.3.(continued)

Threshold: 16%

component

dataset Oe Ob Oe + Ob

Fold_A 2373 1903 4276

Fold_B 2231 1474 3705 Fold_C 1977 1487 3464

Fold_D 2261 1593 3854

Fold_E 1679 1192 2871

Fold_F 1630 1141 2771

Fold_G 2083 1768 3851

RS126 14234 10558 24792

Threshold: 9%

component

dataset O_e O_b O_e + O_b

Fold_A 2744 1532 4276 Fold_B 2531 1174 3705 Fold_C 2295 1169 3464 Fold_D 2632 1222 3854 Fold_E 1910 961 2871 Fold_F 1871 900 2771 Fold_G 2475 1376 3851 RS126 16458 8334 24792

Table 2.3.(continued)

Threshold: 5%

component

dataset Oe Ob Oe + Ob

Fold_A 3028 1248 4276 Fold_B 2769 936 3705 Fold_C 2502 962 3464 Fold_D 2866 988 3854 Fold_E 2098 773 2871 Fold_F 2048 723 2771 Fold_G 2773 1078 3851 RS126 18084 6708 24792

Threshold: 0%

component

dataset O_e O_b O_e + O_b

Fold_A 3652 624 4276 Fold_B 3297 408 3705 Fold_C 3010 454 3464 Fold_D 3378 476 3854 Fold_E 2536 335 2871 Fold_F 2451 320 2771 Fold_G 3360 491 3851 RS126 21684 3108 24792

2.6.5 Global Tendency Fusion QuickRBF Approach

The global tendency fusion strategy is mostly the same with the local tendency fusion strategy. The difference between these two tendency fusion strategies is that these three occurrence numbers used for the global tendency fusion strategy are the occurrence numbers of three kind of components in the RS126 data set.

Fig. 2.10. Architecture of Global Tendency Fusion QuickRBF method. These three occurrence numbers are based on the RS126 data set.

outputs from QuickRBF

Global Tendency Fusion Rule:

{ E (one-stage) + E (two-stage) } * Oe { I (one-stage) + I (two-stage) } * Oi { B (one-stage) + B (two-stage) } * Ob

outputs from Two-Stage QuickRBF

outputs of Global Tendency Fusion QuickRBF

Chapter 3. Experiment and Simulation Results

3.1 Experiment Procedure of Five QuickRBF Approaches

Five different kind of QuickRBF approaches are applied on three-state, E, I, and B, and two-state, E and B, relative solvent accessibility predictions. These five methods are QuickRBF, Two-Stage QuickRBF, Common Fusion QuickRBF, Local Tendency Fusion QuickRBF, and Global Tendency Fusion QuickRBF.

For QuickRBF, One-Stage QuickRBF approach, each residue is coded as a 21-dimensional vector, where the first 20 elements of the vector are the corresponding elements in PSI-BLAST matrix and the last unit was added in order to allow a window to extend over the N- and the C-terminus. The window length is 17 and the dimension of the feature vector is 21×17. The number of the centers randomly selected from the training data set is 12000 and the bandwidth is five for each kernel function. The architecture of QuickRBF in the three-state prediction is shown previously in Fig. 2.6.

For Two-Stage QuickRBF, the window lengths are 17 for the first layer and 15 for the second layer. The dimension of the feature vector for the first layer is 21×17.

The dimensions of the feature vectors for the second layer are 4×15 in the three-state

在文檔中快速輻射半徑基底函數網路演算法於蛋白質相對溶劑可接觸性預測的應用 (頁 14-0)