The input feature vectors - Material and methods

2. Material and methods

2.4 The input feature vectors

Several different input feature vectors for the support vector machine（SVM） are considered. We use the classical local coding scheme of the protein sequences with a sliding window. In this study, the window size was set as 9. The “null” residue was added in order to allow a window to extend over the N- and the C-terminus.

2.4.1 Sequence input vector

The amino acid type of each residue is encoded into a 20-dimension vector consist of 19 “0” and single “1”, e.g. glycine is represented as (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), alanine is represented as (0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), etc. The “null” residue is represented by a 20-dimension vector of all-zero.

2.4.2 Multiple sequence alignment (PSSM) input vector

With multiple sequence alignments, we use PSI-BLAST to detect distant homologues of a query sequence and generate position-specific scoring matrix (PSSM). The matrix has 21 × M elements, where M is the length of the target sequence, and each element represented the likelihood of that particular residue substitution at that position [27]. These profiles were scaled to 0-1 range using the standard logistic function:

exp(-x) 1

f(x) 1

= + (2) The “null” residue is represented by a 20-dimension vector of all-zero.

A figure introduce the procedure of processing PSSM input vector was shown in Figure 2.

2.4.3 Secondary structure (SS) input vector

The secondary structure of each proteins in the dataset was predicted by PSIPRED [28]. The PSIPRED prediction could output the probabilities of three states secondary structure (helix, strand, coil). Because of the value was in the range from 0-1, the three probabilities of each residue could directly be used as 3-dimension vector. The “null” residue is represented by a 3-dimension vector of all-zero.

A figure introduce the procedure of processing secondary structure input vector was shown in Figure 3.

2.4.4 Relative solvent accessibility (RSA) input vector

Amino acid solvent accessibility is the degree to which a residue in a protein is accessible to a solvent molecule. Relative solvent accessibility was calculated by dividing the DSSP-defined solvent accessibility by the accessibility for a Gly-X-Gly tripeptide given by the method of Rose and Dworkin [29]. In this study, we use Jnet [30] to predict relative solvent accessibility of each residue based on a two state-model (exposed/buried) in three categories: 25%, 5%, and 0% accessible. Each residue was represented as 3-dimension vector (RSA-3) composed of zero and one,

“one” means the residue was predicted as exposed at each accessible class, and “zero”

means buried. The “null” residue is represented by a 3-dimension vector of all-zero.

Furthermore, we use another relative solvent accessibility prediction method

developed by our lab members [36], which could predict two-state RSA in ten categories: 0-10%, 10-20%, …, 90-100%. With this RSA prediction scheme, the predicted probabilities of the ten classes were available, and they were directly be used as 10-dimension vector (RSA-10). The “null” residue is represented by a 10-dimension vector of all-zero.

A figure introduce the procedure of processing RSA-10 input vector was shown in Figure 4.

2.4.5 Chou-Fasman conformational parameter input vector

The Chou-Fasman algorithm for the prediction of protein secondary structure [31]

is one of the most widely used predictive schemes. The Chou-Fasman method of secondary structure prediction depends on assigning a set of prediction values to a residue and then applying a simple algorithm to the conformational parameters and positional frequencies (Table 1.).

The Chou-Fasman algorithm is simple in principle. The conformational parameters for each amino acid were calculated by considering the relative frequency of a given amino acid within a protein, its occurrence in a given type of secondary structure, and the fraction of residues occurring in that type of structure. These parameters are measures of a given amino acid's preference to be found in helix, sheet or coil. Using these conformational parameters, one finds nucleation sites within the sequence and extends them until a stretch of amino acids is encountered that is not disposed to occur in that type of structure or until a stretch is encountered that has a greater disposition for another type of structure. At that point, the structure is terminated. This process is repeated throughout the sequence until the entire sequence is predicted.

In order to scale the conformational parameters to 0-1, each value of parameter was simply divided by 200. Each amino acid was represented as 3-dimension vector of its scaled conformational parameters. The “null” residue is represented by a 3-dimension vector of all-zero.

2.4.6 Amino acid solvent exposed area (SEA) input vector

Table 2. implies solvent accessibility information derived from Bordo and Argos [32]. The data for this table was calculated from data taken from 55 proteins in the Brookhaven data base, coming from 9 molecular families: globi ns, immunoglobins, cytochromes c, serine proteases, subtilisins, calcium binding proteins, acid proteases, toxins and virus capsid proteins. Red entries are found on the surface of a proteins on

> 70% of occurrences and blue entries are found inside of a protein of < 20% of occurrences. The only clear trend in this table is that some residues, such as R and K, locate themselves so that they have access to the solvent. The hydrophobic residues, such as L and F, show no clear trend: they are found near the solvent as often as they are found buried. Each amino acid was represented as 3-dimension vector and the

“null” residue is represented by a 3-dimension vector of all-zero.

2.4.7 Amino acid volume input vector

We use the amino acid volume calculated by Zamyatnin [33] as one kind of input vectors. As shown in Table 3., each amino acid was represented by its volume as one-dimension vector, and the “null” residue is represented by an one-dimension vector of zero.

2.4.8 Amino acid hydropathy input vector

Hydropathy index listed a scale combining hydrophobicity and hydorphilicity of R groups; it can be used to measure the tendency of an amino acid to seek an aqueous environment (- values) or a hydrophobic environment (+ values) [34]. As shown in Table 4., and the value was scaled to 0-1 by equation (2). Each amino acid was represented by its hydropathy as one-dimension vector. and the “null” residue is represented by an one-dimension vector of zero.

在文檔中基於支持向量機器方法之蛋白質β-turn預測 (頁 12-16)