Method - 應用機器學習方法預測核糖核酸與蛋白質結合位置

3-1 Problem Definition

We aim to provide a useful RBP binding site predictor that can assist biologists to have clues on site-directed mutagenesis in wet-lab experiments. With protein sequence information only, we predict the binding residues and output binary label.

The definition of protein-RNA interaction residues is based on molecular distance which is a good indication for existence of intermolecular forces. An amino acid residue was designated as a binding site if the side chain or backbone atoms of the residue fell within a cutoff distance from any atoms of the RNA partner molecule in the complex.

All the other residues were regarded as non-binding sites.

3-2 Data Set

We adopt two training sets and one testing set to perform the experiment.

i. RNA Binding Protein Chain 86 (RBPC86)

As mentioned in the related work, RBPC86 is the most common dataset in the field of RNA-Protein interaction sites prediction. The RBPC86 data set consists of 86 protein chains extracted from RNA-protein complexes with X-ray crystallography resolution better than 3.0 Å in PDB.

This dataset first defined by Jeong and his colleagues [19, 20] as a distance cutoff 6.0 Å to include a wide range of protein-RNA interactions, and the homology is 70%

sequence identity over 90% overlap on both sequences and BLASTClust [30]. RBPC86 then used by Kumar et al. [27], and adapted by Cheng et al. [5] as well as Spriggs et al.

[28]. We utilized the Cheng et al. [5] version which has removed non-RBP chains. The resultant data set contains 4,568 RNA interacting residues and 15,503 non-interacting residues, in total of 20,071 residues.

ii. RBPC147

Another training dataset of protein–RNA interactions is RBPC147 extracted from structures of known protein–RNA complexes in the PDB solved by X-ray crystallography resolution better than 3.5 Å. Proteins with larger than 30% sequence identity were removed using PISCES [31].

Terribilini et al. [32] introduced RBPC147 in the RNABindR web-based server. In addition, Tong et al. [25] used RBPC147 as a benchmark dataset. Based on the cut-off distance of 5.0 Å, a total of 32,324 amino acids are included in RBPC 147, which contains 6,157 RNA-binding residues and 26,167 non-binding residues.

iii. RBPC33

An independent testing dataset of protein–RNA interactions is RBPC33 extracted from structures of known protein–RNA complexes that were added after January 2006.

RBPC33 contains chains longer than 40 residues. We performed a redundancy reduction on BLASTClust [30] to ensure that none of the chains showed a sequence

similarity of more than 30% within the dataset and to the previous RBPC86 and RBPC147 dataset. A distance cutoff of 5.0 Å was used to annotate interface residues.

RBPC33 is a testing set modified from 36 binding protein chains which were used by Maetschke et al. [29] in 2009.

3-3 Performance Measure

To benchmark our performance and compare with the other studies, we calculate the following measurements:

Recall

where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives and FN is the number of false negatives. An MCC of +1 reaches its best correlation between the observed and the predicted classes of the samples, and a MCC of −1 is perfect anti-correlation; whereas a MCC of zero denotes

no correlation at all. F-score (also called F-measure) is a harmonic mean of precision and recall, where 1 denotes perfect results and 0 denotes the worst [33].

In this study, we use two cross-validation ways to assess the performance of the SVM models. One is leave-one-out cross-validation, using a single chain from the RBPC86 as the validation data and the rest of chains as the training data. The cross-validation process is repeated 86 times. The other way is by using 5-fold cross-validation on RBPC147 due to the data over-fitting and the time-consuming problem; we use 5-fold cross-validation on both RBPC86 and RBPC147. RBPC86 and RBPC147 are randomly split into 5 non-overlapping subsets on protein-chain level to avoid homological issue. One subset is the validation data, and the remaining subsets are the training set. Then repeat 5 times to generate the performance of our predictor.

3-4 Feature Selection

To obtain the best performance of prediction, we explore distinct features and PSSM schema to apply to our experiments.

i. PSSM

PSSM encoded from PSI BLAST is composed of log-likelihoods for 20 amino acids for individual query residues.

ii. PSSM in 7 groups

PSSM encoded according to amino acid properties of 7 groups shown in Table 2-1 in Chapter 2. After the PSSM turn into a 7-column matrix, we encoded to a sequence patch using sliding window technique.

iii. PSSM added secondary structure information (SS)

The PSIPRED outputs consist of three probability values represented for helix, sheet and coil respectively, for instance (H, E, C) = (0.75, 0.25, 0.25). We add three features to a normalized PSSM features then do the sliding window, that is to say, the added secondary structure information is not normalized.

iv. PSSM added interface propensities

The interface propensities calculate the proportion of the interface to surface of a given residue in RBPs.

where N is the number of interface residues of certain type of amino acid k, ^I_k



k I

N the total number of interface residues, k N is the number of surface residues ^S_k of type k, and



N are the number of surface. We adopt a new interface propensity k

calculated by Laura Pe´rez-Cano et al. [34] The interface propensities are normalized in linear model due to its range from 0 to1.

v. PSSM added Electrostatics propensities

PSSM is added one column of electrostatics propensities based on amino acid attributes, ascertained by Fauchere, J.L. et al. [35] In our schema, 0 means negative charge, 0.5 represents neural and 1 means positive charge.

3-5 Normalization

Normalization is a crucial topic in the process of handling data. The most important purpose of normalization is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. On the other hand, it is utilized to avoid numerical difficulties or even computation crashes during the calculation when dimension grows large. On a common basis, researchers normalize each attribute of the data instance to the range [0,1] or [-1,1].

Figure 3-1 Linear model (in dashed), Logistic model (in blue)

We adopt two major model of normalization: linear model and logistic model. By instinct, the linear model is scaling to proportion of maximum and minimum. For scaling to [-1,1], the linear model is

min 1

where min stands for minimum and max for maximum.

For linearly separable data, over-fitting might occur, since the extreme value of maximums or minimums affect the scaling curve. We can use local maximum and local minimum of each protein chain or normalize input value by each amino acid in column to resolve such situations.

In addition, another solution is adapting logistic model, the so called sigmoid model [36]. To scale data attributes to [0,1], the logistic model uses the following equation:

Since we try to avoid data bias, we propose a modified version of the logistic model that shifts the curve according to their mean and variance.

e

^t

Table 3-1 is the detailed normalization functions categorized in linear or logistic models discussed in this thesis.

Table 3-1 List of normalization functions

Category Function Name Detail

Linear

Linear Column linear model

Linear by column-wise extreme value

Linear Global linear model

Linear by whole dataset extreme value Logistic

e

^t

1  1

Logistic model t=x

Logistic Chain logistic model

 x t x

where

x

and σ are chain-wise

Logistic Column logistic model

 x t x

where

x

and σ are column-wise

The results of distinct normalization ways are reported in 4-1 . We find out that Logistic model outperform the others methods; hence, Logistic model are adopted in the following experiment.

3-6 Single Predictor Model

The performance of the SVM classifier depends on the combination of several parameters. In general, our experiment involves two groups of parameters: parameters relative to input featured PSSM and SVM classifier adjustment.

The first one is the sliding window size of featured PSSM. PSSM generates evolutional information of individual residue and added amino acid properties. Since a residue cannot act as a lonely wolf in biochemical process, we cluster neighboring residues to a central residue and construct sequential patches. By using sliding windows, the sequence properties were integrated into a feature vector covering the whole subsequence and all the information is used to describe the center residue.

Figure 3-2 Sliding window framework

For the SVM classifier, we take two parameters into account. The first one is cost value C, and the other is γ gamma value in the radial basis function. The cost value C is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the training error, while the gamma value γ regulates the amplitude of the kernel function to dominate the generalization ability of SVM.

We test a wide range of window sizes according to the featured type of PSSM.

Since 7-group PSSM extracts features vector out of 7 columns, we obtain a larger window size of 31 to gather enough information. The others window sizes of featured PSSM are about 23, which is around the domain size in RBPs.

Table 3-2 List of optimal parameters of single predictors

Data set PSSM features

Window size

Cost(log

₂

n ) gamma(log

₂

n )

RBPC86

PSSM 23 1 -5

RBPC86

7 groups PSSM 31 1 -4

RBPC86

PSSM + SS 23 1 -5

RBPC86

PSSM + Interface

Propensities 23 1 -5

RBPC86

PSSM +Electrostatics 23 1 -5

RBPC147

PSSM 23 1 -5

RBPC147

7 groups PSSM 31 1 -4

RBPC147

PSSM + SS 23 1 -5

RBPC147

PSSM + Interface

Propensities 23 1 -5

RBPC147

PSSM +Electrostatics 23 1 -5

The corresponding results are listed and discussed in Section 4-2 . Since the improvements of the single predictors are limited, we propose a hybrid model.

3-7 Hybrid Model

Besides diverse PSSM schemas and features as single predictor, we devote to seek models that predict more positive values which mean RBP binding sites. Since the protein functional signatures are strongly related to the conservation domains, we consider RNA-protein interaction as a kind of protein function and utilize WildSpan to find conservation domains. We combine SVM-based single predictors which combined PSSM and secondary structure information together with WildSpan to construct a new model.

We applied the default parameter setting to obtain patterns by WildSpan. As the authors recommend, we input our query to search against Swiss-Prot database [3] with PSI-BLAST (blastpgp –j 6) and obtain maximum 150 unique target sequences. These target sequences share 30% ~ 90% sequence identity with the query sequence, since we would like to find remote homologous domains and to remove the similar protein sequence. Then we utilize WildSpan to obtain the top-one conservation pattern as the binding residues. Since WildSpan cannot generate patterns under certain conditions, we have several chains without WildSpan patterns. There are 14 chains out of RBPC86, 21 chains out of RBPC147, and 11 chains out of RBPC33. The detailed list of protein chains with no WildSpan patterns are enumerated in Table 3-3.

Table 3-3 List of protein chains with no WildSpan patterns

Data set PDB ID chain list

RBPC86 1B23_P, 1C0A_A, 1C9S_L, 1E6T_C, 1EFW_B, 1F8V_A, 1FJG_L, 1IVS_B, 1JBR_A, 1N35_A, 1NB7_B, 1QF6_A, 1QU2_A, 2BBV_C

RBPC147 1A34_A, 1B23_P, 1C0A_A, 1FFY_A, 1FJG_L, 1FJG_V, 1GTF_Q, 1H2C_A, 1JBR_A, 1N35_A, 1NB7_A, 1Q2S_A, 1QF6_A, 1RPU_A, 1U0B_B, 1UVJ_A, 1YVP_A, 2AZ0_A, 2BGG_A, 2BTE_A , 2BU1_A

RBPC33 2DER_B, 2F8S_A, 2G8K_A, 2GJE_A, 2GJE_D, 2GTT_G, 2HVR_A, 2HYI_D, 2PJP_A, 2Q66_A

We test and integrate three single predictors includes PSSM on SVM classifier, PSSM added secondary structure information on SVM classifier as well as pattern information by WildSpan. The new model incorporates all the positive sites that identified by single predictors. We name this Protein-RNA sites prediction method ProteRNA.

3-8 System Architecture

Our experimental method has two main parts as Figure 3-3. Firstly, sequence queries are prepared in FASTA format. Secondly, we input sequence queries to runs on SVM to get a prediction model. Finally, WildSpan provides the conservation information and outputs the second model. After we combine the entire prediction model, the result is done.

Figure 3-3 Overall framework flowchart

The detailed experiment steps are shown in Figure 3-4 and Figure 3-5. In Figure 3-4, we depict SVM part. Firstly, sequence queries are encoded to PSSM by PSI BLAST and normalized by logistic function. Secondly, we prepare the PSSM by adding secondary structure information provided by PSIPRED. The PSSM with secondary structure information combined and do the sliding window to be the training data. The training data runs on SVM to get a prediction model.

The WildSpan part is shown in Figure 3-5. We input our query to search against Swiss-Prot database [3] and obtain maximum 150 unique target sequences that share 30% ~ 90% sequence identity with the query sequence. Then we input these sequences to WildSpan and obtain the top-one conservation pattern as the binding residues. Since

WildSpan cannot generate patterns under certain conditions, we have several chains without WildSpan patterns as enumerated in Table 3-3.

Figure 3-4 Secondary structure information prediction flowchart

Figure 3-5 WildSpan prediction flowchart

在文檔中應用機器學習方法預測核糖核酸與蛋白質結合位置 (頁 31-46)