3-1 Problem Definition
We aim to provide a useful RBP binding site predictor that can assist biologists to have clues on site-directed mutagenesis in wet-lab experiments. With protein sequence information only, we predict the binding residues and output binary label.
The definition of protein-RNA interaction residues is based on molecular distance which is a good indication for existence of intermolecular forces. An amino acid residue was designated as a binding site if the side chain or backbone atoms of the residue fell within a cutoff distance from any atoms of the RNA partner molecule in the complex.
All the other residues were regarded as non-binding sites.
3-2 Data Set
We adopt two training sets and one testing set to perform the experiment.
i. RNA Binding Protein Chain 86 (RBPC86)
As mentioned in the related work, RBPC86 is the most common dataset in the field of RNA-Protein interaction sites prediction. The RBPC86 data set consists of 86 protein chains extracted from RNA-protein complexes with X-ray crystallography resolution better than 3.0 Å in PDB.
This dataset first defined by Jeong and his colleagues [19, 20] as a distance cutoff 6.0 Å to include a wide range of protein-RNA interactions, and the homology is 70%
24
sequence identity over 90% overlap on both sequences and BLASTClust [30]. RBPC86 then used by Kumar et al. [27], and adapted by Cheng et al. [5] as well as Spriggs et al.
[28]. We utilized the Cheng et al. [5] version which has removed non-RBP chains. The resultant data set contains 4,568 RNA interacting residues and 15,503 non-interacting residues, in total of 20,071 residues.
ii. RBPC147
Another training dataset of protein–RNA interactions is RBPC147 extracted from structures of known protein–RNA complexes in the PDB solved by X-ray crystallography resolution better than 3.5 Å. Proteins with larger than 30% sequence identity were removed using PISCES [31].
Terribilini et al. [32] introduced RBPC147 in the RNABindR web-based server. In addition, Tong et al. [25] used RBPC147 as a benchmark dataset. Based on the cut-off distance of 5.0 Å, a total of 32,324 amino acids are included in RBPC 147, which contains 6,157 RNA-binding residues and 26,167 non-binding residues.
iii. RBPC33
An independent testing dataset of protein–RNA interactions is RBPC33 extracted from structures of known protein–RNA complexes that were added after January 2006.
RBPC33 contains chains longer than 40 residues. We performed a redundancy reduction on BLASTClust [30] to ensure that none of the chains showed a sequence
25
similarity of more than 30% within the dataset and to the previous RBPC86 and RBPC147 dataset. A distance cutoff of 5.0 Å was used to annotate interface residues.
RBPC33 is a testing set modified from 36 binding protein chains which were used by Maetschke et al. [29] in 2009.
3-3 Performance Measure
To benchmark our performance and compare with the other studies, we calculate the following measurements:
Recall
where TP is the number of true positives, FP is the number of false positives, TN is the number of true negatives and FN is the number of false negatives. An MCC of +1 reaches its best correlation between the observed and the predicted classes of the samples, and a MCC of −1 is perfect anti-correlation; whereas a MCC of zero denotes
26
no correlation at all. F-score (also called F-measure) is a harmonic mean of precision and recall, where 1 denotes perfect results and 0 denotes the worst [33].
In this study, we use two cross-validation ways to assess the performance of the SVM models. One is leave-one-out cross-validation, using a single chain from the RBPC86 as the validation data and the rest of chains as the training data. The cross-validation process is repeated 86 times. The other way is by using 5-fold cross-validation on RBPC147 due to the data over-fitting and the time-consuming problem; we use 5-fold cross-validation on both RBPC86 and RBPC147. RBPC86 and RBPC147 are randomly split into 5 non-overlapping subsets on protein-chain level to avoid homological issue. One subset is the validation data, and the remaining subsets are the training set. Then repeat 5 times to generate the performance of our predictor.
3-4 Feature Selection
To obtain the best performance of prediction, we explore distinct features and PSSM schema to apply to our experiments.
i. PSSM
PSSM encoded from PSI BLAST is composed of log-likelihoods for 20 amino acids for individual query residues.
27
ii. PSSM in 7 groups
PSSM encoded according to amino acid properties of 7 groups shown in Table 2-1 in Chapter 2. After the PSSM turn into a 7-column matrix, we encoded to a sequence patch using sliding window technique.
iii. PSSM added secondary structure information (SS)
The PSIPRED outputs consist of three probability values represented for helix, sheet and coil respectively, for instance (H, E, C) = (0.75, 0.25, 0.25). We add three features to a normalized PSSM features then do the sliding window, that is to say, the added secondary structure information is not normalized.
iv. PSSM added interface propensities
The interface propensities calculate the proportion of the interface to surface of a given residue in RBPs.
where N is the number of interface residues of certain type of amino acid k, Ik
k IN the total number of interface residues, k N is the number of surface residues Sk of type k, and
kS
N are the number of surface. We adopt a new interface propensity k
calculated by Laura Pe´rez-Cano et al. [34] The interface propensities are normalized in linear model due to its range from 0 to1.
28
v. PSSM added Electrostatics propensities
PSSM is added one column of electrostatics propensities based on amino acid attributes, ascertained by Fauchere, J.L. et al. [35] In our schema, 0 means negative charge, 0.5 represents neural and 1 means positive charge.
3-5 Normalization
Normalization is a crucial topic in the process of handling data. The most important purpose of normalization is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges. On the other hand, it is utilized to avoid numerical difficulties or even computation crashes during the calculation when dimension grows large. On a common basis, researchers normalize each attribute of the data instance to the range [0,1] or [-1,1].
Figure 3-1 Linear model (in dashed), Logistic model (in blue)
29
We adopt two major model of normalization: linear model and logistic model. By instinct, the linear model is scaling to proportion of maximum and minimum. For scaling to [-1,1], the linear model is
min 1
where min stands for minimum and max for maximum.
For linearly separable data, over-fitting might occur, since the extreme value of maximums or minimums affect the scaling curve. We can use local maximum and local minimum of each protein chain or normalize input value by each amino acid in column to resolve such situations.
In addition, another solution is adapting logistic model, the so called sigmoid model [36]. To scale data attributes to [0,1], the logistic model uses the following equation:
Since we try to avoid data bias, we propose a modified version of the logistic model that shifts the curve according to their mean and variance.
e
tTable 3-1 is the detailed normalization functions categorized in linear or logistic models discussed in this thesis.
30
Table 3-1 List of normalization functions
Category Function Name Detail
Linear
Linear Column linear model
Linear by column-wise extreme value
Linear Global linear model
Linear by whole dataset extreme value Logistic
e
t1 1
Logistic model t=x
Logistic Chain logistic model
x t x
where
x
and σ are chain-wiseLogistic Column logistic model
x t x
where
x
and σ are column-wiseThe results of distinct normalization ways are reported in 4-1 . We find out that Logistic model outperform the others methods; hence, Logistic model are adopted in the following experiment.
3-6 Single Predictor Model
The performance of the SVM classifier depends on the combination of several parameters. In general, our experiment involves two groups of parameters: parameters relative to input featured PSSM and SVM classifier adjustment.
31
The first one is the sliding window size of featured PSSM. PSSM generates evolutional information of individual residue and added amino acid properties. Since a residue cannot act as a lonely wolf in biochemical process, we cluster neighboring residues to a central residue and construct sequential patches. By using sliding windows, the sequence properties were integrated into a feature vector covering the whole subsequence and all the information is used to describe the center residue.
Figure 3-2 Sliding window framework
For the SVM classifier, we take two parameters into account. The first one is cost value C, and the other is γ gamma value in the radial basis function. The cost value C is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the training error, while the gamma value γ regulates the amplitude of the kernel function to dominate the generalization ability of SVM.
32
We test a wide range of window sizes according to the featured type of PSSM.
Since 7-group PSSM extracts features vector out of 7 columns, we obtain a larger window size of 31 to gather enough information. The others window sizes of featured PSSM are about 23, which is around the domain size in RBPs.
Table 3-2 List of optimal parameters of single predictors
Data set PSSM features
Window sizeCost(log
2n ) gamma(log
2n )
RBPC86
PSSM 23 1 -5RBPC86
7 groups PSSM 31 1 -4RBPC86
PSSM + SS 23 1 -5RBPC86
PSSM + Interface
Propensities 23 1 -5
RBPC86
PSSM +Electrostatics 23 1 -5RBPC147
PSSM 23 1 -5RBPC147
7 groups PSSM 31 1 -4RBPC147
PSSM + SS 23 1 -5RBPC147
PSSM + Interface
Propensities 23 1 -5
RBPC147
PSSM +Electrostatics 23 1 -5The corresponding results are listed and discussed in Section 4-2 . Since the improvements of the single predictors are limited, we propose a hybrid model.
33
3-7 Hybrid Model
Besides diverse PSSM schemas and features as single predictor, we devote to seek models that predict more positive values which mean RBP binding sites. Since the protein functional signatures are strongly related to the conservation domains, we consider RNA-protein interaction as a kind of protein function and utilize WildSpan to find conservation domains. We combine SVM-based single predictors which combined PSSM and secondary structure information together with WildSpan to construct a new model.
We applied the default parameter setting to obtain patterns by WildSpan. As the authors recommend, we input our query to search against Swiss-Prot database [3] with PSI-BLAST (blastpgp –j 6) and obtain maximum 150 unique target sequences. These target sequences share 30% ~ 90% sequence identity with the query sequence, since we would like to find remote homologous domains and to remove the similar protein sequence. Then we utilize WildSpan to obtain the top-one conservation pattern as the binding residues. Since WildSpan cannot generate patterns under certain conditions, we have several chains without WildSpan patterns. There are 14 chains out of RBPC86, 21 chains out of RBPC147, and 11 chains out of RBPC33. The detailed list of protein chains with no WildSpan patterns are enumerated in Table 3-3.
34
Table 3-3 List of protein chains with no WildSpan patterns
Data set PDB ID chain list
RBPC86 1B23_P, 1C0A_A, 1C9S_L, 1E6T_C, 1EFW_B, 1F8V_A, 1FJG_L, 1IVS_B, 1JBR_A, 1N35_A, 1NB7_B, 1QF6_A, 1QU2_A, 2BBV_C
RBPC147 1A34_A, 1B23_P, 1C0A_A, 1FFY_A, 1FJG_L, 1FJG_V, 1GTF_Q, 1H2C_A, 1JBR_A, 1N35_A, 1NB7_A, 1Q2S_A, 1QF6_A, 1RPU_A, 1U0B_B, 1UVJ_A, 1YVP_A, 2AZ0_A, 2BGG_A, 2BTE_A , 2BU1_A
RBPC33 2DER_B, 2F8S_A, 2G8K_A, 2GJE_A, 2GJE_D, 2GTT_G, 2HVR_A, 2HYI_D, 2PJP_A, 2Q66_A
We test and integrate three single predictors includes PSSM on SVM classifier, PSSM added secondary structure information on SVM classifier as well as pattern information by WildSpan. The new model incorporates all the positive sites that identified by single predictors. We name this Protein-RNA sites prediction method ProteRNA.
3-8 System Architecture
Our experimental method has two main parts as Figure 3-3. Firstly, sequence queries are prepared in FASTA format. Secondly, we input sequence queries to runs on SVM to get a prediction model. Finally, WildSpan provides the conservation information and outputs the second model. After we combine the entire prediction model, the result is done.
35
Figure 3-3 Overall framework flowchart
The detailed experiment steps are shown in Figure 3-4 and Figure 3-5. In Figure 3-4, we depict SVM part. Firstly, sequence queries are encoded to PSSM by PSI BLAST and normalized by logistic function. Secondly, we prepare the PSSM by adding secondary structure information provided by PSIPRED. The PSSM with secondary structure information combined and do the sliding window to be the training data. The training data runs on SVM to get a prediction model.
The WildSpan part is shown in Figure 3-5. We input our query to search against Swiss-Prot database [3] and obtain maximum 150 unique target sequences that share 30% ~ 90% sequence identity with the query sequence. Then we input these sequences to WildSpan and obtain the top-one conservation pattern as the binding residues. Since
36
WildSpan cannot generate patterns under certain conditions, we have several chains without WildSpan patterns as enumerated in Table 3-3.
Figure 3-4 Secondary structure information prediction flowchart
37
Figure 3-5 WildSpan prediction flowchart
38