2-1 Central Dogma
The central dogma is a biological principle for understanding the residue-by- residue transformation of sequential information [10]. There are three major classes involved in the dogma: DNA and RNA, and protein.
First of all, Deoxyribonucleic acid (DNA) is a nucleic acid composed of four bases of nucleotides, viz. adenine (A), thymine (T), guanine (G), and cytosine (C). Each type of bases on one strand bonds with only one type of bases on the opposite strand.
Because of this complementary base pairing, two long strands entwine in the shape of a double helix and duplicate each other. This specific interaction between complementary base pairs is critical for all the functions of DNA in living organisms.
Secondly, ribonucleic acid (RNA) is also a nucleic acid that consists of adenine (A), cytosine (C), guanine (G) or uracil (U). There are not only base pairing but also numerous modified bases and sugars in RNAs. Unlike DNA, RNA is a single-stranded molecule in most of its biological roles and has a much shorter chain of nucleotides.
Hence, RNAs can transform to diverse shapes to play specific roles in biological process. There are many types of RNA in the cells including messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), small nuclear RNA (snoRNA), small
7
RNA (sRNA), and viral RNA (vRNA). According to the target RNA types of RBPs, RBPs have different structures to satisfy specific needs as shown in Figure 2-1.
mRNA (PDB ID: 2PJP) tRNA (PDB ID: 2DER)
RNA as ligand (PDB ID: 2G8K) rRNA (PDB ID:1JJ2) Figure 2-1 RBPs with different target RNA
8
Finally yet importantly, protein is an organic compound made of twenty amino acids arranged in a linear chain and folded into a globular form. Like the previous biological macromolecule-nucleic acids, proteins are essential parts of organisms and participate in virtually every process within cells.
The general transfers describe the normal flow of biological information, as shown in Figure 2-2. DNA can be copied to DNA, which is DNA replication. DNA information can be copied into mRNA, which is called transcription. Then proteins can be synthesized using the information in mRNA as a template, which is translation. In addition, some RNAs, such as viruses, are able to replicate RNA or reverse-transcribe RNA into DNA.
Figure 2-2 Flow chart of central dogma [10]
9
2-2 The Attributes of Amino Acid
Amino acid is the basic molecules of proteins both as building blocks of proteins and as intermediates in metabolism. There are 20 kinds of amino acids found within proteins. Each amino acids type has its specific side-chain and properties and be linked together in various sequences to form a vest variety of protein structures. Nevertheless, several classifications had proposed since some of the amino acids share common properties. As Figure 2-3 shows, the concept map portrays the common amino acid properties and the relationship between them. For instance, positive set is the subset of charged set and charged set is subset of polar set.
Figure 2-3 Amino acid properties [11]
10
The amino acid properties give information of the individual residues that may help us identify the RNA-Binding residues. The interaction interfaces of RBPs are often positive electrostatics surface in order to complements the negative electrostatics charge of the RNA [6, 12]. As a result, we try to add electrostatics to distinguish the binding sites from the non-binding ones.
The 20 amino acids could be clustered into seven groups based on the dipoles and volumes of the side chains [13]. Amino acids within the same group likely involve synonymous mutations because of their similar characteristics. Table 2-1 enumerates amino acids in each group.
Table 2-1 List of Amino Acid in 7 groups
No. Amino acid
Group 1
Ala, Gly,ValGroup 2
Ile, Leu, Phe, ProGroup 3
Tyr, Met, Thr, SerGroup 4
His, Asn, Gln, TprGroup 5
Arg, LysGroup 6
Asp, GluGroup 7
Cys11
2-3 Position-Specific Scoring Matrix
Position-Specific Scoring Matrix (PSSM) can be generated by PSI BLAST [14] by searching against National Center for Biotechnology Information (NCBI) non-redundant (nr) database. A protein sequence in FASTA format is calculated by position-specific scores for each residue independently in the alignment. The score in PSSM is the sum of log-likelihoods under a product-multinomial distribution. Highly conserved residues receive high scores and weakly conserved residues receive low scores. Figure 2-4 depicts the content of PSSM; the query sequences are shown in rows and the types of amino acids comprised of log-likelihoods for 20 amino acids are shown in columns.
Figure 2-4 Part of PDB ID: 1JJ2_1 PSSM
12
2-4 Secondary Structure Information
Protein secondary structure is the general three-dimensional form of local sequence segments. The most common secondary structures are helices and sheets. Each of these two secondary structure elements has a regular geometry, namely stabile hydrogen bonding patterns. The coil is not a bona fide secondary structure, but is the class of conformations that indicates an absence of regular secondary structure.
We obtain protein secondary structures information (SS) by PSIPRED Protein Structure Prediction Server developed by Bryson et al. [15]. The server predicts secondary structures based on amino acid evolutionary information that is PSSM in our thesis.
2-5 Classifier - Support Vector Machines
Support vector machine (SVM) is a powerful machine-learning algorithm developed from statistical learning theory which is based on structural risk minimization proposed by Vladimir Vapnik [8]. Nowadays, SVM is one of the most popular solutions for classification, regression, and novelty detection. Briefly speaking, a SVM constructs a hyper-plane in multi-dimensional space that optimally separates input data into two categories. In the following section, we illustrate the framework of SVM.
13
To begin with, the given data in the multi-dimensional space consist of predictor variables. The predictor variables are called attributes. A transformed attribute that is
used to define the hyper-plane is called a feature. A set of n points of data is in the form:
n
A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal decision boundary (called hyper-plane) that separates clusters of vectors in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other side of the plane. The vectors near the hyper-plane are the support vectors that construct the hyper-plane.
We discuss SVMs by a linear separable case. The linear model can be presented in the form:
As illustrated in Figure 2-5, the margin is defined as the perpendicular distance between hyper-plane and the closest data points. Maximizing the margin leads to a particular choice of hyper-plane which is in the form:
0 )
(
x
b
y W
Tx
(2-3)14
The two dashed lines in the figure are support hyper-plane, and each satisfied the form respectively:
If a data point in the space satisfied the inequality 2-6, this data would be classified as square-shaped points; or, if a data point satisfied the inequality 2-7, it would be denoted the circular points.
1
Figure 2-5 Hyper-plane of SVM
1
15
The two inequalities above can be rewritten as:
1 )
(
x b
labeli WT i for all 1in (2-8)
Under the constraint, the hyper-plane therefore has independent data points instead of support vectors. The intuition behind the result is that the decision boundary is increasing dominant by nearby data points relative to the distant ones.
By far, we discussed the condition in two-dimension. In the following, we further apply these formulas to the multi-dimensional problems. We could obtain the distance of a point x to the hyper-plane:
W Wxb
Distance
(2-9)
If we calculate the distance between support hyper-plane and hyper-plane,
W
Tx
b1, than we haveThus, the maximum margin solution is found by solving the sum of the two support hyper-planes to the hyper-plane
W
2 that is in the form Find w and b, maximizeW
2 , or minimize 2W
WT (2-11)
It seems that the bias parameter b has disappeared from the optimization. However, it is determined implicitly via the constraints, since this requires that changes to W
be compensated by changes to b.
16
Since the input data might have various distributions in feature space, the linear model might not be suitable for the input data in reality. A kernel technique is developed to map the nonlinear input spaces to linear ones. We can apply Lagrange number α to vector w and rewrite formula (2-8) as:
1
The kernel function is given by the relation )
The concept of the kernel formula allows us to build extensions of many well-known algorithms. The common kernel functions are listed below.
Radial basis function: ( , ) exp( 2 )
In the general case, we have to consider another problem: data overlapping. We might prefer a solution that better separates the bulk of the data while ignore a few
17
weird noises. In 1995, Corinna Cortes and Vladimir Vapnik proposed soft margin method that allows for mislabeled examples [16]. The previous discussion is based on a hard margin concept that no data exists between two support hyper-planes. On the contrary, the soft margin method introduces a slack variable, ξ, which measures the degree of misclassification of the data x. Moreover, the cost value, C, is a regularization parameter that controls the trade-off between maximizing the margin and minimizing the training error. A small cost value tends to emphasize the margin while ignoring the outliers in the training data, while a large cost value may tend to over-fit the training data. If the penalty function is linear, the optimization problem can be written as:
Minimize:
i
C i 2
2 1 W
Subject to: labeli(
j
jx
Tjx
ib
)1
i (2-14) for all 1in,
i 0This thesis utilized LIBSVM, developed by Chang et al. [17] The LIBSVM package provides classification model construction, regression, multi-class SVM, etc.
With a user-friendly interface and adjustable parameter settings, LIBSVM has been used by many researches in recent years. We choose the Radial basis kernel to implement our predictor.
18
2-6 WildSpan
As we mention, we attempt to extract information only from amino acid sequences.
Mining subsequence that frequently occurs among a set of training sequence, we may obtain information of function annotation, the functional sites, and RNA-protein interaction sites.
WildSpan (http://biominer.bime.ntu.edu.tw/wildspan/) [18] has been embedded in many applications to discover functional signatures and diagnostic patterns of proteins directly from a set of unaligned protein sequences. Therefore, we apply WildSpan to discover conserved residues as RNA-binding residues in a protein sequence to improve prediction performance. For protein-based mining, the authors suggested at most 150 unique homologous proteins with sequence identity ranged from 30% to 90% are required by searching against Swiss-Prot sequence database with PSI-BLAST (blastpgp –j 6). WildSpan cannot generate any patterns in the case of not enough homologous proteins selected from Swiss-Prot protein sequence database or too similar homologous proteins.
2-7 Related Works
Due to the importance of RNA-protein interaction, there are many related studies in the last decade. In 2004, one of the earliest attempts on prediction of RNA-binding sites is Jeong et al. [19] using an artificial neural network (ANN) based on amino acid
19
sequence and secondary structure information in sliding windows. They achieved a maximum Matthew's correlation coefficient (MCC) of 0.294 with five-fold cross-validation by residues. Jeong and Miyano [20] then endeavored to improve the RNA interacting residues prediction based on evolutionary information from the PSSM and achieved MCC, overall accuracy, specificity, and sensitivity of 0.39, 80.20%, 91.04%, and 43.40%, respectively. They established a dataset containing 86 protein chains that has been used most frequently in the studies afterwards. Furthermore, amino acid evolutionary information from the PSSM plays a crucial role and has widely usage.
Scientists have been seeking to find other critical features to improve the performance of their predictors. In 2006, Wang and Brown [21] put forward another method utilizing SVM with side chain pKa, hydrophobicity index and molecular mass of amino acids on 107 protein chains within 25% sequence identities and achieved a maximum accuracy of 69.32% with 66.28% sensitivity. Additionally, they provided a web server predicting both DNA and RNA protein binding sites called BindN [22]. Kim
et al. [23] studied the propensities of individual amino acids and amino acid pairs in
RNA-protein interfaces on the previous 86 protein chains dataset by Jeong et al. [19].
They reported 50% sensitivity and 57% specificity for a method that combined doublet propensities and evolutionary information.
20
As time goes by, the number of known RBPs has rose up to a considerable degree.
Terribilini et al. [24] developed a Naive Bayes Classifier on a larger dataset on PSSM, and achieved maximum MCC of 0.35 in 2007. Tong et al. [25] applied SVM on the same dataset and features as Terribilini did, and obtained a higher MCC 0.365. Wang et
al. [26] reported MCC 0f 0.457 and accuracy of 87.4% by using PSSM, observed
secondary structure information and solvent accessibility information on SVM. In 2008, Kumar et al. [27] utilized a SVM with a second order polynomial kernel and PSSM as input features on 86 protein chains, achieving an MCC of 0.45 (specificity: 89.6%, sensitivity: 53.0%). Cheng et al. [5] encoded PSSM into a new smooth PSSM on SVM classifier, performed a MCC up to 0.68 with five-fold cross-validation on residue-level on 86 protein chains. A high prediction accuracy with a MCC of 0.50 with five-fold cross-validation on residue-level has been reported by Spriggs et al. [28] utilized SVM to analyze input features such as sequence profiles, interface propensities, accessibility and hydrophobicity on only 81 protein chains. Maetschke et al. [29] examined many structural and topological information on both SVM and Naive Bayes Classifier, including constructing graph-theoretical and geometrical sliding windows on 144 protein chains, and reported MCC 0.39 (specificity: 82.0%, sensitivity: 66.8%). All the related works are summarized in Table 2-2.
21
Table 2-2 List of previous RNA-binding prediction works
Authors Methods feature Performance
Jeong et al.[19] Artificial Neural
Network AA sequence and SS MCC 0.29
molecular mass of AA
69% accuracy and 66% sensitivity
Kim et al.[23] Scoring Function doublet propensities and evolutionary information
50% sensitivity and 57%
specificity Terribilini et al.
[24] Naive Bayes Classifier PSSM MCC 0.35
Tong et al.[25] SVM PSSM MCC 0.37
Wang et al. [26] SVM PSSM, SS and solvent
accessibility information MCC 0.46 Kumer et al.
[27] SVM PSSM and interface
propensities MCC 0.45
Cheng et al. [5] SVM smooth-PSSM MCC 0.68
Some of the previous studies reported acceptable results of macromolecular sequence data on k-fold cross validation on window-base data splitting which is residue-level cross validation. In spite of that, Caragea et al. [7] pointed out the problems of accessing the performance of classifiers on imbalance data like macromolecular sequence dataset. In comparison of window-based k-fold cross
22
validation and sequence-based k-fold cross validation, window-based cross validation can yield overly optimistic estimates of the performance of classifier relative to the estimates obtained using sequence-based cross validation. This kind of data division has homologous issue biologically that might occur overlapping between these data subsets.
As Table 2-2 shows, SVM has been adopted as a core classifier due to its low bias, high customizability and better performance. Therefore, we choose SVM as one of the core classifiers in this paper. Furthermore, SVM-based single predictors have limited improvement [28]; therefore, we propose a hybrid method named “ProteRNA”.
23