• 沒有找到結果。

Chapter 1 Introduction

1.3 R EMOTE H OMOLOGY D ETECTION

The analysis of novel biological sequences usually starts from searching homologous sequences in annotated databases. Homologous sequences usually share a common ancestor, and thus often have similar functions and structures. Based on pairwise identities and some specific thresholds, sequence search tools retrieve similar annotated sequences for homology inferences, which are crucial in advanced analysis, such as protein structure modeling, function predictions, protein-protein interaction networks analysis, and other property annotations. While structural information assists to increase the understanding of some target proteins, in many situations one has to analyze a protein based on its sequence information only. The advent of whole genome sequencing generates large amounts of protein sequences with undetermined structures and functions.

Many of these newly sequenced proteins, including those related to diseases, have few closely related homologs in annotated databases. In addition, as the number of sequenced genomes and proteins grows, many relationships between distantly related proteins are observed and needed to be studied further for better understanding the complex structure of protein universe. Sensitive strategies for analyzing proteins based on simply sequence information are therefore still demanding and of great importance in genomic era.

Sequence similarity is a frequently used simple metric for homology detection and other annotation transfers. However, sequence itself provides only incomplete and noisy information about the protein. The most similar result may not be the most relevant

sequence [57], while some other homologous sequences might be lost in the search results. For example, two sequences are usually identified as homologs if their pairwise similarity is higher than 40%, but the problem becomes rather challenging for sequences sharing similarity between 20% and 35%, i.e., sequences in the twilight zone. Studies showed that even for protein pairs with sequence identity less than 25%, about slightly less than 10% of them still homologous [58]. Thus pairwise sequence similarity has its limit in detecting distant sequence relationships. Using a threshold of pairwise sequence identity to determine homology relationship is arguable since it is hard to determine whether protein pairs having sequence identities lower than this threshold are homologous. Once pairwise similarity of a sequence pair is below a specified threshold, we can hardly distinguish whether the pair of sequences is from homology or not.

Therefore many improvements on homology searching and sequence comparisons have been developed to overcome the limitation of sequence similarity [59-60].

To improve sequence-based analysis strategies, we have to determine the strategies to represent proteins and corresponding similarity metrics for such representations. Based on these two issues, homology detection methods can be roughly divided into two categories: generative models and discriminative models. Given a protein sequence, generative models focus on describing a set of known proteins with a probabilistic model, and propose a probabilistic measurement between the query protein and the model. On the other hand, discriminative models focus on differences between two sets of proteins.

devise probabilistic models to represent the protein sequences, such as PSSM [61] and profiles [62] and profile Hidden Markov Models [63-65]. Some famous packages include HMMER and HMMERHEAD [66] , COMPASS [67-69], COACH [70], HHSearch [71], and profile comparison tools such as PRC [72]. While there might be concerns about the statistical measurement about accuracies for these model-comparison tools [73-74], they provide best available results among generative model methods. These tools, however, are time-consuming. Therefore profile-sequence (sequence-profile) search tools that strike balances between speed and accuracy are de facto standards for large-scale database searching. PSI-BLAST [75] is definitely the Google for bioinformatics community, while CS-BLAST/CSI-BLAST [76] provides more sensitive results based on similar ideas. More detailed comparison could be found in [77].

Discriminative models mainly focus on designing kernel functions based on sequence patterns to distinguish sequences from two different sets. Most of these methods are based on support vector machines, and extract frequent patterns from sequences as their features in the string kernel. The first string kernel might be Fisher’s kernel [78]. Some popular string kernels includes, but not limited to, Pairwise kernel[79], Spectrum and the Mismatch kernels [80-81], Local Alignment method [82], and Word Correlation Matrices [83]. Some methods integrate structural and motif information into the feature set, such as I-Sites [84], eMOTIF-database search [85], Profile-Based Mismatch methods [86] and Profile-based direct methods [87]. Readers can find more comprehensive information about discriminative methods in the following materials [88-89].

While discriminative models, especially string kernels methods, achieve better performance than generative models in some comparative studies [79, 81], these results often lack of evidences for interpretations, such as HSPs in general alignment tools. In addition, they may lead to over-fitting due to parameter setting and feature selections.

Therefore, many strategies attempt to improve homology detections based on results of generative models, especially on results of PSI-BLAST. RankProt [90] attempts to consider pairwise distances between all the query sequences to construct a relation network, and increase homology detection results based on analyzing the network information. Ku and Yona [91] propose a framework based on similar ideas. Since there are already lots of annotated sequences in current databases, a natural thought is to integrate information from external sequences to boost homology detection.

A simple attempt to integrate external sequence information in homology detection might be intermediate sequence search (ISS) [92-93]. In short, if protein sequences A and B are both homologous to the third sequence C, A and B may be detected as homologs although they share low identities. Improved frameworks based on similar ideas consist of SCOOP[94] and SIMPRO[95]. Moreover, some strategies tend to apply information from the probabilistic models, instead of shared sequences only. Consensus-sequence-based methods are representatives of these kinds of strategies. PHOG-BLAST [96] make sequence profiles discrete, and generate consensus for a query sequence by substituting each residue with the most important amino acids in the original sequence. Recently,

against NCBInr to obtain its PSSM. Then the original sequence is transformed to a consensus sequence based on this PSSM. They claim that, by using the informative consensus sequence as the object in comparisons, homology search results would be better than traditional PSI-BLAST searches.

Based on above observation, we aim to design a computational framework for detecting distantly relationships between protein sequences in twilight zone (sequence identities between 25% and 40%) or midnight zone (sequence identities below 25%) with several properties. First, it should deal with sequence relationships among proteins with low sequence identity. Second, the results of the framework should be explainable. That is, we hope the result can provide evidence, and even high quality alignments to support its identification, instead of some profiles or a set of dozens of features. Third, the framework is computationally incremental, and we can easily add or delete sequences in our training set. Besides, this framework should make best use of the power of current homology search tools to make it simple to be implemented. As a result, we use fixed-length protein words as possible homology indicator in this framework. For each word in separate sequences, we use PSI-BLAST to generate its variations. These variations would be integrated to estimate relations between novel sequences and annotated sequences. We demonstrate that this framework achieves high sensitivity in discovering protein homologs even though they share low sequence similarities with annotated sequences.