人類微核醣核酸、目標基因與同源體之預測

全文

(1)國立台灣師範大學資訊工程研究所博士論文 Department of Computer Science and Information Engineering National Taiwan Normal University Doctoral dissertation. 指導教授葉耀明、施純傑博士 Advisors: Yao-Ming Yeh and Arthur Chun-Chieh Shih Ph.D.. 人類微核醣核酸、目標基因與同源體之預測 Prediction of Human miRNAs, Targets and Homologs. 張耀明撰 Yao-Ming Chang. 中華民國九十九年元月 January, 2010.

(2) Abstract. MicroRNAs (miRNAs) are small endogenous RNA molecules ~22 nt that target specific mRNAs to reduce the expression or translation. A large proportion of human protein-coding genes have been found that are probably regulated by miRNAs, suggesting that miRNAs play a critical role in a wide variety of biological functions. In this dissertation, we focus on two issues related to miRNA research: novel miRNA prediction and miRNA homolog prediction. We study these two issues from thorough the understanding of miRNA biogenesis and evolutionary characteristics and then propose two effective new approaches to solve biological problems. In first work, we developed a method to predict novel human miRNAs and target genes without requiring cross-species conservation. We first identified lowly/moderately expressed tissue-selective genes using EST data and then identified overrepresented motifs of 7 nucleotides in the 3' UTRs of these genes. Using these motifs as potential target sites of miRNAs, we recovered more than two thirds of the known human miRNAs. We then used those motifs that did not match any known human miRNA seed region to infer novel miRNAs. We predicted 36 new human miRNA genes with 44 mature forms and 4 novel alternative mature forms of 2 known miRNA genes when a stringent criterion was used and many more novel miRNAs when a less stringent criterion was used. Some of our results have been. I.

(3) experimentally verified with a highly successful rate (8 out of 11) which can definitely reduce much experimental cost and time. In second work, we proposed a new search method to discover as more as possible human miRNA homologs in distant species, such as worm, fruit fly, lancelet, and zebrafish. We first searched miRNA homologous candidates in genomes according to a given known mature miRNA. Then, the similar mature candidates were extended to be precursor candidates and checked by filters of both sequence and structural criterions. The precursor candidates that passed all filters were considered as the possible miRNA homologs. In our results, many of human miRNA homologs were found in all four genomes. So, we infer that most human miRNAs may share the common ancestors with worm and fruit fly.. Keywords: microRNA, target gene, miRNA prediction, miRNA target prediction, tissue-selective genes, frequent pattern, miRNA homolog prediction, homologous miRNA. II.

(4) Contents Chapter 1 Introduction ................................................................................................1 1.1 What are microRNAs .......................................................................................1 1.1.1 History of miRNAs ...............................................................................1 1.1.2 MiRNA biogenesis ................................................................................2 1.1.3 Principles of miRNA Target Recognition .............................................4 1.1.4 Database of Known miRNA .................................................................5 1.2 Motivation ........................................................................................................5 1.3 Objectives ........................................................................................................7 1.4 Organization of the dissertation .......................................................................7 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. Chapter 2 Background ................................................................................................9 2.1 miRNA Prediction ............................................................................................9 U. U. U. U. 2.1.1 Computational Methods ......................................................................10 2.1.2 Biological Methods .............................................................................12 2.2 miRNA Target Prediction...............................................................................13 2.3 Limitations of Previous miRNA and Target Prediction Methods ..................15 2.4 Gene Expression Profile ................................................................................16 2.4.1 Expressed Sequence Tag .....................................................................16 2.4.2 Microarray...........................................................................................17 2.5 Clustering of Known miRNAs.......................................................................17 2.6 Homologous miRNA Search..........................................................................18 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. Chapter 3 miRNA and Target Gene Prediction Using Tissue-Selective Motif .....21 3.1 Introduction ....................................................................................................21 3.2 Methods..........................................................................................................22 3.2.1 Collection of Human Gene Expression Data ......................................22 3.2.2 Identification of Low-Key Tissue-Selective Genes ............................23 U. U. U. U. U. U. U. U. U. U. 3.2.3 Identification of Tissue-Selective Motifs in 3’UTR ...........................24 3.2.4 Motif Filtering.....................................................................................24 3.2.5 Matching Frequent Motifs to the Seed Region of Known Mature miRNAs .......................................................................................................25 3.2.6 Secondary Structure of Potential Novel miRNAs...............................25 3.3 Results ............................................................................................................26 3.3.1 Low-Key Tissue-Selective Genes in Tissues ......................................26 3.3.2 Frequent Tissue-Selective Motifs .......................................................28 3.3.3 Predicted Targets of Know miRNAs...................................................29 3.3.4 Predicted Novel miRNAs ...................................................................32 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. III.

(5) 3.4 Experimental validation of predicted novel miRNAs and their targets .........45 3.4.1 Method Designed of Experimental Validation ....................................45 3.4.2 Expression Tests of Predicted Novel miRNAs ...................................46 3.4.3 Functional Validation of Novel miRNAs by Luciferase Assay and Immunoblotting............................................................................................48 3.5 Summary ........................................................................................................49 U. U. U. U. U. U. U. U. U. U. Chapter 4 Prediction of Human miRNA Homologs in Distant Species ................53 4.1 Introduction ....................................................................................................53 4.1.1 Background .........................................................................................53 4.1.2 Difference of Homology Search Between Protein-Coding Genes and miRNAs .......................................................................................................54 4.2 Materials and Methods ...................................................................................57 4.2.1. miRNA Reference sets, Genomic Sequences, and Annotations ........57 4.2.2 Methods...............................................................................................57 U. U. U. U. U. U. U. U. U. U. U. U. U. U. 4.3 Results ............................................................................................................66 4.3.1 Homologous Candidates of Human miRNAs in Four Species ...........66 4.3.2 Comparison of Sequence Similarity and Structural Base-Pairing by BBQ Grid Representation ............................................................................67 4.3.3 New Predicted Human miRNA Homologs in the Four Genomes ......72 4.3.4 Biological Supporting Evidences ........................................................74 4.4 Discussions and Conclusions .........................................................................78 4.4.1 Comparison of the Bi-swing Match Method and BLAST ..................78 4.4.2 Pseudo miRNA and miRNA Evolution...............................................79 4.4.3 MiR-548 Family in Non-Primate Species...........................................79 4.4.4 Conclusions .........................................................................................80 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. Chapter 5 Conclusions and future works ................................................................82 5.1 Contribution to Biology .................................................................................82 5.2 Contribution to Computer Science.................................................................82 U. U. U. U. U. U. 5.3 Future works ..................................................................................................83 5.3.1 Atypical miRNA Target Site Prediction ..............................................83 5.3.2 Plant miRNA Site Prediction ..............................................................84 U. U. U. U. U. U. Bibliography ...............................................................................................................87 List of Publications ....................................................................................................93 U. U. U. U. IV.

(6) List of Figures Figure 1. miRNA biogenesis ..........................................................................................3 Figure 2. Three categories of miRNA-target binding. ...................................................5 Figure 3. Flowchart of the proposed method ...............................................................22 Figure 4. Distribution of 18,021 human Ensemble genes in 40 tissues from BodyMap-Xs database. ................................................................................27 Figure 5. Distribution of low-key tissue-selective genes in 40 tissues ........................28 Figure 6. Distribution of frequent tissue-selective motifs in 40 tissues. ......................28 Figure 7. Three possible seed match cases of miR-203 with different motifs. ............30 Figure 8. Distribution of the known miRNAs whose seed regions match our predicted tissue-selective motifs in 40 tissues. ............................................31 Figure 9. Two examples of the frequent tissue-selective motifs that match known human miRNAs.. .........................................................................................32 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. Figure 10. Three cases of frequent motifs that match predicted secondary structures of mature miRNA candidates .......................................................................33 Figure 11. A schematic presentation of experimental validating the predicted novel miRNAs and target genes, designed by Prof. Juan’s lab. ............................45 Figure 12. PCR amplification of eight novel mature miRNAs and their endogenous expression levels. .........................................................................................47 Figure 13. MCF-7 cells were transfected with 100-300 pmol mimic of inhibitor. ......49 Figure 14. Distribution of the number of tissues in which a target gene predicted by TargetScan is expressed. ..............................................................................50 Figure 15. Flowchart of the proposed method .............................................................58 Figure 16. Comparison of miR-124 fruit fly and worm homologs by the BBQ grid representation. ..............................................................................................64 Figure 17. Flowchart of the BG score calculation .......................................................66 U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. U. Figure 18. Distributions of the sequence similarity and base-pairing ratios in the BBQ representation......................................................................................68 Figure 19. The sequence and structural alignments and the BBQ grid representations of four human homologous candidates with the highest BG scores. ...........72 Figure 20. Sequence similarity and structural base-paring between hsa-miR-449 and the predicted homologs in four species........................................................74 Figure 21. The complete prediction information of a predicted homolog pre-dme-mir-574 ..........................................................................................77 U. U. U. U. U. U. U. U. V.

(7) List of Tables Table 1. Comparison of homology search methods ....................................................19 U. U. Table 2. Predicted novel miRNAs with no G:U pairing in the seed match. ................35 U. U. Table 3. Predicted novel miRNAs with 1 G:U pairing in seed match. ........................40 U. U. Table 4. Expression tests and sequencing validation of 11 predicted miRNAs in U. breast and lung. .............................................................................................48 U. Table 5. Functional categories of the tissue-selective genes with frequent motifs in U. 40 tissues.. .....................................................................................................52 U. Table 6. Comparison of protein coding gene homology and miRNA homology ........56 U. U. Table 7. The numbers of homologous candidates in four species ...............................67 U. U. Table 8. A list of our predicted human homologous candidates with BG scores ≥ 2 ..71 U. U. Table 9. The numbers of human homologous miRNAs in miRBase and our U. predicted homologs with conserved seeds. ...................................................73 U. Table 10. Comparison of our predicted candidates and known human miRNAs by U. spliced EST data, genome annotation, and closed species conservation. .....76 U. VI.

(8) 謹獻給我最親愛的父母親. VII.

(9) 誌謝首先要感謝我最親愛的家人，因為你們的溫暖與關懷，才有今天的我，我想這份榮耀是屬於你們的。. 首先要感謝師大資工指導教授葉耀明老師的悉心教導，從碩士到博士班這十多年的研究生涯中，不只在學業上的細心指導，待人處事與治學精神更成為我努力的目標與典範，並給予了我一個寶貴、充實且愉快的研究過程。. 本篇論文的完成，最要感謝中研院資訊所的指導教授施純傑老師，您悉心的教導使我得以一窺生物資訊領域的深奧與寬廣，不時的討論並指點我正確的方向，使我在這些年中獲益匪淺。老師對學問的嚴謹更是我輩學習的典範。. 感謝黃文吉教授、阮雪芬教授、黃宣誠教授、楊順聰教授與莊樹諄老師在百忙之中撥冗審查與指導論文，惠賜許多寶貴的建議與指導，也使得本篇論文更加豐富與完整。. 在研究的過程中特別要感謝明正學長、與學弟德清、士翔與金麟給予的協助與鼓勵，讓我的研究生活增添許多光彩與歡樂，也使得本篇論文能夠順利完成，在此衷心地預祝各位的研究成功順利。. 最後要感謝我的摯愛禎苑，除了一直在背後默默地體諒、包容與支持之外，總是陪著我承擔並分享所有沮喪與快樂，尤其分擔了大部分祖睿和芝齊兩個小寶貝的甜蜜負荷，讓我能夠在無後顧之憂的情況下順利完成學業。. VIII.

(10) Chapter 1 Introduction. 1.1 What are microRNAs MicroRNAs (miRNAs) are single stranded RNAs with ~ 22 nucleotides (nt) in length. Instead of being translated into protein, miRNAs down-regulate specific genes in protein synthesis or in mRNA expression level [1-3]. A large proportion of human protein-coding genes have been estimated that are regulated by miRNAs, revealing that miRNAs play a critical role in biological functions.. 1.1.1 History of miRNAs The “central dogma” describes a genetic principle that DNA is transcribed into messenger RNA (mRNA) and mRNA in turn is translated into proteins (the left part of Fig. 1). However, a remarkable discovery was made just over few decades ago; Lee et al. (1993) and Wightman et al. (1993) studied the roundworm Caenorhabditis elegans (worms), an organism frequently used in genetic analysis, found that some RNA wasn’t making proteins [4, 5]. Rather, this unique class of RNA was interfering with mRNA to prevent translation, in complete opposition to the Central Dogma. Researchers had stumbled upon a novel mechanism to control gene expression. These unusual class of RNAs their sizes are as short as one-thousandth of a typical mRNA molecule. Appropriately named microRNAs (miRNAs), these mini-molecules are encoded from DNA like all RNA. Different from the message RNA (mRNA), miRNAs are response not for protein synthesis but for translation regulation. For example, in C. elegans the first miRNA molecule discovered, lin-4, suppressed the expression of a specific developmental gene. More other miRNAs have been also discovered in plants [6], fruit flies [7], mice, and humans [8] and all 1.

(11) of them play the similar regulatory role.. 1.1.2 MiRNA biogenesis MiRNAs possess the reverse complement of mRNA transcript of specific protein-coding genes and inhibit their translation or reduce the expression. miRNAs were first observed in C. elegans, such as lin-4 [4, 5] and let-7 [1] genes. These noncoding RNA molecules of 18–23 nt long that are either fully or partially complementary to the 3’ untranslated regions (UTR) of some specifically expressed genes. As a result, these miRNAs regulate the development of worms. Subsequently, miRNAs were found in different organisms from worms, flies, mouse, to human [9]. More evidences reveal that miRNAs belonging to the same gene family may evolve from a common ancestral small RNA gene. MiRNAs are transcribed from DNA, but their transcriptional formation is different form the general structure of genes in genomes. As shown in the right part of Fig. 1, a primary miRNA transcript, called pri-miRNA, is transcribed from the genome and then forms a hairpin-like structure. A pri-miRNA is processed to a miRNA precursor (pre-miRNA), 60-90 nt in length, by the ribonuclease Drosha in the nucleus [10] and then exported out of the nucleus by the Exportin-5 protein [11]. A miRNA precursor can form a stem-loop structure. The loop part of each pre-miRNA is excised by the cytoplasmic RNaseIII enzyme Dicer and the reminded two stem parts become a double-stranded duplex with ~22 nt long. After unwinding of the duplex, one or both of two single stranded RNAs, called mature miRNAs, are loaded into a RNA-induced silencing complex (RISC). In animal, a mature miRNA in RISC can suppress mRNA expression or directly inhibit protein synthesis of some specific genes by recognizing the fully or partially complementary counterparts, called target sites, in the 3’UTR of these genes. In 2.

(12) some cases, the formation of dsRNA through the binding of miRNA triggers the mRNA degradation process similar to RNA interference (RNAi), though in other cases the miRNA complex blocks the protein translation machinery for synthesis without causing the binding mRNA to be degraded.. Figure 1. miRNA biogenesis. The left diagram illustrates the Central Dogma with miRNA. miRNA is also transcribed from DNA but can regulate specific mRNAs. The right diagram illustrates the biogenesis of miRNA.. Because most target site bindings are based on partial complementarity, one miRNA may target more than one mRNA and moreover different miRNAs may target the same mRNA, coordinately regulating the gene expression in various tissues and cells. Therefore, it is believed that the miRNA plays a role in fine-tuning not stopping the expression of protein-coding genes. Indeed, the discovery of miRNAs has revolutionized our understanding of gene regulation in the 3.

(13) post-genome era [12, 13].. 1.1.3 Principles of miRNA Target Recognition In animals, the interaction of miRNAs to their targets is through imperfect base-pairing. Form experimental observations, the miRNA-target binding site can be grouped into three major categories: 5’ dominant, 3’ compensatory [14, 15], and middle [16] (Fig. 2). Traditionally, the region from the 2nd to 8th nucleotides in a mature miRNA is called the seed region. In the first category, the 5’ dominant canonical sites in the 3’UTR of the target mRNA have a consecutive 7 to 8 nucleotides of perfect matching with the seed region. In some cases, an extra matching region may base pairing to the 3’ end of the miRNA. In the 2nd category, the 3’-compensatory sites have few mismatch or wobble base parings in the seed region of the miRNA, but have extensive base pairing of consecutive 7 to 8 nucleotides to the 3’ end of the miRNA to compensate for the weak binding at the seed region. The middle sites are recently found and verified by Luo et al. [16] that have perfect base pairing to the middle region but some mismatches or wobbles in the 5’ and 3’ end of miRNA.. 4.

(14) Figure 2. Three categories of miRNA-target binding. (A) 5’ dominant sites that include 5’ canonical and 5’ seed site. (B) 3’ compensatory site. (C) Middle site.. 1.1.4 Database of Known miRNA The. most. famous. and. authoritative. miRNA database. is. miRBase. (http://microrna.sanger.ac.uk/) in which keeps the track of all known miRNA genes including the stem-loop and mature sequences, genomic positions, and related literature [17]. All novel discovered miRNAs from different species are submitted to this database. Until September 2009, there are 721, 579, 360, 157 miRNA genes have been found in the human, mouse, zebrafish, and fruit fly genomes, respectively.. 1.2 Motivation 5.

(15) Many miRNA have been discovered and found that they play a critical role in the development and disease of animals as well as in plants [18]. In animals, many miRNA mature sequences, like those in the let-7 family, have been evolutionarily conserved from worm to human and they are supposed to regulate commonly and fundamentally biological functions. Moreover, some miRNAs have been found that only express in some specific tissues and different developmental stages in species. It indicates that these miRNAs may be tissue-specific or species-specific. To discover the target genes of miRNAs in different tissues or developmental stages will definitely help researchers to explore and uncover the complex regulatory networks. MiRNAs involve in many biological processes, including differentiation [19], proliferation [20], apoptosis [21], and different kinds of diseases, such as cancers, cardiovascular diseases, autism, and so on. In addition, some tissue-specific miRNAs have critical functions such as the pancreatic islet-specific miR-375 which can regulate insulin secretion [22]. Therefore, to uncover the relationships between miRNAs and these biological processes can definitely enhance our understanding of the gene regulation in diseases. Previous studies [23] suggested that there should have thousands of miRNA in the human genome, but only hundreds of miRNA have been discovered so far. Moreover, many potential target sites are also not found in the prediction results by current state-of-the-art target prediction methods. Furthermore, experimental approaches to comprehensively identify novel miRNA or target genes under different conditions are cost expensive and time consuming. Indeed, more efforts are still needed to devote to the exploration of the frontiers of miRNA-related researches. In addition to the novel miRNA and target prediction, about the origination of a 6.

(16) miRNA and its evolutionary mechanism are still unclear and controversial. Similar to the study of protein coding gene evolution, based on the comparison of homologous sequences in different species is the major approach to study miRNA evolution. Several approaches have been proposed to search miRNA homologies in genomes [24-26], most of them modified the ideas of searching protein coding gene homology and identified only few homologous human miRNAs in distant species. Therefore, it is necessary to propose a new search approach based on miRNA characteristics to avoid the above searching problem.. 1.3 Objectives There are many topics for miRNA research [19-21]. From the computational perspective, we focus on two research issues: novel miRNA prediction and miRNA homolog prediction. The issue of novel miRNA prediction is to identify novel miRNAs which have not been discovered in the human genome. In the second issue, we will find out possible human miRNA homologs in distant species that can help to reveal the origination and evolution of miRNA based on the comparison of homologous sequences.. 1.4 Organization of the dissertation We have introduced the fundamentally biological concepts of miRNA regulation in the beginning of this chapter. In the next chapter, we will survey previous studies of miRNA and their limitations. In Chapter 3, we will first explain the idea of our proposed reverse approach that can predicted novel miRNAs and target genes simultaneously using tissue-selective motifs in 3’UTRs. We will introduce the detailed flow of our proposed method and show the prediction and experimentally validated results to demonstrate our contribution to biology. In 7.

(17) Chapter 4, we focus on the homology search of human miRNA in four distant species. We will not only introduce a new search method to predicted human miRNA homologs but also demonstrate a new representation of miRNA homologous comparison. At last, the prediction and comparison results will provide a general and global perspective for the research on miRNA evolution. Finally, Chapter 5 concludes all remarkable results in this dissertation and introduce the future works we expect to do soon.. 8.

(18) Chapter 2 Background. Many methods of miRNA and target gene predictions in literature [15, 27, 28] can be classified into computational and biological approaches. In this chapter, we introduce these methods and list their limitations. Sections 2.1 and 2.2 introduce the miRNA and target gene prediction in computational and biological way, respectively. Then, the limitations and problems of the introduced prediction methods are given in Section 2.3. In Section 2.4, different types of gene expression data are introduced. Sections 2.4 and 2.5 introduce the clustering of known miRNAs and homologous miRNA search.. 2.1 miRNA Prediction Computational methods of miRNA prediction can be divided into two subclasses: rule-based and supervised learning methods [27]. In the rule-based methods, some rules observed from experimentally validated data are the criteria to search whole genomic sequences. Potential subsequences satisfied all the criteria are predicted as novel miRNA candidates. In the supervised learning methods, validated miRNAs are taken as input training data. Then, various classifiers or statistical models are applied to obtain the prediction. In recent years, some next-generation high throughput sequencing technologies, like 454 [29] or Solexa [30], have been presented. With these powerful biological technologies, thousands or millions of small RNA sequences (including miRNAs) expressed in a cell can be directly identified at once. However, these technologies have still their limitations. In what follows, we will introduce these two different types of novel miRNA discovery in detail.. 9.

(19) 2.1.1 Computational Methods Since a precursor miRNA forms a distinctive hairpin fold-back structure and the mature miRNA is located in the stem region of the precursor hairpin. Searching the hairpin structure in the genomic sequences is the most important criterion for the computational miRNA prediction. All available methods rely on the hairpin structure prediction result by the energy minimization-based algorithms, such as mfold [31] or RNAfold [32], which are ab initio secondary structure prediction and the output from either RNAfold or mfold is the structure with the minimum free energy (mfe). However, most eukaryotic genomes usually contain abundant inverted repeats that can also form hairpins but almost of them are not real miRNAs. It has been estimated that about ~11 million hairpin structures can be found in the human genome [33]. Since not all hairpins are miRNAs, the question arises as to how to determine which hairpins are real miRNAs. Two general criteria are used for reducing the search space: (a) removing the hairpins in repetitive element regions and exons, and (b) using evolutionary conservation [34-39]. The first criterion is based on the assumption that most miRNAs are not located in the exon or repetitive regions in genomes. Such a filtering criterion can probably increase the signal-to-noise ratio. However, the assumption of miRNAs not in repeat-derived sequences may be not always correct because 11 mammalian miRNA precursors have been found which locate in the repetitive regions [40]. Nevertheless, this assumption is still a good criterion for improving the prediction specificity. Evolutionary conservation is a more general criterion that based on the fundamental concept of molecular evolution: the functional regions among species are usually more conserved than non-functional regions because of selection pressure. Thus, if the predicted hairpin sequences are cross-species conserved, they are probably real miRNAs. This conservation filter can be thought of as a 10.

(20) requirement that the candidate should be also found in other genomes in corresponding locations. However, it has been argued that the regions without conservation do not necessarily mean lack of biological functions. The drawback of using the conservation filter is that it is not easy to detect species-specific miRNAs, which have been found in [33]. Once a hairpin has been passed the “region” constraints, two different approaches can be used to judge whether the structure of a hairpin is “miRNA-like”: (a) rule-based, and (b) machine learning techniques. The rule-based hairpin classification examines the structures with minimum free energy (mfe) and extracts the features identified by researchers. For example, MiRScan scores two aligned segments from two different genomes [8]. This is done on the basis of seven features including: (a) the amount of base pairing of the proposed mature miRNA, (b) amount of base pairing in the stem excluding the mature miRNA, (c) conservation in the 5’ end of the two aligned sequences, (d) 3’ conservation (similar to 5’), (e) bulge symmetry, (f) distance form mature miRNA to the loop of the hairpin, and (g) the specific bases at the first five positions of the candidate mature miRNA. MiRscan uses the mixture of base pairing probabilities from the partition function and parse the mfe structure. For the value of each feature, a log-odds score is estimated based on the frequency of that value in a training set of known miRNAs and 36,000 “background hairpins.” Machine learning algorithms have been applied to many problems including miRNA and target gene prediction. RNAmicro is a tool based on the general RNAz method to predict miRNAs by multiple alignments [41]. In short, it decomposes candidates identified by RNAz into different kinds of features describing the structure, sequence composition and conservation, thermodynamic stability, and structural conservation. Then, the features are fed to a support vector machine (SVM) 11.

(21) that classifies the candidates into miRNAs and non-miRNAs. miSVM is another tool that also uses SVM as a classification kernel [42]. miSVM works on plants and uses a parameterization approach similar to that of miRScan, and is able to find 75% of the known Arabidopsis miRNAs.. 2.1.2 Biological Methods Sequencing-based applications for identifying and profiling small RNAs have been hindered by laborious cloning techniques and the expense of capillary DNA sequencing [43, 44]. Nevertheless, direct small RNA sequencing has several advantages over hybridization-based methodologies. Discovery of novel miRNAs need not rely on querying candidate regions of the genome but rather can be achieved by direct observation and validation of the folding potential of flanking genomic sequence [44, 45]. Direct sequencing also offers the potential to detect variation in mature miRNA length, as well as enzymatic modification of miRNAs such as RNA editing [46] and 3’ nucleotide additions [47, 48]. In contrast with capillary sequencing, recently available “next-generation” sequencing technologies offer inexpensive increases in throughput, thereby providing a more complete view of the miRNA transcriptome. With the added depth of sequencing now possible, we have an opportunity to identify low-abundance miRNAs or those exhibiting modest expression differences between samples, which may not be detected by hybridization-based methods. Next-generation miRNA profiling has already been realized in a few organisms [49-52] using the massively parallel signature sequencing (MPSS) methodology [53] and more recently the Roche/454 platform [29]. The recently released Illumina sequencing platform provides approximately two orders of magnitude greater depth than current competing technologies [30], yielding up to several million sequences from a single flow cell lane 12.

(22) (http://www.solexa.com).. 2.2 miRNA Target Prediction Computational prediction of miRNA targets is more difficult in animals than in plants because of the imperfect complementarity of miRNAs to their targets in animals while the perfectly complementary miRNA-target binding in plants [54]. The principles of miRNA target site prediction used by different approaches are relatively similar. These principles derived from the previous experimental observation on the pairing of mRNAs and miRNAs, such as lin-4 and let-7 in C. elegans [4, 55, 56] and bantam in Drosophila [7]. The major common prediction criteria include: (a) Seed pairing: the miRNA mature sequence is complementary to the 3’ UTR sequence of target mRNAs. Especially, the strong binding of the 5’ end seed region of the mature miRNA to the 3’ UTR sequence is very important for targeting. In the miRNA-target binding, the G:U wobble pairing is allowed but such a binding reduces the silencing efficiency [57]. In addition to the 3’ UTR regions, Ambros [58] suggested that 5’ UTR regions of a potential target mRNA should be also checked. (b) Site conservation: Like for miRNA gene prediction, the conservation is the most commonly used property for reducing the search space of target site prediction. TargetScanS [59] is one of the state-of-art prediction methods and also considers the presence of conserved adenosines surrounding the seed miRNA sequences. (c) Hybridization stability: the thermodynamics of RNA-RNA duplexes can be determined by RNA folding programs, such as RNAfold, and have been considered as an important factor in most prediction algorithms. However, a recent study by Lewis et al. (2005) showed that this factor can be omitted by incorporating other conserved sequence information [59]. 13.

(23) (d) Site number: Similar to the transcription factor control of genes [57, 60, 61], a target gene with multiple binding sites for an miRNA on the 3’ UTR can increase the efficiency of RNA silencing. (e) Site accessibility: Du and Zamore (2005) observed that lack of a strong secondary structure at the miRNA-binding site on the target may be an important feature [62]. Similar to the miRNA gene prediction, machine learning algorithms have been also applied to miRNA target prediction. TargetBoost [63] is a miRNA target prediction based on the machine-learning algorithm. It used only sequence information to create weighted sequence motifs that capture the binding characteristics between miRNAs and their targets. The authors declared that TargetBoost is stable and can identify more of the already verified targets than those by other existing algorithms. Sung-Kyu et al. (2005) developed a machine-learning algorithm using SVM to predict miRNA targets [64]. Recently, Yan et al. (2007) used a machine learning approach that can employ features extracted from both seed and out-seed segments [65]. Their best result obtained was an accuracy of 82.95%, which was generated using only 48 positive and 16 negative human examples – a relatively small training set to assess the algorithm. Thadani & Tammi (2006) proposed MicroTar, a statistical computational tool for prediction of miRNA targets from RNA duplexes, which does not use sequence homology for prediction [66]. MicroTar mainly relies on their proposed approach to estimate the duplex energy. However, the reported sensitivity (60%) is lower comparing with other published algorithms. At the same time, a miRNA pattern-discovery method, RNA22 [67], was proposed to scan 3’ UTR sequences of possible target genes. Although it does not rely upon cross-species conservation, RNA22 was able to recover most of the confirmed target sites. More recently, Yousef et al. (2007) described a 14.

(24) target-prediction method, called NBmiRTar [68], of which kernel is a Naive Bayes classifier. Without requiring sequence conservation, NBmiRTar generates a model from sequences and miRNA–mRNA duplex information derived from validated target sequences and artificially generated negative examples. In this method, both the seed and the ‘out-seed’ segments of the miRNA–mRNA duplex are used for target identification. NBmiRTar produced fewer false-positive predictions and fewer target candidates than those by miranda [69, 70]. In short, it exhibits that using the conservation criterion is the most effective approach to increase sensitivity and specificity as well as to decrease false-positive rate for the miRNA target prediction.. 2.3 Limitations of Previous miRNA and Target Prediction Methods In what follows, we summarize the problems and limitations of the common criteria used in the previous methods of miRNA and target gene predictions. Evolutionary conservation is a good criterion for reducing the searching space. However, the conserved sequence set is dependent on how many and how divergences of species are selected. Moreover, some species-specific miRNA or target sites may be missed when considering the conservation criterion. Not all target sites bind with the 5’ seed region of miRNA. Although many experiment data have revealed that targets with perfect match of 5’ seed region are functional [14, 71], other classes of target sites have been also experimentally verified in literature [14, 16]. Therefore, only considering 5’ seed region matching can efficiently filter out many noise candidates but also miss more possible likely target sites. By the high throughput sequencing technology [28], the new sequencing results reveal that there are more than one types of mature miRNAs produced from a pre-miRNA. These small RNAs, called isomers, are only different the originally 15.

(25) defined mature miRNA with numbers of shifted nucleotides. Since the isomiRs contain different seed regions, their target sites are also different. Until now, no method can predict the target genes of these isomers. Thus, many miRNA target sites have not been identified yet. Based on experimental validated data, the prediction methods using supervised machine learning classifier can effectively eliminate unlikely candidates. However, the prediction results may be trapped by the limited number of few training samples. Most target gene prediction methods did not consider the expression profiles of target genes. These methods scanned 3’UTR sequence of all genes in genomes, predicted possible target sites for every miRNA, and generate huge amount of candidates. However, not all genes express in any tissue and any time point. Without the expression information, the false-positive rate will be high when candidates are selected for in vivo experimental validation.. 2.4 Gene Expression Profile DNA sequences are initially transcribed into mRNA sequences. These mRNA sequences in turn are translated into the amino acid sequences of proteins that perform various cellular functions. A crucial aspect of proper cell function is the regulation of gene expression because different cell types express different subsets of genes. Measuring mRNA levels can provide a detailed molecular view of the subset of genes expressed in different cell types under different conditions. EST (Expressed Sequence Tag) and microarray are two popular biological data for profiling the expressions of genes in different tissues.. 2.4.1 Expressed Sequence Tag Expressed sequence tags (ESTs) [72] are partial sequences of cDNA resulting. 16.

(26) from a single-pass sequencing of clones from cDNA libraries. One gene can have many ESTs. Sequence information from ESTs can be used for deciphering the function and organization of the gene expression in genomes. From a functional viewpoint, we can use ESTs to determine the expression profiles of genes in tissues, in different conditions or states, and then identify regulated genes. Thus, to identify which genes involved in particular processes or tissues, we can select the expressed genes by EST data and use PCR amplification to validation. Thus, by combining sequence, functional and localization, ESTs provide integrated information to genome studies.. 2.4.2 Microarray The DNA microarray is a technology that allows biologists to simultaneously measure mRNA levels of many genes in a small chip [73]. DNA microarrays are small biological arrays designed to monitor expression of hundreds or thousands of genes. With fair sensitivity and accuracy at reasonable cost, the DNA microarrays have been a powerful tool for monitoring gene expression in past decade. However, individual hybridizations usually noisy generate variations between experiments and the expressions of single data points may be not reliable, especially for genes with low expression levels. Furthermore, the most highly expressed genes or those showing the largest differences in expression in a particular comparison may not be the most biologically relevant. Often genes with known biological functions show a slight, though significant change in transcript levels [74].. 2.5 Clustering of Known miRNAs There are a few of the known miRNAs with similar mature sequences or the same seed regions. Several studies have defined some rules to clustering them. 17.

(27) together. First, Griffiths-Jones et al. (2004) defined the miRNAs with highly similar mature regions belonging to the same family [75]. The basic idea was assumed that the miRNAs in the same family were derived from a common ancestor. Their definition was simple but how to determine the similarity between two mature sequences was nor well defined in their study. Based on the theory of miRNA regulation by seed determination [18, 76], Grun et al. (2005) defined the miRNAs having the same seed regions belonging to the same family [77]. Here, a seed region is defined the region located in the 1st to 7th or the 2nd to 8th positions in the 5’ end in a mature sequence. Ibanez-Ventoso compared the mature miRNAs of human, C. elegans, and D. melanogaster and then proposed the following two criterions to determine whether two miRNAs belong to the same family [78]. First, the mature sequences have the same seed regions. Different from Grun et al.'s definition, Ibanez-Ventoso suggested the seed region can be also located at from the 3rd or the 4th base to 9th or the 10th positions in the 5' end of the mature sequence. Moreover, they also allowed one G:A or G:U match in the seed comparison. Second, the similarity of these two mature sequences should > 70%. Based on the proposed definition, he found that 84 C. elegans miRNAs and 75 D. melanogaster miRNAs were homologous each other. Moreover, 73 C. elegans miRNAs can find their homologs in both the fly and human genomes.. 2.6 Homologous miRNA Search Similar to homology search for the protein coding gene, the most popular approach uses known miRNA precursors as queries to search homologous candidates in genomes by BLAST. Wang et al. used the Blast tool as the searching kernel [25, 79]. They first set a high E-value threshold, 10, and then used the 18.

(28) structure alignment constraint to remove those impossible candidates. Artzi et al. also used the Blast tool as the searching kernel [26]. But they set a stringent criterion in the first step, E-value < 0.05, and then used a loose constraint in the structure comparison. However, the average length of a miRNA precursor is only 70 nt, much shorter than those of protein coding genes, and the variations are not randomly distributed because of the stem-loop constraint. Thus, it limits the methods based on the Blast tool to finding more homologous miRNAs or the homologs.. Method Legendre et al. 2005. Kernel ERPIN. Query input. Sequence criterion. Cluster of. Ave. similarity = 77%. precursors. in cluster. Structure criterion Consensus 2nd structure. Filter RepeatMasker. 1. Flanking nucleotides Wang et al. 2005 (miRAlign). BLAST. Precursor. 1.E-value ≤ 10. 2. MFE ≤ -20 kcal/mol. 1. Mature similarity ≥ 70%. 2. word-length: 7. 3. RNA structure alignment. 2. Position of mature. (RNAforester, 2003) Artzi et al. 2008 (miRminer). BLAST. Precursor. E-value ≤ 0.05. 1. Flanking nucleotides. 1. Mature similarity ≥ 80%. 2. MFE ≤ -25 kcal/mol. (max. mismatch: 3nt). 3. Hairpin shape. 2. Seed conservation. 4. Base-pairing ≥ 55%. 3. Position of mature. 1. Flanking nucleotides Our proposed. Bi-swin. method 2010. g match. Mature. Bi-swing score ≥ 0.6. 2. Hairpin shape 3. Base-pairing ≥ 70%. 1. Repeat Masker 2. Seed conservation position of mature. Table 1. Comparison of homology search methods. For solving this problem, Legendre et al. (2005) proposed a different search method without using single precursors as the query [24]. First, they aligned every two different miRNA precursors in miRBase and calculated the pairwise sequence similarities. Then, the precursors with high similarity scores were clustered together. Third, they used multiple sequence alignment to find out the common secondary structures of the precursors in each cluster. Finally, a set of the query profiles can be obtained by combing the sequence and structure features. They used these profiles as the query to search miRNA homologs in different species and identified more homologous candidates. However, those miRNAs that can be clustered together and 19.

(29) formed profiles are still the minority. There should be many miRNAs not included in such an approach. Table 1 lists the comparisons of these three methods and our proposed method. Some other studies used the Blast tool and known human miRNAs to check whether any novel miRNAs in new sequencing genomes. Due to the above mentioned limitation, they only searched the species close to the human, such as human-mouse [80], human to macaque [81], and human to chimpanzee [82].. 20.

(30) Chapter 3 miRNA and Target Gene Prediction Using Tissue-Selective Motif. 3.1 Introduction Although many human miRNA genes are been found, it is likely that a large number of human miRNA genes remain to be discovered [8, 33, 83]. Most current methods (e.g., [18, 59, 84-86]) predict novel miRNA genes directly from RNA or/and DNA sequences. In this chapter, we use a different approach: we first identify putative target binding sites (motifs) in 3' UTRs and then use these sites to infer novel miRNAs. A similar method for predicting motifs in 3’UTRs has been applied to mammalian genomic sequences [61]. Briefly, we try to (a) identify a set of tissue-selective genes, which are genes that are expressed in only one or a few tissues [87-92], (b) exclude highly expressed tissue-selective genes and call the remaining genes low-key tissue-selective genes, (c) find motifs of 7 nucleotides that appear frequently in the 3' UTRs of the genes in the set, (d) consider each of these motifs as a potential miRNA target sequence and its complementary sequence as a potential miRNA seed, (e) exclude those potential seeds that match any known miRNAs, and (f) use the remaining potential seeds to identify potential novel miRNA genes from predicted secondary structures in the human genome. Our method is based on the reasoning that the 3' UTRs of the target genes of a miRNA should share the same or highly similar sequence motif (i.e., target site). To define human tissue-selective genes, we use EST data. Moreover, we use microarray gene expression data to filter out highly expressed tissue-selective genes because a gene that is regulated by miRNAs is more likely to be lowly or moderately expressed than to be highly expressed [36, 93-96]. 21.

(31) 3.2 Methods We develop a seven-step computational method to predict human tissue-selective binding motifs in 3' UTRs and their potential regulatory miRNAs (see the flowchart in Figure 3). Below, we describe these steps in detail.. Figure 3. Flowchart of the proposed method. 3.2.1 Collection of Human Gene Expression Data We downloaded the human UniGene (Homo sapiens: Build #198) data of 40 tissues from the BodyMap-Xs database (http://bodymap.jp/) [97]. Then, we mapped HU. UH. the UniGene data to the Ensembl gene ID and downloaded their transcript data from Ensembl (http://www.ensembl.org/index.html). These entries belong to 18,021 HU. UH. 22.

(32) human Ensembl genes. Finally, we downloaded the 3,919 tissue-selective genes predicted from microarray expression data by Liang et al. [92].. 3.2.2 Identification of Low-Key Tissue-Selective Genes Because some genes such as housekeeping genes are widely expressed in many tissues while a few of the other genes are selectively expressed in some specific tissues, we proposed a simple statistical method to distinguish between them. From the histogram of tissues in which one gene is expressed, we found that the whole distribution can be fitted by a linear combination of an exponential distribution and a normal distribution. That is, the whole distribution can be formulated as the following equation: h ( n ) = a ⋅ λe. − λn. + b⋅. 1 2π σ. e. − ( n −m2). 2. 2σ. ,. where n is the tissue number, 0 ≤ a,b ≤ 1 and λ, σ, m > 0. The exponential distribution represents the genes that are expressed only in one or a few tissues, while the genes for the normal distribution are expressed in many tissues. Then, we used a exhaustive search to find an optimal set (ao,bo,λo,mo,σo) so that 40. 40. ∑ | h(n ) − h (n ) | = ∑ | h(n ) − a n =1. o. n =1. o. ⋅ λo e. − λo t. − bo ⋅. 1 2π σ o. −. e. ( t −mo )2 2 σ o2. |. is the minimum. We obtained (ao,bo,λo,mo,σo) = (0.4,0.6,0.35,21,10). Thus, if a gene was found in ≤ 6 tissues, the p-value of the gene widely expressed in many tissues is < 0.05. Because not all transcripts of genes in tissues are found in ESTs, we set a stringent condition to define a tissue-selective gene: if a gene is found in ≤ 5 tissues, it is called a tissue-selective gene. Among the 18,021 selected genes there are 6,496 tissue-selective genes each of which showed EST expression in ≤ 5 tissues. To remove highly expressed genes we use Liang et al.’s data [92]. These authors identified 3,919 tissue selective genes 23.

(33) from microarray expression data each of which was highly expressed in some tissues. We use this dataset to remove highly expressed genes from the set of tissue-selective genes. We call the remaining genes low-key tissue-selective genes.. 3.2.3 Identification of Tissue-Selective Motifs in 3’UTR For each tissue-selective gene, we downloaded the 3' UTR sequence of its longest transcript from Ensembl. For each tissue we selected two mutually exclusive gene sets: a set of low-key tissue-selective genes (F) and a set of background genes (B). For 7-mer motif mk, let fk and bk be its frequencies in F and B, respectively. We used the one-sided two-sample proportion test [98, 99] to examine whether fk is significantly higher than bk. If so, mk is over-represented in F. Thus, we test H0: fk = bk against H1: fk > bk. Let n1 and n2 be the numbers of motif mk in F and B, respectively, and let NF and NB be the total numbers of mk motifs in F and B, respectively. Let p = (n1 + n2 )/(NF + NB) and σ =. p(1 − p )(1 / n1 + 1 / n2 ) . The null. hypothesis is rejected, if the following condition is satisfied: Z=(fk —bk)/σ > z1-001, where z1-001 (≈2.326) is the z-score at the 0.01 critical value from the standard normal distribution using the asymptotical normal approximation [100]. In this case, motif mk is significantly over-represented in the tissue with the p-value ≤ 0.01. 3.2.4 Motif Filtering We filter out simple repetitive motifs such as partial poly(A) sites. One reason for the existence of such simple motifs is that alternative polyadenylation can occur in a tissue-selective manner [101]. For each motif, we calculate its information entropy E by the following. 24.

(34) equation: 4. E = − ∑ qi ⋅ log 2 qi i =1. where qi is the frequency of nucleotide i (A, T, G, or C) in the motif. Because in ~93% of the known human miRNAs the nucleotide frequency entropies of their seed regions are 0.8, we use the threshold of 0.8 to filter out motifs of low compositional complexity. The remaining motifs are called the frequent tissue-selective motifs or simply the frequent motifs.. 3.2.5 Matching Frequent Motifs to the Seed Region of Known Mature miRNAs. We use each frequent motif identified as a seed-match region to search the miRBase [102] to remove those motifs that perfectly match any known human miRNAs. We downloaded the 678 known human mature miRNA sequences from miRBase (Release 11.0) [102] and then examined whether a frequent motif we predicted perfectly matches any of the seed regions of mature miRNAs. In an expanded analysis, we allow one G:U pairing in 7 nucleotides. In a further expanded analysis, we also allow 1-nt left- or right-shift seed matches, because a seed region can start at the first, second, or third position from the 5＇end of a miRNA [103]. For convenience, a frequent motif that matches the seed region of a known miRNA(s) is called a target motif in this chapter. 3.2.6 Secondary Structure of Potential Novel miRNAs. We use the remaining motifs to search the set of predicted secondary structures in the human genome by Pedersen et al. [104] to select motifs that are good candidates for novel miRNAs. In the 48,476 well-conserved secondary structures in the human genome, Pedersen et al. found 195 known miRNA genes and proposed 187 miRNA gene candidates, of which only 24 candidates have been experimentally 25.

(35) confirmed. We downloaded the 169 human sequences that were predicted to be miRNA genes from the viewpoint of secondary structure [104] but have not been experimentally validated. For each of these miRNA candidates, we predicted its secondary structure by the mFold software [31] (http://www.bioinfo.rpi.edu/ HU. UH. applications/mfold/). Then, we checked whether the unmatched motifs are located at the stem regions of the secondary structures of some candidates under three criteria: (a) matching should be exact or include only one G:U pair between motif and its complement; (b) the length of predicted mature miRNA is at least 17 nucleotides; and (c) the predicted mature miRNA should not extend to the opposite prime of the stem part. Finally, a candidate that passes the three criteria is taken as a putative miRNA.. 3.3 Results 3.3.1 Low-Key Tissue-Selective Genes in Tissues. In the 18,021 selected human Ensembl genes, 16,790, 14,910, 14,766, and 14,042 genes are found in the EST data of cerebrum, lung, testis, and eye, respectively, while only 948, 944, and 135 genes are found in salivary, esophagus, and brain stem, respectively (Figure 4).. 26.

(36) 18000 16000 14000 12000 10000 8000 6000 4000 2000. cerebrum lung testis eye skin placenta kidney colon prostate breast uterus lymphnode heart pancreas liver/hepato bone thymus cerebellum bone marrow spleen peripheral blood peripheral nerve stomach ovary intestine muscle adipos artery/aorta corpus callosum/glia adrenal gland spine retina vein pituitary pineal gland bladder thyroid/parathyroid salivary esophagus brain stem. 0. Figure 4. Distribution of 18,021 human Ensemble genes in 40 tissues from BodyMap-Xs database.. Among the 18,021 selected genes there are 6,496 tissue-selective genes each of which showed EST expression in ≤ 5 tissues. To remove highly expressed genes we use Liang et al.’s data [92]. These authors identified 3,919 tissue selective genes from microarray expression data each of which was highly expressed in some tissues. However, only 1,393 of them map to Ensembl genes and the tissues in BodyMapXs, and only 239 of these 1,393 genes overlap with our tissue-selective gene set. We exclude these 239 genes from our gene set and call the remaining 6,257 genes low-key tissue-selective genes. Among these genes, 3417, 1129, 720, 521, and 470 are expressed in one, two, three, four, and five tissues, respectively (Figure 5).. 27.

(37) Figure 5. Distribution of low-key tissue-selective genes in 40 tissues. Figure 6. Distribution of frequent tissue-selective motifs in 40 tissues.. 3.3.2 Frequent Tissue-Selective Motifs. We identify 2,819 frequent tissue-selective motifs. The number of motifs in a tissue is generally correlated with both the number of expressed genes and the 28.

(38) number of tissue-selective genes in that tissue (both correlation coefficients ≈ 0.6). There are 1703, 575, 284, 136, and 68 motifs identified as one-, two-, three-, four-, and five-tissue selective motifs, respectively. We find 53 frequent motifs (~2%) that appear in > 5 tissues, though each of the genes selected is expressed in ≤ 5 of the 40 tissues under study (Figure 6). Six tissues (lymph node, placenta, kidney, lung, pancreas, and cerebrum) each contain > 400 tissue-selective motifs (833, 557, 489, 465, 440, and 406, respectively). Interestingly, for these six tissues the number of the tissue-selective motifs is negatively correlated with the number of low-key tissue-selective genes (382, 684, 1413, 609, 350, and 2030, respectively; the correlation coefficient ≈ － 0.5). In vein, corpus callosum/glia, adrenal gland, adipose, thyroid/parathyroid, and stomach, the number of predicted motifs is < 10, and in the brain stem, esophagus, bladder, and salivary, no tissue-selective motif is found (Figure 6). The ratio of the predicted tissue-selective motifs over the EST genes in each individual tissue is < 0.08; that is, very few tissue-selective motifs can be found in the genes expressed in a tissue. But the ratios of the number of predicted motifs over the number of low-key tissue-selective genes of lymph node, artery/aorta, and pancreas are 2.18, 1.35, and 1.26, respectively, though those for the other 37 tissues are all < 1.0. That is, the average number of tissue-selective motifs per low-key tissue-selective gene is ~2 in lymph node and >1 in artery/aorta and pancreas.. 3.3.3 Predicted Targets of Know miRNAs. We compare the identified frequent motifs with all known human miRNAs in the miRBase (Release 11.0). First, we find a total of 98 tissue-selective motifs that perfectly match the 133 mature sequences of known human miRNA. (The latter number is larger than the former because the seed regions for different miRNAs can 29.

(39) be the same.) Second, we allow 1-nt left- or right-shift seed matches (see an example in Figure 7). The total number of matched miRNAs increases to 267. Finally, when one G:U pairing is also allowed in the seed match the total number increases to 814 frequent motifs (≈29% of all identified motifs).. A. left-side shift. miR-203 GA C Target gene Gi. U ACCAGGAUU UGUAAAG UG :: :: ::::: ::::::: ACAUUUC. B. zero shift. miR-203 GAU CACC A G U G A Target gene Gj. UU GUAAAGU G ::: :: ::::: ::::::: CAUUUCA. C. right-side shift. miR-203 Target gene Gk. GAU CA CC AG GA UU UG UAAAGUG :::::: :::::::: ::::::: AUUUCAC. Figure 7. Three possible seed match cases of miR-203 with different motifs. (A) Left-side shift seed match. (B) Zero shift seed match. (C) Right-side shift seed match. In total, there are 483 mature sequences of known human miRNAs that match our predicted frequent tissue-selective motifs (Figure 8), while only 194 known human miRNAs do not match any of our predicted frequent motifs. Thus, our method can indeed recover more than two thirds of the known human miRNAs.. 30.

(40) Figure 8. Distribution of the known miRNAs whose seed regions match our predicted tissue-selective motifs in 40 tissues.. When G:U pairing and left- or right-shift imperfect seed matches are allowed, the mappings between miRNAs and their target motifs may not be one-to-one but can be one-to-many, many-to-one, or many-to-many (see the examples in Figure 3.8). For example, the seed region of miR-203 is GUAAGUG. If we do not consider G:U pairing and left- or right-shift matches, the target motif for prefect matching is CAUUUCA (Figure 9A), which is found only in lymph node and lung tissues. However, when the 1-nucleotide shift and one G:U pairing are allowed, we find seven other motifs as their putative target motifs in 12 tissues including lymph node and lung (Figure 9A). As another example, miR-206a and miR-206b are two members of the miR-206 family that differ by only one base in the seed region, but their tissue-selective motifs are completely different and also their target genes are expressed in different tissues (Figure 9B). Possibly, even one nucleotide change in the seed region of a miRNA can lead to completely different target genes that are expressed in different tissues. 31.

(41) Figure 9. Two examples of the frequent tissue-selective motifs that match known human miRNAs. (A) The seed region of miR-203 is UGAAAUG. If we do not allow G:U pairing or left- or right-shift matches, the target motif for prefect matching is ACUUUAC, which is found only in lymph node and lung tissues. However, when the one-nucleotide shift and one G:U pairing are allowed, we find seven other motifs as their putative target motifs in 12 tissues including lymph node and lung. (B) miR-206a and miR-206b are two members of the miR-206 family that differ by only one base in the seed region, but their tissue-selective motifs are completely different and also the target genes are expressed in different tissues. In each example, the seed matches include (1) perfect matches, and (2) imperfect matches allowing one G:U pairing, and matches allowing a left- or right-shift.. 3.3.4 Predicted Novel miRNAs. Over 70% of the predicted frequent motifs do not match any seed regions of currently known human miRNAs. As these unmatched motifs may be potential target sites of unknown miRNAs, we check each of them to see whether it is located in any of the stem regions in the secondary structures of the miRNA candidates predicted by Pedersen et al. [104]. 32.

(42) Figure 10. Three cases of frequent motifs that match predicted secondary structures of mature miRNA candidates. (A) AAUCUUU and UCCUUGU. (B) AACUUUU and CUGGGCA. (C) UAGGAUG. Because the locations of seed regions in the predicted secondary structures are unknown, we consider three possible binding situations. First, two different motifs match the two primes of the stem part of a miRNA gene candidate (Figure 10A). Second, two motifs are mapped onto the same stem of a miRNA gene candidate but at two different locations (Figure 10B). The predicted mature sequences are two potential alternative forms of this candidate miRNA. Third, a motif matches a region of a known miRNA gene, but our predicted mature miRNA is different from the known mature sequence in miRBase. For example, in Figure 10C, the motif UAGGAUG. matches. a. region. in. the. mature. sequence. UUGCAUAUGUAGGAUGUCCCAU of hsa-miR-448, but our predicted mature sequence is ACAUCCUGCAUAGUGCUGCCAG; we call it an alternative mature 33.

(43) form of hsa-miR-448. Tables 2 and 3 show 4 and 11 additional examples, respectively. When G:U pairing is not allowed, 60 frequent motifs that do not match any known miRNA seed regions give rise to 48 mature miRNA candidates (Table 2). However, P-9-3p and -5p are two novel alternative mature forms of hsa-miR-652; that is, these two sequences overlap with that of hsa-miR-652 but have different seed regions. Similarly, P-27-3p and -5p are two novel alternative mature forms of hsa-miR-802 (Table 2). When one G:U pairing is allowed in the seed match, 116 frequent motifs give rise to 93 mature miRNA candidates (Table 3). P-62, P-63-1 and -2, P-64-1 and -2, P-65, P-66, P-67, P-68-1 and -2, and P-69 are novel alternative. mature. forms. of. has-miR-544,. hsa-miR-1264,. has-miR-1298,. hsa-miR-873, hsa-miR-376b, hsa-miR-381, hsa-miR-365, and hsa-miR-448, respectively (Table 3).. 34.

(44) Table 2. Predicted novel miRNAs with no G:U pairing in the seed match. *P-9 (3p and 5p), †P-11 (3p and 5p), and ‡P-27 (3p and 5p) are alternative mature forms of hsa-miR-652, cfa-miR-1839, and hsa-miR-802, respectively; however, no human version of cfa-miR-1839 is found in miRbase. The coordinates and the located gene IDs of these predicted mature miRNAs are given in the last two columns.. Predicted Frequent. Tissues of motif. novel motif. Predicted miRNA mature sequence. identified. miRNA AUGCAAA. artery/aorta, uterus. P-1. Genomic coordinates (NCBI 36.1). AUUUGCAUAAUGGAUGC. chr2:205539433-205539449. AGUCUGUCCCAUACAAUA. chr6:44077683-44077700. GUCUGUCCCAUACAAUAU. chr6:44077684-44077701. Gene located. ENSG00000116117. intestine, GACAGAC. liver/hepato, pituitary, breast. P-2. GGACAGA. breast. AGCUGAA. peripheral nerve. P-3. UUUCAGCUCAUAAAA. chr5:82856771-82856785. UAAUUGG. ovary. P-4. UCCAAUUAAGUCUUUUAAAU. chr8:83086984-83087003. UUCAAAG. liver/hepato. CCUUUGAAAAUAUAAAAUC. chr8:35236713-35236732. CUUUGAAAAUAUAAAAUC. chr8:35236714-35236732. ENSG00000181577. ENSG00000038427. cerebrum, UUUCAAA. intestine,. P-5. liver/hepato,. ENSG00000156687. pancreas UCAAUUU. kidney. P-6-3p. UAAAUUGAGGUGGAUCCUGU. chr15:57977629-57977648. AAAUUGA. lymphnode. P-6-5p. CUCAAUUUAUUCCUAGAAACA. chr15:57977582-57977602. 35.

(45) UAAAUUG. placenta. UCAAUUUAUUCCUAGAAACAG. chr15:57977583-57977603. AAUUGAG. kidney. UCUCAAUUUAUUCCUAG. chr15:57977581-57977597. GCAAUUU. peripheral nerve. P-7-3p. AAAUUGCCAUAAAGUG. chr9:80472606-80472621. AUGUUCA. artery/aorta. P-8. AUGAACAUCUGAUUAUU. chr11:132066843-132066859. ENSG00000183715. CCAUUCA. retina. P-9-3p. UUGAAUGGCGCCACUAGGGUU. chrX:109185270-109185290. ENSG00000157600. UUGUGCA. lymphnode. P-9-5p. CUGCACAACCCUAGGAGAGGG. chrX:109185228-109185248. (hsa-miR-652). ACAAAUC. cerebrum. P-10. UGAUUUGUUCAAGAUGAUGA. chr7:27256727-27256746. GGUCUUG. lung. P-11-3p. UCAAGACCUACUUAUCUACC. chr15:81221851-81221870. AUCUACC. testis. P-11-5p. GGUAGAUAGAACAGGUCUUG. chr15:81221817-81221836. AUGCAUU. lymphnode. P-12-3p. AAAUGCAUGAAAUAGAU. chr1:158032773-158032789. AUAAUGA. cerebrum. P-13-5p-1. UCAUUAUAAAAUGUGAUAAUGU. chr15:51019585-51019606. AAUGUCA. placenta. P-13-3p. GUGACAUUAUGACAUUACAUU. chr15:51019643-51019663. CUAAUUA. retina. UUAAUUAGCAAAAAGGCU. chr1:208493640-208493657. UAAUUAGCAAAAAGGCU. chr1:208493641-208493657. P-15-3p. AAUUAACAGAAUAUUAU. chr5:146055768-146055784. P-15-5p. UUGUUAAUCAAAAAACUAU. chr5:146055739-146055757. P-16-5p. UAUAUUCACAUUUAUUGGAU. chr7:146609661-146609680. GCUAAUU UGUUAAU. thymus, testis,. P-14. liver/hepato placenta cerebrum, lymphnode,. AUUAACA. placenta,. ENSG00000156475. lung, pancreas UGAAUAU. cerebrum, pineal. ENSG00000174469. 36.

(46) gland, peripheral blood UCAAAAC. placenta, lung. UAAUUUC. kidney lymphnode,. AAUUUCA. placenta, testis,. P-17. P-18-1. UGUUUUGAUAACAGUAAUGU. chr8:75780484-75780503. UGAAAUUAUAUUACCAACA. chr10:128235007-128235025. UUGAAAUUAUAUUACCAACA. chr10:128235006-128235025. kidney, pancreas UUGUAGU. lymphnode. P-19-1. GACUACAACUCCCAAGGUA. chr1:153246074-153246092. GGAGUUG. vein. P-19-2. ACAACUCCCAAGGUACAUACA. chr1:153246078-153246098. AUUAUCU. lymphnode. P-20-1. UAGAUAAUUUGCACAUUAU. chr14:72223490-72223508. P-21. CUGUAAUAUAAAUUUAAUUUAUU. chr4:126647864-126647886. UGUUAAAAAAAGAAAAACAA. chr3:85884941-85884960. ENSG00000160685 ENSG00000205683. cerebrum, lymphnode, AUUACAG. uterus, muscle, lung, kidney, pancreas. UUUUAAC. cerebrum, lymphnode placenta,. UUUAACA. ENSG00000175161. P-22. prostate,. UUGUUAAAAAAAGAAAAACAA. chr3:85884940-85884960. GUAUAUUGUGACAUACAUGU. chr1:1463389-1463408. pancreas CAAUAUA. cerebrum, prostate, lung. P-23. 37.

(47) AAUUAAC. placenta. AUUUACA. placenta. UAUUUAC. cerebrum, placenta, kidney lymphnode,. UUUACAU. P-24-3p. GUGUUAAUUAAACCUCUAUUUAC. chr8:113724958-113724980. AUGUAAAUACAGAUUUAAUUAAC. chr8:113724904-113724926. UGUAAAUACAGAUUUAAUUAACA. chr8:113724905-113724927. CAUGUAAAUACAGAUUUAAUUAA. chr8:113724902-113724924. AGUAAUUUUCGAUAAAGCCCUU. chr10:129082297-129082318. P-24-5p. placenta, kidney,. ENSG00000164796. pancreas AAAUUAC. lymphnode, placenta cerebrum,. GAAAAUU. placenta,. ENSG00000150760. P-25 UAAUUUUCGAUAAAGCCCUU. chr10:129082299-129082318. P-26. CCAACAUGAUGCUAAUAAAU. chr17:73293262-73293281. P-27-5p. CAAAGAUUCAUCCUUGUGU. chr21:36014906-36014929. prostate, testis, lung, pancreas. UCAUGUU. placenta lymphnode,. AAUCUUU. placenta, prostate, lung,. (hsa-miR-802). pancreas UCCUUGU CUAAUAA AUCAAUA. ovary, breast lymphnode, placenta cerebrum, testis. ENSG00000159216. P27-3p. AACAAGGAGAAUCUUUGUCACU. chr21:36014934-36014955. P-28. UUUAUUAGUGCCAUAUAAUA. chr2:219696897-219696916. P-29-3p. UUAUUGAUCAGCGUAGCAAACA. chr5:3508678-3508699. ENSG00000187736. 38.

(48) UCAAUAA UUGAUCA UUAUUGA. prostate, testis,. CUUAUUGAUCAGCGUAGCAA. chr5:3508677-3508697. P-29-5p-1. CUGAUCAAUAAUAAGAUUGAU. chr5:3508628-3508648. P-29-5p-2. AUCAAUAAUAAGAUUGAUAC. chr5:3508631-3508650. CCAGUUUAUUUUGUAAAUAUA. chr1:60826128-60826148. CAGUUUAUUUUGUAAAUAUA. chr1:60826129-60826148. P-31. UCAUAUUUUCUAUCUCUUUGCUU. chr4:39795271-39795293. ENSG00000078177. adrenal gland lymphnode lymphnode, kidney. UAAACUG. lymphnode. AUAAACU. pancreas. P-30. cerebrum, AAAUAUG. lymphnode, pancreas. UUCCUUA. placenta. P-32-1. UUAAGGAAAUUAUGCUGAAC. chr4:21498378-21498397. ENSG00000185774. AAUUUUC. lymphnode. P-33. AGAAAAUUAGGUUGAUA. chr6:37530191-37530212. ENSG00000137200. UUCAAAA. cerebrum. P-34. CUUUUUGAGUUUUGAGGAAG. chr9:99359944-99359963. ENSG00000136842. UUACAGC. placenta. P-35. AGCUGUAAACAGCUCUCCA. chr17:69079331-69079349. P-36. AUUAUUCUUUUUAUAAAA. chr2:144691036-144691053. ENSG00000121964. P-37. UGUUUAUAGUAAUGGGAGAUA. chr9:127702054-127702074. ENSG00000167081. P-38. AAUAUUUGGAAACAUCCA. chr9:13932445-13932467. cerebrum, eye, GAAUAAU. lymphnode, ovary, lung, kidney. UAUAAAC. CAAAUAU. liver/hepato, pancreas peripheral nerve, placenta, kidney. 39.

(49) Table 3. Predicted novel miRNAs with 1 G:U pairing in seed match.. Frequent. Tissues of motif. motif. identification. Predicted novel. Predicted miRNA mature sequence. miRNA. Genomic coordinates (NCBI 36.1). UCCAUCU. retina. P-6-3p. GAGGUGGAUCCUGUUCCAAUU. chr15:57977635-57977655. AAUUUGU. lymphnode. P-7-3p. UGCAAAUUGCCAUAAAGUG. chr9:80472603-80472621. UAAAGUG. prostate, testis. P-7-5p. CAUUUUAUGGCAAUUUGUU. chr9:80472560-80472583. P-12-5p. CUAUUUUAUGCAUUCUA. chr1:158032753-158032774. P-13-5p-2. AAAUGUGAUAAUGUCAUUGC. chr15:51019593-51019612. P-15-5p. CUUGUUAAUCAAAAAACUAU. chr5:146055738-146055757. UAUAUUCAUGAAUAUAU. chr7:146609695-146609711. AAUAUAUUCAUGAAUAUAU. chr7:146609693-146609711. UUAUAUUACCAACAGAAAU. chr10:128235012-128235030. Gene located. placenta, testis, UGAAAUA. liver/hepato, kidney, pancreas lymphnode,. UUACAUU. uterus, placenta, prostate, kidney, pancreas. UUAACAG UGGAUAU. placenta, pancreas cerebrum lymphnode,. GAUAUAU. prostate, lung,. P-16-3p. ENSG00000156475. ENSG00000174469. pancreas UGAUAUA. kidney. P-18-2. 40.

(50) AGAUUAU. lymphnode. P-20-2. GAUAAUUUGCACAUUAU. chr14:72223492-72223508. UAAGUGUUAAUUAAACCUCUA. chr8:113724955-113724975. ENSG00000205683. lymphnode, AACAUUU. ovary, kidney, pancreas. UAACAUU. ENSG00000164796. P-24-3p. cerebrum, skin, placenta. AAGUGUUAAUUAAACCUCUAUU. chr8:113724956-113724977. UGAUUUC. bone, lung. P-32-2. GGAAAUUAUGCUGAACUCAUUU. chr4:21498382-21498403. ENSG00000185774. UAAUGUU. artery/aorta. P-39. UGACAUUAGUUCAUUU. chr2:56002041-56002056. ENSG00000115380. P-40. GCUUAGAAAAGUGACCUAGA. chr2:77350845-77350864. ENSG00000176204. P-41. AAAGCAGCGUGAAGAUGC. chr2:105075146-105075163. ENSG00000135972. P-42. UAUAUGUAGAUGUAGCUAUAU. chr2:192552553-192552573. ENSG00000144339. P-43. AUAAUUUAUUAGAACAAUUAG. chr3:60801997-60802017. ENSG00000189283. UAAUUAUUUUUCUCCAUC. chr3:62045270-62045287. UUAAUUAUUUUUCUCCAUC. chr3:62045269-62045287. lymphnode, skin, UUUUAAG. placenta, lung, pancreas. GCUGUUU. lung peripheral nerve,. UGCAUAU. placenta, pancreas. AGAUUAU. lymphnode. AGUAAUU. lymphnode. GUAAUUA. lymphnode,. P-44. kidney, pancreas. UUCAAUU. lymphnode. P-45. UAAUUGGAAAUUUCAUUU. chr4:146917406-146917423. GCACAUU. lung. P-46. UAAUGUGUAAUGCUGUAGUUU. chr4:153150954-153150974. GUCUAUA. lymphnode, lung. P-47. AUGUAGACAAAACAUCCAGAUAA. chr5:15476433-15476455. ENSG00000144724. ENSG00000151612. 41.

(51) UGUCUAU. lung. UGUAGACAAAACAUCCAGAU. chr5:15476434-15476453. P-48. AUUAUUUUAGUAAUUCAACAG. chr5:36750623-36750643. P-49. AAGUUGUUUCUGCAUAAA. chr5:104185237-104185254. UAACUUUUAAUGUAAGCCUGG. chr6:50426012-50426032. AACUUUUAAUGUAAGCC. chr6:50426013-50426029. UGUUUUGCCAGCAUGUGGUUG. chr9:72221811-72221831. UUCAUUUAAAAUUAGGC. chr10:13522803-13522819. UCAUUUAAAAUUAGGC. chr10:13522804-13522819. cerebrum, eye, GAAUAAU. lymphnode, ovary, lung, kidney lymphnode,. AAUAACU. placenta, lung, pancreas. AGAGUUA. UAAGAGU. GCAAAAU UAAGUGA UUAAGUG. UUUAAGU AAGUGAA UUAAGAU. cerebrum, skin, placenta placenta,. P-50. pancreas cerebrum, retina, placenta, prostate. P-51. cerebrum lymphnode, placenta, lung lymphnode, placenta, kidney cerebrum placenta, lung, pancreas. ENSG00000165626. P-52. P-53. UCAUUUAAAAUUAGGC. chr10:13522803-13522819. AUUCAUUUAAAAUUAGGC. chr10:13522801-13522819. CAUCUUGAAAUAAGUCCUCA. chr11:122110545-122110564. ENSG00000154127. 42.

(52) lymphnode, UUUAAGA. uterus, prostate,. AUCUUGAAAUAAGUCCUCAU. chr11:122110546-122110565. UCUUGAAAUAAGUCCUCAUC. chr11:122110547-122110566. P-54. AUAAUCAGAAACACUAAUCA. chr13:66389595-66389614. P-55. UAUAUUUAACAUACACUUG. chr13:104638844-104638862. P-56. GAAUUAAUGGUAUUAA. chr14:56071430-56071448. P-57. UAAAUUAAAAUCAAUAUUUU. chr16:7601200-7601219. GGGAAUUCCCACUCUGCAG. chr17:9233266-9233289. GGAAUUCCCACUCUGCA. chr17:9233267-9233288. testis, lung, kidney, pancreas lymphnode, skin,. UUUUAAG. placenta, lung, pancreas. UUGAUUA. UAAAUGU. AUUAGUU. lymphnode, lung, pancreas lymphnode, placenta lymphnode, kidney. UAGUUUA. placenta. GAAUUUC. skin, placenta. GGAAUUU. kidney. AUUCUGA. kidney. P-59. UUUAGAAUUCUAAUUA. chr18:26697252-26697267. GCAAAUU. lymphnode. P-60. UGAUUUGCAUUUUAGU. chr18:40000232-40000247. AUGAAGU. uterus, lung. P-61. UAUUUCAUUUUAAUCUUGA. chr19:35550032-35550050. UAGCAAG. prostate. CUUGUUAAAAAGCAGAUUCU. chr14:100584767-100584786. UUUUAGC. lung. UGUUAAAAAGCAGAUUCUGA. chr14:100584769-100584788. UUGUUGA. kidney. CUCAAUAAGUAUUUGUUGA. chrX:113793391-113793409. P-58. P-62 P-63-1. ENSG00000184226. ENSG00000078328 ENSG00000170310. hsa-miR-544 ENSG00000147246. 43.