• 沒有找到結果。

The system flow of profiling small RNA expression

4. Materials and Methods

4.1 The system flow of profiling small RNA expression

For monitoring the expression of small RNA, there are four steps. First step is trimming adaptor sequence from sequencing reads. Second step is aligning trimmed reads with the reference genomes or small RNA database. Third step is calculating the expression level of various kinds small RNA. The fourth is normalizing the expression of small RNA and finding differential expressed small RNAs.

Trimming adaptor sequence from sequencing reads

In the first step, all identical sequencing reads are merged into a unique read and counts each unique read. Then, each unique read is checked to determine whether it contains a full or a partial adaptor sequence at 3’ end. The length of reads is 36~40 nt (illumia and ABI SOLiD platforms) and the length of small RNAs is 18~25 nt. Therefore, the reads have full or partial adaptor sequence at 3’

end (The length of 3’ adaptor is 18~22 nt). In checking the full adaptor sequence, if the adaptor sequence is in the middle or at the beginning of the read, then the read is removed. If the adaptor sequence is at the end of the read, then the adaptor sequence is trimmed from the read. In checking the partial adaptor sequence, the last bases of the 5’ adaptor are used as a probe to match the first bases of the reads. The first bases of the 3’ adaptor are used as probes to match the last bases of the reads. If the sequence identities of the matched regions are greater than 70%, these regions are eliminated (Figure 4).

41

Figure 4. The flow of trimming 3’ adaptor sequence from the sequencing reads

42

Aligning trimmed reads with the reference genomes or small RNA database In the second step, each read is aligned with the reference genomes or small RNA database (no available reference genomes). If the reference genomes are available, the reads which are perfectly mapped to the genomic sequence are retained. The reads which have multiple chromosome location are removed if the number of location is more than the defined threshold. The threshold is set according to the maximum chromosome locations of mature miRNAs of each species (data are extracted from miRBase 16.0). For example, the maximum chromosome location of Human is 11. The threshold is set as 11. In Arabidopsis, it is set as 7. For the sequencing data which do not have available reference genome, the sequencing reads are aligned with small RNA database such as Rfam and miRBase. The perfectly mapped reads are retained for further analysis.

Calculating the expression level of various kinds small RNA

In the third step, the expression level of small RNA is calculated by combing the information of chromosome location of the reads and known small RNAs. Before monitoring the expression level of small RNA, the reads which are mapped to rRNAs, tRNAs, snoRNAs and snRNAs are removed first. The information of these non-coding RNAs are from Rfam and NCBI. In animals, the expression of the miRNA is computed by counting the reads which locate in the region of known miRNA. The region of miRNA is defined as the flanking three-nucleotide region at 5’ and 3’ end of the miRNA. For example, the mature miRNA of hsa-let-7a-1 is located at chr9: 96938244~96938265 [+]. The region of miRNA is 96938241~96938268. The reads which are located in this region are considered as the expressed sequence of hsa-let-7a-1 (Figure 5). The three-nucleotide is defined by pre-analyzing the global NGS data in different species. Table 3 lists the

43

number of miRNAs and reads and the GEO accession IDs for analyzing the shifting bases of miRNAs in various species. After pre-analyzing NGS data, three-nucleotide shifting for miRNAs can cover more 90% reads in different organisms (Figure 6).

Figure 5. How to monitor the expression level of the miRNA in sequencing data Table 3. The number of miRNA and reads and GEO accession IDs for analyzing the shifting bases of miRNA in various species

Species # of miRNA # of reads Experiments accession ID (GEO)

Human 774 9433 GSE13370,GSE16579, [17]

Mouse 625 12243 GSE9306,GSE16579,GSE12633,GSE12074,GSE12075,GSE125 21,GSE7414

Dog 219 1263 GSE10825

Chicken 489 2666 GSE10686,GSE15513

Drosphilia 161 3043 GSE15378,GSE11086,GSE10794,GSE12840,GSE11019,GSE93 89,GSE13679,GSE10515,GSE11624,GSE14849,GSE15898,GS E7448,GSE15899,GSE14488,GSE15137,GSE10790,GSE15897 C.elegan 179 3236 GSE11738,GSE15169,GSE17787,GSE17153,GSE11736,GSE59

90

Arabidopsis 215 2388 GSE13605,GSE10036,GSE12037,GSE10180,GSE14696,GSE11 094,GSE14695,GSE5343,GSE16971,GSE6478,GSE6682,GSE1 4694,GSE11007,GSE5228,GSE3008

Rice 416 1825 GSE11974,GSE16350,GSE11014,GSE10523,GSE7107,GSE162 48,GSE13152,GSE12317

44

Figure 6. The distribution of shifting bases of miRNAs in varius species

In plants, various kind of small RNAs such as tasiRNA, nat-siRNA and rasiRNA are needed to be profiled excluding the miRNA. For profiling miRNA expression, the processes are the same with animals. For tasiRNAs, the genomic locations of tasiRNAs are obtained from ASRP (Arabidopsis Small RNA Project) (http://asrp.cgrb.oregonstate.edu/). The definition of region of tasiRNA is the same with miRNAs (flanking three-nucleotides). The reads which locate in the region are summed to present the expression level of tasiRNA. Nat-siRNAs contain cis- nat-siRNAs and trans- nat-siRNAs. The cis-natural antisense transcripts are identified by analyzing the genomic loci of genes from TAIR 9.0 (http://www.arabidopsis.org/ ), thus the length of overlap regions are greater than 30nts were extracted. The trans-natural antisense transcripts are also identified by alignment analysis of all gene transcripts, thus the length of overlap regions, which are with perfect complementarity and greater than 100 nts are

45

extracted. The expression level of nat-siRNAs is calculated by counting the reads which locate in cis- or trans- natural antisense region. In profiling the expression of rasiRNA, the transposons and repetitive elements are obtained from Repbase (http://www.girinst.org/). Inverted and tandem repeats are extracted by using

“Inverted Repeats Finder” [166] and “Tandem Repeats Finder” [167] to scan the whole genome, respectively. The reads locate in the dsRNA regions which are formed by transposons, repetitive elements, inverted and tandem repeats are counted as the expression level of rasiRNAs.

Normalizing the expression of small RNA and finding differential expressed small RNAs

In the fourth step, in order to find differentially expressed small RNAs in different experiments, normalization is needed to be done first for each experiment. The quantities of reads which are multiplied one million and divided by adjusted read counts are called as rpm (reads per million).

"Adjusted read counts" indicates the total amount of mapped reads subtracted the amount of reads, which were annotated as rRNAs, tRNAs, snoRNAs and snRNAs. After normalization, differentially expressed small RNAs are found if the fold change is more than 1.5 (up-regulated) or less than 0.67 (down-regulated).

46

4.2 Novel miRNAs discovery

After profiling known miRNA expressions, there are still some reads which can’t be mapped to known miRNAs but can be mapped to genomes. To check these reads whether are the novel miRNAs or not. First, these reads are clustered according to their chromosome location. The maximum distance between two reads is 10 nt. Then, if the distance between two clusters is smaller than 100 nt, these two clusters are merged to one cluster (These two clusters could be miRNA and miRNA*). The flanking regions of these clusters are extracted. In animals, the flanking length is 100 nt at 5’ and 3’ end. In plants, the length is set as 200 nt.

Then, these flanking sequences are folded RNA secondary structures by using RNAfold which identifies stable RNA structures. These RNA secondary structures are filtered based on the characteristics of miRNA precursors. The structures are removed if they are not stem-loop structures (the reads must locate in the stem).

The structures are also removed if the ratio of base pairing in the stem region or the number of unpairing bases is not fitted with thresholds. These thresholds are different in each organism and constructed by pre-analyzing the RNA structures of the known miRNAs from miRBase. Table 4 lists the thresholds for the ratio of base pairing and the number of unpairing bases in animals and plants. For example, the thresholds for the ratio of base pairing and for the number of unpairing bases are 0.69 and 6 in human, respectively. There are 1355 human miRNAs and 90% miRNAs are fitted with these thresholds. Therefore, the structures are retained if the ratio of base pairing is greater than 0.69 and the number of unpairing bases is smaller than 6. Finally, the remaining structures are drawn as the figures and checked manually whether the secondary structures are unbranched fold-back (hairpin liked) (Figure 7).

47

To find more reliable novel miRNAs, the miRNA candidates are checked with cross-species conserved region. The concept of cross-species conserved region is widely used in identifying novel miRNAs. For example, if a known human miRNA locates in human, mouse and rat conserved region, this region may contain a miRNA in mouse and rat. The first appear is identifying novel C. elegans miRNAs [168]. Afterward, this concept is applied for discovery miRNAs in human [169]

and fly [170]. Recently, 529 novel chimp miRNAs were identified based on this concept [171]. The cross-species conserved sequences of miRNA candidates are extracted from UCSC Genome Browser [172]. Then, RNALogo which is the tool supporting RNA sequence and structure conservation information is applied to observe the conservation of the novel miRNAs. If they are highly conserved, they are more convinced as the novel miRNAs.

48

Table 4. The thresholds of miRNAs and corresponding coverage rates for each organism

Species # of miRNAs the ratio of base pairing # of unpairing bases coverage Animals

49

50

Figure 7. The flow of finding the novel miRNA. It contains clustering reads, flanking the cluster and folding RNA secondary structure

51