• 沒有找到結果。

De novo motif discovery on the promoters of co-expressed coding genes

CHAPTER 6 Regulation of lncRNA Expression

6.2 De novo motif discovery on the promoters of co-expressed coding genes

6.2 De novo motif discovery on the promoters of co-expressed coding genes

De novo motif discovery was conducted on the promoters of “coding genes” in a cluster

to identify potential regulatory elements. In this section, we first discussed about the issue for how to define the promoter region of a gene. Second, parameters used when conducting de novo discovery were tuned and analyzed. Third, the quality of the discovered motifs was evaluated.

6.2.1 Promoter regions of genes in D. melanogaster

The gene promoter regions may have lengths varying from hundreds to thousands long,

and locate upstream or downstream from transcription start site (TSS), in different species [85, 86]. In D. melanogaster, some studies have used (1,000 to 200 bp) as the [87, 88], and some others used (100 to

200 bp) [89]. To clarify which

Figure 20. Distribution and Conservation scores (CS) analysis of the 2,059 annotated binding sites collected from REDfly database [90] (170 TFs and 2,048 target genes included). (a) Position distribution. The averages CS of TFBSs located within (500 to

200 bp) is 0.482; (b) Frequency of TFBSs

that have a CS value  0.482.

region should be considered as the promoters of the identified co-expressed gene cluster for the de novo motif discovery, we matched the annotated TFBSs (collected from REDfly database [90]) back onto the gene promoter regions for investigating the patterns of promoter structure. As shown in Figure 20(a), most of annotated TFBSs

located at the regions adjacent to the TSS. For the annotated TFBSs located in the region of (500 to 200 bp), we calculated the average conservation scores (CS). The

calculated CS value (0.482) is much higher than the average CS value of mRNA and lncRNA promoters (0.328 and 0.381, respectively; Table 11). In addition, the annotated

TFBSs with the supports of evolutionary conservation (CS

 0.482) were usually

located at the regions adjacent to the TSS as shown in Figure 20(b). In this thesis, the region of (500 to 200 bp) was used for the subsequent de novo motif discovery.

6.2.2 Parameter tuning for the weights of nucleosome occupancy and evolutionary conservation while conducting de novo motif discovery

To optimize the performance of de novo motif discovery using eTFBS [91], we adopted an analyzed procedure to find the best parameter set for the weights of nucleosome occupancy and evolutionary conservation. Here, we selected a fixed pattern support during pattern mining step, 0.15, for the subsequence analysis. The patter support was defined as the proportions of sequences in the coding gene promoters of a co-expressed cluster that contains an observed pattern. With the selected pattern support, it has a possibility to achieve highest precision for the prediction of TFBSs as validated by the annotated TFBSs collected from REDfly database [90] (Appendix Figure 1. ).

As described in the motif discovery procedure (section 6.4.3), a pattern ranking scheme (Eq 1) is used for selecting reliable patterns. In the equation, there are three parameters (a, b, and c) that can be tuned, where a, b, and c are the relative weights given to the position score, nucleosome occupancy score and conservation score.

Nevertheless, the position score (with weight a) was designed for positive sequences

with scores relevant to reliability, such as P value estimated from ChIP-seq experiments.

In this thesis, the weight a should thus be set as ‘0’, since the positive promoters used in this thesis were collected form each co-expression cluster and have no measured scores relevant to reliability. Therefore, in this section, only the weights (b, c) of nucleosome occupancy and evolutionary conservation were analyzed. The weights, (0, 1, 2, 3) were used for b, while (1, 2, 3) for c. In total, there are 12 parameter sets were tested.

To evaluate the performance of the predictions considering different parameter sets, we collected the annotated TFBSs from REDfly database [90] for validation. For each

Figure 21. Parameter tuning for the weights (b, c) which are given to

nucleosome occupancy and evolutionary conservation. Different colors denote different weights for nucleosome occupancy. The colors, (blue, red, green, orange), indicate b

 (0, 1, 2, 3). Different types of lines represent different

weights for evolutionary conservation. The line types, (solid line, thick broken line, broken line), indicate c  (1, 2, 3).

run of prediction, a list of top-10 putative motifs, along with their corresponding positions in the positive promoters (instances), was reported. We validated these predicted instances by comparing to the collected annotated TFBSs. Precision scores were calculated by the ratio of (True positives/Predicted instances), where ‘True positives’ were counted when a predicted instance was overlapped with an annotated TFBS. Figure 21 suggested that the information of evolutionary conservation was useful for finding true TFBSs, since it was observed that the highest c (solid line) obtained the best precision for each fixed b (each line color). Moreover, along with the ranks of the predicted motifs, the result showed that the information of nucleosome occupancy helped to make real TFBSs better ranked when comparing lines in read to lines in blue.

Taken together, the parameter set of (b, c) are empirically set to (1, 3), where the best performance on the prediction of TFBSs was obtained.

6.2.3 Evaluation of the discovered motifs

A list of top-10 putative TFBSs for each cluster was reported, and resulted in 270 putative TFBSs in total for the 27 clusters. About 80% of the predicted motifs (212 motifs among the total 270 motifs) were similar to annotated TFBSs (Table 14). To confirm the results were not random events caused by genome-wide motif mapping, we further mapped the discovered motifs onto 3’ untranslated regions (3’ UTRs) and

introns of coding genes. The frequency of motif hits in LNC gene promoters was

Table 14. Summary of de novo motif discovery results

Promoter region -500/+200

Num. of clusters with annotated TFBS 27

Num. of predicted motifs 270

Num. of predicted motifs supported by annotated PFMs

212 (78.52%)

Num. of involved annotated PFMs 73

Table 15. Investigation of similarity between lncRNA and mRNA promoters

Discovered motifs matched onto different sequence sets p-value of paired t test

mRNA promoter vs. lncRNA promoter 0.157

mRNA promoter vs. 3' UTR 0.012

lncRNA promoter vs. 3' UTR 0.031

mRNA promoter vs. intron 0.035

lncRNA promoter vs. intron 0.027

significantly hits in LNC gene promoters was significantly higher than 3’ UTRs and

introns, while it was not different from the coding gene promoters. Table 15 showed that frequency distribution of motif hits for all the predicted TFBSs has no significant difference between the mRNA and lncRNA promoters (P-value of paired t-test: 0.157).

Nevertheless, in comparison to 3’ UTR regions or introns of mRNA, lncRNA promoters showed significantly difference (P-value lower than 0.05) of motif-hit frequency distribution from those two types of sequences (P-value: 0.031 and 0.027, respectively) which behaved like the distribution calculated from mRNA promoters. In summary,

these results provided evidences to the identified co-expressed clusters by showing that the promoters of coding genes in co-expressed clusters share motifs that were similar to the annotated TF PFMs.

6.3 Co-occurrence of TF binding motifs in the promoter regions of

相關文件