Chapter 2 Results
2.1 The Mechanism of Boundary Determination in the Internal Sequence Elimination
2.1.3 IESs are conserved within different inbred lines
We showed that IES locations were highly conserved among B strains. Next, we
wanted to know whether this conservation is shared by different inbred lines. C3
inbreeding strains were derived from different inbreeding programs and exhibited genetic
polymorphism from B inbreeding strains (67). We sequenced two C3 strains, 6a and 12a,
and observed that they contained similar numbers of IESs to the B strains; 6586 IESs
(7125 forms) and 6646 IESs (7168 forms) in 6a and 12a, respectively (Fig. 4A). Most of
these IESs were shared between these two C3 strains (95.81% in 6a and 94.94% in 12a)
(Fig. 4B), and the majority was also shared with the three B strains (at least 71.44%) (Fig.
4C), indicating that most IESs are universal among different strains. However, the IES
boundaries within inbred lines were very different. For instance, 791 precise IESs existed
in all three B strains, but only 345 of these occurred in the two C3 strains, with 80.66% of
the remainder exhibiting small variance (within 20 bp) and 12.90% exhibiting large
25
variance (>100-bp) (Fig 4D and E). The results suggest that genetic background can
influence IES boundary precision.
2.1.4 cis-acting elements near IES boundaries could be involved in IES regulation
Since a large proportion of the IES boundaries showed only small variations, we
tried to determine if there was any cis-acting element in the flanking regions of other IESs.
The nucleotide distribution near the IES junctions from the 500-bp inner sequences to the
500-bp upstream or downstream flanking regions of all IESs showed that the regions
contained a lower GC content than the average of the MIC genome (25 %GC). However,
significant fluctuations could be observed within the 100-bp flanking regions, implying
that the boundary of IESs could be marked in cis (Fig. 5).
To group cis-acting elements with a particular feature, we used a motif finding tool,
eTFBS, to scan the flanking sequences (http://bioinfo.bime.ntu.edu.tw/c4lab/). The
100-bp IES flanking regions were used as the query dataset to find 10 overrepresented
motifs (Top 10 motifs) and the IES flanking regions that are 100-200 bp from the
junctions were used as background datasets (Fig. 6A). Most of the motifs revealed high
AT patterns, which could be due to the higher AT content within these regions (Fig. 5).
However, we could not find any common pattern amongst motifs in the IES flanking
regions, except for two (motifs 2 and 7) that displayed a consistent distance to the IES
26
boundary when they served as IR but not as direct repeats (DR). Moreover, we found that
these two motifs have the same core sequence “TACCNT”. We find a total of 1881 motifs
when we used this core sequence to scan all 100-bp IES flanking regions, 199 of which
were paired at the both sides of an IES as IR and 57 of them are paired as DR (Table 2).
The distances between these IRs to respective IES boundaries were about 62 bp, and the
difference of these distances between the two ends of the same IES was about 11 bp,
which differed from the distributions of DR (Fig. 6B). Furthermore, 80.40% of IESs that
were flanked by these IRs were in the group of 20-bp variations among the B strains,
suggesting that this motif could be related to regulation of IES boundaries (Fig. 6C).
To search further for more candidate cis-acting elements that could regulate IES
boundaries, we focused on the IRs located at a similar distance (less than 10 bp difference)
to the boundaries on both ends of the same IES in CU427. We arbitrarily defined the IR as
a pentamer sequence and allowed no mismatch between the two repeats. IRs that had an
identical sequence at the 5’-end were clustered and we identified the localization
distribution by calculating the interquartile range (IQR). The IQR served as an indicator
of location concentricity in the IES flanking regions. A larger IQR indicated that the
distribution of the IR was dispersed within IES flanking regions and vice versa. We
identified 472 5-mer sequences serving as IRs at both ends. The “TAAAA” sequence
presented the highest frequency; however, it was widely distributed and had a very high
27
IQR (IQR = 56) with no apparent pattern (Fig 7A). However, we also found 138 pentamer
IRs with low IQRs (IQR ≤ 10), indicating that the distribution of these IRs is concentrated
within a small range in each group (Table 3).
Interestingly, we clustered some low IQR groups that shared the same core sequence
and found that they still maintained their concentricity; 4119 IRs contained the core
sequence “TATA”, which the most frequent group with low IQR (Fig. 7B). However,
some pentamer groups contained “TATA” were spread distributed (Fig. 7G-7J). The
sequence consensus could be further specified by aligning four pentamer IRs with low
IQRs in the “TATA” group (Fig. 7C-7F and 7K), and the distribution of this consensus
was shown in Fig. 7L. Some additional examples that showed a similar feature with the
TATA group were identified. For example, 702 pentamer IRs contained TGT (Fig. 8A)
and 696 IRs contained ACT were identified (Fig. 8D). The refined consensus maintained
the concentrated patterns in these IR groups (Fig. 8B-8C and 8E-8F). Furthermore, the
low IQR IRs that shared the same sequence as the above identified “TACCNT” IR were
observed (Fig. 9A and 9B), indicating that this method could identify the consensus of
IRs that showed distinct patterns near the IES flanking regions.
Moreover, pentamer IRs composed of G or C were also highly represented, with
distribution patterns similar to IRs with the “GGG” sequence and other IRs with at least
three G residues (G-rich). The distances of “GGG” and “G-rich” IRs to the IES
28
boundaries were 46.59 bp and 46.76 bp on average, respectively (Fig. 9C and 9D).
However, the distribution patterns of IRs with the “CCC” sequence and the C-rich IRs
differed, implying that the C-rich IRs could be utilized by several regulatory elements
(Fig. 9E and 9F).
2.1.5 G-rich and C-rich inverted repeats are the cis-acting elements of Lia3-affected
IESs
A recent study showed that the function of Lia3 is to act as a cis-acting element
binding protein that regulates the boundary of IESs containing G-rich polypurine tracts,
including the M element (59). Low progeny viability and the loss of boundary precision
of M elements and other 4 M element-like IESs in Lia3-deficient progeny lines indicates
that Lia3 is lined to maintenance of IES boundaries. In our study, we observed that the
G-rich IR were highly represented in the IES flanking regions. Hence, we postulated that
this motif could be regulated by Lia3. To test our hypothesis, we sequenced three progeny lines (3-1, 4-1 and 27-2), which were generated from Lia3 strains (Fig. 10). Note that
the Lia3 strains were generated from a mating of CU428 and BII, hence, we compared
the changes in IES distribution between three B strains and the Lia3 progeny strains. We
noticed that the number of IESs with more than 100-bp boundary variations in different
forms was 11.07%, i.e. higher compared to the WT strains (Fig. 10). Since boundary
29
regulation was lost in the Lia3 strains, we suggest that the increased number of IESs
with large variations is caused by the Lia3 deficiency.
To identify the subset of IESs that is potentially regulated by Lia3, we first focused
on the 519 Lia3-affected IESs that showed more than 100-bp boundary variations when the Lia3progeny were compared with the WT strains (Table 4). Of these 519 IESs,
59.34% exhibited within 20-bp variations in WT strains, indicating that the majority of
Lia3-affected IESs had only small variation. Hence, we extracted the 100-bp
Lia3-affected IES flanking regions that were within 20-bp variation in CU427 WT strains
and calculated the nucleotide distribution of these regions. We found a distinct
enrichment of G from 40-60 bp to the IES boundary (Fig. 11), which agrees with our
expectations based on the limited cases and further extended the potential size of this
group. We also observed a small peak of C from 25-40 bp to the IES boundary (Fig. 11),
suggesting that a C-rich motif could also play a role. By searching for significant motifs
within these regions using MEME, we identified the G-rich motif as “GAGGG” (Fig.
12A). In order to know whether this motif exists at both ends of an IES, we used this
motif and its reverse complement (CCCTC) to scan IES flanking regions (Fig. 12A and
12B). In this case, we did not limit the difference of motif distances to the IES boundary;
however, if there was more than one IR pair at an IES, the IR pair with the minimum
difference of distance was chosen. Surprisingly, if we set the threshold as 60% identity of
30
the value of position weight matrix (PWM), we observed that 94.48% of Lia3-affected
IESs contained G-rich or C-rich IR. Both copies of the IRs were at similar distances from
the IES boundaries. In fact, there were fewer differences between each IR pair than
among such distances in all IESs of this group (Table 5 and Fig. 12C and 12D). If we
further restricted the comparison to the subset of IESs that showed boundary variations of
10-bp or less among WT strains and exhibited more than 100-bp boundary variations among the Lia3 strains, the proportion of the G- or C-rich IR containing IESs increased
to 96.09%, indicating that almost all Lia3-affected IESs that have very limited boundary
variations were flanked by the G- or C-rich IR.
However, the background ratio of IESs with these IRs in the entire IES flanking
region is also high (about 73%), indicating that the threshold of the PWM value should be
adjusted. To optimize the outcome, we tested different identities of PWM value and found
that in the G- or C-rich IRs with 75% identity, more than 60% of Lia3-affected IESs
contained one of these IRs, but only about 14% of background IESs had them (Table 5).
Hence, we set the PWM threshold as 75% identity for subsequent experiments.
Next, we suspected that the G- or C-rich IRs also existed among IESs that exhibited
larger variations among strains. We extracted IESs that showed more than 100-bp boundary differences between WT and Lia3strains, and scanned for the two IRs in the
IES flanking regions in all three WT strains. Significantly, 167 out of 175 IESs that had
31
G- or C-rich IRs were shared among all WT strains (representing 58.95% of
Lia3-affected IESs), but only 9.81% of them were present in the background IESs among
all WT strains (Table 6). Moreover, the distances of the two IR copies to both ends of an
IES boundary and the difference between them in G/C-rich IRs containing Lia3-affected
IESs were quite consistent among strains (Fig. 12C and 12D), and they were significantly
different from IRs found in the background IESs (p-value<10-5 on average) (Fig. 12E and
12F).
We also considered the G-rich or C-rich motifs as DRs. We identified 41 G-rich DRs
and 45 C-rich DRs in the group of Lia3-affected IESs. However, the distances to the IES
boundaries varied (53.32 bp±23.24 in G-rich DRs and 50.12 bp±22.54 in C-rich DRs). In
addition, the difference between the distances of the two copies to the IES boundaries
were higher for DRs (about 25 bp in G-rich DRs and about 17 bp in C-rich DRs)
compared to IRs, indicating that DRs were less likely to regulate Lia3-affected IESs. In
conclusion, our results show that Lia3-affected IESs are regulated by G-rich and C-rich
cis-acting elements.
2.1.6 piggyBac transposon families in Tetrahymena
Domesticated piggyBac transposases play an important role in IES excision (41,
68). Six putative domesticated piggyBac transposase genes can be identified in the
32
Tetrahymena Genome Database (http://ciliate.org/index.php/home/welcome), TPB1,
TPB2, TPB3, TPB6, TPB7 and LIA5 (Fig. 13). Inhibition of TPB2 expression prevents
IES deletion globally and affects the developmental process (68). Tpb1p and Tpb6p
probably function as a heterodimer and both are required to regulate a small subset of
IESs that contain a specific inverted terminal repeat (ITR) highly similar to the ITR of the
piggyBac transposon in Bombyx mori (69). Moreover, unlike most TPB2-dependent IESs
that show deletion boundary heterogeneity, TPB1/6-dependent IESs are precisely deleted,
leaving one copy of the duplicated TTAA at the termini in the MAC genome (41). These
features suggest that TPB1 and TPB6 act much more like a piggyBac transposase than
TPB2 does.
Since there are several domesticated piggyBac transposases in the Tetrahymena
genome, we wanted to know whether there is any active piggyBac transposon. The
Tetrahymena Genome Database (http://ciliate.org/index.php/home/welcome) reports a
piggyBac-like gene in the MIC-specific region, which we have named as TPB3. We
analyzed the DNA structure of this gene, compiling the complete coding region to
investigate whether it has features of an active transposon. We searched upstream and
downstream of this sequence for ITR-like sequences and found that TPB3 indeed
possesses typical DNA structures of an active piggyBac transposon, with the TTAA
tetranucleotide DRs adjacent to a pair of ITR, as well as a coding region (without introns)
33
for the putative piggyBac transposase with the conserved DDD (or DDE) domain (Fig. 14
and 15). We then compared the full-length transposon structure with the first identified
piggyBac transposon, IFP2, in Trichoplusia ni and found that their compositions are quite
similar (Fig.16) (70). These observations support that TPB3 is likely an active piggyBac
transposon.
We then used the sequence of TPB3 to search the MIC genome for other copies. We
identified an additional two copies in the MIC genome and named them TPB4 and TPB5.
TPB3 and TPB4 are located in a 14 kb and an 11 kb IES, respectively, suggesting that
they were initially inserted into pre-existing IES regions (Fig. 17). Interestingly, the copy
of the TTAA DR adjacent to the 3’-ITR of TPB5 coincides with one boundary of the IES,
and the other IES boundary is only about 200 bp away from the 5’-ITR of TPB5 (Fig. 17
and 18), raising the possibility that this IES may actually be formed from a TPB5
invasion. TPB4 shows 88% DNA sequence identity to TPB3 and all three conserved
regions can be aligned (Fig. 19). However, there are frameshifts in TPB4 that cause loss
of the first and third D in the predicted amino acid sequence, suggesting TPB4 is a
dysfunctional transposase. Although the transposase part of TPB5 is destroyed as well,
the ITRs and the untranslated regions (UTRs) are still highly conserved among TPB3,
TPB4 and TPB5 (Fig. 19). It is notable that the ITRs among these three transposons are
very similar. The sequences of the 3’-ITRs of TPB3 and TPB4 are identical (Table 7),
34
suggesting that they are derived from the same origin. Interestingly, the first 11 bp of the 5’-ITR of TPB5 is identical to that of TcPLE1,a piggyBac transposon identified from the
red flour beetle Tribolium castaneum (71) but not to BmPBLEs from Bombyx mori (69).
This finding indicates that TPB3, TPB4 and TPB5 could have been derived from a
common origin that differs from the other piggyBac transposons and invaded the
Tetrahymena genome at different time points. Moreover, the evolutionary relationship of
the DDD domain truncation shows that TPB1, TPB2 and TPB6 are in the same clade,
and TPB2 is more closely related to PiggyMac transposase of Paramecium (72).
However, TPB3 and TPB4 (the frameshift-corrected form) are in a different clade (Fig.
20). Further analysis shows that the TPB3 clade is more closely related to human
PGBD5 (Fig. 21).
2.1.7 Discussion
The aim of this study was to investigate the global regulatory mechanism of IES
boundaries by comparing different Tetrahymena strains. We observed that IES boundary
deletions were highly conserved among Tetrahymena strains, but the precision of the
deletion boundaries varied. When we compared IES precision between two different
inbred lines, strains B and C3, most of the deletion boundaries presented small variations,
indicating that the majority of IESs exhibit microheterogeneity at the deletion boundaries.
35
Nevertheless, nucleotide distributions near IES junctions showed significant
fluctuations within the 100-bp flanking regions, implying that the DNA sequences could
be linked to boundary determination. To understand whether IES elimination is regulated,
we searched for cis-acting elements among the IES flanking regions, such as the A5G5 IR
of the M element. Surprisingly, we found that all of the identified cis-acting elements are
more likely to be IRs. The same group of cis-acting candidates was located at rather
invariable distances to the IES boundaries and the distance of the two copies of IRs to
both IES ends differed only slightly, suggesting that the IRs could cooperate with each
other. Hence, we hypothesize that IRs can interact with the cis-acting element binding
proteins to set boundaries and interact with other related proteins to help Tpb2p to cut two
ends at the same time. Furthermore, after Tpb2p excision, the structure could protect the
MAC-destined region from the cleavage of endonuclease and maintain two double-strand
breaks in close proximity and thereby help to join ends by NHEJ (Fig. 22).
In addition, we found a deficiency of Lia3 also affects IESs that have larger
boundary variations among different forms in WT strains. In the subset of the
Lia3-affected IESs with more than 20-bp boundary variations, the distance of the
cis-acting elements to the IES boundaries still varied only slightly, suggesting that
recognition of the cis-acting element was the determining factor of the IES boundary, i.e.
changing the position of the cis-acting element would change the IES boundary. This
36
scenario has also been described for the M element, which contains 0.6- and 0.9 kb
deletion forms (73). The two forms have the same boundary at the 3’-end and a 0.3-kb difference at the 5’-end. The two 5’-end boundaries both contain the 5’-A5G5 motif about
45 bp away (58). Furthermore, if we searched for IESs that at least one boundary with
small variations (within 20-bp) among different forms in all three B strains, 6513 of 6599
IESs (98.70%) were identified. If we hypothesize that the small boundary variations are
controlled by the nearby cis-acting element, the number of IES forms is determined by
the number of cis-acting elements near the IES boundaries. Our results suggest that the
majority of IESs contain multiple cis-acting elements in the IES flanking regions.
We observed that there are two cis-acting elements in Lia3-affected IESs, namely
G-rich and C-rich IRs. More than 90% of Lia3-affected IESs contained at least one of the
cis-acting elements under our threshold of 60% identity of the PWM value, whereas
about 60% of them contained one of the cis-acting elements under a threshold of 75%
identity. This finding indicates that almost all Lia3-affected IESs contained G-rich or
C-rich IRs, though some exhibited lower similarity that may affect the binding ability of
Lia3. Interestingly, the distance of the two copies of IRs to the IES boundaries differed
between G-rich and C-rich IRs (51bp and 38bp, respectively). A recent study showed that
Lia3 preferentially binds to single-strand sequences with five guanine residues, which
forms a parallel G-quadruplex in vitro (59), but its ability to bind C-rich sequences is very
37
poor, suggesting that Lia3 could bind to the G strand in both G-rich and C-rich IRs.
However, the represented motif in our study was “GAGGG”, which has been shown to
have the most unstable form for maintaining the G-quadruplex structure (74), suggesting
that the distance variation between G-rich and C-rich IRs is less likely to be due to the
direction of the G-quadruplex. Hence, we suspect that the orientation of the G strand
toward the IESs may affect the direction of Lia3 dimerization and further alter the
conformation of the protein complex to shift the cutting site.
Although domesticated piggyBac transposases are responsible for the excision of
IES, features of transposons (such as ITR and the TTAA cutting site) only remained in
the TPB1-dependent IESs but were lost in the TPB2-dependent IESs. In order to
understand more about the piggyBac transposase in Tetrahymena, we investigated an
active transposon TPB3, which is an MIC-specific sequence. The ITR sequence and
evolutionary analysis of the transposase showed that TPB3 and its variants had origins
different to other TPBs. A previous study showed that TPB2 has an ortholog in
different to other TPBs. A previous study showed that TPB2 has an ortholog in