CHAPTER 1: Introduction
1.4. Hypothesis
We hypothesized that gene duplication and TE accumulation in the non-recombining Sb supergene may underlie some of the different phenotypes associated with a polygyne colony.
Duplicated genes may affect queen discrimination by workers, leading to selective mortality of SB/SB queens in polygyne colonies. For example, desaturases gene families in insects seem to frequently evolve by gene duplication and have been proposed to have a role in the production of cuticular hydrocarbons (CHCs) involved in the recognition of different species or castes (Helmkampf, Cash, & Gadau, 2015). Some duplicated gene could also be involved in one or more of the “polygyne syndrome”
phenotypes. Similarly, TEs may affect Sb associated phenotypes, such as by disrupting important genes or by providing cis-regulatory elements (E B Chuong, N C Elde, & C Feschotte, 2016; Edward B Chuong, Nels C Elde, & Cédric Feschotte, 2016; Ellison & Bachtrog, 2013; Lynch et al., 2015).
To understand gene duplication in the supergene, we first characterized the genetic elements that have different copy number between the SB and Sb-carrying individuals in S. invicta, by conducting a bioinformatics analysis of copy number variation, validated a subset by DNA quantitative PCR (qPCR) digital droplet PCR (ddPCR).
We then conducted over-representation analysis for the genes with copy number differences between SB and Sb genotypes. From this analysis, we confirmed a general trend of TE accumulation on the Sb allele, and we found many non-TE fire ant genes with greater copy number on Sb.
Interestingly, many these non-TE genes encode enzymes implicated in CHC synthesis, which may be involved in differential queen odors.
Furthermore, we found that the majority of the CHC enzyme genes with
virgin queens, and generally highly expressed in SB/Sb mature queens.
Among these genes, we selected two candidate genes possibly involved in differential odor production of SB/SB and SB/Sb queens.
In addition, we compared the expression of TEs with their copy number differences in SB and Sb, and found a weak correlation between TE proliferation and increased expression levels.
Mapping of select genes and TEs with copy number differences on the SB and Sb genomes showed that these genes were probably translocated from other chromosomes, and underwent multiple duplication and rearrangement events.
CHAPTER 2
Materials and Methods
2.1. Collection of transposable elements (TEs) and other fire ant genes to detect copy number variation (CNV)
2.1.1. Motivations
To detect TEs and non-TE genes with copy number difference between SB and Sb-carrying individuals, we first built a comprehensive, non-redundant gene list of 27,279 S. invicta genes (hereafter, nr-sinv-ref) by combining three TE and four fire ant gene datasets available at the start of the analysis (Figure 2.1).
We used this nr-sinv-ref, instead of the draft genome, to avoid some potential problems. First, our experimental design (comparing genome sequencing from SB and Sb haploid brothers, see below), targeted the supergene, which is poorly assembled due to repeats and re-arrangements.
Some TEs and non-TE genes, especially those that are Sb-specific (e.g., piggyBac transposon (Wang et al., 2008)) are not represented in the SB draft genome and thus would be missed in the analysis (Wang et al., 2013). Second, repeats in the genome can cause dilution of read depth and therefore potential underestimation of the number of duplicated genes for our following CNV analyses. Third, a non-redundant list permits a better estimate of which types of TEs are preferentially duplicated. A downside of this strategy is that it may miss calling CNV for some genes, presumably edge cases, such as those with many small exons.
2.1.2. Trinity de novo assembly of 13 RNA-seq libraries for the collection of fire ant TEs
To obtain as many fire ant TEs as possible, we decided to include expressed TEs found in a Trinity de novo assembly of 13 RNA-seq datasets available at the start of the experiment. These RNA-seq datasets included germline tissues, where “active” TEs should be expressed, as well as different somatic tissues because many experiments have revealed TE expression in non-germline tissue (Colborn, Byrd, Koita, & Krogstad, 2008).
We performed assembly of 13 RNA-seq libraries (Table 2.1) using Trinity-v2.2.0 (Haas et al., 2013). Prior to assembly we applied standard quality control filters. Each library quality was assessed by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The raw reads were quality trimmed and contaminating adaptors were removed using cutadapt (Martin, 2011) [paired-end: cutadapt -q 20 --trim-n -m 50 -b <adapters> -B <adapters> -o outputR1.fastq -p outputR2.fastq inputR1.fastq inputR2.fastq; single-end: cutadapt -q 28 --trim-n -b
<adapters> input.fastq > output.fastq]. Reads were then trimmed of the first 15 bp at the 5’ end [paired-end: cutadapt –u 15 –U 15 –m 50 –o outputR1.fastq –p output R2.fastq inputR1.fastq inputR2.fastq; single-end:
cutadapt –u 15 –trim-n –m 50 input.fastq > output.fastq] due to bias in sequence composition that may lead to transcript artifacts in the de novo assembly.
De novo assembly was performed using Trinity-v2.2.0 (Haas et al., 2013) in paired-end mode [Trinity --seqType fq --left left.fastq --right right.fastq --min_contig_length 305 --min_kmer_cov 2]. Left reads from all seven paired-end libraries and reads from the six single-end libraries were combined as left reads, while only right reads from paired-end libraries were combined as right reads.
The basic statistics of the assembly results are in Table 2.2.
To assess the assembly quality, we conducted two analyses. First, the raw reads (Table 2.1) were mapped back to the de novo assembled fire ant transcriptome using bowtie-2.2.1 (Langmead & Salzberg, 2012) [paired-end: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -1 readsR1.fastq -2 readsR2.fastq -S output.sam; single-end: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -U reads.fastq -S output.sam]. Paired-end reads were mapped separately from single-end reads. Of a total of 191 million paired-end reads, 98.1% mapped back to the assembled transcriptome, and of these 97.1% as proper pairs while the remaining 2.9% mapped as singlets. For the single-end reads 82.8% of the 195 million single reads could be mapped (Table 2.3).
Second, to assess the completeness of the assembled transcriptome, we searched for the presence of 4,415 hymenoptera BUSCO (Benchmarking Universal Single-Copy Orthologs) genes (“hymenoptera_odb9 set”
downloaded on 2017/02/05. BUSCO genes are highly-conserved and expected to be present in divergent lineages (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). We used the software BUSCO_v2.0 (http://busco.ezlab.org/) to search for the BUSCO genes in our database [BUSCO.py -o <BUSCO genes on Trinity assembly output>
-i <Trinity assembly.fasta> -l hymenoptera_odb9 -m tran]. Among the 4,415 hymenoptera BUSCO genes, 4,200 Complete BUSCOs (95.2%) were present in our assembled transcriptome, while 124 (2.8%) were fragmented and 91 (2.0%) were missing. Among the complete BUSCO genes, 1,998 were single-copy and were used for following analysis.
Some of the missing genes may be due to their lack of expression in the tissues or life stages sampled, and conserved gene can be lost in some lineages. Our de novo assembled transcriptome identified a higher number of hymenoptera BUSCO genes compared to the fire ant OGS
(Wurm et al., 2011) and the NCBI-Sinv100 (93.8% and 72.7% BUSCO genes found, respectively).
Our Trinity assembly produced 91,409 transcript isoforms, corresponding to 62,189 genes. This gene number is an over-estimate because the genomes for several insects typically have between ~10,000 to ~20,000 genes (Simola et al., 2015). Trinity assemblies often exceed the predicted gene number by 3-4 fold due to redundant or incorrectly assembled transcripts. To avoid spurious transcripts, Trinity assemblies are often filtered for expression or presence of ORFs (Albertin et al., 2015; Chan et al., 2016; Singh et al., 2017; Theissinger et al., 2016), especially when a high number of reads are used (Ana Conesa et al., 2016). In our analysis, we avoided filtering transcripts to permit the identification of lowly expressed and partial TEs.
A total of 6,853 were identified among the transcripts ad putative TEs (tblastx search (Altschul, Gish, Miller, Myers, & Lipman, 1990) against the invertebrate repetitive elements from RepBase v21.01 (Bao, Kojima,
& Kohany, 2015) (E-value <1e-20). As the transcript include multiple isoform, we selected only the longest isoform, for a total of 4,128 TE transcripts. Among these, redundancies were filtered with CD-HIT (Fu, Niu, Zhu, Wu, & Li, 2012; W. Li & Godzik, 2006) (sequence identity 95%)), for a total of 3,665 non-redundant putative TEs that were used to augment the gene list for CNV detection (nr-sinv-ref).
2.1.3. Collection of fire ant TEs
The 3,665 TEs found using de novo assembly of fire ant RNA-seq databases, we combined with previous TE gene sets (247 S. invicta RepBase TEs (Bao et al., 2015) and 16 differentially expressed TEs (Nipitwattanaphon, Wang, Dijkstra, & Keller, 2013)). Redundancies among datasets were identified through a blastn search (percent identity
≥95% and bitscore ≥200) and discarded. We obtained a final list of 3,678 TEs.
2.1.4. Collection of fire ant non-TE genes
For the non-TE genes, we merged four fire ant gene sets:
1. NCBI release 100, which included 22,066 nuclear and 13 mitochondrial genes. For the nuclear genes, we selected only the longest isoforms, for a total of 15,058 genes.
2. Fire ant official gene set v2 (n = 16,569) (Wurm et al., 2011).
3. Fire ant expressed sequenced tag gene set (n = 12,859) (Wang et al., 2006).
4. Odorant-binding proteins (OBP) and chemosensory protein (CSP) from (González et al., 2009; Gotzek, Robertson, Wurm, &
Shoemaker, 2011; Xu et al., 2009) (n = 31).
Redundancies were removed among these 4 dataset (blastn, percent identity ≥95% and bitscore ≥200), for a total of 25,415 fire ant genes.
2.1.5. Collection of TEs and other fire ant genes (nr-sinv-ref) to detect copy number variation (CNV)
Last, we merged the TE gene set with the non-TE gene set, with a final removal of redundancies, to obtain a “non-redundant” list of 27,279 fire ant TEs and non-TE genes (hereafter nr-sinv-ref).
2.2. Copy number variation (CNV) analysis
2.2.1. SB versus Sb whole genome sequencing reads
To detect genes with copy number variation, we compared whole genome sequencing reads from eight pairs of sons (SB or Sb, ant males are
haploid) from unrelated SB/Sb queens. Low coverage sequence for seven pairs of sons were generated using 101 bp paired-end Illumina HiSeq 2000 reads (Wang et al., 2013). One more pair of SB (Wurm et al., 2011) and Sb (Wang et al., 2013) brothers with much greater sequencing read depth was included. These two were sequenced using multiple size libraries; we only used the 50 bp Illumina paired-end reads (Table 2.4).
Although the exact genetic location of each CNV gene would be unknown, by comparing eight pairs of brothers, we expected to preferentially detect CNV genes on the supergene.
2.2.2. DNA read alignment to nr-sinv-ref
Each of the 16 datasets (eight pairs of SB and Sb brothers) was aligned against our nr-sinv-ref using the Burrows-Wheeler Aligner, BWA-MEM-0.7.12 (H. Li & Durbin, 2009), with default settings [bwa mem <nr-sinv-ref.fasta> <DNA-seqR1.fastq> <DNA-seqR2.fastq>].
For each of the 16 datasets, 44.2 to 53.4% of the total raw reads mapped against the nr-sinv-ref (Figure 2.2b). Raw reads mapping against CenSol, the major fire ant centromeric satellite DNA repeat (Huang et al., 2016), made up for 21.8 to 29.8% of the total reads, while all other genes combined for 18.8 to 26.6% (Figure 2.2b). When mapping DNA reads against the transcriptome, reads coming from intergenic regions or unannotated genes will remain unmapped (Ana Conesa et al., 2016), therefore, we considered our mapping percentages acceptable.
Among the 27,279 genes used as the nr-sinv-ref, 26,307 were mapped for all 16 datasets (96.5%), and 27,047 were mapped for at least one dataset (99.2%). 740 genes showed no corresponding reads for at least one dataset, probably due to the lower coverage of some of the DNA-seq libraries used. The other 232 genes were unmapped in all 16 datasets,
some of which might be due to the presence of S. invicta viral genes (Wang et al., 2006) or other contaminants in the datasets used to build the reference.
2.2.3. Copy number variation (CNV) analysis
To determine the CNV for each gene between SB and Sb brothers, DNA read depths obtained from the BWA alignment were compared using CNV-seq (Xie & Tammi, 2009).
DNA reads aligned with BWA software were first converted into hits files with SAMtools (H. Li et al., 2009) [samtools view -b –S align.sam >
align.bam; samtools view -F 4 align.bam | perl -lane 'print
"$F[2]\t$F[3]"' > align.hits] and the compared with CNV-seq [perl cnv-seq.pl --test <Sb.hits> –ref <SB.hits> --genome-size 39534189 --log2-threshold 0.6 --p-value 0.001 --global-normalization --window-size=350;
where genome-size was 39,534,189 bp, corresponding to the total size of the genes used as a reference]. CNV-seq will choose a minimum window size for each datasets. However, we decided to use the same window size for all eight comparisons. We first asked CNV-seq to choose the window size for each comparison, and then we selected the one with lowest resolution (350 bp) for all eight datasets.
For each overlapping window, we calculated “log2-fold(+1)” values normalized for different read depths using the formula:
!"#2! !"#$(+1) = !"#! !"!!"#$!!"#$% + 1
!"!#$!!"!!"#$%
!"!!"#$!!"#$% + 1
!"!#$!!"!!"#$%
which was modified from (Xie & Tammi, 2009) by the addition of 1 to the SB and Sb read counts. This adjustment retains windows with zero reads, which are discarded by CNV-seq, as they may represent important genes that are deleted in either one of the two genotypes.
among both biological replicates and overlapping windows) and P ≤0.05 in 75% of biological replicates for at least 25% of the gene length (binomial test with false discovery rate using the Benjamini-Hochberg procedure) were selected as putative duplicated genes in the Sb or SB genome, respectively (i.e., CNV genes). The same analysis was repeated by using a log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold). This relaxed threshold likely includes more false positives, however, it may include modest expansions or contractions of gene families and partially duplicated genes.
Statistical analysis were performed in R (Ihaka & Gentleman, 1996).
2.3. Duplicated gene annotation and gene function over-representation analysis
We manually verified the annotation of the CNV genes using multiple methods: 1) RepeatMasker analysis with RepBase v21.01 (Kohany, Gentles, Hankus, & Jurka, 2006), where each gene was assigned to a TE class/family according to the best score; 2) Blast2GO search against the NCBI nr database (A Conesa et al., 2005), where the first best annotated hit was retained; and 3) blastx search against Uniref50 proteins (downloaded on 2017/05/17 (Suzek et al., 2015)), where the first best hit which was not an unknown protein was retained. In case of disagreement between methods, the annotation of CNV genes was manually curated by selecting the result from the method with the best scores and longest alignment length. Over-representation of gene function among CNV genes with respect to the nr-sinv-ref was tested only for gene categories with ≥3 genes, using the Fisher’s exact test with Bonferroni correction.
The number of genes for each category in our nr-sinv-ref was estimated
through RepeatMasker for TEs, and Uniref50 for non-TE genes, separately. Since the annotation of our CNV gene list was manually curated, thus creating a possible bias between the nr-sinv-ref and the CNV genes, we conducted statistical analyses with both a non-curated list of CNV genes and our final curated list.
2.4. DNA real-time quantitative PCR (qPCR) and digital droplet PCR (ddPCR) to detect copy number
2.4.1. Ant Sampling
To obtain ant samples, polygyne and monogyne colonies of S. invicta were collected from Taoyuan, Taiwan. The social form of the colonies was assessed by observing the number of queens in the colony and subsequently confirmed by a Gp-9 PCR-RFLP assay (M. Krieger & Ross, 2002) of 10-15 workers per colony. DNA was extracted from a pool of 3-6 males from the same colony with the same genotype using the DNeasy Blood & Tissue kit (Qiagen, Germany). Since polygyne colonies include both SB and Sb males, each individual male genotype was determined by a Gp-9 PCR-RFLP assay (M. Krieger & Ross, 2002).
We compared three haploid male types: SB from monogyne colonies (Mono SB), SB from polygyne colonies (Poly SB), and Sb from polygyne colonies (Poly Sb), each with three separate colonies (biological replicates).
2.4.2. DNA real-time quantitative PCR (qPCR): goals, primer design, and protocol
We conducted DNA qPCR to: 1) validate the bioinformatics results for CNV in TEs, and 2) examine CNV in TEs with known differential
(11 TEs and 1 non-TE gene, Tdpoz) and a non-CNV TE as a negative control. We also performed DNA qPCR to assess whether 13 TEs (different from above) with known DE between SB and Sb-carrying individuals (Nipitwattanaphon et al., 2013; Wang et al., 2008) showed corresponding copy number differences.
Primers were designed using Primer ExpressTM v3.0 (Thermo Fisher), targeting a region within the coding sequence (Tables 2.5 and 2.6). The coding sequences were determined using NCBI ORF Finder (Open Reading Frame Finder)
(https://www.ncbi.nlm.nih.gov/orffinder/) and RepBase CENSOR (http://www.girinst.org/censor/) (Kohany, Gentles, Hankus, & Jurka, 2006). Two single copy genes in both the SB and Sb genotypes were used as the internal controls: 40S ribosomal protein S9 and elongation factor 1-alpha. Ribosomal proteins and elongation factors are housekeeping genes often used as reference for gene expression qPCR (Lu et al., 2013; Sun, Lu, Tang, & Du, 2015). We verified that these genes are single copy in the fire ant by a blastn search against both SB and Sb assembled genomes.
Each primer pair was tested for qPCR efficiency using five four-fold dilutions of DNA. Only primer pairs with efficiency >90% and <110%
were retained, and primer efficiencies were adjusted using the formula:
2!!!"= !"#$%&!!!!!"#$%!
!"#$%&#!!!"#$%&'$( , where ! = 2!!! and s=slope of the plot, x-axis=Ct, and y-axis=log2(DNA concentration/DNA initial concentration), which was modified from (Pfaffl, 2001).
All the qPCR reactions were carried out by mixing 3 µl of DNA, 10 µl of Power SYBR Green PCR master mix (Applied Biosystem), 200 – 300 nM of each primer, and water to a final volume of 20 µl. For each primer pair reaction, three four-fold dilutions of DNA were used, each with three technical replicates. All qPCR reactions were performed on an
ABI-7300 Real-Time PCR System (Applied Biosystem) and the raw results were generated by the 7300 System SDS Software v 1.4 (Applied Biosystem). Copy number variation for each gene was calculated as 2-ΔΔCt normalized to the Mono SB genotype. Statistical analysis (ANOVA and post-hoc Tukey test) were performed in R (Ihaka & Gentleman, 1996).
2.4.3. DNA digital droplet PCR (ddPCR): goals, primer design, and protocol
We conducted droplet digital PCR (ddPCR) (Biorad Q200 Droplet Digital PCR System) to validate five additional CNV genes (1 TE and 4 non-TE genes). ddPCR is a newer technique that can more precisely detect smaller fold differences than qPCR (e.g., 2-fold or less).
Primers and TaqMan probes were designed by Integrated DNA Technologies (Table 2.7). The sequences of the two single copy genes (40S ribosomal protein S9 and elongation factor 1-alpha), already used as internal control for qPCR, were utilized for ddPCR primer and probe design. We used blastn to compare the designed primers and probes against the PacBio assembled SB and Sb genomes to ensure that at least one complete amplicon was found in these genomes. The efficiency of primer and probe sets were tested by gradient ddPCR (QX200, BioRad) by the Genomic Technology Core- Sequencing and Microarray Division of the Institute of Plant and Microbial Biology at Academia Sinica, using heterozygous virgin queen genomic DNA as template following the QX200 manufacturer’s protocol. ddPCR results were analyzed by QuantaSoft Analysis Pro. After obtaining the optimized annealing temperature for each primer/probe sets, multiplex PCR reactions were designed by reducing the primer/probe concentration of the genes with lower fluorescence signal in half. The optimized multiplex conditions
Copy number variation for each gene was shown as estimated actual number of genomic copies. Statistical analysis (ANOVA and post-hoc Tukey test) were performed in R (Ihaka & Gentleman, 1996).
2.5. Gene expression of CNV genes
2.5.1. Differential expression (DE) analysis of CNV genes
We compared the log2 ratio of Sb:SB copy number of CNV genes with their DE between SB/Sb and SB/SB virgin queens.
To assess DE in SB/Sb virgin queens, we generated four pairs of SB/Sb and SB/SB virgin queen RNA-seq libraries (Table 2.8). We first verified that all libraries were of adequate quality with FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Next, raw reads were quality trimmed and contaminating adaptors were removed using cutadapt (Martin, 2011):
cutadapt -q 20 --trim-n -m 50 –O 5 -b <adapters> -B <adapters> -o outputR1.fastq -p outputR2.fastq inputR1.fastq inputR2.fastq
To calculate gene expression within each RNA-seq dataset we used RSEM (B. Li & Dewey, 2011):
rsem-calculate-expression –paired-end reads_R1.fastq reads_R2.fastq
<reference gene list.fasta> output
The gene expression was calculated as log2(FPKM+1). We added 1 to the FPKM value obtained with RSEM to avoid discarding genes with
The gene expression was calculated as log2(FPKM+1). We added 1 to the FPKM value obtained with RSEM to avoid discarding genes with