CHAPTER 2: Materials and Methods
2.1. Collection of transposable elements (TEs) and other fire ant
2.1.1. Motivations
To detect TEs and non-TE genes with copy number difference between SB and Sb-carrying individuals, we first built a comprehensive, non-redundant gene list of 27,279 S. invicta genes (hereafter, nr-sinv-ref) by combining three TE and four fire ant gene datasets available at the start of the analysis (Figure 2.1).
We used this nr-sinv-ref, instead of the draft genome, to avoid some potential problems. First, our experimental design (comparing genome sequencing from SB and Sb haploid brothers, see below), targeted the supergene, which is poorly assembled due to repeats and re-arrangements.
Some TEs and non-TE genes, especially those that are Sb-specific (e.g., piggyBac transposon (Wang et al., 2008)) are not represented in the SB draft genome and thus would be missed in the analysis (Wang et al., 2013). Second, repeats in the genome can cause dilution of read depth and therefore potential underestimation of the number of duplicated genes for our following CNV analyses. Third, a non-redundant list permits a better estimate of which types of TEs are preferentially duplicated. A downside of this strategy is that it may miss calling CNV for some genes, presumably edge cases, such as those with many small exons.
2.1.2. Trinity de novo assembly of 13 RNA-seq libraries for the collection of fire ant TEs
To obtain as many fire ant TEs as possible, we decided to include expressed TEs found in a Trinity de novo assembly of 13 RNA-seq datasets available at the start of the experiment. These RNA-seq datasets included germline tissues, where “active” TEs should be expressed, as well as different somatic tissues because many experiments have revealed TE expression in non-germline tissue (Colborn, Byrd, Koita, & Krogstad, 2008).
We performed assembly of 13 RNA-seq libraries (Table 2.1) using Trinity-v2.2.0 (Haas et al., 2013). Prior to assembly we applied standard quality control filters. Each library quality was assessed by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). The raw reads were quality trimmed and contaminating adaptors were removed using cutadapt (Martin, 2011) [paired-end: cutadapt -q 20 --trim-n -m 50 -b <adapters> -B <adapters> -o outputR1.fastq -p outputR2.fastq inputR1.fastq inputR2.fastq; single-end: cutadapt -q 28 --trim-n -b
<adapters> input.fastq > output.fastq]. Reads were then trimmed of the first 15 bp at the 5’ end [paired-end: cutadapt –u 15 –U 15 –m 50 –o outputR1.fastq –p output R2.fastq inputR1.fastq inputR2.fastq; single-end:
cutadapt –u 15 –trim-n –m 50 input.fastq > output.fastq] due to bias in sequence composition that may lead to transcript artifacts in the de novo assembly.
De novo assembly was performed using Trinity-v2.2.0 (Haas et al., 2013) in paired-end mode [Trinity --seqType fq --left left.fastq --right right.fastq --min_contig_length 305 --min_kmer_cov 2]. Left reads from all seven paired-end libraries and reads from the six single-end libraries were combined as left reads, while only right reads from paired-end libraries were combined as right reads.
The basic statistics of the assembly results are in Table 2.2.
To assess the assembly quality, we conducted two analyses. First, the raw reads (Table 2.1) were mapped back to the de novo assembled fire ant transcriptome using bowtie-2.2.1 (Langmead & Salzberg, 2012) [paired-end: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -1 readsR1.fastq -2 readsR2.fastq -S output.sam; single-end: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -U reads.fastq -S output.sam]. Paired-end reads were mapped separately from single-end reads. Of a total of 191 million paired-end reads, 98.1% mapped back to the assembled transcriptome, and of these 97.1% as proper pairs while the remaining 2.9% mapped as singlets. For the single-end reads 82.8% of the 195 million single reads could be mapped (Table 2.3).
Second, to assess the completeness of the assembled transcriptome, we searched for the presence of 4,415 hymenoptera BUSCO (Benchmarking Universal Single-Copy Orthologs) genes (“hymenoptera_odb9 set”
downloaded on 2017/02/05. BUSCO genes are highly-conserved and expected to be present in divergent lineages (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). We used the software BUSCO_v2.0 (http://busco.ezlab.org/) to search for the BUSCO genes in our database [BUSCO.py -o <BUSCO genes on Trinity assembly output>
-i <Trinity assembly.fasta> -l hymenoptera_odb9 -m tran]. Among the 4,415 hymenoptera BUSCO genes, 4,200 Complete BUSCOs (95.2%) were present in our assembled transcriptome, while 124 (2.8%) were fragmented and 91 (2.0%) were missing. Among the complete BUSCO genes, 1,998 were single-copy and were used for following analysis.
Some of the missing genes may be due to their lack of expression in the tissues or life stages sampled, and conserved gene can be lost in some lineages. Our de novo assembled transcriptome identified a higher number of hymenoptera BUSCO genes compared to the fire ant OGS
(Wurm et al., 2011) and the NCBI-Sinv100 (93.8% and 72.7% BUSCO genes found, respectively).
Our Trinity assembly produced 91,409 transcript isoforms, corresponding to 62,189 genes. This gene number is an over-estimate because the genomes for several insects typically have between ~10,000 to ~20,000 genes (Simola et al., 2015). Trinity assemblies often exceed the predicted gene number by 3-4 fold due to redundant or incorrectly assembled transcripts. To avoid spurious transcripts, Trinity assemblies are often filtered for expression or presence of ORFs (Albertin et al., 2015; Chan et al., 2016; Singh et al., 2017; Theissinger et al., 2016), especially when a high number of reads are used (Ana Conesa et al., 2016). In our analysis, we avoided filtering transcripts to permit the identification of lowly expressed and partial TEs.
A total of 6,853 were identified among the transcripts ad putative TEs (tblastx search (Altschul, Gish, Miller, Myers, & Lipman, 1990) against the invertebrate repetitive elements from RepBase v21.01 (Bao, Kojima,
& Kohany, 2015) (E-value <1e-20). As the transcript include multiple isoform, we selected only the longest isoform, for a total of 4,128 TE transcripts. Among these, redundancies were filtered with CD-HIT (Fu, Niu, Zhu, Wu, & Li, 2012; W. Li & Godzik, 2006) (sequence identity 95%)), for a total of 3,665 non-redundant putative TEs that were used to augment the gene list for CNV detection (nr-sinv-ref).
2.1.3. Collection of fire ant TEs
The 3,665 TEs found using de novo assembly of fire ant RNA-seq databases, we combined with previous TE gene sets (247 S. invicta RepBase TEs (Bao et al., 2015) and 16 differentially expressed TEs (Nipitwattanaphon, Wang, Dijkstra, & Keller, 2013)). Redundancies among datasets were identified through a blastn search (percent identity
≥95% and bitscore ≥200) and discarded. We obtained a final list of 3,678 TEs.
2.1.4. Collection of fire ant non-TE genes
For the non-TE genes, we merged four fire ant gene sets:
1. NCBI release 100, which included 22,066 nuclear and 13 mitochondrial genes. For the nuclear genes, we selected only the longest isoforms, for a total of 15,058 genes.
2. Fire ant official gene set v2 (n = 16,569) (Wurm et al., 2011).
3. Fire ant expressed sequenced tag gene set (n = 12,859) (Wang et al., 2006).
4. Odorant-binding proteins (OBP) and chemosensory protein (CSP) from (González et al., 2009; Gotzek, Robertson, Wurm, &
Shoemaker, 2011; Xu et al., 2009) (n = 31).
Redundancies were removed among these 4 dataset (blastn, percent identity ≥95% and bitscore ≥200), for a total of 25,415 fire ant genes.
2.1.5. Collection of TEs and other fire ant genes (nr-sinv-ref) to detect copy number variation (CNV)
Last, we merged the TE gene set with the non-TE gene set, with a final removal of redundancies, to obtain a “non-redundant” list of 27,279 fire ant TEs and non-TE genes (hereafter nr-sinv-ref).