紅火蟻的社會超級基因有許多基因和跳躍子上的拷貝數變異

全文

(1)Department of Life Science, College of Science Taiwan International Graduate Program on Biodiversity, Academia Sinica. National Taiwan Normal University Doctoral Dissertation. The fire ant social supergene is characterized by extensive gene and transposable element copy number variation. FONTANA, Silvia. Advisor: WANG, John, Ph.D. 109. 2. February 2020.

(2) Acknowledgments First and foremost, it is my great honor, and pleasure, to thank my supervisor Dr. John Wang, for many things that will be superfluous and difficult to list here. I want to say thank you and sorry to my labmates. Thank you for helping with experiments and lab work in general, and sorry for being totally caught up in myself for many years, I should have put my head up from the table and actually see the people. Thanks to the three anonymous reviewers of my paper for the good insights, but also for making my life just a little gloomier for almost a year. I learned science, but mostly patience. Thank you to the members of my thesis committee, for reading my work through the years and being nice most of the time. Thank you Dr. David Chou for getting on the wagon without much prior notice. I wish to thank the Biodiversity Research Center, TIGP program, and NTNU for the financial support, and the many people dealing patiently with me. Thank you to the friends who were with me in the process: Feby W. for finding that reality is sad and funny at the same time, Juani W. for the bravery to stay and leave and stay still, Nia S. for the slow peace of life, the here and now. Thank you Rex C. for singing to me at night. Saborni C. for preventing total alienation. To Shashank K. for showing me the true meaning of Christmas. To Laura M. to alleviate my only-child syndrome. Francesca F. and Federica F. for accompanying me to treasure hunting. To Alvin E.L., Josh E., Myrron C.A. and all the friends who have been there from the beginning of this little adventure. Thank you to my therapist for letting me see again what’s important to me and remember my value as a person, these things are easy to lose in Academia. Thank you to my previous teachers: Francesca B. for believing in me during a time in my life long time ago. That really made a difference. Thank to Allen C. for being my teacher and for the love of Earth. Of course, thank you to my parents for providing valuable genetic materials, and for a couple more stuff. Last but not least, I would like to dedicate this work to the one closest to my heart: Puffetto. You are with me always. !. i.

(3) Abstract. The fire ant Solenopsis invicta shows two distinct social forms (monogyne and polygyne colonies, with one and multiple queens, respectively) under genetic control of a ~12.5 to 20 Mb supergene. The supergene includes ~600 genes linked together by multiple inversions, which prevents recombination between the two variants, ‘Social B’ (SB) and ‘Social b’ (Sb). In monogyne colonies all individuals carry only the SB allele, while in polygyne colonies, some individuals carry the Sb allele (all queens are heterozygotes SB/Sb, workers can be either SB/SB or SB/Sb and haploid males SB or Sb). In this study we characterized genes with copy number variation between SB and Sb-carrying individuals. We showed extensive acquisition of gene duplicates in the Sb genome, with some likely involved in polygyne-related phenotypes. We found 260 genes with copy number differences between SB and Sb, of which 239 have greater copy number in Sb. We observed transposable element (TE) accumulation on Sb, likely due to the accumulation of repetitive elements on the non-recombining chromosome. We found a weak correlation between TE copy number and differential expression, suggesting some TEs may still be proliferating in Sb while many of the duplicated TEs have already been silenced. Characterization of the TEs revealed showed the BEL family (LTR retrotransposons) was disproportionately duplicated in the supergene. While the presence of TEs in supergenes is well documented, little is known about duplication of non-TE genes and their possible adaptive role. Among the 115 non-TE genes with higher copy in Sb, we found genes. ii.

(4) with putative polygyne related functions, such as juvenile hormone synthesis, odorant reception, and innate immunity. Interestingly, enzymes responsible for cuticular hydrocarbon synthesis were highly represented. We identified candidate duplications, and conducted detailed examination of a desaturase and an elongase, as potentially responsible for different cuticular hydrocarbons profiles in SB/SB and SB/Sb queens. These genes, which are likely beneficial for polygyne ants and important for queen recognition, seem to have translocated into the supergene from other chromosomes and proliferated by multiple duplication events. Overall, our results suggest that gene duplications may be an important factor leading to monogyne and polygyne ant societies.. iii.

(5) Table of Contents Acknowledgments ...................................................................................... i Abstract ...................................................................................................... ii Table of Contents ...................................................................................... iv List of Tables ........................................................................................... vii List of Figures ......................................................................................... viii CHAPTER 1: Introduction ..................................................................... 1 1.1. The fire ant Solenopsis invicta ................................................. 1 1.2. Supergenes................................................................................ 3 1.3. Transposable elements (TEs) ................................................... 4 1.4. Hypothesis ................................................................................ 5 CHAPTER 2: Materials and Methods ................................................... 8 2.1. Collection of transposable elements (TEs) and other fire ant genes to detect copy number variation (CNV) ................................ 8 2.1.1. Motivations ................................................................. 8 2.1.2. Trinity de novo assembly of 13 RNA-seq libraries for the collection of fire ant TEs ................................................. 8 2.1.3. Collection of fire ant TEs .......................................... 11 2.1.4. Collection of fire ant non-TE genes .......................... 12 2.1.5. Collection of TEs and other fire ant genes (nr-sinv-ref) to detect copy number variation (CNV) .............................. 12 2.2. Copy number variation (CNV) analysis ................................. 12 2.2.1. SB versus Sb whole genome sequencing reads......... 12 2.2.2. DNA read alignment to nr-sinv-ref ........................... 13 iv.

(6) 2.2.3. Copy number variation (CNV) analysis .................... 14 2.3. Duplicated gene annotation and gene function over-representation analysis ........................................................ 15 2.4. DNA real-time quantitative PCR (qPCR) and digital droplet PCR (ddPCR) to detect copy number............................................ 16 2.4.1. Ant Sampling ............................................................ 16 2.4.2. DNA real-time quantitative PCR (qPCR): goals, primer design, and protocol................................................. 16 2.4.3. DNA digital droplet PCR (ddPCR): goals, primer design, and protocol ............................................................ 18 2.5. Gene expression of CNV genes.............................................. 19 2.5.1. Differential expression (DE) analysis of CNV genes 19 2.5.2. Absolute expression analysis of CNV genes ............ 21 2.6. Genomic analysis of CNV genes............................................ 21 2.6.1. Assembly and linkage group construction of SB and Sb genomes ......................................................................... 21 2.6.2. Genomic analysis of CNV genes .............................. 22 2.6.3. Placement of SB contigs 000611F and 000344F onto chr 5 and chr 3, respectively ............................................... 22 2.7. Analysis of desaturase and elongase copies ........................... 23 CHAPTER 3: Results ............................................................................ 25 3.1. CNV analysis: duplicated TE and non-TE gene discovery ........... 25 3.2. Validation of CNV results with quantitative real-time PCR (qPCR) and droplet digital PCR (ddPCR) ......................................................... 26 3.3. Higher transposon expression is weakly correlated to higher copy number ..................................................................................................... 27. v.

(7) 3.4. Enzymes involved in cuticular hydrocarbon (CHC) synthesis are highly duplicated in Sb and over-expressed in SB/Sb queens ............. 29 3.5. CHC enzyme genes have copies on supergene and non-supergene chromosomes ........................................................................................ 31 3.6. Desaturase and elongase have potentially functional copies on Sb32 CHAPTER 4: Discussion ...................................................................... 34 4.1. TEs accumulation on Sb ................................................................ 34 4.2. Different functions of Non-TE genes with CNV between the Sb and SB fire ant genomes .................................................................... 37 4.3. CHC gene copies may have Sb functions ................................... 41 4.4. The role of TEs in gene duplication .............................................. 44 CHAPTER 5: Conclusions .................................................................... 45 Tables ....................................................................................................... 47 Figures ..................................................................................................... 63 References................................................................................................ 86 Appendices .............................................................................................. 98. vi.

(8) List of Tables !. Table 2.1. Details of the 13 Illumina RNA-seq libraries used for de novo assembly of the S. invicta transcriptome ............................................... 47 Table 2.2. Statistics for the S. invicta transcriptome assembly ............... 48 Table 2.3. Results of re-mapping by bowtie2 of paired- and single-end RNA-seq reads on the assembled transcriptome of S. invicta ................. 49 Table 2.4. Information for the DNA-seq raw reads from 8 pairs of SB and Sb haploid brothers, used to detect copy number variation between the SB and Sb alleles ......................................................................................... 50 Table 2.5. Information and primers for the 13 genes and two internal controls used to validate CNV analysis results by DNA qPCR ............ 51 Table 2.6. Information and primers for TEs with known expression differences between SB/SB and SB/Sb queens, for which we evaluate copy number difference between SB and Sb by DNA qPCR ................ 52 Table 2.7. Information and primers for the five genes and two internal controls used to validate CNV analysis results by DNA ddPCR .......... 53 Table 2.8. Details of the 4 Illumina RNA-seq library pairs used for the gene expression and DE analysis of CNV genes ................................... 54 Table 3.1. Number of total genes per gene category in nr-sinv-ref compared to genes with CNV between the SB and Sb genomes, when using a log2-fold threshold of ≥1 or ≤-1 (2-fold) .................................... 55 Table 3.2. Number of total genes per gene category in nr-sinv-ref compared to genes with CNV between the SB and Sb genomes, when using a log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold) ....................... 56 Table 3.3. Number of total TEs per TE class/family in our nr-sinv-ref compared to TEs with CNV between the SB and Sb genomes, when using a log2-fold threshold of ≥1 or ≤-1 (2-fold) .............................................. 57 Table 3.4. Number of total TEs per TE class/family in our nr-sinv-ref compared to TEs with CNV between the SB and Sb genomes, when using a log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold) ................................. 60. vii.

(9) List of Figures! !. Figure 1.1. Depiction of the two social forms of fire ants (a) and the social chromosome (b) ............................................................................. 63 Figure 2.1. Flowchart of the strategy to obtain a collection of fire ant TEs and non-TE genes for CNV analysis ....................................................... 64 Figure 2.2. Percentage of raw reads mapping against the custom list of fire ant TEs and non-TE genes (nr-sinv-ref) for each DNA-seq library.. 65 Figure 3.1. Proportion of genes with copy number variation in SB or Sb, identified as transposable elements (TEs) or non-TE genes .................... 66 Figure 3.2. Results of DNA qPCR (a) and ddPCR (b) to verify copy number variation of the in silico CNV analysis....................................... 67 Figure 3.3. Results of DNA qPCR to detect copy number differences in 13 TEs with known DE between SB/SB and SB/Sb females ................. 68 Figure 3.4. Comparison of TE copy number ratio (Sb:SB) vs DE between SB/Sb and SB/SB virgin queens (a), absolute gene expression in SB/Sb virgin queens (b), and absolute gene expression in SB/Sb mature queens (c) ............................................................................................................. 69 Figure 3.5. Chemical pathway for insect cuticular hydrocarbon synthesis71 Figure 3.6. Comparison of copy number ratio (Sb:SB) vs DE between SB/Sb and SB/SB virgin queens (a), gene expression in SB/Sb virgin queens (b), and gene expression in SB/Sb mature queens (c) for 27 genes with greater copy number in Sb encoding for enzymes putatively involved in the synthesis of CHCs.......................................................................... 72 Figure 3.7. Genomic location of the 17 CHC genes with greater copy number in Sb and higher expression in SB/Sb relative to SB/SB virgin queens; and their most frequently flanking TEs ...................................... 74 Figure 3.8. Copies of the CNV desaturase (Acyl-CoA desaturase Δ11) gene in SB and Sb .................................................................................... 81 Figure 3.9. IGV visualization of reads mapping to the coding region of two copies of the CNV desaturase (Acyl-CoA desaturase Δ11): Sinv_desatA1_a and Sinv_desatA1_b ...................................................... 82. viii.

(10) Figure 3.10. Copies of the CNV elongase (elongation of very long chain fatty acids) gene in SB and Sb.................................................................. 83 Figure 3.11. IGV visualization of reads mapping to the coding region of two copies of the CNV elongase (elongation of very long chain fatty acids): Sinv_eloA and Sinv_eloB ............................................................. 84 Figure 3.12. RNA expression of the coding region of the two copies of the desaturase (Sinv_desatA1_a and Sinv_desatA1_b) and the elongase (Sinv_eloA and Sinv_eloB) genes ............................................................ 85!. ix.

(11) CHAPTER 1 Introduction. 1.1. The fire ant Solenopsis invicta The fire ant Solenopsis invicta is an invasive ant species, with a negative impact on agriculture and ecosystems worldwide, including Taiwan. The S. invicta native range is in Southern America, but it was inadvertently introduced to Southern USA during the last century, leading to a series of secondary introductions, including Taiwan, where it was discovered in 2003 (Ascunce et al., 2011; Chen, Shen, & Lee, 2006; Yang, Shoemaker, Wu, & Shih, 2007). S. invicta, as all ant species, is “eusocial”. Eusociality is a high level of social organization; eusocial species have mainly sterile working castes, cooperative brood care, and overlapping generations (Bert Hölldobler & Wilson, 1990). Ant colonies have fertile queen caste and worker caste with reduced fertility or completely infertile. The number of queen in each colony varies between species, and in some case, such in S. invicta, within the same species. Colonies with single queens are called monogynous, while multiple queens colonies are polygynous (B Hölldobler & Wilson, 1997). The two social forms of S. invicta are associated with a single Mendelian locus, Gp-9, encoding for a putative odor binding protein (Laurent Keller & Ross, 1998; K G Ross & Keller, 1998). More recently, the Gp-9 locus was shown to be part of a ~12.5 to 20 Mb supergene, called “Social B” (SB) and “Social b” (Sb). The fire ant supergene occupies ~55% of the fire ant chromosome 16 and contains ~600 genes, and show reduced recombination rate, similar to the XY system, caused by multiple 1.

(12) inversions between two variants (Huang, Dang, Chang, & Wang, 2018; Huang & Wang, 2013; Pracana, Priyam, Levantis, & Nichols, 2017; Wang et al., 2013). In homozygous SB/SB individuals recombination occurs normally, while in heterozygous SB/Sb individuals recombination is suppressed in the supergene. Recombination is unlikely in Sb/Sb individuals because most Sb/Sb females are inviable (Hallar, Krieger, & Ross, 2006). All ants in monogyne colonies are homozygous SB/SB (females diploid: queens and workers) or haploid SB (males). Only in polygyne colonies the Sb allele is found: all queens are heterozygotes (SB/Sb); this happened because SB/Sb queens trying to join a polygyne colony are recognized based to cuticular odor and can be accepted into the colony, while SB/SB are killed by the workers (L Keller & Ross, 1993; Laurent Keller & Ross, 1998; K G Ross & Keller, 1998). The workers (females, diploid) in a polygyne colony are either SB/SB or SB/Sb, and males (haploid) are either SB or Sb (Figure 1.1). Individuals in polygyne colonies show a series of phenotypes, shared with many other polygynous ants, comprehensively called “the polygyne syndrome” (Bourke & Franks, 1995; Keller, 1995). Polygyne SB/Sb reproductive queens have lower fat abdomen accumulation, lower fecundity, and lower dispersal ability than monogyne SB/SB queen, the latter leading to faster invasion of monogyne colonies into unoccupied habitats (DeHeer, 2002; DeHeer, Goodisman, & Ross, 1999; Huang & Wang, 2013; Keller & Passera, 1989; Keller & Ross, 1993b; 1995; Tschinkel, 2006). In polygyne colonies, workers are generally less aggressive and have smaller average size (Ross & Keller, 1995), and males have lower sperm count and lower weight than those from monogyne colonies (Lawson, Vander Meer, & Shoemaker, 2012). Sb is a. 2.

(13) recessive lethal: Sb/Sb homozygous queens do not survive to reproduce, and Sb/Sb homozygous workers are virtually absent in the colonies (Hallar et al., 2006). Such a variety of phenotypes is consistent with the presence of a non-recombining supergene, where multiple genes are in strong linkage disequilibrium and slightly deleterious mutations can accumulate. Characterization of the fire ant supergene revealed that many of the genes on the Sb allele have differences in gene expression, protein coding mutations, longer introns, and lower nucleotide diversity compared to SB (Huang & Wang, 2013; R Pracana et al., 2017; Wang, Ross, & Keller, 2008; Wang et al., 2013). There are also more repetitive elements on Sb; 17% of the known fire ants repetitive elements have higher abundance in Sb individuals compared to SB, while only 0.7% have higher abundance in SB (Wang et al., 2013). Recently it has been show that heterogeneous repetitive elements account for an Sb expansion of at least 30% (Stolle, Pracana, Howard, & biology, 2018).. 1.2. Supergenes Reduced recombination between chromosomes predicts the accumulation of deleterious mutations on the non-recombining chromosome, due to mechanisms such as Muller’s ratchet or genetic hitchhiking with a nearby beneficial mutation (Bachtrog, 2013; Charlesworth & Charlesworth, 2000). In organisms with XY (or ZW) sex chromosomes, the nonrecombining Y regions show gene degeneration, are gene-poor, and contain mainly simple repetitive sequences and transposable elements (TEs) (Bergero & Charlesworth, 2009). Accumulation of repetitive elements in non-recombining genomic regions. 3.

(14) has been observed in other organisms beside the fire ant (Bachtrog, 2003; Bartolomé, Maside, & Charlesworth, 2002; Bergero & Charlesworth, 2009; Peichel et al., 2004). For example, RNA transposons are more abundant on the neo-Y chromosome of Drosophila miranda (Bachtrog, 2003) and in genetic regions with lower recombination in D. melanogaster (Bartolomé et al., 2002; Rizzon, Marais, Gouy, & Biémont, 2002). Similarly, the Y chromosomes of the threespine stickleback (Gasterosteus aculeatus) (Peichel et al., 2004) and papaya (Carica papaya) (Liu et al., 2004) both contain more repetitive elements than their homologous regions on the X chromosome. In contrast to the well documented cases of TE accumulation, only a few examples of gene duplication and copy number variation in heteromorphic non-recombining chromosomes have been reported. For instance, multiple copy genes with male-specific functions are found on mammals and Drosophila Y chromosomes (Connallon & Clark, 2010). In birds, a case of a single segmental duplication was reported in the whitethroated sparrow supergene (Davis et al., 2011), and three deletions and two duplications of genes were found in the inverted region linked to different males morph in the ruff (Lamichhaney et al., 2015).. 1.3. Transposable elements (TEs) Transposable elements (TEs) are selfish genetic elements that propagate by copying themselves in the genome. TEs are divided into two classes depending. on. their. main. transposition. mechanism:. Class. I. (retrotransposons) are copy-and-paste genetic elements that use RNA intermediate and transpose via reverse transcription; Class II (DNA transposons) use a cut-and-paste mechanism, where a transposase create. 4.

(15) double or single DNA breaks and are able to move into different parts of the genome (Finnegan, 1989; Sultana, Zamborlini, Cristofari, & Lesage, 2017). Depending on encoded protein types and DNA sequences, and other specific features, TEs are further divided in subclasses and families. Class I elements include LTR (long terminal repeat) TEs, which encode for reverse transcriptase, intergrase, and other virus-like particles. Among non-LTR class I elements, LINE (long interspersed nuclear element) TEs are widespread in all eukaryotes, they shows ORFs with endonuclease and reverse transcriptase activities (Wicker et al., 2007). Their relative abundance is highly variable in different organisms. TE insertions may influence host genomes by increasing genome size, inducing genome rearrangement such as inversion by recombination among homologous copies, or by mutagenic insertion into host genes, altering gene expression. Some studies show TE insertions caused advantage by affecting gene expression (Feschotte & Pritham, 2006; Girard & Freeling, 1999; Kidwell & Lisch, 2001). However, random insertions are often neutral or mildly deleterious for the organisms, while highly deleterious or sub-lethal insertions may be subjected to purifying selection. Other molecular mechanisms can prevent TE proliferations. For example in Drosophila melanogaster, hybrid sterility arises if a strain carrying a specific TE is mated with a strain lacking if (hybrid dysgenesis) (Kidwell, Kidwell, & Sved, 1977). TE repressor alleles in the genome can silence TEs through small non-coding RNAs intermediate, such as piRNAs in the germline and endo-siRNAs in somatic cells, repressing TE through posttranscriptional gene silencing (Brennecke et al., 2007).. 1.4. Hypothesis 5.

(16) We hypothesized that gene duplication and TE accumulation in the nonrecombining Sb supergene may underlie some of the different phenotypes associated with a polygyne colony. Duplicated genes may affect queen discrimination by workers, leading to selective mortality of SB/SB queens in polygyne colonies. For example, desaturases gene families in insects seem to frequently evolve by gene duplication and have been proposed to have a role in the production of cuticular hydrocarbons (CHCs) involved in the recognition of different species or castes (Helmkampf, Cash, & Gadau, 2015). Some duplicated gene could also be involved in one or more of the “polygyne syndrome” phenotypes. Similarly, TEs may affect Sb associated phenotypes, such as by disrupting important genes or by providing cis-regulatory elements (E B Chuong, N C Elde, & C Feschotte, 2016; Edward B Chuong, Nels C Elde, & Cédric Feschotte, 2016; Ellison & Bachtrog, 2013; Lynch et al., 2015). To understand gene duplication in the supergene, we first characterized the genetic elements that have different copy number between the SB and Sb-carrying individuals in S. invicta, by conducting a bioinformatics analysis of copy number variation, validated a subset by DNA quantitative PCR (qPCR) digital droplet PCR (ddPCR). We then conducted over-representation analysis for the genes with copy number differences between SB and Sb genotypes. From this analysis, we confirmed a general trend of TE accumulation on the Sb allele, and we found many non-TE fire ant genes with greater copy number on Sb. Interestingly, many these non-TE genes encode enzymes implicated in CHC synthesis, which may be involved in differential queen odors. Furthermore, we found that the majority of the CHC enzyme genes with greater copy number in Sb are over-expressed in SB/Sb respect to SB/SB. 6.

(17) virgin queens, and generally highly expressed in SB/Sb mature queens. Among these genes, we selected two candidate genes possibly involved in differential odor production of SB/SB and SB/Sb queens. In addition, we compared the expression of TEs with their copy number differences in SB and Sb, and found a weak correlation between TE proliferation and increased expression levels. Mapping of select genes and TEs with copy number differences on the SB and Sb genomes showed that these genes were probably translocated from other chromosomes, and underwent multiple duplication and rearrangement events.. 7.

(18) CHAPTER 2 Materials and Methods. 2.1. Collection of transposable elements (TEs) and other fire ant genes to detect copy number variation (CNV) 2.1.1. Motivations To detect TEs and non-TE genes with copy number difference between SB and Sb-carrying individuals, we first built a comprehensive, nonredundant gene list of 27,279 S. invicta genes (hereafter, nr-sinv-ref) by combining three TE and four fire ant gene datasets available at the start of the analysis (Figure 2.1). We used this nr-sinv-ref, instead of the draft genome, to avoid some potential problems. First, our experimental design (comparing genome sequencing from SB and Sb haploid brothers, see below), targeted the supergene, which is poorly assembled due to repeats and re-arrangements. Some TEs and non-TE genes, especially those that are Sb-specific (e.g., piggyBac transposon (Wang et al., 2008)) are not represented in the SB draft genome and thus would be missed in the analysis (Wang et al., 2013). Second, repeats in the genome can cause dilution of read depth and therefore potential underestimation of the number of duplicated genes for our following CNV analyses. Third, a non-redundant list permits a better estimate of which types of TEs are preferentially duplicated. A downside of this strategy is that it may miss calling CNV for some genes, presumably edge cases, such as those with many small exons. 2.1.2. Trinity de novo assembly of 13 RNA-seq libraries for the collection of fire ant TEs. 8.

(19) To obtain as many fire ant TEs as possible, we decided to include expressed TEs found in a Trinity de novo assembly of 13 RNA-seq datasets available at the start of the experiment. These RNA-seq datasets included germline tissues, where “active” TEs should be expressed, as well as different somatic tissues because many experiments have revealed TE expression in non-germline tissue (Colborn, Byrd, Koita, & Krogstad, 2008). We performed assembly of 13 RNA-seq libraries (Table 2.1) using Trinity-v2.2.0 (Haas et al., 2013). Prior to assembly we applied standard quality control filters. Each library quality was assessed by FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc).. The. raw. reads were quality trimmed and contaminating adaptors were removed using cutadapt (Martin, 2011) [paired-end: cutadapt -q 20 --trim-n -m 50 -b <adapters> -B <adapters> -o outputR1.fastq -p outputR2.fastq inputR1.fastq inputR2.fastq; single-end: cutadapt -q 28 --trim-n -b <adapters> input.fastq > output.fastq]. Reads were then trimmed of the first 15 bp at the 5’ end [paired-end: cutadapt –u 15 –U 15 –m 50 –o outputR1.fastq –p output R2.fastq inputR1.fastq inputR2.fastq; single-end: cutadapt –u 15 –trim-n –m 50 input.fastq > output.fastq] due to bias in sequence composition that may lead to transcript artifacts in the de novo assembly. De novo assembly was performed using Trinity-v2.2.0 (Haas et al., 2013) in paired-end mode [Trinity --seqType fq --left left.fastq --right right.fastq --min_contig_length 305 --min_kmer_cov 2]. Left reads from all seven paired-end libraries and reads from the six single-end libraries were combined as left reads, while only right reads from paired-end libraries were combined as right reads. The basic statistics of the assembly results are in Table 2.2.. 9.

(20) To assess the assembly quality, we conducted two analyses. First, the raw reads (Table 2.1) were mapped back to the de novo assembled fire ant transcriptome using bowtie-2.2.1 (Langmead & Salzberg, 2012) [pairedend: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -1 readsR1.fastq -2 readsR2.fastq -S output.sam; single-end: bowtie2 -local --no-unal –x Trinity_bowtieindex.fasta -p 10 -q -U. reads.fastq -S. output.sam]. Paired-end reads were mapped separately from single-end reads. Of a total of 191 million paired-end reads, 98.1% mapped back to the assembled transcriptome, and of these 97.1% as proper pairs while the remaining 2.9% mapped as singlets. For the single-end reads 82.8% of the 195 million single reads could be mapped (Table 2.3). Second, to assess the completeness of the assembled transcriptome, we searched for the presence of 4,415 hymenoptera BUSCO (Benchmarking Universal Single-Copy Orthologs) genes (“hymenoptera_odb9 set” downloaded on 2017/02/05. BUSCO genes are highly-conserved and expected to be present in divergent lineages (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). We used the software BUSCO_v2.0 (http://busco.ezlab.org/) to search for the BUSCO genes in our database [BUSCO.py -o <BUSCO genes on Trinity assembly output> -i <Trinity assembly.fasta> -l hymenoptera_odb9 -m tran]. Among the 4,415 hymenoptera BUSCO genes, 4,200 Complete BUSCOs (95.2%) were present in our assembled transcriptome, while 124 (2.8%) were fragmented and 91 (2.0%) were missing. Among the complete BUSCO genes, 1,998 were single-copy and were used for following analysis. Some of the missing genes may be due to their lack of expression in the tissues or life stages sampled, and conserved gene can be lost in some lineages. Our de novo assembled transcriptome identified a higher number of hymenoptera BUSCO genes compared to the fire ant OGS. 10.

(21) (Wurm et al., 2011) and the NCBI-Sinv100 (93.8% and 72.7% BUSCO genes found, respectively). Our Trinity assembly produced 91,409 transcript isoforms, corresponding to 62,189 genes. This gene number is an over-estimate because the genomes for several insects typically have between ~10,000 to ~20,000 genes (Simola et al., 2015). Trinity assemblies often exceed the predicted gene number by 3-4 fold due to redundant or incorrectly assembled transcripts. To avoid spurious transcripts, Trinity assemblies are often filtered for expression or presence of ORFs (Albertin et al., 2015; Chan et al., 2016; Singh et al., 2017; Theissinger et al., 2016), especially when a high number of reads are used (Ana Conesa et al., 2016). In our analysis, we avoided filtering transcripts to permit the identification of lowly expressed and partial TEs. A total of 6,853 were identified among the transcripts ad putative TEs (tblastx search (Altschul, Gish, Miller, Myers, & Lipman, 1990) against the invertebrate repetitive elements from RepBase v21.01 (Bao, Kojima, & Kohany, 2015) (E-value <1e-20). As the transcript include multiple isoform, we selected only the longest isoform, for a total of 4,128 TE transcripts. Among these, redundancies were filtered with CD-HIT (Fu, Niu, Zhu, Wu, & Li, 2012; W. Li & Godzik, 2006) (sequence identity 95%)), for a total of 3,665 non-redundant putative TEs that were used to augment the gene list for CNV detection (nr-sinv-ref). 2.1.3. Collection of fire ant TEs The 3,665 TEs found using de novo assembly of fire ant RNA-seq databases, we combined with previous TE gene sets (247 S. invicta RepBase TEs (Bao et al., 2015) and 16 differentially expressed TEs (Nipitwattanaphon, Wang, Dijkstra, & Keller, 2013)). Redundancies among datasets were identified through a blastn search (percent identity 11.

(22) ≥95% and bitscore ≥200) and discarded. We obtained a final list of 3,678 TEs. 2.1.4. Collection of fire ant non-TE genes For the non-TE genes, we merged four fire ant gene sets: 1. NCBI release 100, which included 22,066 nuclear and 13 mitochondrial genes. For the nuclear genes, we selected only the longest isoforms, for a total of 15,058 genes. 2. Fire ant official gene set v2 (n = 16,569) (Wurm et al., 2011). 3. Fire ant expressed sequenced tag gene set (n = 12,859) (Wang et al., 2006). 4. Odorant-binding proteins (OBP) and chemosensory protein (CSP) from (González et al., 2009; Gotzek, Robertson, Wurm, & Shoemaker, 2011; Xu et al., 2009) (n = 31). Redundancies were removed among these 4 dataset (blastn, percent identity ≥95% and bitscore ≥200), for a total of 25,415 fire ant genes. 2.1.5. Collection of TEs and other fire ant genes (nr-sinv-ref) to detect copy number variation (CNV) Last, we merged the TE gene set with the non-TE gene set, with a final removal of redundancies, to obtain a “non-redundant” list of 27,279 fire ant TEs and non-TE genes (hereafter nr-sinv-ref).. 2.2. Copy number variation (CNV) analysis 2.2.1. SB versus Sb whole genome sequencing reads To detect genes with copy number variation, we compared whole genome sequencing reads from eight pairs of sons (SB or Sb, ant males are. 12.

(23) haploid) from unrelated SB/Sb queens. Low coverage sequence for seven pairs of sons were generated using 101 bp paired-end Illumina HiSeq 2000 reads (Wang et al., 2013). One more pair of SB (Wurm et al., 2011) and Sb (Wang et al., 2013) brothers with much greater sequencing read depth was included. These two were sequenced using multiple size libraries; we only used the 50 bp Illumina paired-end reads (Table 2.4). Although the exact genetic location of each CNV gene would be unknown, by comparing eight pairs of brothers, we expected to preferentially detect CNV genes on the supergene. 2.2.2. DNA read alignment to nr-sinv-ref Each of the 16 datasets (eight pairs of SB and Sb brothers) was aligned against our nr-sinv-ref using the Burrows-Wheeler Aligner, BWA-MEM0.7.12 (H. Li & Durbin, 2009), with default settings [bwa mem <nr-sinvref.fasta> <DNA-seqR1.fastq> <DNA-seqR2.fastq>]. For each of the 16 datasets, 44.2 to 53.4% of the total raw reads mapped against the nr-sinv-ref (Figure 2.2b). Raw reads mapping against CenSol, the major fire ant centromeric satellite DNA repeat (Huang et al., 2016), made up for 21.8 to 29.8% of the total reads, while all other genes combined for 18.8 to 26.6% (Figure 2.2b). When mapping DNA reads against the transcriptome, reads coming from intergenic regions or unannotated genes will remain unmapped (Ana Conesa et al., 2016), therefore, we considered our mapping percentages acceptable. Among the 27,279 genes used as the nr-sinv-ref, 26,307 were mapped for all 16 datasets (96.5%), and 27,047 were mapped for at least one dataset (99.2%). 740 genes showed no corresponding reads for at least one dataset, probably due to the lower coverage of some of the DNA-seq libraries used. The other 232 genes were unmapped in all 16 datasets,. 13.

(24) some of which might be due to the presence of S. invicta viral genes (Wang et al., 2006) or other contaminants in the datasets used to build the reference. 2.2.3. Copy number variation (CNV) analysis To determine the CNV for each gene between SB and Sb brothers, DNA read depths obtained from the BWA alignment were compared using CNV-seq (Xie & Tammi, 2009). DNA reads aligned with BWA software were first converted into hits files with SAMtools (H. Li et al., 2009) [samtools view -b –S align.sam > align.bam; samtools view -F 4 align.bam | perl -lane 'print "$F[2]\t$F[3]"' > align.hits] and the compared with CNV-seq [perl cnvseq.pl --test <Sb.hits> –ref <SB.hits> --genome-size 39534189 --log2threshold 0.6 --p-value 0.001 --global-normalization --window-size=350; where genome-size was 39,534,189 bp, corresponding to the total size of the genes used as a reference]. CNV-seq will choose a minimum window size for each datasets. However, we decided to use the same window size for all eight comparisons. We first asked CNV-seq to choose the window size for each comparison, and then we selected the one with lowest resolution (350 bp) for all eight datasets. For each overlapping window, we calculated “log2-fold(+1)” values normalized for different read depths using the formula: !"#2! !"#$(+1) = !"#!. !"!!"#$!!"#$% + 1 !"!!"#$!!"#$% + 1 !"!#$!!"!!"#$% !"!#$!!"!!"#$%. which was modified from (Xie & Tammi, 2009) by the addition of 1 to the SB and Sb read counts. This adjustment retains windows with zero reads, which are discarded by CNV-seq, as they may represent important genes that are deleted in either one of the two genotypes. Genes having an average value of log2-fold ≥1 or ≤-1 (≥2-fold difference; 14.

(25) among both biological replicates and overlapping windows) and P ≤0.05 in 75% of biological replicates for at least 25% of the gene length (binomial test with false discovery rate using the Benjamini-Hochberg procedure) were selected as putative duplicated genes in the Sb or SB genome, respectively (i.e., CNV genes). The same analysis was repeated by using a log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold). This relaxed threshold likely includes more false positives, however, it may include modest expansions or contractions of gene families and partially duplicated genes. Statistical analysis were performed in R (Ihaka & Gentleman, 1996).. 2.3. Duplicated gene annotation and gene function overrepresentation analysis We manually verified the annotation of the CNV genes using multiple methods: 1) RepeatMasker analysis with RepBase v21.01 (Kohany, Gentles, Hankus, & Jurka, 2006), where each gene was assigned to a TE class/family according to the best score; 2) Blast2GO search against the NCBI nr database (A Conesa et al., 2005), where the first best annotated hit was retained; and 3) blastx search against Uniref50 proteins (downloaded on 2017/05/17 (Suzek et al., 2015)), where the first best hit which was not an unknown protein was retained. In case of disagreement between methods, the annotation of CNV genes was manually curated by selecting the result from the method with the best scores and longest alignment length. Over-representation of gene function among CNV genes with respect to the nr-sinv-ref was tested only for gene categories with ≥3 genes, using the Fisher’s exact test with Bonferroni correction. The number of genes for each category in our nr-sinv-ref was estimated. 15.

(26) through RepeatMasker for TEs, and Uniref50 for non-TE genes, separately. Since the annotation of our CNV gene list was manually curated, thus creating a possible bias between the nr-sinv-ref and the CNV genes, we conducted statistical analyses with both a non-curated list of CNV genes and our final curated list.. 2.4. DNA real-time quantitative PCR (qPCR) and digital droplet PCR (ddPCR) to detect copy number 2.4.1. Ant Sampling To obtain ant samples, polygyne and monogyne colonies of S. invicta were collected from Taoyuan, Taiwan. The social form of the colonies was assessed by observing the number of queens in the colony and subsequently confirmed by a Gp-9 PCR-RFLP assay (M. Krieger & Ross, 2002) of 10-15 workers per colony. DNA was extracted from a pool of 36 males from the same colony with the same genotype using the DNeasy Blood & Tissue kit (Qiagen, Germany). Since polygyne colonies include both SB and Sb males, each individual male genotype was determined by a Gp-9 PCR-RFLP assay (M. Krieger & Ross, 2002). We compared three haploid male types: SB from monogyne colonies (Mono SB), SB from polygyne colonies (Poly SB), and Sb from polygyne colonies (Poly Sb), each with three separate colonies (biological replicates). 2.4.2. DNA real-time quantitative PCR (qPCR): goals, primer design, and protocol We conducted DNA qPCR to: 1) validate the bioinformatics results for CNV in TEs, and 2) examine CNV in TEs with known differential expression (DE). For CNV analysis validation, we tested 12 CNV genes 16.

(27) (11 TEs and 1 non-TE gene, Tdpoz) and a non-CNV TE as a negative control. We also performed DNA qPCR to assess whether 13 TEs (different from above) with known DE between SB and Sb-carrying individuals (Nipitwattanaphon et al., 2013; Wang et al., 2008) showed corresponding copy number differences. Primers were designed using Primer ExpressTM v3.0 (Thermo Fisher), targeting a region within the coding sequence (Tables 2.5 and 2.6). The coding. sequences. were. determined. using. NCBI. ORF Finder (Open Reading Frame Finder) (https://www.ncbi.nlm.nih.gov/orffinder/). and. RepBase. CENSOR. (http://www.girinst.org/censor/) (Kohany, Gentles, Hankus, & Jurka, 2006). Two single copy genes in both the SB and Sb genotypes were used as the internal controls: 40S ribosomal protein S9 and elongation factor 1alpha. Ribosomal proteins and elongation factors are housekeeping genes often used as reference for gene expression qPCR (Lu et al., 2013; Sun, Lu, Tang, & Du, 2015). We verified that these genes are single copy in the fire ant by a blastn search against both SB and Sb assembled genomes. Each primer pair was tested for qPCR efficiency using five four-fold dilutions of DNA. Only primer pairs with efficiency >90% and <110% were retained, and primer efficiencies were adjusted using the formula: 2!!!" =. !"#$%&! !!!!"#$%! !"#$%&#! !!"#$%&'$(. !. , where ! = 2!! and s=slope of the plot, x-axis=Ct,. and y-axis=log2(DNA concentration/DNA initial concentration), which was modified from (Pfaffl, 2001). All the qPCR reactions were carried out by mixing 3 µl of DNA, 10 µl of Power SYBR Green PCR master mix (Applied Biosystem), 200 – 300 nM of each primer, and water to a final volume of 20 µl. For each primer pair reaction, three four-fold dilutions of DNA were used, each with three technical replicates. All qPCR reactions were performed on an ABI17.

(28) 7300 Real-Time PCR System (Applied Biosystem) and the raw results were generated by the 7300 System SDS Software v 1.4 (Applied Biosystem). Copy number variation for each gene was calculated as 2-ΔΔCt normalized to the Mono SB genotype. Statistical analysis (ANOVA and post-hoc Tukey test) were performed in R (Ihaka & Gentleman, 1996). 2.4.3. DNA digital droplet PCR (ddPCR): goals, primer design, and protocol We conducted droplet digital PCR (ddPCR) (Biorad Q200 Droplet Digital PCR System) to validate five additional CNV genes (1 TE and 4 non-TE genes). ddPCR is a newer technique that can more precisely detect smaller fold differences than qPCR (e.g., 2-fold or less). Primers and TaqMan probes were designed by Integrated DNA Technologies (Table 2.7). The sequences of the two single copy genes (40S ribosomal protein S9 and elongation factor 1-alpha), already used as internal control for qPCR, were utilized for ddPCR primer and probe design. We used blastn to compare the designed primers and probes against the PacBio assembled SB and Sb genomes to ensure that at least one complete amplicon was found in these genomes. The efficiency of primer and probe sets were tested by gradient ddPCR (QX200, BioRad) by the Genomic Technology Core- Sequencing and Microarray Division of the Institute of Plant and Microbial Biology at Academia Sinica, using heterozygous virgin queen genomic DNA as template following the QX200 manufacturer’s protocol. ddPCR results were analyzed by QuantaSoft Analysis Pro. After obtaining the optimized annealing temperature for each primer/probe sets, multiplex PCR reactions were designed by reducing the primer/probe concentration of the genes with lower fluorescence signal in half. The optimized multiplex conditions were verified by using heterozygous virgin queen DNA.. 18.

(29) Copy number variation for each gene was shown as estimated actual number of genomic copies. Statistical analysis (ANOVA and post-hoc Tukey test) were performed in R (Ihaka & Gentleman, 1996).. 2.5. Gene expression of CNV genes 2.5.1. Differential expression (DE) analysis of CNV genes We compared the log2 ratio of Sb:SB copy number of CNV genes with their DE between SB/Sb and SB/SB virgin queens. To assess DE in SB/Sb virgin queens, we generated four pairs of SB/Sb and SB/SB virgin queen RNA-seq libraries (Table 2.8). We first verified that. all. libraries. were. of. adequate. quality. with. FastQC. (http://www.bioinformatics.babraham.ac.uk/projects/fastqc). Next, raw reads were quality trimmed and contaminating adaptors were removed using cutadapt (Martin, 2011): cutadapt -q 20 --trim-n -m 50 –O 5 -b <adapters> -B <adapters> -o outputR1.fastq -p outputR2.fastq inputR1.fastq inputR2.fastq To calculate gene expression within each RNA-seq dataset we used RSEM (B. Li & Dewey, 2011): rsem-calculate-expression –paired-end reads_R1.fastq reads_R2.fastq <reference gene list.fasta> output The gene expression was calculated as log2(FPKM+1). We added 1 to the FPKM value obtained with RSEM to avoid discarding genes with zero reads. DE between SB/Sb and SB/SB virgin queens was calculated from RSEM expression output, using edgeR (M. D. Robinson, McCarthy, & Smyth, 2009), and expressed as log2(fold-change) (logFC): 1. Create matrix from RSEM results using Trinity-v2.2.0 (Haas et al.,. 19.

(30) 2013) tool: abundance_estimates_to_matrix.pl --est_method RSEM -gene_trans_map none RSEM_1-big-vs-CnvRef.genes.results RSEM_1small-vs-CnvRef.genes.results RSEM_2-big-vs-CnvRef.genes.results RSEM_2-small-vs-CnvRef.genes.results RSEM_3-big-vsCnvRef.genes.results RSEM_3-small-vs-CnvRef.genes.results RSEM_4big-vs-CnvRef.genes.results RSEM_4-small-vs-CnvRef.genes.results 2. edgeR R scripts: FileName <c('RSEM_1.big.vs.CnvRef','RSEM_1.small.vs.CnvRef','RSEM_2.big.vs.Cn vRef', 'RSEM_2.small.vs.CnvRef','RSEM_3.big.vs.CnvRef','RSEM_3.small.vs.C nvRef','RSEM_4.big.vs.CnvRef','RSEM_4.small.vs.CnvRef') Subject <- c(1,1,2,2,3,3,4,4) Treatment <- c('big','small','big','small','big','small','big','small') targets <- data.frame(FileName,Subject,Treatment) Subject <- factor(targets$Subject) Treat <- factor(targets$Treatment, levels=c("big","small")) design <- model.matrix(~Subject+Treat). #Create a design. matrix y <- DGEList(counts=data) keep <- rowSums(cpm(y)>1) >= 2 y <- y[keep, keep.lib.sizes=FALSE]. #keep only genes with cpm > 1 in. at least 2 libraries y <- calcNormFactors(y). #Normalize library. y <- estimateDisp(y,design) fit <- glmQLFit(y, design). 20.

(31) qlf <- glmQLFTest(fit). #Perform test for paired. design 2.5.2. Absolute expression analysis of CNV genes The log2 ratio of Sb:SB copy number of CNV genes was also compared with their absolute gene expression in: 1) in virgin SB/Sb queens, and 2) reproductive SB/Sb polygyne queens. Absolute gene expression values (log2(FPKM+1)) in virgin SB/Sb queens were from the same data from above (Section 2.5.1). For the absolute expression in reproductive SB/Sb queens we used six published RNA-seq datasets (Wang et al., 2013) (Table 2.1, libraries 9-13). We calculated gene expression (log2(FPKM+1)) as above. As a baseline for each RNA-seq dataset, we used the average gene expression of 1,998 fire ant complete and single-copy BUSCO (Benchmarking Universal Single-Copy Orthologs) genes. Regression analyses to evaluate linear correlations were performed in R (Ihaka & Gentleman, 1996).. 2.6. Genomic analysis of CNV genes 2.6.1. Assembly and linkage group construction of SB and Sb genomes To identify the chromosomal locations of some selected CNV genes we used newly assembled SB and Sb PacBio genomes, obtained from SB and Sb pools of brothers from a single SB/Sb queen. SB and Sb fire ant genomes were obtained from SB and Sb pools of brothers from a single SB/Sb queen. Approximately 48x and 59x sequence coverage for the SB and Sb genotypes, respectively, were. 21.

(32) obtained on the PacBio RSII platform. Genomes were assembled using FALCON (SB, v0.4.0 with overlap filtering --min_cov=2, --max_cov=80; Sb, v0.5.0 with --min_cov=2, --max_cov=100; all others default), contigs ≤1000 bp were discarded as well as circular contigs, and then remaining contigs polished once with Quiver (Chin et al., 2013) (in SMRT analysis pipeline 2.3.0). After assembly, we constructed the 16 linkage groups (pseudochromosomes) by transferring a genetic map previously obtained with RAD tag sequencing (Wang et al., 2013) to the new genomes as follows: for each genetic position in the original map we extracted 100 bp from each side, and performed a blastn search into the new genome assembly. The blastn tabular output was utilized to assign the linkage group and genetic position to the new assemblies. We then use the new genetic maps to construct linkage groups for both SB and Sb genomes, using ALLMAPS (Tang et al., 2015) [python -m jcvi.assembly.allmaps merge <new. genetic. map.csv>. -o. <output.bed>. ;. python. -m. jcvi.assembly.allmaps path <output.bed> <linkage groups.fasta>]. With this method, we assigned 67.6% and 75.7% of the SB and Sb PacBio sequences to a linkage group, respectively. The remaining sequences were included in the analysis as unmapped contigs. 2.6.2. Genomic analysis of CNV genes Then we searched for the CNV genes in the newly assembled genomes using blast (Altschul, Gish, Miller, Myers, & Lipmal, 1990) and BLAT (Kent, 2002) searches, and visualized the results with the Integrative Genomics Viewer (IGV) (J. T. Robinson et al., 2011). 2.6.3. Placement of SB contigs 000611F and 000344F onto chr 5 and chr 3, respectively. 22.

(33) Because the Sb PacBio genome assembly placed more sequence into linkage groups, we used the Sb contigs to place two unmapped SB contigs: 000611F and 000344F, containing two CNV genes: the genes Sinv_desatA1_a and Sinv_eloA, respectively. Sinv_desatA1_a was found on Sb chr 5 while Sinv_eloA was found on Sb chr 3, suggesting that the two SB unmapped contigs 000611F and 000344F are part of chr 5 and chr 3, respectively. A blastn search showed that the 000611F and 000344F SB unmapped contigs could be fully aligned with Sb chr 5 and chr 3, respectively. Hence, we consider Sinv_desatA1_a to be on chr 5, and Sinv_eloA to be on chr 3, in in both SB and Sb genome.. 2.7. Analysis of desaturase and elongase copies We then focused on two selected CNV genes, possibly involved in the differential synthesys of CHC between SB/SB and SB/Sb queens: a desaturase (ID: gi|751224626|ref|XM_011167365.1, Acyl-CoA Delta(11) desaturase) and an elongase (ID: SI.CL.01.cl.0146.Contig1, elongation of very long chain fatty acids). To identify all copies and intron/exon structure of both genes, we conducted blastn (Altschul et al., 1990) similarity searches against the SB and Sb genomes. For all copies, we extracted the gene and 1 kb flanking regions from each side. All genomic copies, their open reading frames, and closest homologous proteins (silkworm B. mori and honeybee A. mellifera for desaturase, and the ants M. pharaonis, T. zeteki, V. emeryi and the honeybee for elongase; retrieved from the NCBI nr database) were aligned using MEGA7 (Kumar, Stecher, & Tamura, 2016). For each copy we examined expression in the 13 RNA-seq libraries used to build the nr-sinv-ref (Table 2.1). RNA-seq data were aligned against. 23.

(34) the coding region of both copies of the desaturase (Sinv_desatA1_a and Sinv_desatA1_b) and the elongase (Sinv_eloA and Sinv_eloB) genes using BWA-MEM-0.7.12 [paired-end: bwa mem <Copies of Elongase and Desaturase gene.fasta> <readsR1.fastq> < readsR2.fastq>; single-end: bwa. mem. <Copies. of. Elongase. and. Desaturase. gene.fasta>. <reads.fastq>]. The results were visualized with IGV (J. T. Robinson et al., 2011). We used SNPs unique to each copy to visually assess whether the reads were properly mapped onto: Sinv_desatA1_a or Sinv_desatA1_b, and Sinv_eloA or Sinv_eloB. Once we confirmed each read was uniquely mapped onto each copy, we used the mapped read numbers to calculate RPKM. For each RNA-seq dataset, we also reported the average expression level of 1,998 complete and single-copy fire ant BUSCO (Benchmarking Universal Single-Copy Orthologs) genes.. 24.

(35) CHAPTER 3 Results. 3.1. CNV analysis: duplicated TE and non-TE gene discovery To identify genes with copy number differences between the two fire ant supergene alleles we mapped whole genome sequencing reads onto a non-redundant gene set (nr-sinv-ref, see Methods) for eight pairs of sons (SB and Sb, ant males are haploid) from unrelated SB/Sb queens. We found 260 genes having copy number differences between SB and Sb (log2-fold ≥1 or ≤-1, i.e., 2-fold difference). Of the 260 CNV genes, 239 (91.9%) were in greater copy number in Sb, and only 21 (8.1%) in SB. Among the 239 genes with higher copy in Sb, 124 genes (52%) were TEs, while 115 (48%) were non-TE genes (Figure 3.1a). When using a relaxed log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold), a total of 475 CNV genes were found. TE and non-TE gene proportions were consistent with the 2-fold analysis: the majority of the genes had greater copy number in Sb (241 TEs and 189 non-TE genes; 90.5% of total), while fewer in SB (15 TEs and 30 non-TE genes, 9.5% of total) (Figure 3.1b). For the log-2 fold analysis, among the non-TE CNV genes, we found 21 genes with no known function and an additional 31 genes with putatively unique functions not shared with any other CNV genes. Of the remaining CNV genes, 11 non-TE gene categories had ≥3 genes: fatty acid synthases (FASs), reductases, cytochrome P450s, zinc fingers, kinases,. 25.

(36) farnesyl pyrophosphate synthases (FPPSs), histones, coiled-coil domaincontaining proteins, toll-like receptors, serine proteases, and OBP (Table 3.1). Similar results were found when using a log2-fold threshold of ≥0.6 or ≤-0.6 (~1.51 fold) (Table 3.2). Among the CNV TEs, we found 61 DNA transposons, 51 LTR retrotransposons, and 19 long interspersed nuclear element (LINE) retrotransposons. Only the BEL family (LTR retrotransposons) was overrepresented among TEs that have proliferated in the Sb genome (18% BEL among CNV TEs and 8% in the nr-sinv-ref, P <0.01, Fisher’s exact test with Bonferroni correction) (Table 3.3). The TE analysis performed with a relaxed thresholds ~1.51 fold was also consistent with the 2-fold CNV results, with BEL family over-represented among CNV genes. Using a more relaxed threshold showed overrepresentation in three additional TE families (Polinton, Crypton, and Copia), while the Mariner family (DNA transposons) was underrepresented (Table 3.4).. 3.2. Validation of CNV results with quantitative real-time PCR (qPCR) and droplet digital PCR (ddPCR) We validated the bioinformatics results with DNA qPCR or ddPCR on 3 independent sets of SB and Sb male samples from Taiwan: SB from monogyne colonies (Mono SB), SB from polygyne colonies (Poly SB), and Sb from polygyne colonies (Poly Sb). Since the CNV analysis was based on a single population (Georgia, USA), some duplicates may not be found in our population due to polymorphism. Initially, we used qPCR to test 12 genes (11 TEs and 1 non-TE) identified as having higher copy number in Poly Sb males compared to either Poly. 26.

(37) SB or Mono SB males and one non-CNV gene (negative control). Among these, nine showed higher copy number in Poly Sb males compared to both Mono SB and Poly SB (P <0.05, ANOVA and post-hoc Tukey test); and they had no difference between Mono SB and Poly SB males (P >0.7). In two of the three non-significant cases (Mariner11 and P26) the lowest Poly Sb value exceeded both Mono SB and Poly SB values, suggesting that the non-significant results were mainly due to high variation between biological replicates. This variation could be due to differences in TE copy number among individuals from different colonies. Finally, Tdpoz was not duplicated in our qPCR analysis indicating that it is either a false positive or the Taiwan and the Georgia populations differ for this gene. The non-CNV TE used as a negative control did not show any copy number difference in both the CNV analysis and by qPCR, as expected (P ≥0.2) (Figure 3.2a). We the used ddPCR to test one TE (F04BEA piggyBac, also tested with qPCR, hereafter, note, there are zero copies of the F04BEA1219 TE in the SB genome) and four genes associated with CHC synthesis with higher copy number in Sb relative to SB (Figure 3.2b). In all five cases, we confirmed the results of the bioinformatics analysis, with higher copy in Poly Sb males compared to both Mono SB and Poly SB (P <0.05, ANOVA and post-hoc Tukey test, Supplementary Figure 4B). Together these results show the reliability of our CNV analysis: 14 out of 17 genes (82%) showed significant copy number difference through qPCR or ddPCR.. 3.3. Higher transposon expression is weakly correlated to higher copy number. 27.

(38) Previous studies have found TEs with differential expression (DE) between SB and Sb-carrying individuals (Nipitwattanaphon et al., 2013) and we expected these TEs to show corresponding copy number differences. To this end, we assessed the CNV status for 13 TEs with known DE between SB and Sb-carrying individuals (Figure 3.3). Surprisingly, eleven of the 13 TEs showed no copy number variation among genotypes in both DNA qPCR and CNV analysis (nine showed no copy number variation among genotypes, and two showed marginally significant copy number variation). Only a DNA Harbinger and a piggyBac element were upregulated in both SB/Sb queens and workers, and both showed higher copy number in Sb. These results revealed that higher TE expression does not necessarily correspond to greater copy number. These TEs were expected to show corresponding copy number differences, however, We also examined the converse case: whether TEs with greater copy number in Sb have correspondingly higher expression in SB/Sb individuals (Figure 3.4). We correlated CNV TEs with gene expression and found a positive linear relationship for DNA TEs. When excluding one outlier datapoint, only the DE analysis (but not the absolute expression analyses) showed a linear correlation with copy number (p = 0.01, R2 = 0.342, Figure 3.4a). In all cases, low R2 values showed that the correlation between copy number and expression is not straightforward. We found no positive correlation when considering LTR and LINE elements. The observed weak correlation between the TE copy number and their expression suggests that more factors (e.g., age of TE in genome) need to be considered to explain the difference in TE copy number between SB and Sb.. 28.

(39) 3.4. Enzymes involved in cuticular hydrocarbon (CHC) synthesis are highly duplicated in Sb and over-expressed in SB/Sb queens We tested if “highly represented” gene categories (≥3 CNV genes) were over-represented among CNV genes. Eleven non-TE gene functions were highly represented, and of these, five were over-represented: genes encoding FASs, FPPSs, histones, toll-like receptors, and OBPs (Table 3.1, P <0.05, Fisher’s exact test with Bonferroni correction). FASs and two of the highly represented gene families (cytochrome P450s and reductases) are possibly involved in the synthesis of CHCs (Figure 3). Fatty acid synthases are responsible for incorporation of malonyl-CoA or methylmalonyl-CoA into an acetyl-CoA, thus forming a medium-chain fatty acyl-CoA. Desaturases and elongases add double bonds and elongate the carbon chain, thus regulating number and position of double bonds, and chain length, respectively. The resulting long-chain acyl-CoA is reduced to aldehyde by a reductase enzyme. Here we report the total number of reductases with copy number variation between SB and Sb, however, some of them may be responsible for reduction of other molecules than long-chain acyl-CoA. Specifically, three reductases duplicated in Sb were fatty acyl-CoA reductases, while the others were ambiguous. Finally, the hydrocarbon is produced from the aldehyde by a decarboxylase (which is thought to be a cytochrome P450 enzyme). As for reductases, we report the number of all CNV cytochrome P450. However, only the cytochrome P450 CYP4 family is responsible for fatty acid metabolism (Kirischian & Wilson, 2012). In particular, an enzyme of the CYP4G family has been shown to produce CHCs in D. melanogaster (Qiu et al., 2012), but among our CNV cytochrome P450 genes we only found enzymes from the families CYP4A, CYP4C, and CYP6. 29.

(40) Because fire ant queen acceptance or rejection by workers is based on an odor cue (Eliyahu, Ross, Haight, Keller, & Liebig, 2011; Trible & Ross, 2015), one or more of these duplicated genes could plausibly affect queen hydrocarbon profiles and consequently worker behavior, contributing to the social form differentiation. Given higher copy in Sb of these three gene families (25 genes) associated with CHCs, we next looked for additional genes putatively involved in the synthesis of CHCs with greater copy number in Sb and found two; an acyl–coenzyme A desaturase Δ11 and an elongase of very long chain fatty acid (Figure 2.5), for a total of 27. These two enzymes control the presence of double bonds (i.e., saturated or unsaturated hydrocarbons) and the length of the CHCs, respectively (Fang et al., 2009; Finck, Berdan, Mayer, Ronacher, & Geiselhardt, 2016; Helmkampf et al., 2015). We compared the log2 ratio of Sb:SB copy number with DE between SB/Sb and SB/SB virgin queens for the 27 CHC genes with greater copy number in Sb (Figure 3.6a). Only 20 passed the low expression threshold, and of these 17 (63%) were significantly over-expressed in SB/Sb (p < 0.05, FDR) and only one in SB/SB (p = 0.01, FDR). Taking all CNV genes into consideration, only 22.6% of non-TEs with greater copy number in Sb were over-expressed in SB/Sb queens. There was no positive correlation between copy number and DE (p = 0.93, F-test, linear regression). A similar analysis using absolute expression revealed no positive correlation between copy number and absolute gene expression in SB/Sb virgin queens (p = 0.07, F-test, linear regression) (Figure 3.6b) and SB/Sb reproductive queens (p = 0.17, F-test, linear regression) (Figure 3.6c). All 27 genes showed high expression in reproductive queens, with 25 of them. 30.

(41) having greater expression than the average of a set of conserved hymenoptera genes (BUSCO, Benchmarking Universal Single-Copy Orthologs, n =1,998). The desaturase gene was the most highly expressed gene in both SB/Sb virgin and reproductive queens.. 3.5. CHC enzyme genes have copies on supergene and nonsupergene chromosomes We searched for the copies of the 17 CNV CHC genes that were also more highly expressed in SB/Sb virgin queens relative to SB/SB in the SB and Sb genomes (Figure 3.7). All 17 genes have at least a copy on a chromosome scaffold. Three definitively have a copy on chr 16 (the supergene chromosome); the other 14 have copies on one of chr 1 to 15 as well as copies on unmapped contigs, consistent with being on the supergene. We suggest that the majority of Sb unmapped genes are part of the supergene for three reasons: first, the 16 chromosomes were constructed by transferring a genetic map obtained from a SB genome so would fail to incorporate chr 16 Sb-specific contigs; second, the supergene, by its repetitive nature, presents more difficulties in assembly than other chromosomes; and third, since the CHC genes were called as CNV, some of their containing contigs should be on the supergene, but this remains to be verified. All the CHC gene duplications (covering at least 2 exons) included introns and the flanking region. Thus, these duplications were unlikely due to gene retroposition (the insertion of a DNA fragment due to reverse transcription) but rather likely the product of segmental duplications. Because gene duplications could be caused by TEs (Feschotte & Pritham, 2006), we also looked for the presence of several TEs with greater copy. 31.

(42) number in Sb in the genomic regions flanking the CHC gene copies. While some TEs, such as Mariner family transposons, were found in close proximity with the CHC gene copies (< 5 kb), the majority of the flanking TEs were >10 kb away.. 3.6. Desaturase and elongase have potentially functional copies on Sb We further analyzed two candidate duplicate genes (the desaturase and the elongase) in more detail. This desaturase corresponds to Sinv_desatA1_a (Helmkampf et al., 2015), which encodes Acyl-CoA desaturase Δ11, an enzyme that introduces the first double bond at the 9th or 11th position in the acyl chain (Hashimoto et al., 2007). Sinv_desatA1_a is located on chromosome 5. Ten additional partial copies, with 90 to 98% nucleotide identity with Sinv_desatA1_a, were found on unmapped Sb contigs. The three longest copies are almost identical to each other in their coding regions (≥98.9% pairwise identity) and share at least 2 kb of flanking region. In comparison to Sinv_desatA1_a, they have share ~90% identity, have longer introns, and lack exon 1. The lack of exon 1 may be due to incomplete assembly as their respective contigs end within 1,200 bp upstream of exon 2 and putative. “orphan”. exon. 1’s. (87%. nucleotide. identity. to. the. Sinv_desatA1_a) exist on four different Sb contigs. These three copies correspond to, but are longer than Sinv_desatA1_b (Helmkampf et al., 2015) , a partial gene annotated with only exons 4 and 5. We call these three. additional. copies. Sinv_desatA1_b1,. Sinv_desatA1_b2,. and. Sinv_desatA1_b3. The seven other putative desaturase copies on Sb seem to be pseudogenes, missing critical exons containing essential histidine residues (Figure 3.8). 32.

(43) We assessed specific expression of the desaturase copies by mapping RNA-seq reads onto the Sinv_desatA1_a and the Sinv_desatA1_b coding regions (Figure 3.9). Sinv_desatA1_a is expressed in all analyzed datasets (16.5 to 93.9 RPKM), while Sinv_desatA1_b is expressed in polygyne reproductive queens and polygyne virgin queen ovaries (1,161.6 and 49.5 RPKM, respectively), consistent with a potential polygyne-specific role of Sinv_desatA1_b. These latter values are relatively high when compared to the average expression of the BUSCO genes for each dataset, which ranged from 25.9 to 35.3 RPKM. Very low expression of Sinv_desatA1_b (3.1 RPKM) was also detected in monogyne virgin queen ovaries. This result suggests that there may be polymorphism for Sinv_desatA1_b gene copies on SB, in addition to being a CNV gene between SB and Sb (Figure 3.12). The elongase gene has one copy on chromosome 3 (Sinv_eloA) and two additional copies on unmapped Sb contigs (Sinv_eloB). All of the copies encode full-length proteins with ≥96.9% nucleotide pairwise identity of the coding region (Figure 3.10). Among the 13 RNA-seq libraries examined (Figure 3.11), the expression of the full length of Sinv_eloA was not detected, and that of Sinv_eloB was detected only in polygyne reproductive queens (81.4 RPKM) (Figure 3.12).. 33.

(44) CHAPTER 4 Discussion. 4.1. TEs accumulation on Sb Among total 260 genes with copy number differences between SB and Sb-carrying individuals in the fire ant, we found around half (50.4%, n =131) to be TEs. When using a more relaxed threshold to identify CNV genes (log2-fold ≥0.6 or ≤-0.6 (~1.51 fold)), we found a similar proportion of TEs (53.9%, n = 256, out of 475 CNV genes). This accumulation of TEs is consistent with theoretical predictions for non-recombining genomic regions (Bachtrog, 2013; Bartolomé et al., 2002; Bergero & Charlesworth, 2009; Dolgin & Charlesworth, 2008; Liu et al., 2004; Peichel et al., 2004). We found 124 TEs with greater copy number in Sb, and only 7 in SB, confirming a general trend of accumulation of repetitive elements on the Sb genome (Wang et al., 2013). Since TEs accumulate in regions with suppressed recombination, finding of TEs on Sb is most probably a consequence of reduced recombination between the two supergene alleles in heterozygous SB/Sb queens coupled with no (or potentially infrequent) recombination between Sb alleles due to the essential absence of functional Sb/Sb queens. In the invasive range Sb/Sb queens are extremely rare (<1%) (Fritz, Vander Meer, & Preston, 2006; K. G. Ross, Krieger, Shoemaker, Vargo, & Keller, 1997) while in the native range (South America) none have been reported but plausibly could occur (~20% Sb/Sb polygyne queens have been observed for a related species, Solenopsis richteri) (Hallar et al., 2006; K G Ross, 1997). Since normal recombination occurs in SB/SB queens, we did not observe a major 34.

(45) accumulation of TEs on SB. Two models can explain the presence of TEs with greater copy number in Sb in relation the reduced recombination between SB and Sb variants of the supergene: 1) More TEs were able to proliferate after the formation of the supergene, due to lack of recombination and consequent reduced purging of deleterious repeats insertion. 2) Both SB and Sb possessed similar levels of TEs at the time of the supergene formation, and these TEs were lost on SB though recombination, but retained in Sb. Both processes may have occurred, perhaps on different TEs. Most of the TE duplications (and deletions) were likely already present in the native range. First, the introduction of S. invicta into the Southern United States was less than 100 years ago (Ascunce et al., 2011; Tschinkel, 2006), which seems too little time for the observed TE patterns. Second, while our bioinformatics analysis was for a population from Georgia, USA, we validated CNV TEs with a Taiwan population (10 out of 12 were confirmed). Finding similar TE duplications in both Georgia and Taiwan supports the idea that most of the Sb-linked TE likely predated invasion into the USA. Similar cases of inversions and lack of recombination, such as the animal Y or W sex chromosomes, show accumulation of TEs and deleterious mutation (Bachtrog, 2006; Peichel et al., 2004; Vicoso, Emerson, Zektser, Mahajan, & Bachtrog, 2013; Yoshida, Makino, & Kitano, 2016). In the case of the fire ant Sb allele, most of the TE insertions presumably have no fitness effect, i.e., they are neutral. However, some TE insertions may contribute to the deleterious phenotypes present in Sb-carrying. 35.

(46) individuals (e.g., lower sperm production in males (Huang & Wang, 2013; Lawson, Vander Meer, & Shoemaker, 2012), and low viability of Sb/Sb individuals (Hallar et al., 2006)). Another possibility is that some of the TEs contribute adaptively to Sb phenotypes by providing cis-regulatory elements or promoting host gene duplication. For example, in Drosophila miranda, helitron transposable elements mediate dosage compensation by creating MSL (male specific lethal) complex binding sites (Ellison & Bachtrog, 2013). Similarly, cis-regulatory elements from endogenous retrovirus and TEs have been co-opted to regulate the immune pathway and pregnancy-related genes in mammals (Edward B Chuong et al., 2016; Lynch et al., 2015). Furthermore, host genes can be captured by TEs, and in this way TEs can promote gene transduction (the process of transferring DNA to a new genetic location) and duplication (Jiang, Bao, Zhang, Eddy, & Wessler, 2004; Moran, DeBerardinis, & Kazazian, 1999). Along these lines, it is interesting to note that the BEL transposon family, with 18% (n =22) of the CNV TEs, seems to have disproportionately proliferated on the Sb supergene. TE from the Mariner family (DNA transposons), were under-represented among CNV genes when using a relaxed threshold of ~1.52 fold. TEs from the Mariner family, one of the first identified in the fire ant (M. J. B. Krieger & Ross, 2003), have shown expansion in ant genomes (Bonasio et al., 2010) and have high horizontal gene transfer rates in both insects and mammals (Lee & Wang, 2018; Oliveira, Bao, Martins, & Jurka, 2012; Peccoud, Loiseau, Cordaux, & Gilbert, 2017). One possible explanation for their under-representation among CNV genes is that the majority of the fire ant Mariner expansions predates the supergene, hence most of the Mariner elements were already inactive at the time of the supergene birth. The TEs that have accumulated in the Sb genome might have been. 36.

(47) expected to have higher gene expression, and conversely, increased copy number might have driven higher gene expression. While we found a weak positive correlation for DNA TEs between the ratio of Sb:SB copy number (i.e., the increase in copy number in Sb) with DE in SB/Sb versus SB/SB virgin queens, and with absolute gene expression in SB/Sb virgin and reproductive queens, we found none for LTR and LINE elements. Overall, this suggests that gene expression is not the only (or even main) driver of copy number and vice versa. Rather, more factors (e.g., age of TE in genome) need to be considered to explain the difference in TE copy number between SB and Sb. Many high copy number transposons had low gene expression levels. One possibility is suppression by the host. As we found no correlation of gene expression with methylation levels, we suggest that suppressed gene expression of some TEs is likely by other mechanisms, such as piRNAs (Brennecke et al., 2007). Alternatively, some of the TE copies may not be expressed because they lack a promoter or have partial protein sequence. Another potential explanation is that some TEs were co-opted in the soma, so their high expression would be uncorrelated with their germline expression level (E B Chuong et al., 2016).. 4.2. Different functions of Non-TE genes with CNV between the Sb and SB fire ant genomes Many non-TE genes had different copy number between SB and Sb (49.6%, n = 129): 115 genes had more copies in Sb and 14 in SB. As for the TEs, similar proportions were find when using a relaxed threshold of ~1.51 fold: 46.1% non-TE CNV genes, n = 219 (out of 475 CNV genes), of which 189 had more copies in Sb and 30 in SB.. 37.