CHAPTER 1 INTRODUCTION
1.3 G ENOME - WIDE ASSOCIATION STUDY
1.3.1 The concept of GWAS
GWAS is basically the association mapping of a germplasm but with markers through whole genome. A significant marker is identified when the phenotypes between different genotypes are statistical different, usually examined by t-test or ANOVA. In this process, no linkage map is required. Once a significant marker is revealed, the QTL should be located within the LD interval of this marker. That is to say, GWAS utilizes markers through whole genome to examine which markers are associated with a studying phenotype. Comparing to a bi-parental cross population, GWAS involves more alleles because a germplasm accumulates mutations and recombinant events through its whole history. Together with the cost down of sequencing that makes the genotyping of a natural population much redundant, an explosive growth of GWAS in plants is now happening (Huang & Han, 2014; Soto-Cerda & Cloutier, 2012; Zhu, Gore, Buckler, &
Yu, 2008). Following the concept of GWAS, number of markers and the LD between markers and QTL in a given population will determine the GWAS result (Korte &
Farlow, 2013). More markers and more individuals mean more detectable recombinant events between markers and QTL, suggesting more precise estimations of LD and QTL effects. Unfortunately, QTL controlled by small-effect alleles and/or rare alleles could not be detected in a small population due to the limitation of statistical methods (Ingvarsson & Street, 2011; Korte & Farlow, 2013; Visscher et al., 2017). Despite many statistical models were proposed to rescue the problem, the fundamental solution would be a population of large sample size.
1.3.2 LD determines the resolution of GWAS
LD is the non-random assortment between pairwise alleles; it is measured by allele frequency and recombination using generally two statistics, r2 and D’. In brief, r2 summarizes the recombinant events and mutations, while D’ presents only the information of recombination. A main concern for D’ is that it is affected heavily by allele frequency, especially for a small population, because it is less possible to find a genotypic combination containing a rare allele. Meanwhile, r2 has a relatively small bias in a small population and additionally, it can reflect the correlation between markers and QTL. Therefore, r2 is utilized much more common in GWAS (Flint-Garcia, Thornsberry,
& Buckler, 2003). Since allele frequency and recombination determine LD, any factor that affects these two factors may have an influence on LD and consequently GWAS results. (Flint-Garcia et al., 2003; Slatkin, 2008). In population history, allele frequency serves as an essential parameter; therefore, migration, mutation, selection and populations with or without subdivision all reflect on LD. Generally, migration and mutation that provide new genetic materials to a population would increase genetic diversity and consequently decrease LD. Strong selection force or genetic bottleneck would decrease genetic diversity and then create LD in a population (Flint-Garcia et al., 2003; Slatkin, 2008).
Recombination is basically determined by mating system in a natural population. In selfing genomes, generally an extensive region of LD would be observed because alleles tend to be fixed after selfing (Huang et al., 2012; Yano et al., 2016). In addition, great selection force during domestication process made LD extending to hundreds Kb, leading rough resolution in GWAS (Bauchet et al., 2017; Sauvage et al., 2014). To overcome the natural disadvantages of selfing plants, discovering new materials of high
genetic diversity or designing diverse population panels have become common strategies. The population that consists of hybrid genomes, the multi-parent advanced generation intercross population or the population involving in wild relatives can increase genetic diversity and consequently improve GWAS resolution (Bauchet et al., 2017; Crowell et al., 2016; Huang et al., 2012; Ranc et al., 2012). In addition, more markers for a world-wide collection could also detect higher diversity, resulting in a better resolution as well (Kim et al., 2007).
1.3.3 Population structure and kinship cause confounding effects in GWAS
Any factor contributing to LD can inflate the significance of GWAS result because the associations between markers and phenotypes determine the results of GWAS (Huang & Han, 2014; Soto-Cerda & Cloutier, 2012; Korte & Farlow, 2013). The confounding is created when LD is formed by only different allele frequency among families or among subpopulations. Two main confounding effects are the population structure, the distant common ancestry of a population, and the kinship, the existence of relatedness in a relatedness-unknown population (Astle & Balding, 2010). So far, the mixed linear model is a standard procedure to correct both confounding factors (Astle &
Balding, 2010; Korol, Ronin, Itskovich, Peng, & Nevo, 2001; Yu et al., 2006; Zhang et al., 2010). However, population structure and kinship actually reflect a part of the genetic nature in a studying population rather than a problem. Simply using any correction could underestimate the genetic factors (Vilhjálmsson & Nordborg, 2013).
Therefore, the correction would be strongly recommended when performing a candidate gene research but would be optional when investigating the genetic architectures of a given trait (Korte & Farlow, 2013).
The most practical method to correct population structure into GWAS would be the
integrations of the matrix from principal component analysis (PCA) or STRUCTURE and/or ADMIXTURE. PCA transforms a large data of possibly correlated variables into a smaller set of linearly-uncorrelated principal components (PCs) (Patterson, Price, &
Reich, 2006). The first PC has the largest variance of the observation, meaning it accounts for the largest variation, and the succeeding PCs have the largest variance in a condition of orthogonal to the former components. By reducing the variables, PCs could reflect the main pattern of the genotypic data and distinguish the genetic difference among samples. Therefore, PCA is widely applied to cluster subpopulations of a studying population and PCs are added as a matrix of fixed effect into GWAS (Price, Zaitlen, Reich, & Patterson, 2010). On the other hand, STRUCTURE and/or ADMIXTURE is an algorism that using the posterior probability to estimate the best number of subpopulations (K) (Pritchard, Stephens, & Donnelly, 2000). It identifies the simplest haplotypes among individuals and then assigns the individuals into subpopulations as probabilities. The best K can be determined by the natural logarithm of the probability of K or delta K (Evanno, Regnaut, & Goudet, 2005; Pritchard et al., 2000). Once K is determined, the probability of each individual assigned to each subpopulation can also reflect the portion of different genomes for each individual. And this probability matrix can be added as a fixed effect in GWAS.
Kinship refers to the degree of genetic relatedness and traditionally is estimated by identical by descent (IBD) while pedigree information is well informed (Jacquard, 1972). When incorporating to a pedigree-unknown germplasm, two identical alleles are considered as IBD or random sampling from a gene pool. Hence, the kinship can be modified by allele frequency and treated as the correlation coefficient of pairwise individuals (Anderson & Weir, 2007). Generally, kinship would be a random effect in GWAS because traditionally the relatedness is used to estimate the variance of heritable
components (Yu et al. 2006; Astle & Balding 2009; Zhang et al. 2010).