• 沒有找到結果。

3 MATERIALS AND METHODS

3.4 Gene-gene interaction detecting methods

All these five method were applied to our genotype-based data and haplotype-based data to detect marginal effect, two-way, and three way interactions.

Chi-square test. A chi-square test (also chi-squared or χ2 test) is any statistical hypothesis test in which the test statistic has a chi-square distribution when the null hypothesis is true, or any in which the probability distribution of the test statistic (assuming the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. In this study, we used chi-square test as a benchmark. We used a two-step approach in

chi-square test. It works as follows: (i) all markers are individually tested and ranked for marginal associations with disease; (ii) the markers with p value less than 0.05 are selected, among which all two-way and three-way interactions are tested and ranked for association.

Here is an example of testing association by χ2 test. If we want to test for two-way interactions, there are nine possible genotypes combination for biallelic marker (each with three genotypes). We can use the χ2 test with eight degrees of freedom to test for two-way interactions. To investigate higher-order interactions, chi-square test will face the sparse data problem and the χ2 approximation can be poor. In this situation, we can use the Fisher exact test or R provides a Monte Carlo test (Hope, 1968). The simulation is done by random sampling from the set of all contingency tables with given marginals.

Logistic regression model. One traditional approach still widely used today is regression. In particular, logistic regression is used when the outcome variable is discrete, for example, disease status. Logistic regression enables direct modeling of the mathematical relationship of genetic and other risk factors to disease status.

However, this ‘workhorse’ suffers from the curse of dimensionality, meaning that as the distribution of data across numerous combinations of factors becomes sparse, the parameter estimates become unreasonably biased, particularly when the ratio of independent variables to sample size exceeds ten to one [14].

In order to overcome this problem, we also use the two-step approach in LRM: (i) all markers are individually tested and ranked for marginal associations with disease by LRM; (ii) the top 20% of markers are selected, among which all two-way and three-way interactions are tested and ranked for association.

To illustrate the method we used in LRM, for simplicity, we describe the two-way

interactions association testing in genotype-based data. For two-way interactions, there are three possible genotypes for each marker. We use two dummy variables for each SNP to fit the model:

Interaction effects were tested using a likelihood ratio test (LRT) statistic with four degrees of freedom for the χ2 values. Note that LRM differs from chi-square test.

Chi-square test not only tested interaction effects, but also main effects. That is, if there is a two-way model with strong main effects but only little interaction effects, chi-square test still shows significant result. However, LRM only tested the interaction effects.

In LRM, we will still face the sparse data problem, that the LRT will have zero degrees of freedom. In this situation, the main effect can explain all variation and can be thought as there are no interaction effects.

BEAM. BEAM uses Markov chain Monte Carlo (MCMC) to ‘interrogate’ each marker conditional on the current status of other markers iteratively and outputs the posterior probability that each marker and/or epistasis is associated with the disease.

The method can be used either in a ‘pure’ bayesian sense or just as a tool to discover potential ‘hits’. For the former, one relies on the reported posterior probabilities to make inferential statements; as for the latter, one can take the reported hits and use another procedure to test whether these hits are statistically significant. The latter approach is more robust to model selection and prior assumptions (such as Dirichlet priors with arbitrary parameters) and is less prone to the slow mixing problem in the MCMC computational procedure. BEAM also proposes the B statistic to facilitate the

latter approach [7]. Figure 1 shows that an example of posterior probabilities of association for each marker by applying BEAM to our genotype-based data. We can see that two SNPs, rsDAO_7 and rsDAO_8, have a posterior probability above 0.5.

Figure 1. Example of posterior probabilities of association for each marker by applying BEAM to our genotype-based data. Two SNPs, rsDAO_7 and rsDAO_8, have a posterior probability above 0.5.

We use BEAM to detect both single-marker and epistasis associations in our genotype-based and haplotype-based data. The marker which had posterior probability that is associated with disease will be examined by B statistic. Then we can rank the association by the B statistic in one-way, two-way, and three-way interaction.

CART. Decision trees date back to the early 1960s with the work of Morgan and Sonquist. Breiman and colleagues published the first comprehensive description of recursive partitioning methodology. As a powerful data analysis method, trees are used in many fields, such as epidemiology and medical diagnosis, and provide an alternative to more standard model-based regression techniques for multivariate analyses [1]. We use the S implementation [8] in the present study. Through binary recursive partitioning, a tree successively splits the data along the coordinate axes of the predictors such that, at each division, the resulting two subsets of data are as homogeneous as possible with respect to the response of interest. Deviance is a natural splitting criterion based on likelihood values.

We used the S defaults in our study. That is, a node must include at least 10 observations and the minimum node deviance before the tree growing stops should be 1% of the root node. The subsets that are not further split are the terminal nodes. The SNP variables were considered as nominal categorical variables. We build the tree and then pruned it to a smaller tree using the deviance criteria (set the best size of tree equal to 5). Figure 2 is an example of applying CART to our genotype-based data.

Investigating the tree terminal nodes provides a natural way to identify interaction.

For example, we can calculate the chi-square statistic for each terminal node. Then we can rank the association by chi-square statistic. Note that we didn’t use CART to analyze our haplotype-based data because of computational limitation. In haplotype-based data, there are too many categories in block variables and factor predictor variables have a limit of levels in S.

Figure 2. Example of applying CART to genotype-based data

MDR. The MDR approach is a model-free and nonparametric approach that it does not assume any particular genetic model and does not estimate any parameters. With MDR, multilocus genotypes are pooled into high risk and low risk groups, effectively reducing the dimensionality of the genotype predictors from N dimensions to one dimension. The new one-dimensional multilocus genotype variable is evaluated for its ability to classify and predict disease status using cross-validation and permutation testing. It identifies interactions through an exhaustive search, that is, it searches over all possible factor combinations to find combinations with an effect on an outcome

variable. We simply use the MDR default setting to detect gene-gene interactions in our two types of data. Note that we use the MDR v1.1.0 software in this study. There are some differences between this version and the original version described in the paper [10]. In the current version, interaction with the lowest classification error (average over the ten cross-validations) is selected as the best model in each k-way interaction. The interaction that maximizes the testing accuracy is selected as the final best overall model across all k-way models.

相關文件