族群分類及/或錯誤資料下對基因-環境/基因-基因交互作用的效應及其調整

(1)

For Peer Review

A Robust TDT-Type Association Test under Informative Parental Missingness

Journal: Statistics in Medicine Manuscript ID: Draft

Wiley - Manuscript type: Paper Date Submitted by the

Author:

Complete List of Authors: Chen, Jin-Hua; China Medical University, Graduate Institute of Biostatistics

Cheng, K.F.; National Central University, Statistics(Biostatistics Branch); China medical University, Biostatistics Center and Department of Public Health

Keywords: Association test, Case-parents study, Informative missigness, Robustness, Transmission /disequilibrium test

(2)

For Peer Review

A Robust TDT-Type Association Test under Informative Parental

Missingness

J.H. Chena and K.F. Chengb,c

a

Biostatistics Center and Graduate Institute of Biostatistics, China Medical University,

Taichung, Taiwan (ROC) b

Biostatistics Center and College of Public Health, China Medical University, Taichung,

Taiwan (ROC) c

Graduate Institute of Statistics, National Central University, Chungli, Taiwan (ROC)

Short Title: Robust TDT-type Association Test

Correspondence to Professor: K.F. Cheng, Biostatistics Center, China Medical University,

Taichung, Taiwan (ROC). E-mail: [email protected].

Phone number: 886-4-2207-8539. Fax: 886-4-22078539.

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(3)

For Peer Review

Many family-based association tests rely on the random transmission of alleles from parents

to offspring. Among them, the transmission/disequilibrium test (TDT) may be considered to

be the most popular statistical test. The TDT statistic and its variations were proposed to

evaluate nonrandom transmission of alleles from parents to the diseased children. However, in

family studies, parental genotypes are not always available. Quite often, the offspring

genotype affects the severity of offspring phenotype or/and the age at onset and in turn affects

the parental missingness. In such case, the nonrandom transmission of alleles may also occur

even when the gene and disease are not associated. As a consequence, the usual TDT-type

tests would produce excessive false positive conclusions in association studies. In this paper,

we propose a TDT-type association test which is not only simple in computation but also

robust to the joint effect of population stratification and informative parental missingness. The

test statistic does not rely on any model and also allows for having different mechanisms of

parental missingness across subpopulations. We use a simulation study to compare the

performance of new test and the TDT and point out the advantage of the new method.

Keywords: Association test; Case-parents study; Informative missigness; Robustness;

Transmission /disequilibrium test

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(4)

For Peer Review

1. Introduction

Testing association between genetic markers and disease usually consists of a comparison of

genotypes from a sample of diseased individuals with those from a certain sample of

nondiseased individuals. The usual case-parents study suggests using genotype data of the

diseased children and their parents for making inference about gene-disease association. Well

known tests based on parental controls include the transmission/disequilibrium test (TDT)

proposed by Spielman et al. [1], and the conditional-on-parental-genotypes (CPG) tests

proposed by Schaid and Sommer [2] (see also [3]-[7]) for related approaches. The TDT and

the CPG tests are identical under additive genetic model. However, the CPG approach is

generally more powerful than the TDT approach under other genetic models.

In case-parents study, the cases and controls are matched in genetic ancestry. Thus, the

analysis based on the TDT or CPG tests is free of bias arising from population stratification.

This is an important property for valid association tests. However, these tests may still

produce biased results if informative parental missingness exists in the study. The effect of

missing parental genotype and its correction were studied by Clayton [8], Sun et al. [9],

Weinberg [10], Cervino and Hill [11], Allen et al. [12], and Chen [13] (see also Robinwitz and

Laird [14]; Robinwitz [15] for tests based on general families.) However, many of these

methods often require assumptions such as missing-at-random (MAR, conditional on

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(5)

For Peer Review

offspring and available parent, the genotype frequencies among missing parents and among

observed parents are the same) or missing-independent-of-offspring-genotype (MIOG,

conditional on parental genotypes, the parental missingness is independent of offspring’s

genotype). Since there is no genotype information available on the missing parents, thus these

important assumptions are usually difficult to justify in real applications. Another assumption

also often required in some association tests is that the response probabilities of parents can be

modeled by the same parametric function across all families in the study sample. For example,

Allen et al. [12] required that the response-odds parameters satisfy relatively simple models

across all studied families. This assumption may not be credible either, if the overall

population consists of several subpopulations and response rates have different forms across

subpopulations.

In the paper, we first point out that when there is no disease-gene association, and both

parents are observed, the probability of offspring’s genotype conditional on the parental

genotypes (general CPG probabilities) are no longer the same as the usual Mendelian

proportions, if the parental missingness also depends on the offspring genotype. In this case,

many tests such as TDT or its variations, depending on using the properties of

Schaid-Sommer’s CPG probability, would produce biased association results. This particular

case may occur when, for example, the offspring genotype affects the severity of offspring

phenotype or/and the age at onset and in turn affects the parental missingness.

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(6)

For Peer Review

According to the previous discussions, we find that in the literature, there exists no

association test which is simultaneously robust to the effects of population stratification and

general informative parental missingness. In this paper, we intend to propose a truly robust

association test based on case-parents data. The proposed test is novel, simple, and derived

from using the conditional probability of the offspring’s genotype given parental genotypes

when they are both observed. Thus, it is robust to the effect of population stratification. We

emphasize that the new test does not require any assumption or model for parental

missingness. That is, we let the probability of parental missingness simultaneously depend not

only on the parental genotypes but also the offspring’s genotype, and be model free. In the

case of population stratification, we also allow this probability depend on the ethnicity. Thus

the mechanism of the missingness considered in this note is the most general form of

informative parental missingness (GIPM), under which many important association tests may

become invalid. In this note, we also present simulation results to compare the performance of

the usual TDT test and the new test using only the complete case-parents data. Under some

scenarios where the MIOG condition fails, we show the TDT test tends to have excessive

false positive association results. This indicates that many approaches based on the

Schaid-Sommer’s CPG probability when both parental genotypes are observed [12, 13] may

be invalid too. In contrast, the new test has satisfactory performance in the sense that its type I

error can be approximately controlled at the desired significance level and its power is in

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7)

For Peer Review

general sufficiently large so that at least moderate genetic effect can be detected using

reasonable number of family data. In the simulation study, we consider scenarios where

conditions such as MAR, MIOG or GIPM are satisfied. We also consider the situation where

the general population consists of two subpopulations with different allele frequencies at the

candidate marker and different mechanisms for the parental missingness. Under all conditions

studied in the simulation, we find out that the new test is insensitive to the joint effect of

population stratification and GIPM.

2. Method

We assume that the candidate gene has two alleles, coded as a (normal allele) and A

(candidate disease allele), or can be divided into two groups of alleles. The genotype of the

diseased offspring is denoted by G₀. The set of parental genotypes is denoted by (G G_m, _f ), where G_m is the maternal genotype and G_f is the paternal genotype. G₀ represents the number of copies of the A allele in the offspring genotype (taking the values 0, 1, and 2) with

the same convention for G_m and G_f . The missing pattern is denoted as (R_m,R_f), where R_m

(R_f) equals one if the maternal (paternal) genotype is available in the study and zero, otherwise.

In the following discussion, we focus on using complete family trios, where both parental

and maternal genotypes are observed. The probability of an offspring genotype G₀

conditional on his/her parental genotypes(G G_m, _f), parental missingness pattern (R_m,R_f) = (1,

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(8)

For Peer Review

1), and offspring’s phenotypeD₀ is given by

0 0 0 2 0 0 ( , ) ( , ) [ | , ] , ( , ) [ | , ] m f G m f G m f g m f g m f g G G G G P G G G G G P G g G G γ ϕ γ ϕ = =

∑ ∑

(2.1) where 0 [ 0| 0] [ 0| 0 0] G P D G P D G

ϕ = = are the genotype relative risk parameters, and

0( , ) [ 1, 1| 0, , , 0] [ 1, 1| 0 0, , , 0]

G G Gm f P Rm Rf G G Gm f D P Rm Rf G G Gm f D

γ = = = = = = is a ratio

of missingness probabilities under offspring genotype G₀versus that under baseline. Note that the general CPG probability (1) is derived under the usual assumption that the offspring’s

phenotype and parental genotypes are independent conditional on the offspring’s genotype. If

the overall population consists of several subpopulations, we require this assumption to be

held within each subpopulation too. We point out that the general CPG probability can be

reduced to the Schaid-Sommer’s CPG probability [2], if

0( , )

G G Gm f

γ is a constant with

respect to G₀. The latter condition holds when, for example, the MAR or MIOG condition holds. On the other hand, if

0( , )

G G Gm f

γ is not a constant, then any test based on the

Schaid-Sommer’s CPG probability may be invalid.

The general CPG probability depends on the relative risk parameters, ratios of missingness

probabilities, and Mendelian proportions. If we define ε_g(G G_m, _f)

0(G Gm, f) g(G Gm, f) 1 g(G Gm, f)

γ γ γ

= − = − and assume that with respect to g,

( , ) g G Gm f

ε are small and approximately equal (denoted asε(G G_m, _f) ) for each fixed

(G G_m, _f), then the general CPG probabilities can be greatly simplified after applying Taylor’s expansion. Note that this assumption essentially requires that the probability of parental

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(9)

For Peer Review

missingness do not deviate too much under different offspring’s genotypes. Simulation results

presented in this paper confirm that even the differences are moderate

( εg(G Gm, f)−ε(G Gm, f) <0.1 ), the test proposed in this paper still has satisfactory

performance. In contrast, the usual TDT has type I errors seriously inflated under this scenario.

In our formulation of the testing procedure we consider approximations of the general CPG

probabilities by ignoring all terms involving εg(G Gm, f) ,a a≥2 in their Taylor’s expansions.

Under the null hypothesis of no gene-disease association, the first-order approximations of the

general CPG probabilities are given in Table I.

In view of the approximation results of Table I, we consider association analysis using only

the informative family data. Let Pˆ_{2 3}₊ ( )i denote the sample proportion of an offspring carrying i risk alleles under parental mating types 2 or 3. Pˆ_{7 8}₊ ( )i and P iˆ ( )₆ represent similar sample proportions under parental mating types 7 or 8 and mating type 6, respectively. The results in

Table I imply that

2 3 2 3 6 6 7 8 7 8

2 3 (2) (1) 6 2 (2) (1) 7 8 (0) (1) 0,

S =N ₊ _P + −P + _+N _ P −P _+N ₊ _P + −P + _≈ (2.2) under null association. The variance estimate of Sis given by

2 3 2 3 _{2 3} 6 6 6 6 6 6 ₆ 7 8 7 8 _{7 8} = 4 (1)(1 (1)) 4 (2)(1 (2)) (1)(1 (1)) 4 (2) (1) 4 (1)(1 (1)) Var P P N P P P P P P N P P N + + ₊ + + ₊  ₋  ₊ ₋ ₊ ₋ ₊        +_ − _

where N N_k( _k₊_j)is the number of complete families with mating type k_{(mating types} kor

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10)

For Peer Review

).

j Thus, a simple TDT-type association test can be defined as 2

/ .

T =S Var _{The P-value of}

the test is given by 2 1

Pr[χ ≥T],where 2 1

χ is a chi-square random variable with one degree of

freedom. We point out that the test is still valid under population stratification, where the

probability function of parental missingness differs in subpopulation.

3. Simulation Results

We have conducted a simulation study to investigate the performance of the new association

test Tand compared the results with those for the traditional TDT based on the complete trios.

According to Chen [13], the methods of Allen et al. [12] and Chen [13] had the best overall

performance under various missingness models satisfying MIOG condition. However, under

complete trios, the methods of Allen et al. and Chen are the same as or variations of the

traditional TDT, thus we excluded their methods in our simulation study. To study the

performance of type I error, we assumed the relative risks satisfied ϕ₁=ϕ₂ = in the 1

simulations. To study the power performance, we considered three genetic models: dominant

model with ϕ₁ =ϕ₂ =5 , recessive model with ϕ₁ =1,ϕ₂ =5 and additive model

withϕ₁=5,ϕ₂ = . 9

In the simulation study, we considered three missingness models satisfying MAR, MIOG,

or GIPM condition, respectively. We assumed that the joint missingness probability was the

product of maternal and paternal missingness probabilities:

( 1, 1| , , , 1) ( 1| , , 1) ( 1| , , 1). m f m m f f c c P R R G g G g G g D P R G g G g D P R G g G g D = = = = = = = = = = = × = = = = 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(11)

For Peer Review

We also assumed that each marginal missingness probability satisfied a logistic regression

model: 1 ( 1| , , 1) 1 exp( ) m m m c c m m m c m P R G g G g D g g α β γ = = = = = + − − − and 1 ( 1| , , 1) . 1 exp( ) f f f c c f f f c f P R G g G g D g g α β γ = = = = = + − − −

Under MAR condition, we assumed α_m=1.7346, α_f =1.0986, and the remaining parameter values were zeroes. This is equivalent to assuming maternal response rate equal to a constant

0.85 and paternal response rate equal to 0.75. Under MIOG condition, we assumed

m

α = 1.3863, β_m = −0.5390 , α_f =0.8473 , β_f = −0.4418 , and γ_m =γ_f =0. This is equivalent to having maternal response rate ranging from 0.5765 to 0.8000 and paternal

response rate ranging from 0.4909 to 0.7000. Two models satisfying GIPM condition were

assumed in the study. GIPM (1) model assumedα_m =1.7346, β_m = −0.2183, γ_m = −0.3445,

1.3863, f

α = β_f = −0.1206, and γ_f = −0.2559. This is equivalent to assuming maternal response rate ranging from 0.6523 to 0.8500 and paternal response rate ranging from 0.6532

to 0.8000. Note that this is a weak GIPM model. GIPM (2) model assumed

0.8473,

m

α = β_m =0.2513, γ_m =0.3466,α_f =0.4055, β_f =0.0827,and γ_f =0.1614. In this case, the range of maternal response rate is (0.5400, 0.7000) and that for the paternal response

rate is (0.4800, 0.6000). This is a moderate GIPM model.

We also studied the effect of population stratification. We assumed that the studied

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(12)

For Peer Review

population consisted of two subpopulations with high risk allele frequencies p₁ =0.4, and

2 0.2,

p = respectively, and each subpopulation satisfied Hardy-Weinberg equilibrium condition. We assumed the total complete trios for study is 300 and the proportion pof the

family trios is from subpopulation 1. If p= 1( p= 0) then the studied population was

subpopulation 1 (2) with allele frequency 0.4 (0.2). The simulation results reported in the

tables are based on 10,000 replications. Each size (or power) is the proportion of times that

10,000 simulated p-values<0.05.

In Tables II and III, we report the simulated sizes and powers of the association tests Tand

TDT under different combinations of missingness model and population structure. The results

in Table II were based on one population and therefore there was no effect of population

stratification. Under this case, the range of the size of the Ttest was (0.0506, 0.0565) and

that of the TDT was (0.0519, 0.2685), when the risk allele frequency was 0.4. On the other

hand, when the risk allele frequency became 0.2, the corresponding ranges changed to

(0.0534, 0.0762) and (0.0491, 0.1807), respectively. These results showed that the size of the

new test was basically consistent with the nominal value of 5% under most simulation

conditions. The exceptional case occurred when the allele frequency was small and GIPM

level was moderate. In contrast, the size of the TDT tended to be inflated under GIPM models.

The amount of increase in size also depends on the GIPM level. Under the same case, the

powers of the T test were in general greater than 0.9800. The exceptional cases occurred

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(13)

For Peer Review

when the allele frequency was high and the genetic model was additive or allele frequency

was low and genetic model was recessive. However, we pointed out that the power of the new

test was at least 0.70 under combinations of any genetic model and GIPM model. This

indicates that the new test is rather efficient. The results in Table III were derived under two

subpopulations with identical or different missingness models. Under these cases, the effects

of population stratification were present. Therefore, from Table III one can study the joint

effects of population stratification and GIPM when the new test T or TDT were used.

According to Table III, we first found out that the size of the new test ranged from 0.0530 to

0.0586 and that of TDT ranged from 0.0528 to 0.2075 under all study conditions. This means

that using the new test, we were able to control its type I error at the predetermined

significance level, while the TDT cannot. It is also of interest to point out that the new test

seems to have better power performance when there is population stratification, comparing

with that under no population stratification. Table III showed that the power of the new test

were in general greater than 0.900. The exceptional case happened under MAR and additive

genetic model where the smallest power was 0.7782. These results concluded that the new test

was efficient in detecting true associations under population stratification and any missingness

4. Real data analysis

We considered a real study to investigate the performance of the TDT and new association

test under null association. The study was to examine transforming growth factor beta-1 SNPs

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(14)

For Peer Review

in relation to asthma risk and degree of atopy among 546 case-parent triads ( Li et al.[16] ),

consisting of asthmatics aged 4-17 years and their parents in Mexico City. Five SNPs were

considered in the study. Here, we focus only on SNP rs8179181. Both TDT and the new test

showed that no statistically significant association exists between this SNP and asthma risk

(P-value=0.457901, and 0.797963, respectively). We used GIPM model

(α_m =1.9924,β_m = −0.2578,γ_m = −0.3180α_f =1.7346, β_f = −0.3483,and γ_f = −0.2685.) as described above to randomly generate incomplete family triads. Figure 1 shows the p-value

histograms for the TDT and the new test based on 10,000 replications. The original study

has136 informative families (consisting of at least one heterozygous parent) . Under our

missingness model, the averaged number of informative and complete families is 92. That is,

about 1/3 of the informative families have missing parental genotypes. The figure shows that

the TDT has excessive number of small p-values, indicating that the analysis based on the

TDT has produced too many false positive results. In contrast, the new association test still

maintains satisfactory performance under complicated missigness scenario.

5. Discussion

Several family-based tests of association or linkage of genetic marker and a diseases

susceptible locus have proposed in the literature. These tests have gained popularity because

of their insensitivity to population stratification. However, these tests may still be biased

because of missing parental information, which would be typical for diseases of old age.

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15)

For Peer Review

Some of these tests accommodate missing parental information, but they also require

important assumptions such as MAR or MIOG. Unfortunately, these assumptions are difficult

to justify based on the incomplete family data, particularly when the population under study is

heterogeneous. Under our simulation settings, we found that if the parental missingness also

depended on the genotypic outcome of the diseased offspring, then the largest empirical type I

error rate of the usual TDT, based on using 300 complete trios, would be 0.2685, when in fact

the predetermined significance level was only 0.0500. Since many recently proposed tests for

correcting bias in case-parents studies, by Allen et al. [12], or Chen [13] for examples, were

the same as or a variation of the TDT under complete trios, therefore, one needs to be cautious

in using these tests. Guo et al. [16, 17] considered the missing parental haplotype problem

based on the EM algorithm approach. However, they also assumed that MAR or MIOG

conditions were satisfied.

We note that under general parental missingness, Rabinowitz [15] also developed an

analysis based on a regression-adjusted score statistic to adjust for population heterogeneity.

The proposed method provided a general framework for developing valid association tests

with incomplete family data. However, the test depends on the choice of score vector and

specification of the conditional probability of the missing genotype(s). Guidance on the

choice of these important functions and the related sensitivity analysis so far remain unsolved.

In this paper, we consider a simple TDT-type test based on complete families with at least

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(16)

For Peer Review

one heterozygous parent. The test statistic depends on the proportions of the transmission of

the risk allele from parents to their diseased children. Thus it is simple in computation and

robust to the effect of population stratification. The test allows the parental missingness

depending on all genotype information of the family and the subpopulations involved in the

study. It is also nonparametric in the sense that there is no model ever being used in the

analysis. We remark that our analysis is based on using those family data where both parents

respond to the study. In the development of the new test we have used a Taylor’s expansion

for the joint response probability conditional on the offspring’s genotype, with the

requirement that the conditional probability does not deviate too much with respect to the

offspring’s genotype. Thus, theoretically speaking, if the offspring’s genotypic outcome

would greatly influence the parental missingness, then the approximation used in the analysis

may not be valid and the new test could be biased too. However, according to our simulation

results, if the differences of these conditional response probabilities are less than 10%, the

performance of our new test is still satisfactory. We consider such differences to be rather

reasonable in practical applications, especially when the parental response rates are moderate

or high.

Many family-based association tests also include incomplete trios, such as dyads or monads,

in their analysis. However, the trade-off is that they also require strong assumptions such

MAR or MIOG be satisfied. To keep full robustness and model-free in our association

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(17)

For Peer Review

analysis, we find that the genotype data from incomplete families contribute no additional

information, if the approach for analyzing complete data was modified for incomplete data.

This is because that the probability of the offspring’s genotype conditional on the (one)

observed parent’s genotype still has two unknown parameters under the null hypothesis. So

far, it is not clear if there exists such a method that includes incomplete trios in the analysis

without making any assumption about the probability of missingness. It is of interest to

investigate this issue in the future.

Acknowledgements

This research was supported in part by a grand from National Science Council and a joint

research grand from China Medical University and Asia University.

References

1. Spielman, R. S., McGinnis, R. E. AND Ewens, W. J.. Transmission test for linkage

disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM).

American Journal of Human Genetics 1993; 52: 506–516.

2. Schaid, D. J. AND Sommer, S. S. Genotype relative risks: methods for design and

analysis of candidate-gene association studies. American Journal of Human Genetics

1993; 53: 1114–1126.

3. Ott, J. Statistical properties of the haplotype relative risk. Genetic Epidemiology 1989; 6:

127–130. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(18)

For Peer Review

4. Terwilliger, J. D. AND Ott, J. A haplotype-based “halotype relative risk” approach to

detecting allelic associations. Human Heredity 1992; 42: 337–346.

5. Ewens, W. J. AND Spielman, R. S. The transmission /disequilibrium test: history,

subdivision and admixture. American Journal of Human Genetics 1995; 57: 455–464.

6. Thomson, G. Analysis of complex human genetic traits. An ordered-notation method and

new tests for model of inheritance. American Journal of Human Genetics 1995a; 57:

474–486.

7. Thomson, G. Mapping disease genes: Family-based association studies. American

Journal of Human Genetics 1995b; 57: 487–498.

8. Clayton, D. A generalization of the transmission/disequilibrium test for uncertain

haplotype transmission. American Journal of Human Genetics 1999; 65: 1170–1177.

9. Sun, F., Flanders, W. D., Yang, Q. AND Khoury, M. J. Transmission disequilibrium test

(TDT) when only one parent is available: the 1-TDT. American Journal of Epidemiology

1999; 150: 97–104.

10. Weinberg, C. R. Allowing for missing parents in genetic studies of case-parent triads.

American Journal of Human Genetics 1999; 64: 1186–1193.

11. Cervino, A. C. AND Hill, A. V. Comparison of tests for association and linkage in

incomplete families. American Journal of Human Genetics 2000; 67: 120–132.

12. Allen, A. S., Rathouz, P. J. AND Satten, G. A. Informative missingness in genetic

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19)

For Peer Review

association studies: case-parent designs. American Journal of Human Genetics 2003; 72:

671–680.

13. Chen, Y. H. New Approach to association testing in case-parent designs under

informative parental missingness. Genetic Epidemiology 2004; 27: 131–140.

14. Rabinowitz, D. AND Laird, N. A unified approach to adjusting association tests for

population admixture with arbitrary pedigree structure and arbitrary missing marker

information. Human Heredity 2000; 50: 211–223.

15. Rabinowitz, D. Adjusting for population heterogeneity and misspecified haplotype

frequencies when testing nonparametric null hypotheses in statistical genetics. Journal of

the American Statistical Association 2002; 97: 742–758.

16. Li, H., Romieu, I., Wu, H., Sienra-Monge, J.J., Ramírez-Aguilar, M., del Río-Navarro, B.E., del Lara-Sánchez, I.C., Kistner, E.O., Gjessing, H.K., London, S.J. Genetic

polymorphisms in transforming growth factor beta-1 (TGFB1) and childhood asthma and atopy. Human Genetics 2007; 121: 529–538.

17. Guo, C.Y., DeStefano, A.L., Lunetta, K.L., Dupuis, J., and Cupples, L. A. Expectation

maximization Algorithm based haplotype relative risk (EM-HRR) test of linkage

disequilibrium using incomplete case-parents trios. Human Heredity 2005; 59: 125-135.

18. Guo, C.Y., Gui, J., and Cupples, L. A. Impact of non-ignorable missingness on genetic

tests of linkage and/or association using case-parent trios. BMC Genetics 2005; 6: (Suppl

1):S90. 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(20)

For Peer Review

Table I. First-order approximations of the general CPG probabilities for the complete trio

under the null hypothesis of no association

Offspring genotype Mating type Parental genotype m f G ×G AA Aa aa 1 AA AA× 1 0 0 2 AA Aa× 1/2 1/2 0 3 Aa×AA 1/2 1/2 0 4 AA aa× 0 1 0 5 aa×AA 0 1 0 6 Aa×Aa [4 ( , )] 16 Aa Aa ε + [8 2 ( , )] 16 Aa Aa ε + [4 3 ( , )] 16 Aa Aa ε − 7 Aa aa× 0 1/2 1/2 8 aa×Aa 0 1/2 1/2 9 aa aa× 0 0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(21)

For Peer Review

Table II. Sizes and Powers of the Association Tests Under One Population

Sampling proportion p=1

Sampling proportion p=0

Missingness Model Hypothesis

T TDT T TDT Null 0.0538 0.0519 0.0535 0.0491 Dominant (φ2 = 5) 0.9890 1.0000 0.9996 1.0000 Recessive (φ2 = 5) 1.0000 1.0000 0.7489 0.9898 MAR Additive (φ2 = 9) 0.4525 1.0000 0.9982 1.0000 Null 0.0506 0.0530 0.0534 0.0537 Dominant (φ2 = 5) 0.9972 1.0000 0.9956 1.0000 Recessive (φ2 = 5) 0.9997 1.0000 0.6033 0.9740 MIOG Additive (φ2 = 9) 0.7013 1.0000 0.9942 1.0000 Null 0.0565 0.1661 0.0692 0.1126 Dominant (φ2 = 5) 0.9984 1.0000 0.9978 1.0000 Recessive (φ2 = 5) 0.9998 1.0000 0.7172 0.9339 GIPM(1) Additive (φ2 = 9) 0.7773 1.0000 0.9962 1.0000 Null 0.0559 0.2685 0.0762 0.1807 Dominant (φ2 = 5) 0.9978 1.0000 0.9987 1.0000 Recessive (φ2 = 5) 0.9998 1.0000 0.7857 0.9335 GIPM(2) Additive (φ2 = 9) 0.7206 1.0000 0.9975 1.0000 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(22)

For Peer Review

Table III. Sizes and Powers of the Association Tests Under Two Populations

Missingness Model Sampling proportion

p=2/3 Sampling proportion p=1/3 Subpopulation 1 Subpopulation 2 Hypothesis T TDT T TDT Null 0.0558 0.5450 0.0586 0.0528 Dominant (φ2 = 5) 0.9973 1.0000 0.9997 1.0000 Recessive (φ2 = 5) 0.9989 1.0000 0.9725 0.9999 MAR MAR Additive (φ2 = 9) 0.7782 1.0000 0.9664 1.0000 Null 0.0539 0.0545 0.0536 0.0528 Dominant (φ2 = 5) 0.9994 1.0000 0.9999 1.0000 Recessive (φ2 = 5) 0.9933 1.0000 0.9725 0.9999 MIOG MIOG Additive (φ2 = 9) 0.9042 1.0000 0.9664 1.0000 Null 0.0554 0.1601 0.0546 0.1373 Dominant (φ2 = 5) 0.9994 1.0000 0.9999 1.0000 Recessive (φ2 = 5) 0.9926 0.9995 0.9455 0.9949 GIPM(1) GIPM(1) Additive (φ2 = 9) 0.9183 1.0000 0.9869 1.0000 Null 0.0550 0.2075 0.0530 0.1560 Dominant (φ2 = 5) 0.9990 1.0000 1.0000 1.0000 Recessive (φ2 = 5) 0.9937 0.9998 0.9485 0.9944 GIPM(2) GIPM(1) Additive (φ2 = 9) 0.9084 1.0000 0.9855 1.0000 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(23)

For Peer Review

317x171mm (96 x 96 DPI) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

(24)

KF Chenga,b, JY Leeb, and JH Chenc a

Biostatistics Center and Department of Public Health, China Medical University,bGraduate Institute of

Statistics, National Central University, Taiwan(ROC), andcBiostatistics Center and Graduate Institute

of Biostatistics, China Medical University, Taichung, Taiwan (ROC)

Short title: Assessing the Joint Effects of Population stratification and Sampling

Correspondence to: Professor KF Cheng, Biostatistics Center, China Medical University, Taichung,

Taiwan 40402, e-mail:[email protected], phone number: 886-4-2207-8539, fax number:

(25)

case-control studies producing spurious gene-disease associations. However, its impact on the study

of interactions is not clear. In this paper, we investigate the joint effect of population stratification

and sampling in case-control study of gene-gene or gene-environment interactions. We show that

the level of the PS bias in testing null interaction depends not only on the variations of the baseline

genotype (exposure) frequency, disease risk, and gene-gene (environment) odds ratio, levels of the

main effects, but also the sampling of the study. Using this result, we point out that the PS bias may

still exist even when the case and control data were matched in ethnicity. We also give general

conditions under which the PS bias is null and derive bounds for the bias when the structure of the

population is not known. The information about these bounds can be used to define conservative

tests. In this paper, we also quantify the PS bias to the genetic main effect and show how the bias is

modified by the presence of another factor. Finally, the performance of the usual tests is studied by

numerical examples.

Key words: bias; environmental risk factors; genes; interactions; population stratification; sampling

(26)

have been identified. Overwhelming evidence indicates that there are reasons to believe that relative

common polymorphisms in a wide spectrum of genes may modify the effect of environmental

agents [1, 2]. Several studies also have demonstrated the presence of gene-gene interaction in

complex human diseases [3-7]. Gene-gene interaction, epistasis, is also considered as a basic

genetic concept which has been widely used by biologists for a long time [8].

Many association designs have been proposed for studying gene-environment or gene-gene

interactions. Recently, Wang and Zhao [9] found that in the study of gene-gene interactions, the

unmatched case-control association design is more powerful than both the matched case-control

design and case-parents design. They also found that when a logistic regression model is fitted for

assessing gene-environment interactions based on case-parents sample [10], the approach may be

susceptible to the PS bias. However, case-control design is also well known to be susceptible to the

PS bias in the study of genetic effect, if the gene under study shows marked variation in allele

frequency across subgroups of the population and if these subgroups also differ in their base-line

disease risks [11-17]. Wang, et al. [18] recently provided numerical examples showing that when

the correlation between genetic and environmental factors is small or the linkage disequilibrium is

weak, and case-control data were collected according to a simple random sampling (SRS) scheme,

the PS biases in testing null interaction odds ratios are also small. However, the SRS scheme is

(27)

testing null interaction. We quantify the magnitude of the PS bias in terms of the baseline disease

risks, genotype frequencies (or exposure rates), their odds ratios (linkage disequilibrium

coefficients), the levels of the main effects, and sampling design used in the study. Based on this

result we are able to derive many important conditions under which the PS bias is null. We point out

that if the study does not follow an SRS design, the PS bias may be large even the genetic factors

under study are in linkage equilibrium. We also show that matching in ethnicity between case and

control samples cannot always eliminate the PS bias. However, our numerical examples indicate

that the biases are generally small when the genetic factors are in linkage equilibrium.

The PS bias often cannot be estimated, since wedon’tknow how many importantsubpopulations

involved in the studied population and/or which subpopulation a person belongs to. In this paper,

however, we are able to derive useful bounds to measure the maximal impact of the PS when the

structure of the population is not known. Sometimes, these bounds can be estimated so that more

careful conclusion about the significance of the test can be made; see Lee and Wang [19] for similar

suggestion in studies of gene-disease association. We give many numerical examples to show the

general performance of the usual tests.

In addition to the discussion about the PS bias in the study of interactions, we also discuss the

impact of the PS effect in testing the null genetic main effect when another genetic (environmental)

(28)

Methods

The Magnitudes of the Biases

We begin this section with the notation that will be used throughout this work. Disease status is

denoted as D with levels d=1, 0, indicating the presence and absence of the disease, respectively.

G=1 (0) represents the presence (absence) of the genotype of interest. H=1 (0) represents the

presence (absence) of the environmental exposure or another genotype of interest. All results shown

in this paper can be extended to any risk factor with finite number of levels. We let S denote a

general stratification variable with values s=1,…,K, representing K strata. S is not observable in

discussion of the effects of PS. In the subpopulation level, we assume that the risk model is given

by

logitP D( 1|Gg H, h S,      s)      _s g h gh.

s=1, g=0, and h=0 represent the referent subpopulation, genotype and environmental exposure,

respectively. For identifiability purpose, we define ₁ =0.  _s,s 1,...,K, are the subpopulation-specific parameters representing the potential heterogeneity of disease risk across

subpopulations. In this model, log-odds-ratio  measures the association between the genotype

and risk of disease, log-odds-ratio  measures the association between the environmental

(29)

s

OR to represent the corresponding baseline G H odds ratio (givenD0), G H_s( _s)to represent the baseline G H( )-frequency odds (givenD0, andH G( )0), D to represent the baseline_s

disease frequency odds (givenH 0,andG0) and

* * ( | 1) ( | 0) ( | 1) ( | 0) s P S s D P S s D DS P S s D P S s D          to

represent the deviation of the true sampling scheme of the study and the usual SRS scheme. Here

*

( | 1)

P S s D ( P S*( s D|  ) is the true sampling proportion of the subjects from0) subpopulation s in the cases (controls), and P S( s D| 1) ( P S( s D| 0) ) are the corresponding proportions under SRS scheme.

Since in the population level we only observe factors G and H in a case-control study, we show in

the Appendix that the main effects and interaction are given by

D G odds-ratio =exp(  *), DH odds-ratio =exp(  ,*) G H interaction =exp(  *), where * (1, 0) log (0, 0) K K      , * (0,1) log (0, 0) K K  _ _  , and * (1,1) (0, 0) log , (1, 0) (0,1) K K K K      and * 1 * 1 ( 0, 0 | , 0) ( | 0) ( , ) . ( 0, 0 | , 0) ( | 0) k g h g h s s s s s s k g h g h s s s s OR G H D DS P G H S s D P S s D K g h OR G H P G H S s D P S s D                 



We note that if D_sDS_sis constant across subpopulations then there is no PS bias of any kind. A sufficient condition for this to hold is when the baseline disease risk is identical across all

(30)

The maximal impact of the PS bias to the genetic main effect and conditions for the null bias

The maximal PS bias  to the genetic main effect depends on the variation of the baseline* genotype frequency odds, * max s min s,

s s

G  G G variation of the sampling

deviation * max s min s s s

DS  DS DS and variation of the baseline disease odds

*

max s min s s s

D  D D . In the Appendix, we also showed that the bias can be expressed as

* exp( =)









1 1 1 k s s s s s k k s s s s s s s G D DS w G w D DS w   



 

* * * * * * (1 ) , ( (1 ) )( (1 ) ) M M M m m m M m M M m m w G D DS w G D DS w G w G w D DS w D DS       

where w and_s w* are some constants satisfying 0w*, w_s 1and

1 1 K s s w  



. The inequality shows that the bias is the greatest when the number of subpopulations is 2. Note that the value of

constant w* is generally unknown, however, using the above inequality the bias can be shown to

be bounded above by











2 * * * * * * * * * * * * * * * 1 G D DS G D DS U G D DS G G D DS D DS      .

The bias is also bounded below byL_U_1. These bounds give the maximal impact of the PS in making inference about the genetic main effect. Under rare disease, the background disease rate is

(31)

if the data observed in the study follow an SRS scheme (DS*=1), and the disease risk is constant,

then the PS bias is also null. (However, if the sampling is not SRS, the results in Tables 1-2 show

that the PS bias may be non-null even when the subpopulations have constant disease risk); (3) if

the case and control data were matched in ethnicity, and H-main effect and interaction are null,

then there is the PS bias is null. (However, if there is non-null H-main effect (or non-null

interaction), the results in Tables 1-2 show that the PS bias is non-null)

The maximal impact of the PS bias to the interaction and conditions for the null bias

The PS bias to the interaction effect also depends on the variation of the baseline odds ratio*

*

max s min s.

s s

OR  OR OR We point out that OR* is not necessary to be 1 even the two genetic factors are in linkage equilibrium within each subpopulation. In fact, if the latter condition holds,

one can show









1 exp(1 exp() 1 exp(



) 1 exp( ) )



,

s s s s s OR                    where ( 0 | ) log . ( 0 | ) s s P D S s P D S s       

Thus, OR* also depends on the levels of the main effects and interaction. In the Appendix, we

showed that under SRS, the bias was bounded above by (1) *2

U_ D and bounded below

(1) *2 1

( )

L_ D . These are the same bounds derived by Wang et al. [18]. Unfortunately, these bounds are not valid if the SRS condition is not satisfied. Under general sampling design, we showed that

(32)

and bounded below by 1 U_(2) L(2)_ . We note that if the genetic factors are in linkage equilibrium within each subpopulation, and the G(orH)-main effect is null, then OR* 1in testing null interaction. Under this scenario, if the variation of the G(orH) frequency odds is small, the above

bounds show that the bias of the test cannot be expected to be large.

Using these bounds, some interesting conclusions about the PS bias can be reached: (1) if the

baseline G H odds ratio and G(orH)- frequency odds are constant across subpopulations then the PS bias is null; (2) if the sampling of the study is SRS, and the disease risk is constant, then* the PS bias is also null. (However, if the SRS condition does not hold, the latter conclusion is not

necessary valid; see Tables 1-2 for more results.)

RESULTS

True type I errors

In case-control studies, one often expects that the type I errors of the association tests can be

approximately controlled at some predetermined level. However, in the presence of PS, the usual

test statistic does not have a chi-square distribution under null hypothesis. Instead, it has a

non-central Chi-square distribution, with non-centrality parameter depending on the level of the PS

bias. Thus, the usual chi-square test tends to have inflated type I errors.

Suppose that the intended type I error rate of the chi-square test is  and let _1;12__represent



(33)

given by

 

2 * (1) (1) (1) (1) (0) (0) (0) (0) 11 01 10 00 11 01 10 00 . 1 1 1 1 1 1 1 1 ( ) ( ) n n n n n n n n           

where n( )_ghd is number of observations with outcome Gg H, h and disease status d. Then the true type I error of the usual chi-square test of null interaction is given by _P(₁2( _) _1;12__), which is always greater than or equal toIn the case of testing null genetic main effect, the.

non-centrality parameter is given by





2 * (1) (1) (0) (0) 10 00 10 00 log( ) . 1 1 1 1 ( ) n n n n       

The corresponding true type I error of the chi-square test is given by _P(₁2( _) _1;12__), which is also.

Tables 1 and 2 show values of the PS bias, bias*  and true type I errors of the usual* chi-square tests when the significance level is 0.05. We assumed that there are two subpopulations

(K 2),  0,0or 1 (thus OR =1 under linkage equilibrium),_s G(H) frequency of the first subpopulation was given by P G( 1|S  1) 0.51 ( P H( 1|S 1) 0.19) , the first subpopulation disease risk was P D( 1|S 1) 0.05, the proportion of subpopulation 1 in the overall population was 0.7, and case and control sample sizes both equaled to n= 500. We defined

1 2

( , ),

(34)

parameter values were determined from the variations G H D and*, *, * DS*given in the tables with the assumption that subpopulation 2 has the maximal baseline G(orH) frequency odds, disease

risk, and sampling deviation (this implies that P S*( 2 |D ranges from 0.0585 to o.7163).0) Finally, we note that in computing the non-centrality parameters, the sample frequencies d

gh

n were

replaced byn P G ( g H, h D|  .d)

According to the results in Table 1, the true type I error  ranges from 0.05 to 0.9998 under_ linkage equilibrium. If the SRS condition holds and0, the true type I error  ranges from_ 0.05 to 0.9602 with mean 0.4377 and standard error 0.3298. Under the same scenario but1, the corresponding range becomes (0.05, 0.9326) with mean 0.3822 and standard error 0.2969. On the

other hand, if the sampling is not SRS (DS*=3 or 5) and0, the range of  is (0.05, 0.9998)_ with mean 0.4747 and standard error 0.3978. Under non-SRS but 1, the corresponding range becomes (0.05, 0.9992) with mean 0.4425 and standard error 0.3678. These summarized results

indicate that the PS bias can be quite large and its level may be modified by the sampling design

and the level of H-main effect. In general, the PS bias is smaller when 1. We also observed that the PS bias may be non-null under perfect matching. For example, if the matching is perfect* and H-main effect 1, the largest true type I error is 0.1064, which occurs at the case with

* * *

5.

G H   This is contrary to our usual believe that matching between cases and controlsD

(35)

disequilibrium coefficient is small and the sampling is SRS. Our Table 1 also shows that under SRS

and linkage equilibrium, the true type I error  in testing null interaction ranges from 0.05 to_ 0.0659. This agrees with the finding by Wang et al. However, if the sampling is not SRS (DS*=3 or

5),  has range (0.05, 0.2656), mean 0.0840, and standard error 0.0516 when_ 0, and range (0.05, 0.2750), mean 0.0871, and standard error 0.0550 when1. These results indicate that PS also can cause serious problem in case-control study of gene-gene interactions even the two genes

are in linkage equilibrium. Under this scenario, the best way of reducing the bias is to match cases

and controls in ethnicity. We note that under perfect matching and linkage equilibrium, the range of



 is only between 0.05, and 0.0541.

Linkage disequilibrium between two genes or correlation between genetic and environmental

factors also play important role in determining the PS level in the studies of interaction. According

to results presented in Table 2, we find that the bias to the genetic main effect becomes smaller

when the linkage disequilibrium coefficient increases to 0.05. When 0, the mean of  is_ 0.3377 under SRS and 0.3843 under non-SRS, and when 1 the mean becomes 0.2716 and 0.3351, under SRS and non-SRS, respectively. On the contrary, the bias to the interaction effect

increases when the linkage disequilibrium coefficient increases to 0.05. Our results show that

when0, the mean of  is 0.1642 under SRS and 0.3841 under non-SRS, and when_ 1, the mean becomes 0.1706 and 0.3906, under SRS and non-SRS, respectively. In all, the PS bias *

(36)

perfectly matched.

Discussion

The impact of population stratification is considered by many to be important in case-control

studies of gene-disease association. Many authors have suggested quantitative methods to control

type I errors of the usual association test. The most popular treatments include the “genomic control” method [17, 21-26] and the “structure association” method [20, 27- 30]. Each of the

proposed methods requires typing extra polymorphic markers to generate an estimate of PS bias

which can be used to adjust the test statistic. The impact of PS in case-control studies of gene-gene

(environment) interaction is considered to be less important, when the genes under studied are in

linkage equilibrium or when the gene-environment correlation is weak [18, 31]. However, this

conclusion holds only when the sampling of the case and control data follows a simple random

sampling design. Unfortunately, there exists no formal test for testing the validity of the SRS

condition when the PS is present.

In practical applications, the SRS condition is rarely satisfied. For examples, when the

hospital-based cases (controls) are used in the study and they are not representative of the

population-based cases (controls) or when the non-response of the cases or/and controls occur in the

study, then the SRS condition may fail. In this paper, we show that under slight deviation of the

(37)

genetic factors are in linkage equilibrium or when the genetic and environmental factors are

uncorrelated. Large correlation or strong linkage disequilibrium could make the PS bias become

even larger. Also, small variation in disease risk cannot guarantee small PS bias to the interaction,

unless the sampling of the study follows an SRS design. Therefore, in applications, it is also

important to be able to measure the maximal impact of the PS bias. In this paper, we drive bounds

for the PS bias. If the bounds are estimable, then they can be used for making more conservative

inference.

We note that matching in ethnicity between cases and controls has been suggested by the

epidemiologists to control the PS bias in case-control gene-disease association study. However, the

PS bias () to the genetic main effect also depends on the sampling design and levels of another* main effect (value) or interaction effect. We found that if  0 then the residual bias to the main effect after matching is zero. However, if1, and0, the residual bias may still be substantial. Tables 1-2 also show that matching cannot remove the PS bias to the interaction effect.

However, under the cases considered, the biases are small.

Since the presence of PS may cause unacceptable bias to the usual interaction analysis, it is of

importance to have an efficient method to control the bias. Unfortunately, so far there exists no

effective method for controlling the PS bias in the studies of interaction. The major difficulty is that

(38)

wishes to test the significance of the gene-gene interaction, the genomic control method may be

applied to control for the PS bias. One can follow the idea of genomic control to type extra pairs of

null markers and apply the computed interaction levels to control the PS bias. In principle, if the

candidate markers are in linkage equilibrium, the selected pairs of null markers also need to be in

linkage equilibrium so that the important characteristics of the PS bias can be captured. On the other

hand, if the candidate markers are in linkage disequilibrium, the paired null markers also need to be

correlated. We are currently working on this important problem. Another approach is to reduce the

PS bias by matching the cases and controls in the study. We find that under perfect matching and

linkage equilibrium, the PS biases to the interaction effect are small in the discussed cases. More

study is needed in order to understand the impact of the residual bias when matching is not perfect.

Finally, we point out that one may also use extra null markers to determine the number of

subpopulations involved in the study and identify which subpopulation a person belongs to and use

this information to fit the risk model. However, the success of this approach also heavily depends

on the selection of the null markers.

Appendix

Following the usual Bayesian argument, the disease-risk model implies that



 



Pr Gg H, h S| s D, 1 Pr Gg H, h S| s D, 0 exp



       _s g e ge



, where  _s  _slog Pr





D0,S s

 

Pr D1,S s





, s=2,…,k. As a consequence,

(39)

On the other hand, the joint frequency distribution of G and H in the control population is given by







 

*



1 Pr , | 0 Pr , | , 0 P | 0 . k s G g H h D G g H h S s D S s D     



     

Thus their ratio is given by



 





  



 

   







* * * * * Pr , | 1 Pr , | 0 exp , exp . G g H h D G g H h D g h gh K g h g h gh                                 Here, we define *



*

 



log K 0, 0  ,

 

* * * 1, 0 log 0, 0 K K _{ } _    ,

 

* * * 0,1 log 0, 0 K K _{ } _    and

   

* * * * * 1,1 0, 0 log 0,1 1, 0 K K K K       , where

 



 

  



 



* * 1 * 1 Pr , | , 0 P | 1 exp , . Pr , | , 0 P | 0 k s s k s G g H h S s D S s D K g h G g H h S s D S s D                



It can be shown that *

 

 represents the PS bias to the genetic (environmental) main effect,* and is the PS bias to the interaction effect. We note that*

 

( 0 | 1, ) exp ( 0 | 0, ) s P G H D S s P G H D S s           ( 1| 0, ) ( | 0) ( 0 | 0, ) ( | 1) P D G H S s P S s D P D G H S s P S s D               ( 0) , ( 1) P D P D   

thus one can re-write *

 

, ( , ) ( 0). ( 1) P D K g h K g h P D   

(40)

where * * 1 ( 0, 0 | , 0) ( | 0) ( 0, 0 | , 0) ( | 0) s k s P G H S s D P S s D w P G H S s D P S s D                



.Simple algebra shows that there

exists some constant w*such that the bias is bounded above by

* * * * * * (1 ) ( (1 ) )( (1 ) ) M M M m m m M m M M m m w G D DS w G D DS w G w G w D DS w D DS       0 1 (1 ) max ( (1 ) )( (1 ) ) M M M m m m w M m M M m m wG D DS w G D DS wG w G wD DS w D DS         











2 * * * * * * * * * * * * * * * 1 . G D DS G D DS G D DS G G D DS D DS    

Also note that under SRS, DS_s  and therefore according to the definition of1 exp( , we*)

easily show that it is bounded above by D*2and bounded below byD* 2. However, under general sampling design, the bias is expressed as

* 1 1 1 1 1 1 exp( ) , k k k s s s s s s s s s s s k k k s s s s s s s s s s s OR G H w G w H w G w H w OR G H w               



 



where * * 1 ( | 0) ( | 0) s s s k s s s D DS P S s D w D DS P S  s D       



, * * 1 ( | 0) ( | 0) s k s P S s D w P S s D       



. By applying the same

approach for deriving bounds forexp( , we also can derive bounds for*) exp( .*)

Acknowledgements

This research was supported in part by a grand from National Science Council and a joint research

grand from China Medical University and Asia University.