• 沒有找到結果。

家族性病例對照資料之統計分析

N/A
N/A
Protected

Academic year: 2021

Share "家族性病例對照資料之統計分析"

Copied!
49
0
0

加載中.... (立即查看全文)

全文

(1)

統計學研究所

家族性病例對照資料之統計分析

Statistical Analysis for Familial Case-Control Data

研 究 生:蘇筱嵐

指導教授:王維菁 博士

(2)

家族性病例對照資料之統計分析

Statistical Analysis for Familial Case-Control Data

研 究 生:蘇筱嵐 Student:Hsiao-Lan Su

指導教授:王維菁 博士 Advisor:Dr. Weijing Wang

國 立 交 通 大 學

統 計 學 研 究 所

碩 士 論 文

A Thesis

Submitted to Institute of Statistics

College of Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Master

in

Statistics June 2010

Hsinchu, Taiwan, Republic of China

(3)

家族性病例對照資料之統計分析

學生:蘇筱嵐 指導教授:王維菁 博士

國立交通大學統計學研究所

家族性病例對照資料研究方法近年來常使用於探討疾病與致病因子之

關係。本論文回顧了分析家族資料的統計文獻方法:針對得病與否的反應

變數,考慮了邏輯斯迴歸模型; 針對得病時間的反應變數,考慮了 Cox 等

比風險模型。我們討論如何將建立在個別性資料上之研究方法推廣至家族

性資料,並探討如何將適用於前瞻性資料的方法修改為分析病例對照資料

所需做的假設與調整。此外,我們也透過模擬實驗來驗證推論過程中所需

要之條件與比較參數估計之表現。

關鍵字:家族性病例對照資料研究;前瞻性研究;邏輯斯迴歸模型;Cox PH

模型;Clayton 模型

(4)

Statistical Analysis for Familial Case-Control Data

Student: Hsiao-Lan Su

Advisor: Dr. Weijing Wang

Institute of Statistics

National Chiao Tung University

Hsinchu, Taiwan

Abstract

Familial case-control data are frequently used to study the relationship between disease

and risk factors. In the thesis, we review literature for analyzing familial data. The logistic

model is applied to model the probability of disease incidence. The Cox proportional hazards

model is applied to model the age at onset of the disease. For each model, we discuss how to

extend the method and model developed for individual data to familial data. In addition, we

discuss the criteria and modification from prospective data to case-control data. We also

propose simulation algorithms for generating case-control data and then, based on simulated

data, examine parameter estimates and crucial properties of the inference procedure.

Keywords: Familial case-control study; Prospective study; Logistic regression; Cox PH model;

(5)

碩士生涯的終點,也代表了十八年學生旅程的結束。這兩年能有所成果,

首先必須感謝我的指導老師─王維菁教授,我在理論的推導與邏輯思考

上,一直不是很敏感,反應也很慢,是老師像母親般不厭其煩的引導我從

大方向思考,也讓我學會了該如何去統整概念和如何論述的技巧,更幫助

我撐起整篇論文的架構,老師給予的訓練在我往後的人生中將會受用無窮。

再來要感謝所有交大統研 97 與悶騷的研究室同學們,因為有你們的一同

努力與分享,學習之路不至於乏味而無援。還有高中與大學的好友們,你

們的支持與鼓勵是我向前邁進的動力。

也必須感謝交大統計所所有的老師與郭姐,給予了我們良好的學習環

境,讓我在碩士這兩年學到了許多知識。還有我的口試委員徐南蓉教授、

黃信誠教授與洪慧念教授,謝謝你們的協助與討論,使論文更加完善。

最後,我要感謝我的母親,我知道我不是個貼心的好女兒,這幾年來讓

您辛苦了,您無條件的支持使我求學過程中完全無後顧之憂。最後,我想

將這篇論文獻給我們永遠懷念的父親,我會盡力達成您的期望。

感謝一路上曾幫助過我的貴人,希望大家未來的人生都順利而快樂。

蘇筱嵐 謹誌于

國立交通大學統計學研究所

中華民國九十九年六月

(6)

Contents

1 Introduction ... 1

1.1 Motivation ... 1

1.2 Outline ... 2

2 An Overview of Case-Control Designs ... 3

2.1 Conventional Case-Control Designs ... 3

2.2 Familial Case-Control Designs ... 3

2.3 The Issue of Matching in Case-Control Designs ... 4

3 Logistic Regression on Different Designs ... 5

3.1 Conventional Case-Control Designs ... 5

3.2 Familial Case-Control Designs ... 7

3.2.1 Likelihood analysis based on familial prospective data ... 8

3.2.2 Likelihood analysis based on familial case-control data ... 10

4 Simulations for Logistic Regression Analysis ... 12

4.1 Data Generation for Individual Data ... 12

4.1.1 Prospective data of the true population ... 12

4.1.2 Case-control data from the true population ... 12

4.2 Analysis on Individual Data ... 13

4.3 Data Generation for Familial Data ... 15

4.3.1 Familial prospective data of the true population ... 15

4.3.2 Familial case-control data from the true population ... 15

4.4 Analysis on Familial Data ... 15

5 Regression Analysis Based on Familial Data ... 24

5.1 Likelihood Analysis Based on Probands ... 24

5.2 Likelihood Analysis Based on Familial data ... 27

6 Simulations for Analysis of Age-onset Data from Case-Control Family Studies ... 33

(7)

6.1.1 Prospective data of the true population ... 33

6.1.2 Case-control data from the true population ... 34

6.2 Analysis on Individual Data ... 34

6.3 Data Generation for Familial Data ... 35

6.3.1 Familial prospective data of the true population ... 35

6.3.2 Familial case-control data from the true population ... 36

6.4 Analysis on Familial Data ... 36

7 Concluding Remarks ... 38

(8)

List of Tables

Table 3.1 Joint probability for

Y Y given p, r

Zp,Zr

... 10

Table 4.1 Logistic regression analysis of case-control data ... 14

Table 4.2.A Checking reproducible properties of the population data (p=0.3) ... 17

Table 4.2.B Checking reproducible properties based on case-control data (p=0.3) ... 17

Table 4.2.C Checking reproducible properties based on case-control data (p=0.3) ... 18

Table 4.2.D Checking reproducible properties based on case-control data (p=0.3) ... 18

Table 4.3.A Checking reproducible properties of the population data (p=0.5) ... 19

Table 4.3.B Checking reproducible properties based on case-control data (p=0.5) ... 19

Table 4.3.C Checking reproducible properties based on case-control data (p=0.5) ... 20

Table 4.3.D Checking reproducible properties based on case-control data (p=0.5) ... 20

Table 4.4.A Checking reproducible properties of the population data (p=0.7) ... 21

Table 4.4.B Checking reproducible properties based on case-control data (p=0.7) ... 21

Table 4.4.C Checking reproducible properties based on case-control data (p=0.7) ... 22

Table 4.4.D Checking reproducible properties based on case-control data (p=0.7) ... 22

Table 4.5 The MLE of  based on case-control familial data ... 23

Table 5.1 Age-matched case-control data ... 25

Table 5.2 Age-matched case-control familial data ... 28

Table 6.1 Analysis of age-onset data based on case-control studies ... 35

(9)

List of Figures

(10)

Chapter 1 Introduction

1.1 Motivation

Scientists are interested in studying the roles of genetic and environmental factors on the

development of a disease. Besides the information about whether the disease is present or not,

age-at-onset has been viewed as a useful quantitative trait for some commonly-seen complex

diseases. For example, early onset of breast cancer has been viewed as an important hallmark

for genetic predisposition. Figure 1 highlights the scientific background which motivates this

thesis. For a quantitative trait, statisticians can perform regression analysis which the effects

of the explanatory variables on the response.

Figure 1: Scientific Background

We focus on two quantitative traits, namely disease incidence and age-at-onset. Disease

incidence can be coded as a binary variable. Age-onset variables are continuous but may be

censored due to termination of the study or loss to follow-up. Genetic, environmental and

individual factors are treated as observed covariates. Their influences on the chosen response

variable are of major interest. If disease incidence is the response variable, logistic regression

models can be adopted. If age-at-onset is studied, failure-time regression models such as Cox

proportional hazards models can be applied. When genetic information is not directly Environmental factors Individual factors Quantitative Traits:  Incidence  Age-onset  Others Genetic factors

(11)

measured, familial data can be used to detect its influence. Familial aggregation often

indicates that genetic or shared environmental factors play some role in the development of

the disease.

From the aspect of data design, the case-control sampling study is often applied to gather

the information of rare diseases. It has the advantage that sufficient number of cases can be

obtained and hence is cheaper and more convenient in comparison with a prospective study.

In recent years, familial case-control designs have become a popular choice in genetic

epidemiology. However statistical inference based on familial case-control data deserves

careful investigation since the underlying probability structure is not straightforward.

1.2 Outline

The purpose of the thesis is to review related literature under a unified framework and

examine some theoretical statements via simulations. In Chapter 2, we provide some

background for different types of case-control designs. In Chapter 3, we review literature on

logistic regression for familial prospective studies and case-control designs. Chapter 4

contains some simulation results which are conducted to verify crucial probability statements

for logistic regression analysis. In Chapter 5, we review literature on familial age-onset data

based on case-control designs. Chapter 6 contains simulation studies for checking the

assumptions that are required in analysis of age-onset data from case-control family studies.

(12)

Chapter 2 An Overview of Case-control Designs

Case-control designs are preferable because they are cheaper and more convenient. In

this chapter, we focus on two common case-control designs: namely the conventional and

familial designs.

2.1 Conventional Case-control Designs

Conventional case-control designs begin by recruiting a group of individuals with a specific

disease as “cases” and the other group of non-diseased individuals as “controls”. Cases and

controls are compared based on risk factors including familial history of the disease. Here

positive familial history is defined as presence of the disease in one or more first-degree

relatives. However potential bias may arise due to incorrect information of recall.

Furthermore individuals may differ in their family sizes so that positive family history is more

likely to occur in a larger family. The family sizes differ in cases and controls can lead to false

results. Liang (2000) discussed potential biases for conventional case-control designs in

details.

2.2 Familial Case-Control Designs

Familial data obtained from case-control designs are frequently used to detect disease

aggregation in families. This design begins by identifying a sample of diseased cases and an

independent sample of disease-free controls, and for each individual, hereafter called a

“proband”, determines his/her covariates, the family structure, and the disease status and covariates of relatives in the family. The disease status of relatives is treated as one part of the

responses in the model.

A major difference between the two designs lies in the sampling unit. The sampling unit in

familial case-control designs is a pre-defined set of family members. Compared to the

conventional design, familial case-control designs provide direct evaluations of the relatives

and can avoid misclassification of family history. It is also useful for genetic counseling.

(13)

2.3 The Issue of Matching in Case-Control Designs

In case-control designs, there are some confounding variables that may affect the

evaluation of the association between disease incidence and risk factors. So, sometimes we

must consider the necessity to match these confounding variables in the design stage. The

purpose of matching is to let the units between cases and controls have more comparability.

The matching method includes frequency matching and individual matching. In

conventional case-control design, if individual matching is part of the design, the conditional

logistic regression method mentioned in Breslow and Day (1980) may be adopted. When in a

familial case-control design, we note that the sampling units are families. So the matching

between case probands and control probands doesn’t guarantee the matching between case

relatives and control relatives. Thus the matching procedure in such studies should be subject

to some modification. First, the matching in design stage must be run under the condition that

the confounding variables are familial, for example: races. Second, correlations among

relatives have to be dealt with. Liang (1987) proposed a method for analyzing the matched

designs which accounts for the within-family correlation. For age-onset responses, Li et al.

(1998) also discussed situations under familial structure and matched procedure.

Finally, Sturmer and Brenner (2000) discussed the issue of the balance between power

(14)

Chapter 3 Logistic Regression on Different Designs

Logistic regression models are commonly adopted for modeling the relationship between

a binary response and covariates. We first discuss statistical inference based on prospective

studies which can be easily understood. Then we discuss how to construct the likelihood

function if the sample is obtained from a case-control design. Finally we will review the

literature on logistic regression analysis for familial case-control studies.

Denote Y as a binary indicator for disease status. Specifically Y 1 represents that the individual is diseased while Y 0 indicates that the individual is free of the disease. Denote

Z as a p1 vector of covariates. Consider the following logistic regression model:

exp( ) Pr( 1| ) 1 exp( ) T T Z Y Z Z          . (3.1)

Let {( ,Y Zi i) (i1,..., )}n denote the observed sample. If the data are collected from a

prospective design, the likelihood function can be written as

1 1 exp( ) 1 1 exp( ) 1 exp( ) i i Y Y T n i T T i i i Z Z Z                     

. (3.2)

A case-control study, by contrast, identify a sample of diseased cases: Y 1 and another independent sample of non-diseased controls: Y 0 . The covariate Z is measured

afterwards. Notice that the distribution of data from a case-control study is based on

Pr Z Y instead of | Pr

Y Z as given in (3.1). However logistic regression analysis can |

still be applied to both sampling designs (Prentice and Pyke, 1979). In Sections 3.1 and 3.2,

we will review the results of Whittemore (1995) in which the probability structure under

conventional and familial case-control designs is well examined.

3.1 Conventional Case-Control Designs

Let  be the target population. The logistic regression model in (3.1) is equivalent to

Pr( 1| , ) log Pr( 0 | , ) T Y Z Z Y Z         (3.3)

(15)

where  is the intercept that represents the log odds for developing the disease of the baseline group, and  is the log odds ratio between a subject with covariate Z and a subject of the baseline group. Since  reflects the effect of Z on Y , it is the parameter of

major interest.

As mention earlier, a sample based on a case-control design involves Pr

Z Y| ,

. Applying Baye’s rule, we obtain

Pr( | 1, ) Pr( 0 | ) exp( ) exp( ). Pr( | 0, ) Pr( 1| ) T Z Y Y Z Z Y Y           (3.4a)

Whittemore (1995) mentioned that one can imagine a hypothetical population denoted as * in which the covariate distribution is the same as in  such that

* * Pr( | 1, ) Pr( | 1, ) Pr( 0 | ) exp( ) exp( ) Pr( | 0, ) Pr( | 0, ) Pr( 1| ) T Z Y Z Y Y Z Z Y Z Y Y               . (3.4b) Define * * Pr( 0 | ) Pr( 1| ) exp( ) exp( ) Pr( 1| ) Pr( 0 | ) Y Y Y Y              .

One can rewrite (3.4b) as

* * * * Pr( | 1, ) Pr( 0 | ) exp( ) exp( ) Pr( | 0, ) Pr( 1| ) T Z Y Y Z Z Y Y           . (3.5)

From (3.5), we can construct the following logistic model based on *:

* * Pr( 1| , ) log Pr( 0 | , ) T Y Z Z Y Z         . (3.6)

Now we discuss the implication of the above analysis. Comparing the two models in (3.3)

and (3.6), they differ in the intercept parameter but have the same slope parameter, which is of

major interest. In a case-control design, the sampling distribution is based on

*

Pr Z Y|  1, and Pr

Z Y|  0, *

, where

*

*

*

* Pr | Pr | 1, Pr 1| , Pr 1| Z Z Y Y Z Y          ;

(16)

*

*

*

* Pr | Pr | 0, Pr 0 | , Pr 0 | Z Z Y Y Z Y          . Notice that

* * Pr | Pr | Z Y

 is independent with parameters. The likelihood function for

case-control data can be constructed based on model (3.6). Accordingly case-control data can

be treated as prospective data from * if the following condition holds:

* * Pr( | 1, ) Pr( | 1, ) Pr( | 0, ) Pr( | 0, ) Z Y Z Y Z Y Z Y          . (3.7)

As long as (3.7) is satisfied in collecting the case-control sample, one can proceed the

regression analysis, by pretending that the sample is from a prospective study, to obtain an

estimate of  which is still reliable. We will examine the crucial condition in (3.7) via simulations.

3.2 Familial Case-Control Designs

In analysis of familial data, some studies ignored probands’ information and only focus

on relatives’ data. Such an approach may lose efficiency by ignoring useful information in

probands’ data. Whittemore (1995) applied multivariate techniques to analyze familial

case-control data. Specifically she proposed a two-stage sampling procedure. Specifically in

the first stage, two types of probands (case and control) are sampled and then, in the second

stage, their relatives are sampled. To simplify the discussion, we focus on bivariate analysis

which means that only one relative is sampled based on each proband. The resulting

likelihood analysis contains two components. One involves the logistic model on probands as

introduced earlier. The other component is related to the model which measures the

dependence between a proband and his/her relatives.

Let (Y Zp, p) and ( ,Y Zr r) be the disease status and covariates for a proband and his/her

relative respectively. Denote Y (Y Yp, r) and Z (Zp,Zr). We will first discuss likelihood

(17)

design.

3.2.1 Likelihood analysis based on familial prospective data

A prospective study involves sampling from

Pr( ,Y Zr|Zp)Pr( | ) Pr(Y Z Zr|Zp). (3.8)

When only one relative is involved, Pr( | )Y Z Pr(Ypy Yp, ryr|Z Zp, r) for y* 0,1 and

* = ,p r . Note that

Pr( | )Y Z =Pr(Yp|Z Zp, r) Pr(Y Y Zr| p, ).

Whittemore (1995) mentioned that a reasonable joint model should satisfy the so-called

“reproducible” assumption such that

1 0 Pr( , | , ) r r y p p r r p r y Y y Y y Z Z    

Pr(Ypyp|Zp); (3.9a) 1 0 Pr( , | , ) p p y p p r r p r y Y y Y y Z Z    

Pr(Yryr|Zr). (3.9b)

That is, the covariate of a person is sufficient to determine his/her disease status and hence the

relative’s covariate does not contribute extra information. The paper examines the plausibility

of the reproducible assumption. Suppose that the dependence between Yp and Y within the r

same family may also be attributed to some un-measured latent variable denoted as U . If Pr(U Z| )Pr( )U , the reproducible assumption can be achieved. Whether this assumption makes sense depends on the scientific problem at hand.

When (3.9a) is true, it follows that

Pr( | )Y Z =Pr(Yp|Zp) Pr(Y Y Zr | p, ). (3.10)

(18)

log Pr( 1| ) Pr( 0 | ) p p T p p p Y Z Z Y Z       .

The second quantity Pr(Y Y Zr | p, )Pr(Y Y Zr | p, p,Zr) in (3.10) involves the dependence between a proband and his/her relative which is of major interest. Denote observed data

as{(Y Y Zpi, ri, pi,Zri) (i1,..., )}n . If the data is collected from a prospective sampling design,

the likelihood function can be written as

1 , , Pr , , | i i i i n p r r p i L    Y Y Z Z  

1 1 Pr | Pr | , i i i i n n p p r p i i i Y Z Y Y Z   

(1) (2) , , , L   L      , (3.11)

where L(1)

 ,

has the form as in (3.2) and  denotes additional parameter in

Pr(Y Y Zr | p, ) . Additional joint model assumption is required to specify the form of

Pr(Y Y Zr | p, ).

One model choice is the following model first proposed by Bahadur (1961):

1 1 Pr(( , ) | ( , )) ( ) (1yp ) yp( ) (1yr ) yr(1 ) p r p r p p r r p r Y Y Z Zpppp  t t (3.12) where

* * * * * * , 1 y p t p r p p     , and * * * * * exp( ) Pr( 1| ) * , 1 exp( ) T T Z p Y Z p r Z            .

The coefficient  satisfies the following constraint:

1 1 1 1 min , min , . 1 1 1 1 p r r p p r p r p r r p p r p r p p p p p p p p p p p p p pp p                      

(19)

We will check whether this model satisfies the reproducible assumption via simulations. The

following table summarizes the joint probability of (Yp,Yr) given (Zp,Zr).

1 r YYr 0 1 p Y  1 ( p)( r) Pp p 1 1 (1 ) (1 ) (1 ) p r p p r r p p p p p p       3 ( p)(1 r) Ppp 1 (1 ) (1 ) (1 ) p r p p r r p p p p p p       0 p Y  2 (1 p)( r) P  p p 1 (1 ) (1 ) (1 ) p r p p r r p p p p p p       4 (1 p)(1 r) P  pp (1 ) (1 ) (1 ) p r p p r r p p p p p p      

Table 3.1 Joint probability for (Y Yp, r) given (Zp,Zr)

3.2.2 Likelihood analysis based on familial case-control data

A case-control study involves two independent samples from Pr( ,Y Z Yr | p 1) and

Pr( ,Y Z Yr | p 0). Notice that Pr( ,Y Z Yr | p)Pr(Y Z Yr | , p) Pr( |Z Yp) and

Pr( |Z Yp)Pr(Z Z Yp, r | p)Pr(Zp|Yp) Pr(Z Y Zr | p, p).

The reproducible assumption implies that, given Zp, Yp and Z are independent. Hence r

Pr( |Z Yp)Pr(Zp |Yp) Pr(Zr|Zp). In summary we have

 

 

Pr Y Z Yr, | p Pr Y Z Yr, | p Pr Zp|Yp Pr Zr|Zp

Pr(Zp|Yp)



Pr(Y Z Yr| , p) Pr(Zr |Zp)

 . (3.13)

Recall that Pr(Zp|Yp) can be analyzed assuming that the data is from a prospective sample

(20)

* * Pr( 1| , ) log Pr( 0 | , ) p p T p p p Y Z Z Y Z         .

Accordingly the retrospective likelihood function is given by

* 1 , , , Pr , | i i n r i p i L     Y Z Y  

1 1 Pr | Pr | , i i i i n n p p r p i i i Z Y Y Y Z   

*(1) (2) , , , L   L      .

Notice that the information of  is also contained in familial case-control data when the reproducible assumption holds for the joint model.

(21)

Chapter 4 Simulations for Logistic Regression Analysis

In this chapter, we propose data generation algorithms to simulate case-control data for

logistic regression analysis. Some crucial probability statements will be examined to verify

whether the simulated data are reliable for statistical inference.

4.1 Data Generation for Individual Data 4.1.1 Prospective data of the true population

First of all, we generate population data from the model:

exp

Pr 1| , 1 exp Z Y Z Z           .

Then set the values of the parameters:  ,  and p. The algorithm is summarized below:  Step 1: Generate Zi Bernoulli( )p

 Step 2: Given Z , generate i

exp Bernoulli . 1 exp i i i Z Y Z             

The procedure is repeated for i1,...,N for very large N , say N10000. Denote

{( ,Y Zi i)(i 1,...,N)}

   .

4.1.2 Case-control data from the true population

Suppose that we generate nN observations from the population with n persons 1

from the case group with Y 1 and n0  n n1 persons from the control group Y 0. The procedure is stated as follows.

 Step 1: Randomly select n subjects from the case group and record their values of 1 Zi;

 Step 2: Randomly select n subjects from the control group and record their values of 0

i

(22)

We briefly discuss how to implement Step 1 since Step 2 follows a similar procedure. First

identify the case population:  1 {(Yi 1,Zi) (i1,...,N1)}where 1

1 ( 1) N i i N I Y  

 . The

objective is to select n observations from 1 N subjects. Label the subjects in 11 from 1 to

1

N . At the first time, generate U U

 

0,1 and define s

N1U

, where

 

is the Gauss function. A subject with label “s ” is selected into the case-control sample and removed

from 1. The procedure is repeated n times. Specifically at the 1 kth time, generate

 

0,1

U U and a subject with a re-defined label s

(N1  k 1) U

is selected from the remaining case population containing N1 k 1 subjects. Finally the case sample is formed and denoted as {(Yk 1,Zk) (k 1,..., )}n1 . The control sample can be generated in a similar way.

4.2. Analysis on Individual Data

We examine whether the proposed case-control sampling procedure produces reliable

data. We let 1 1 * 1 1 ( *, 1) / ( 1) ( *, 0) / ( 0) N N i i i i i N N i i i i i I Z Y I Y R I Z Y I Y           

*0,1

and

1 1 * 1 1 *, 1 / *, 0 / n i i i n i i i I Z Y n r I Z Y n       

* 0 , 1

be the empirical estimates of

Pr * | 1, Pr * | 0, Z Y Z Y       and

* * Pr * | 1, Pr * | 0, Z Y Z Y       respectively.

The first criteria to evaluate the quality of data is checking whether r is close to * R . The *

(23)

the combined case-control data: {(Yk 1,Zk) (k 1,..., )}n1 and {(Yk 0,Zk) (k 1,...,n0)}. The MLE of ˆ and ˆ are obtained. By checking whether ˆ is close to the true value, we can examine whether the case-control data provide reliable information of . The results are summarized in Tables 4.1. We observe that the empirical estimate r is close to * R *

obtained from the population data, the estimations of ˆ are also stable and close to the true value.

Table 4.1: Logistic regression analysis of case-control data

0.5,N 10000, Replications 100     1 100, 0 100 nn  0 0 r R 1 1 r R

 

3 ˆ 10    SE of ˆ 0.3 p 1.013132 1.018201 22.336209 0.034299 0.5 p 1.024408 0.992022 33.755393 0.027108 0.7 p 1.021804 1.003206 1.477217 0.028456 1 50, 0 150 nn  0 0 r R 1 1 r R

 

3 ˆ 10    SE of ˆ 0.3 p 0.991638 1.044951 31.116517 0.033287 0.5 p 0.978128 1.033878 56.818868 0.032211 0.7 p 1.013979 1.004441 20.400237 0.036247 1 150, 0 50 nn  0 0 r R 1 1 r R

 

3 ˆ 10    SE of ˆ 0.3 p 1.018330 1.038011 17.547938 0.036607 0.5 p 1.037552 1.021409 27.831059 0.038455 0.7 p 1.042732 1.009384 6.449529 0.038106

(24)

4.3 Data Generation for Familial Data

4.3.1 Familial prospective data of the true population

We first generate data following the model proposed by Bahadur (1961). First set the

values of the parameters:  , ,  and p. The algorithm is summarized below:  Step 1: Generate

i p

Z following Bernoulli

 

p and

i r

Z independently also following

Bernoulli

 

p ;  Step 2: Given i p Z and i r

Z , compute P P P P mentioned in Table 3.1; 1, 2, 3, 4  Step 3: Generate Ui Uniform 0,1

 

;

 Step 4: Set 1 1 1 2 1 2 1 2 3 1 2 3 1, 1 if 0 0, 1 if 1, 0 if 0, 0 if 1 i i i i i i i i p r i p r i p r i p r i Y Y U P Y Y P U P P Y Y P P U P P P Y Y P P P U                  .

The procedure is repeated for i1,...,N for N10000.

4.3.2 Familial case-control data from the true population

The procedure is stated as follows.

 Step 1: Randomly select n probands from the case families with 1 Ypi 1 and record the

values of ( , , )

i i i

p r r

Z Y Z ;

 Step 2: Randomly select n probands from the control families with 0 Ypi 0 and record

their values of ( , , )

i i i

p r r

Z Y Z .

4.4. Analysis on Familial Data

We first examine whether the algorithm for generating perspective data satisfies the

(25)

1 1 1 ( , , ) ( , , ) ( , ) N pi p pi p ri r i p p r N pi p ri r i I Y y Z z Z z q y z z I Z z Z z        

, and 1 1 1 ( , ) ( , ) ( ) N pi p pi p i p p N pi p i I Y y Z z q y z I Z z      

; 1 2 1 ( , , ) ( , , ) ( , ) N ri r pi p ri r i r p r N pi p ri r i I Y y Z z Z z q y z z I Z z Z z        

, and 2( r, r) q y z 1 1 ( , ) ( ) N ri r ri r i N ri r i I Y y Z z I Z z      

.

The reproducible condition should imply that q y z1( p, p,zr) q y z1( p, p) and q y z2( r, p,zr) 2( r, r)

q y z

 . The results of these quantities based on prospective data and case-control data from the true population are recorded in Table 4.2 ~ 4.4. In analyzing the case-control familial

data, we assume  and  are known and then obtain the MLE of . By checking whether ˆ is close to the true value, we can examine whether the familial case-control data provide reliable information of the association in a family. This result is given in Table 4.5.

We observe that the performance of the reproducible properties is good in our population

data which means that the model is appropriate. But sometimes the reproducible properties do

not reflected in the simulated case-control data. Accordingly the estimations of ˆ will have worse results in these situations.

(26)

Table 4.2.A: Checking reproducible properties of the population data (p=0.3) 10000 N (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.734649 q10.630283 q1 0.265351 q10.369717 0 r zq1 0.729743 q10.638770 q1 0.270257 q10.361230 1 q 0.731267 q10.636183 q1 0.268733 q10.363817 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.718202 q2 0.607213 q2 0.281798 q2 0.392787 0 p zq2 0.752438 q2 0.642232 q2 0.247562 q2 0.357768 2 q 0.742251 q2 0.632012 q2 0.257749 q2 0.367988

Table 4.2.B: Checking reproducible properties based on case-control data (p=0.3)

1 0 10000, 100, 100 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.545455 q10.547170 q1 0.454545 q10.452830 0 r zq1 0.696970 q10.391304 q1 0.303030 q10.608696 1 q 0.636364 q10.448276 q1 0.363636 q10.551724 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.590909 q2 0.575758 q2 0.409091 q2 0.424242 0 p zq2 0.698113 q2 0.521739 q2 0.301887 q2 0.478261 2 q 0.666667 q2 0.536000 q2 0.333333 q2 0.464000

(27)

Table 4.2.C: Checking reproducible properties based on case-control data (p=0.3) 1 0 10000, 50, 150 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.363636 q10.282609 q1 0.636364 q10.717391 0 r zq1 0.222222 q10.218750 q1 0.777778 q10.781250 1 q 0.275862 q10.239437 q1 0.724138 q10.760563 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.590909 q2 0.222222 q2 0.409091 q2 0.777778 0 p zq2 0.652174 q2 0.427083 q2 0.347826 q2 0.572917 2 q 0.632353 q2 0.371212 q2 0.367647 q2 0.628788

Table 4.2.D: Checking reproducible properties based on case-control data (p=0.3)

1 0 10000, 150, 50 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.782609 q10.763158 q1 0.217391 q10.236842 0 r zq1 0.818182 q10.690476 q1 0.181818 q10.309524 1 q 0.807692 q1 0.713115 q1 0.192308 q10.286885 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.695652 q2 0.527273 q2 0.304348 q2 0.472727 0 p zq2 0.842105 q2 0.619048 q2 0.157895 q2 0.380952 q 0.786885 q 0.582734 q 0.213115 q 0.417266

(28)

Table 4.3.A: Checking reproducible properties of the population data (p=0.5) 10000 N (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.725902 q10.611177 q1 0.274098 q10.388823 0 r zq1 0.739654 q10.627195 q1 0.260346 q10.372805 1 q 0.732735 q10.619038 q1 0.267265 q10.380962 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.729869 q2 0.611089 q2 0.270131 q2 0.388911 0 p zq2 0.726092 q2 0.616170 q2 0.273908 q2 0.383830 2 q 0.727973 q2 0.613609 q2 0.272027 q2 0.386391

Table 4.3.B: Checking reproducible properties based on case-control data (p=0.5)

1 0 10000, 100, 100 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.500000 q10.390244 q1 0.500000 q10.609756 0 r zq1 0.548387 q10.529412 q1 0.451613 q10.470588 1 q 0.527778 q10.467391 q1 0.472222 q10.532609 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.695652 q2 0.532258 q2 0.304348 q2 0.467742 0 p zq2 0.609756 q2 0.470588 q2 0.390244 q2 0.529412 2 q 0.655172 q2 0.504425 q2 0.344828 q2 0.495575

(29)

Table 4.3.C: Checking reproducible properties based on case-control data (p=0.5) 1 0 10000, 50, 150 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.325581 q10.245614 q1 0.674419 q10.754386 0 r zq1 0.306122 q10.137255 q1 0.693878 q10.862745 1 q 0.315217 q10.194444 q1 0.684783 q10.805556 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.604651 q2 0.244898 q2 0.395349 q2 0.755102 0 p zq2 0.614035 q2 0.215686 q2 0.385965 q2 0.784314 2 q 0.610000 q2 0.230000 q2 0.390000 q2 0.770000

Table 4.3.D: Checking reproducible properties based on case-control data (p=0.5)

1 0 10000, 150, 50 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.825397 q10.636364 q1 0.174603 q10.363636 0 r zq1 0.733333 q10.787879 q1 0.266667 q10.212121 1 q 0.780488 q10.701299 q1 0.219512 q10.298701 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.777778 q2 0.550000 q2 0.222222 q2 0.450000 0 p zq2 0.727273 q2 0.727273 q2 0.272727 q2 0.272727 q 0.757009 q 0.612903 q 0.242991 q 0.387097

(30)

Table 4.4.A: Checking reproducible properties of the population data (p=0.7) 10000 N (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.725744 q10.631256 q1 0.274256 q10.368744 0 r zq1 0.713138 q10.625140 q1 0.286862 q10.374860 1 q 0.721928 q10.629445 q1 0.278072 q10.370555 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.718154 q2 0.607750 q2 0.281846 q2 0.392250 0 p zq2 0.731822 q2 0.622896 q2 0.268178 q2 0.377104 2 q 0.722294 q2 0.612238 q2 0.277706 q2 0.387762

Table 4.4.B: Checking reproducible properties based on case-control data (p=0.7)

1 0 10000, 100, 100 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.505155 q10.428571 q1 0.494845 q10.571429 0 r zq1 0.577778 q10.437500 q1 0.422222 q10.562500 1 q 0.528169 q10.431034 q1 0.471831 q10.568966 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.659794 q2 0.422222 q2 0.340206 q2 0.577778 0 p zq2 0.642857 q2 0.375000 q2 0.357143 q2 0.625000 2 q 0.654676 q2 0.409836 q2 0.345324 q2 0.590164

(31)

Table 4.4.C: Checking reproducible properties based on case-control data (p=0.7) 1 0 10000, 50, 150 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.261364 q10.111111 q1 0.738636 q10.888889 0 r zq1 0.363636 q10.260870 q1 0.636364 q10.739130 1 q 0.295455 q10.161765 q1 0.704545 q10.838235 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.431818 q2 0.386364 q2 0.568182 q2 0.613636 0 p zq2 0.600000 q2 0.521739 q2 0.400000 q2 0.478261 2 q 0.488722 q2 0.432836 q2 0.511278 q2 0.567164

Table 4.4.D: Checking reproducible properties based on case-control data (p=0.7)

1 0 10000, 150, 50 Nnn  (y zp, p)(1,1) (y zp, p)(1, 0) (y zp, p)(0,1) (y zp, p)(0, 0) 1 r zq1 0.724490 q10.575758 q1 0.275510 q10.424242 0 r zq1 0.857143 q10.900000 q1 0.142857 q10.100000 1 q 0.768707 q1 0.698113 q1 0.231293 q10.301887 (y zr, r)(1,1) (y zr, r)(1, 0) (y zr, r)(0,1) (y zr, r)(0, 0) 1 p zq2 0.734694 q2 0.755102 q2 0.265306 q2 0.244898 0 p zq2 0.787879 q2 0.800000 q2 0.212121 q2 0.200000 q 0.748092 q 0.768116 q 0.251908 q 0.231884

(32)

Table 4.5: The MLE of based on case-control familial data 10000, 0.5 N   0.5   1 100, 0 100 nnn150,n0 150 n1 150,n0 50 0.3 p ˆ 0.522000 ˆ 0.400637 ˆ 0.473103 0.5 p ˆ 0.467715 ˆ 0.614172 ˆ 0.519416 0.7 p ˆ 0.562765 ˆ 0.534639 ˆ 0.399379

(33)

Chapter 5 Regression Analysis Based on Familial Data

For those who have developed the disease, the age at onset may be informative. As

mentioned in Li et al. (1998), early age of onset has been a hallmark for genetic predisposition

in most of diseases that aggregate in families. When age-at-onset is chosen as the primary

response, the effect of censoring has to be considered in the analysis.

In this chapter, we discuss several important issues on analyzing familial age-onset data.

Specifically denote T as the age-onset variable and Z as a p1 vector of covariates. The Cox proportional hazards model is the most well-known model for failure time variables

which can be written as

|

0

 

exp

,

T

t Z t Z

   (5.1) where 0( )t is the baseline hazard function and  measures the effect of Z on the hazard and is of major interest. In familial failure-time analysis, the Cox model is imposed on

probands. For inference of , we first review the analysis based on a prospective sample and then extend the discussion to a valid case-control sample. Finally we will discuss the

modeling and inference frameworks when familial case-control data are collected.

5.1 Likelihood Analysis Based on Probands

Under right censoring, let C be the censoring variable. One observes that X  T C,

( )

I T C

   and covariates Z. In prospective studies, we identify a sample of individuals with specified covariates: Z and then determine their observed time and disease status: i

(Xi, )i for i1,...,n. At time t , the risk set can be denoted as R t( ){ :i Xit i, 1,..., }n .

Given the risk set information, a subject failing at time t with covariate Zj given jR t( )

will contribute to the partial likelihood by

 

 

0 0 exp exp | . | exp exp T T j j j T T i i i t Z Z t Z t Z t Z Z          

(5.2)

(34)

Prentice and Breslow (1978) discussed the likelihood formulation based on case-control

age-onset data. It is important to first introduce the sampling procedure which involves how

to match a case subject with a control subject. Specifically at time t , m t observations are ( ) sampled from the case population containing those who develop the disease at time t and,

independently, n t( ) observations are sampled from the control population containing those

who have not developed the disease up to time t . Observed data can be summarized in Table

5.1.

Time:t i Case:

Xti, 1

Control:

Xti, 0

1 t m t individuals ( )1 n t individuals ( )1 i t m t individuals ( )i n t individuals ( )i k t m t( )k individuals n t( )k individuals

Table 5.1 Age-matched case-control data

The case-control design for collecting age-onset data considers sampling from the

conditional distribution of Z based on ( , )X  . To establish the relationship between prospective and retrospective samples, Prentice and Breslow (1978) extended the result of

Cornfield (1951) to age-onset data and derived the following condition:

Pr( | , 1) / Pr( 0 | , 1) Pr( | , 0) / Pr( 0 | , 0) Pr( , 1| ) / Pr( , 1| 0) . Pr( , 0 | ) / Pr( , 0 | 0) Z X t Z X t Z X t Z X t X t Z X t Z X t Z X t Z                              (5.3)

Notice that when C is independent of both T and Z , we have

Pr( , 1| ) Pr( , | ) Pr( | ) Pr( ) ( ) ; Pr( , 0 | ) Pr( , | ) Pr( | ) Pr( ) C( ) X t Z T t C t Z T t Z C t t X t Z T t C t Z T t Z C t t                    Pr( , 1| 0) Pr( | 0) Pr( ) 0( ) . Pr( , 0 | 0) Pr( | 0) Pr( ) C( ) t X t Z T t Z C t X t Z T t Z C t t                  

(35)

Hence the right-hand side of (5.3) equals ( ) /t0( )t and, under the proportional hazard model, (5.3) becomes Pr( | , 1) / Pr( 0 | , 1) exp( ) Pr( | , 0) / Pr( 0 | , 0) T Z X t Z X t Z Z X t Z X t               . (5.4) Rearranging (5.4), we obtain Pr( 0 | , 1) Pr( | , 1) Pr( | , 0) exp( ) Pr( 0 | , 0) T Z X t Z X t Z X t Z Z X t                

which is equation (2) in Li et al. (1998). The left-hand side of (5.4) is identifiable based on

case-control data which implies that  is also identifiable based on such data.

Prentice and Breslow (1978) proposed a conditional likelihood approach for estimating

based on case-control data. At time t , define R m t n t( ( ), ( )) as a set of all subsets of size ( )

m t from a total of m t( )n t( ) subjects. Given this risk set information, the first m t ( ) subjects with covariates Z1,...,Zm t( ) respectively actually belonging to the case group will contribute the probability

( ) 1 ( ) ( ( ), ( )) 1 | , | m t i i m t lj l R m t n t j t Z t Z     

 

(5.5)

where Zlj denotes the covariate value for j th subject in the l th combinations. Notice that

( ) ( ) 0 1 ( ) 1 | { ( )} exp{ ( ... )} m t m t T i m t i t Z t Z Z       

and

( ) ( ) 0 1 ( ) 1 | { ( )} exp{ ( ... )}. m t m t T lj l lm t j t Z t Z Z       

It follows that

( ) 1 ( ) ( ( ), ( )) ( ( ), ( )) 1 | exp( ) , exp( ) | m t i T i m t T l lj l R m t n t l R m t n t j t Z s s t Z       

 

(5.6)

(36)

where sZ1 ... Zm t( ) and slZl1 ... Zlm t( ). Finally the likelihood can be written as 1 ( ( ), ( )) exp( ) , exp( ) j j T k T j l l R m t n t s s    

(5.7)

where t1 ... tk denote observed failure times for the case group. It is important to note that

( ( ), ( ))

R m t n t only includes subjects who are sampled from the retrospective study at time t .

Hence it does not have the nested property of a regular risk set such as R t( )R t( ). It is important to mention that computation of (5.7) involves all possible permutations in the

denominator which is very time-consuming if m t( )n t( ) is not small. Several authors proposed algorithms to approximate the likelihood.

We provide a numerical example to illustrate construction of R m n( , ) in which the label

t is ignored to simply the presentation. Suppose the case-sample contains subjects with

covariates Z and 1 Z respectively and the matched control-sample contains subjects with 2

covariates Z and 3 Z respectively. Hence 4 R(2, 2) consists of 4 2

   

  combinations which

can be labeled by l1,..., 6 corresponding to

1 2 1 3 1 4 2 3 2 4 3 4

(Z Z, ), (Z Z, ), (Z Z, ), (Z Z, ), (Z Z, ), (Z Z, ) sets of covariates respectively. For example Z22Z3 corresponds to the second covariate with l2 . It follows that

1 2

sZZ , s1Z1Z2 , s2Z1Z3 , s3Z1Z4 , s4Z2Z3 , s5Z2Z4 and

6 3 4 sZZ .

5.2 Likelihood Analysis Based on Familial Data

Table 5.2 summarizes observed case-control familial data in which probands’ times

(onset or censored) are matched. Specifically at time t , we sample i m t case probands

 

i

and their relatives and matched with n t control probands and their relatives. Denote

 

i ( p, r)

XX X as observed times and  ( p, r) as the corresponding indicators for a proband and his/her relative respectively.

(37)

Time Case family:

Xpti,p 1

Control family:

Xpti,p 0

1

t m t probands and their

 

1

relatives

 

1

n t probands and their

relatives

i

t m t probands and their

 

i

relatives

 

i

n t probands and their

relatives

k

t m t

 

k probands and their

relatives

 

k

n t probands and their relatives

Table 5.2 Age-matched case-control familial data

To simply the presentation, assume that there are two members in one family (one

proband and one relative). Observed information for a case subject includes

(Xpt,p 1,Zp,Xr,r,Zr) while the information for the corresponding age-matched control subject includes (Xpt,p 0,Zp,Xr,r,Zr) . Two samples from

Pr( Xr,r ,Z Zp, r | Xpt,p 1 ) and Pr(

Xr,r

,Z Zp, r |

Xpt,p 0 )

are drawn

independently.

Li et al. (1998) extended the discussions in Whittemore (1995) from binary data to

age-onset data. The model assumption consists of two stages. In the first stage, the model on

the proband namely Pr(Zp|

Xp,p

), is assumed to follow the Cox model. In the second stage, the model on Pr(

Xr,r

|

Xp,p

, )Z is constructed where Z (Zp,Zr). It should

數據

Figure 1: Scientific Background
Table 3.1 Joint probability for  ( Y Y p , r )   given  ( Z p , Z r )
Table 4.1: Logistic regression analysis of case-control data  0.5, N 10000, Replications 100 1 100, 0 100nn 0 0rR 1 1rR  ˆ 10 3  SE of  ˆ 0.3p 1.013132  1.018201   22.336209  0.034299  0.5p 1.024408  0.992022   33.755393  0.027108  0.7p 1
Table 4.5: The MLE of     based on case-control familial data  10000, 0.5N   0.5 1 100, 0 100nn n 1  50, n 0  150 n 1  150, n 0  50 0.3p  ˆ  0.522000  ˆ  0.400637  ˆ  0.473103  0.5p ˆ  0.467715  ˆ  0.614172  ˆ  0.519416  0.7p ˆ
+4

參考文獻

相關文件

The new control is also similar to an R-format instruction, because we want to write the result of the ALU into a register (and thus MemtoReg = 0, RegWrite = 1) and of course we

• bool CanIntersect() returns whether this shape can do intersection test; if not, the shape must provide.

Depending on the specified transfer protocol and data format, this action may return the InstanceID of an AVTransport service that the Control Point can use to control the flow of

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-field operator was developed

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-eld operator was developed

This theorem does not establish the existence of a consis- tent estimator sequence since, with the true value θ 0 unknown, the data do not tell us which root to choose so as to obtain

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of

Based on [BL], by checking the strong pseudoconvexity and the transmission conditions in a neighborhood of a fixed point at the interface, we can derive a Car- leman estimate for