• 沒有找到結果。

Motivation and Background

In the thesis, we consider a pair of failure times (X,Y) which can be included in the sample only if X ≤ . The variable Y is said to be “left truncated” by X and X is said Y to be “right truncated” by Y . In many applications, usually one variable is of major interest while the other is nuisance. The book by Klein and Moeschberger (2003) mentioned an example which studied the survival distribution for elderly residents in a retirement center. In the example, X denotes a subject’s age of entering the retirement community and Y denotes the lifetime for the person. Notice that only those who had lived long enough to be eligible for joining the retirement community could be included in the sample. Therefore the truncation scheme has to be taken into account in the development of inference methods for Y . Most nonparametric inference methods for truncation data assume independence between X and Y (e.g. Lynden-Bell, 1971 and Woodroofe, 1985). Under this assumption, Lynden-Bell suggested to estimate Pr(Y > based on the product-limit expression of this t) quantity and thereafter many nice properties of the Kaplan-Meier estimator for right censored data have been extended to the truncation setting.

Unlike the situation of right censoring in which the independent censorship assumption is not testable, Tsai (1990) claimed that the independence assumption can be relaxed to a weaker assumption of “quasi-independence” and the latter can be verified nonparametrically.

Tsai (1990) introduced a measure of “conditional Kendall’s tau” which was later applied to different truncation settings by Martin and Betensky (2005). Tsai also proposed a test of quasi-independence based on this measure. Alternatively, Chen, Tsai, and Chao (1996) suggested a conditional version of Pearson’s product-moment correlation coefficient, denoted as ρc, to measure the association between X and Y . Based on the sample version of ρc, they proposed a test for quasi-independence. However the method based on ρc can not be

extended to the more general situation that also includes right-censoring.

Figure 1.a: individuals with X ≤ can be observed. Y

Figure 1.b: individuals with X > can not be observed. Y

In some applications, X and Y may be correlated and their dependent relationship is of interest. Tsai (1990) applied his testing procedure to an example of transfusion-related AIDS study. Let T be the infection time of individuals, measured form the beginning of the study, and X be the incubation period from the time of infection to AIDS. Only individuals who developed AIDS by the end of study can be observed (see Figure 1.a). Since the total study period is 102 months, individuals with T + X ≤102 were included in the sample.

Using the notation 102−T =Y , we view X as being right truncated by Y . Primary interest on this study focuses on the incubation distribution X . Dependence between X and Y might be of secondary interest. However applying Tsai’s method (1990), the assumption of quasi-independence was rejected. Positive association between X and Y means negative association between T and Y . That is the earlier the infection time, the larger the length of incubation. This surprising finding might shed some light on the study of

AIDS X

T

Start Y End

Infection

Start End

T X

Y

AIDS Infection

population dynamics of AIDS.

Recently Chaieb et al. (2006) proposed a semi-parametric inference approach to assessing the dependence between X and Y under the assumption that the two variables jointly follow a modified version of an Archimedean copula (AC) model which adapts to the nature of truncation. Copula models have the nice feature that the dependence structure is modeled separately from the marginal effects. Semiparametric inference of copula models has received substantial attentions in the literature. There exist several ways of estimating the association parameter, for a specific copula model or a class of copula models, without specifying the marginal distributions. One popular approach, which has been taken by Oakes (1986) for right censored data and by Fine et al. (2000) for semi-competing risks data, is to utilize the concordance or discordance information for pairs of observations. This idea has been taken by Chaieb et al. (2006) in analysis of dependent truncation data. Compared with the previous results, the new challenge is that the association parameter can not be estimated without knowing the truncation probability. Hence the paper of Chaieb et al. (2006) also considered estimation of the truncation probability and the marginal functions. Their proposed algorithm can be considered as an extension of the method by Rivest and Wells (2001) who considered the situation of dependent censoring.

The dissertation contains two parts, both of which deal with possibly correlated truncation data. The first project was motivated by the paper of Chaieb et al. (2006) but a different inference approach is proposed. Besides proving a new method, we also aim to unify the two different types of inference approaches under a general framework. In the second project we study the problem of testing quasi-independence. Specifically we construct a testing procedure similar to the setup of the weighted Log-rank statistics constructed based on a series of two-by-two tables. The proposed test is nonparametric in the sense that no model assumption is needed. We also derive an equivalent expression of the proposed test statistics which allows us to compare different methods under the same framework. It turns out that the

proposed test statistic can be viewed as a generalized version of some existing tests including Tsai’s test (1990). Furthermore, in both projects, the likelihood information is utilized to improve efficiency of the proposed estimator or power of the proposed test.

1.2 Overview of the Dissertation

Literature review is given in Chapter 2. The first part focuses on bivariate analysis in which some common association measures and models for lifetime variables are introduced and related inference results are reviewed. In particular the family of copula models and its sub-class, Archimedean copula models, are discussed. Different semi-parametric inference approaches developed for analyzing data which follow copula models are examined.

Specifically we focus on three methods of constructing an estimating function of the copula association parameter. One is the conditional likelihood approach which first appeared in the landmark paper of Clayton (1978) for bivariate censored data. The second approach utilizes concordant information of paired observations and has been applied to bivariate censored data by Oakes (1986), Fine (2001) for semi-competing risks data and Chaieb et al. (2006) for dependent truncation data. The third approach suggests to construct estimating functions based on a series of two-by-two tables which has been applied by Day et al. (1997) and Wang (2003) in analysis of semi-competing risks data. In the second part of Chapter 2, we review the literature on marginal estimation. The idea of product-limit expression has been used to construct the Kaplan-Meier estimator and the Lynden-Bell’s estimator under independent censoring and (quasi-) independent truncation respectively. Many papers have studied the situation when the assumption of independence fails. We will review the papers which use copula models to specify the dependence relationship.

Chapters 3 and 4 contain our results for the two projects. Specifically, in Chapter 3, we consider semi-parametric inference based on semi-survival AC models under the framework

proposed by Chaieb et al. (2006). Besides proposing a new inference approach which turns out to be more efficient, we also establish the relationships among different estimating functions. The unified framework allows us to compare different methods in a systematic way and hopefully such analysis can facilitate future development of statistical methodology or inference theory. In Chapter 4, we consider the problem of testing quasi-independence for truncation data. We propose a general class of test statistics which include some existing tests as special cases. In addition, we discuss how to incorporate additional likelihood information provided by the alternative hypothesis to improve the power of the test.

Chapter 2 Literature Review

2.1 Association Measures and Copula Models

To simplify the analysis, let (X,Y) be a pair of continuous failure time variables.

Kendall’s tau, denoted as τ , is a rank-correlation measure which is often used to describe the level of global association between X and Y . Let (Xi,Yi) and (Xj,Yj) be two

We note that τ has the nice property of rank invariance since its value is unchanged by both linear or nonlinear increasing transformations. For measuring local dependence or time-varying association, Oakes (1989) proposed the following cross ratio-function:

y

θ x implies negative association respectively. Oakes also derived another useful expression of ~( , )

The two expressions in (2.2) and (2.3) are useful in the development of inference methods for copula models which be introduced later.

Modeling provides a systematic way of describing the behavior of random variables.

Copulas form a class of bivariate distribution functions whose marginals are uniform on the unit interval (Genest and MacKay, 1986). In applications of lifetime data analysis, the copula structure is usually imposed on the joint survival function such that one can write

)}

Pr(

), {Pr(

) ,

Pr(X > xY > y =C X > x Y > y ,

where the function C(u,v):[0,1]2 →[0,1] can be viewed as the survival copula of )

,

(X Y (Nelsen, 1999, p.28). When the copula function is parameterized as Cα( vu, ), the parameter α is related to Kendall’s tau such that

τ 4 ( , ) ( , ) 1

1

0 1

0

=

∫∫

Cα u v Cα du dv .

The copula family has the nice feature that the dependence structure can be studied separately from the marginal distributions. In practical applications, the association parameter α is often the major of interest and can be estimated without specifying the marginal distributions.

We will review existing semi-parametric inference methods developed for copula models later.

Archimedean copulas (AC) are special copula models which possess useful analytical properties. For an AC model, the bivariate copula function Cα( vu, ) can be further simplified as

)}

( ) ( { )

,

(u v 1 u v

Cαα φαα for u,v∈[0,1], (2.4) where ]φα(.):[0,1]→[0,∞ is a univariate function which have two continuous derivatives satisfying φα(1)=0,φα′(t)=∂φα(t)/∂t<0and φα′′(t)=∂2φα(t)/∂t2 >0. A special property of AC models is that the bivariate relationship can be summarized by the univatiate function

α(.)

φ . In applications, selecting an appropriate Archimedean copula model refers to identifying the form of φα(.). For an AC model indexed by φα(⋅), Oakes (1989) showed that

)}

, {Pr(

) ,

*(

y Y x X y

xα > >

θ , where θα(.) is a univariate function satisfying

) ( / ) ( )

(v v α v α v

α φ φ

θ = ′′ . (2.5)

When )φα(t)=−log(t , X and Y are independent. For the Clayton model with 1

)

( = (α1)

φα t t (α >1) , it can be shown that θα(v)=α.

2.2 Semi-parametric Inference for Survival-copula Models

There have been substantial interests in developing inference methods for estimating the association parameter of a copula model without specifying the marginal distributions. Most results have been derived for survival copula models in which the copula structure is imposed on the joint survival function as mentioned earlier. Early work focused on the Clayton model (Clayton, 1978), a member of the AC family with φα(t)=t(α1) −1 (α >1) and

α θ~( , )=

y

x . Clayton (1978) proposed to maximize a product of conditional probabilities and later his estimator was re-expressed by Clayton and Cuzick (1985) as a weighted form of Oakes’ concordance estimator (Oakes, 1982). The new representation is related to a U-statistics which turns out to be useful in the establishment of asymptotic properties (Oakes, 1986).

There has been a trend to develop unified inference approaches suitable for a class of copula models rather than a single member, say the Clayton model. The approach of two-stage estimation has been adopted by Genest et al. (1995), Shih and Louis (1995) and Wang and Ding (2000) for complete data, bivariate right censored data and current status data respectively. Specifically Cα( vu, ) can be viewed as the joint survival function of

) ( X S

U = X and V =SY(Y) , where SX(t)=Pr(X >t) and SY(t)=Pr(Y >t) . If the marginals were completely specified, then a random sample of (U,V) , denoted as

)) ( ), ( ( ) ,

(Ui Vi = SX Xi SY Yi (i=1,...,n) , or its censored version can be obtained in construction of the likelihood for α . However since the marginals are unspecified, a random

sample of (U,V) is not available. These papers suggested a two-stage estimation procedure.

In the first stage, the marginal distributions are estimated by applying existing nonparametric methods. In the second stage, the marginal estimators are treated as “pseudo observations” in the likelihood constructed based on Cα( vu, ) . Despite of its simplicity, this approach becomes infeasible when the data involve dependent censoring or other complicated situations so that the marginal distributions become non-identifiable nonparametrically.

Semi-competing risks data provides such an example in which one variable is a competing risk for the other but not vise versa and hence the aforementioned two-stage estimation procedure is not applicable. For semi-competing risks data., two different approaches have been adopted. Specifically Day et al. (1997) and Wang (2003) constructed estimating functions, in the form of the log-rank statistics, based on a series of two-by-two tables in which the odds ratio of the table reveals the information of association. Day et al.

(1997) considered the Clayton model with θ~( , )=α y

x and Wang (2003) extended the idea to the whole AC family using the properties of (2.5). The second approach was proposed by Fine et al. (2001) who utilized equation (2.3) to construct an estimating function for the Clayton model based on the concordance indicator Δ whose expected value contains the ij information of α .

2.3 Association Measures and Copula Models Suitable for Truncation Data

For truncation data, we observe (X,Y) only if X ≤ . Hence joint analysis has to be Y restricted in the upper wedge RU ={(x,y):0≤ xy<∞}. Consequently the aforementioned descriptive measures and models may not be directly applicable to describe (X,Y) if they have a truncation relationship.

Kendall’s tau defined in (2.1) is obviously not identifiable for truncation data. Tsai (1990)

suggested to consider the event ~( )}

) ( :

ij ω ij ω

ij X Y

A = ≤ , where Xij = XiXj ,

j i

ij Y Y

Y~ = ∧

. Notice that under the truncation scheme, as long as Xij Y~ij)∈RU ,

( , or

equivalently XijY~ij , it follows that (Xi,Yi) and (Xj,Yj) are both in R . By U conditioning on the event A , Tsai proposed the modified version of Kendall’s tau such that ij

1 )

| (

2 Δ −

= ij ij

a E A

τ , (2.6)

where )(Xi,Yi and (Xj,Yj) be two independent replications of (X,Y), which are known

to satisfy the truncation scheme with Xi ≤ and Yi Xj ≤ given Yj A . The measure ij τa is a well-defined measure for truncation data.

To measure local dependence for truncation data, Chaieb et al. (2006) adopted Tsai’s idea to modify equation (2.3). Specifically for x≤ they proposed to consider y

~ ) (x, ))

, (

| 1 Pr(

)) (x,

~ ) , (

| 0 ) Pr(

, (

* X Y y

y Y

y X x

ij ij ij

ij ij ij

=

= Δ

=

=

= Δ

θ (2.7)

The value of θ*(x,y) can be interpreted in the same way as ~( , ) y

θ x . Notice that θ*(x,y) in (2.7) and ~( , )

y

θ x in (2.3) differ in the way of choosing the corner position. Specifically for ~( , )

y

θ x , the corner is chosen to be ~) ( , )

~ ,

(Xij Yij = XiXj YiYj while, for truncation data, the corner is ~ ) ( , )

,

(Xij Yij = XiXj YiYj . The measure ~( , ) y

θ x is not appropriate for truncation data since given ~ ) U

,

(Xij YijR , it is still possible that (Xi,Yi) or (Xj,Yj) may fall outside R . In contrast, by choosing U ~ )

,

(Xij Yij as the target in making the conditioning arguments, the two points will fall in R . U

For truncation data, Chaieb et al. (2006) suggested to impose the model structure on the

“semi-survival” function, defined as Pr(Xx,Y>y) (x≤ ), which is a more natural y

descriptive measure than the joint survival function Pr(X>x,Y>y). Furthermore since no information is available in the lower wedge {(x,y):0≤ y< x<∞} , the function

)

| , ( Pr ) ,

(x y = XxY>y XY

π can be identifiable nonparametrically while Pr(Xx,Y>y) is not. Accordingly, adapting to the nature of truncation, Chaieb et al. (2006) suggested to impose the AC structure on π(x,y) such that

c y S x

F y

x, ) [ { X( )} { Y( )}]/ ( φα 1φα φα

π = + (x≤ ), y (2.8)

where FX(⋅) and SY(⋅) are continuous distribution and survival functions respectively and c is a unknown normalizing constant satisfying

∫∫

<

+

− ∂

=

y x

Y

X x S y dxdy

y F

c x 1[ { ( )} { ( )}]

2

α α

α φ φ

φ . (2.9)

Note that under model (2.8), the normalizing constant c may not be the truncation proportion Pr(XY), but it makes the model (2.8) to have a valid density function. Note that when φα(t)=−log(t), quasi-independence between X and Y holds.

2.4 Statistical Inference for Truncated Data under Quasi-Independence

For truncation data, we observe (X,Y) only if X ≤ . Replications of Y (X,Y) are located in the upper wedge RU ={(x,y):0≤xy<∞} . The sample consists of

)}

, , 1 ( ) , (

{ Xj Yj j = … n subject to XjYj . We can consider the sample )}

, , 1 ( ) , (

{ Xj Yj j = … n as iid from the cumulative distribution function )

| , Pr(

) ,

(x y X x Y y X Y

H = ≤ ≤ ≤ . Let X and Y be positive independent random variables having the marginal distribution functions Pr(Xx) and Pr(Yy) . The independence between X and Y cannot be tested from data since the information for the lower wedge is unavailable. Thus, the independence assumption

∫∫

∫∫

=

v u x y

v Y d u X d v u I v Y d u X d v u I y

x

H( , ) ( ) Pr( ) Pr( ) ( ) Pr( ) Pr( )

0 0

may not be acceptable unless independence between X and Y is known from prior knowledge. Instead, Wang, Jewell and Tsai (1986) assumed the model,

H : 0 H x y I u v dFX u dFY v c

x y

/ ) ( ) ( ) ( )

, (

0 0

∫∫

= ,

where FX andFY are arbitrary distribution functions and c is the normalizing constant satisfying

∫∫

=

y x

Y

X x dF y

dF

c0 ( ) ( ).

Tsai (1990) called the assumption under H as “quasi-independence”. 0

Using the semi-survival function, the assumption of quasi-independence can be simplified as

H : 0 Pr(Xx,Y > y|XY)=FX(x)SY(y)/c,

where FX and SY are arbitrary right continuous distribution and survival functions, and c 0 is the normalizing constant satisfying

∫∫

=

y x

Y

X x dS y

dF

c0 ( ) ( ).

Define the support of X as [xL,xU] , where xL =inf{u;FX(u)>0} and }

1 ) (

;

sup{ <

= u F u

xU X . Similarly define the support of Y as [yL,yU] , where }

1 ) (

;

inf{ <

= u S u

yL Y and yU =sup{u;SY(u)>0}. It is usually assumed that xLyU so that c>0 . In general, the true distributions FX and SY cannot be estimated nonparametrically without further assumptions. However the following conditional distributions are estimable:

) ,

| Pr(

)

0(

L U

X x X x X y Y x

F = ≤ ≤ ≥ , )SY0(y)=Pr(Y > y|XyU,YxL .

Under the assumption of quasi-independence, Lynden-Bell (1971) derived the nonparametric maximum likelihood estimators (NPMLE) for the two marginal distributions which can be

expressed as following explicit formula:

> ⎭⎬⎫

⎩⎨

⎧ − − −

=

x u

X R u u

u R u x R

F ( , )

) 0 , ( ) 0 , 1 ( )

ˆ ( ,

⎭⎬⎫

⎩⎨

⎧ − ∞ − ∞ +

=

y u

Y R u u

u R u y R

S ( , )

) , ( ) , 1 (

)

ˆ ( , (2.10)

where R(x,y)

=

= n

j

j

j x Y y

X I

1

) ,

( . Woodroofe (1985) showed the uniform consistency results

0

| ) ( ) ˆ (

|

supx>0 FX xFX0 x ⎯⎯→P ; 0supy>0 |SˆY(y)−SY0(y)|⎯⎯→P .

Wang et al. (1986) derived a simple asymptotic variance for the Lynden-Bell’s estimator, which turns out to be an analogy of the asymptotic variance of the Kaplan-Meier estimator. A necessary condition for the above Lynden-Bell’s estimators to be consistent estimators for

FX and SY is that xU < yU and xL < yL so that FX =FX0 and SY =SY0. In other words, there exists two positive number yL < xU such that

0 ) ( L >

X y

F , SY(yL)=1, 1FX(xU)= and SY(xU)>0.

2.5 Statistical Inference for Dependent Truncation Data

Recall the modified version of Kendall’s tau proposed by Tsai in (2.6):

1 )

| (

2 Δ −

= ij ij

a E A

τ .

Based on the sample consists of {(Xj,Yj)(j=1,…,n)} subject to Xj ≤ , Tsai (1990) Yj proposed to estimate τa by

} 1 {

} { } 2

{

} { )}

)(

sgn{(

ˆ −

⋅ Δ

=

=

<

<

<

<

j i

ij j

i

ij ij

j i

ij j

i

ij j i j i

a I A

A I A

I

A I Y Y X X

τ . (2.11)

Under the semi-survival AC assumption in (2.8), Chaieb et al. (2006) proposed to estimate α by utilizing the concordant information provided by Δ since its (conditional) ij expected value reveals the information of α . Their idea can be viewed as an extension of the

methods by Clayton and Cuzick (1985) for bivariate right censored data and by Fine et al.

(2001) for semi-competing risks data. Specifically under the semi-survival AC model assumption, it follows that

)}

, ( { 1 ) 1 ) , (

~) , (

|

( X Y x y A c x y

E ij ij ij ij

π θα

= +

=

Δ ,

where the relationship between θα(.) and φα (.) is given in equation (2.5). Accordingly they proposed the following estimating function:

< ⎥⎥

⎢⎢

− + Δ

=

j

i ij ij

ij ij ij c ij

w c A w X Y c X Y

U ~)}

, ˆ( { 1 ) 1 ,~

~ ( } { 1 ) ,

~ (

, θ π

α

α

α , (2.12)

where )~ ( ,

, x y

wαc is a weight function and ) , ˆ(x y

π

=

>

= n

i

i

i x Y y n

X I

1

/ ) ,

( .

Note that when ~ ( , ) 1

, x y =

wα c , the above estimating function is equivalent to

<

<

+

=

j i

ij j

i

ij ij

ij ij

ij

a I A

A I Y X c Y

X c

} {

} {

~)}]

, ˆ( { 1 /[

~)}]

, ˆ( { 1 [ ˆ

π θ π

θ

τ α α , (2.13)

where the right-hand side can be viewed as an model-based estimator of τa. Notice that ~ ( , )

c

Uw α involves the truncation proportion parameter c which is

unknown. In the special case of the Clayton model with φα(t)=t(α1) −1 (α >1) and α

θα(v)= , ~ ( , ) c

Uw α depends only on α . This implies that ~ ( , ) c

Uw α alone is not enough for estimation of α . Chaiebl et al. (2006) proposed their second estimating procedure which was motivated by the paper of Rivest and Wells (2001) on marginal estimation for dependent censored data. Their idea was inspired by the paper of Zheng and Klein (1995).

Now we describe the second estimation procedure proposed by Chaiebl et al. (2006). Let t n

t1 < < 2 be ordered observed points of (X1,…,Xn,Y1,…,Yn) and t0 =0 . Define

=

j

j

j t Y t

X I t

t

R(, ) ( , ) . Replacing π(t,t) by R(t,t+)/n in equation (2.8) , they

obtained a set of estimating equations:

)}

( { )}

( ) {

, (

i Y i

X i

i F t S t

n t t

cR α α

α φ φ

φ = +

⎭⎬

⎩⎨

⎧ +

(i=1,…,2n−1). (2.14) To solve the above equations, Chaieb et al. (2006) modified the algorithm of Rivest and Wells (2001) originally proposed for dependent censored data. Specifically they first estimated the jumps, φα{SY(ti)}−φα{SY(ti−)} and φα{FX(ti)}−φα{FX(ti+)}, and then summed them up over all the failure times prior to t to obtain the estimators for φα{FX(t)} and

)}

( {SY t

φα . Then by plugging in all the marginal estimators into the equations in (2.14), an estimating function for c can be obtained. In Section 3 and Section 4, we propose different methods for estimating (α,c) and solving the equations in (2.14), respectively.

Chapter 3 The Proposed Approach for Semi-parametric Inference

In this chapter, we develop a new inference approach to analyzing semi-survival AC models of the form in (2-8). Specifically two types of estimating functions are needed to estimate the unknown parameters, α, c, FX(⋅) and SY(⋅) . One is for estimating the association parameter and the other is related to marginal estimation. The present method is semiparametric in the sense that we do not specify the form of FX(⋅) and SY(⋅), but specify the functional form φα(.).

3.1 Estimation of Association

3.1.1. Conditional Likelihood Approach

In this section, we consider estimation of α under the semi-survival AC model in (2.8).

To simplify the analysis, we assume that there is no ties and, temporarily, we ignore external censoring. The sample consists of {(Xj,Yj)(j =1,…,n)} subject to XjYj. Here we generalize Clayton’s likelihood approach (Clayton, 1978) to truncation data. Define the set of grid points as follows:

⎭⎬

⎩⎨

⎧ ≤ ≤ = = = ≥ =

=

∑ ∑

=

=

1 ) ,

( , 1 ) ,

( ,

| ) , (

1 1

n

j

j j

n

j

j

j x Y y I X x Y y

X I y x y

ϕ x .

For a point (x,y) in ϕ, we can define the “risk set” ℜ(x,y)={i;Xix,Yiy}. Denote )

, (x y

R

=

= n

i

j

j x Y y

X I

1

) ,

( as the number of observations in ℜ(x,y) . Let

=

=

=

=

Δ n

i

j

j xY y

X I y

x

1

) ,

( )

,

( , which indicates whether failure occurs at (x,y). Given r

y x

R( , )= for (x,y)∈ϕ and under model (2.8), the variable Δ(x,y) follows a Bernoulli distribution with the probability

)} about α , we can construct the following conditional likelihood function

However for other members in the AC family, estimation of α requires the information of c. It is important to note that, for most models, ∂logL(α,c)/∂c yields the same estimating

However for other members in the AC family, estimation of α requires the information of c. It is important to note that, for most models, ∂logL(α,c)/∂c yields the same estimating

相關文件