隨機治癒模式下治癒率之迴歸分析(I)

(1)

Regression Analysis for Cure Rate under a Random

Cure Time Model

Weijing Wang

NSC midterm report: 92-2118-M-009-007

Email: wjwang@stat.nctu.edu.tw

Institute of Statistics, National Chiao-Tung University Hsin-Chu, Taiwan, R.O.C.

May, 2004

Summary

This article considers regression analysis for cure models in presence of competing risks. The model is formulated by a mixture representation. The main interest is in the in-cidence part, which measures the probability of a specic type of failure or cure rate. Assuming a binary regression model, several inference methods for estimating the re-gression parameters are proposed to handle the missing cure status due to censoring. The latency distributions, despite of less interest, play an essential role to utilize the partial but biased information provided by censored data. Alternatively a distribution-free procedure is also developed given that the quality of the data is good enough in terms of sucient follow-up.

Key words: Cause-specic hazard Competing Risks Cure models EM algorithm

Illness-Death Models Imputation Logistic regression Missing Data Mixture model Sucient follow-up Susceptibility.

(2)

1 Introduction

Cure models allow for the possibility of not developing the event of interest despite of long-term follow-up. Several versions of cure models have appeared in the literature which di er in how \cure is dened. Under the classical setting, cure is not clearly specied in the sense that cured (or immune) individuals are not directly observed but always mixed with temporarily censored but susceptible ones. The book by Maller and Zhou (1996) is an excellent reference on the subject. Another type of the models assumes that a patient is cured if he/she does not develop the event of interest within a pre-specied time period. One can refer to Laska and Meinser (1992) for further references. For the model considered in the article, cure is determined by the order of competing events. Throughout the paper, we will use the severe acute respiratory syndrome (SARS) as an illustrating example of such models. SARS is a life-threatening acute disease that resulted in a global outbreak in 2003. The phenomenon can be described by a two-path model, sometimes known as an illness-death model, depicted in Figure 1, where subjects may follow two di erent paths, 1 ! 2 ! 3 or 1 ! 3. For the SARS example, the

intermediate state refers to hospital discharge with recovery and the absorbing state is death. Betensky and Schoened (2001) call the third type as cure models with random cure times.

Cure models are often expressed by a mixture formulation that contains two compo-nents: one component is related to long-term incidence and the other is related to the latency distribution given the cure status. Covariates may have di erent inuences on the two parts and the mixture formulation provides a exible way to study the e ects separately. For the classical type of model, Farewell (1982) assumed a logistic/Weibull model. Several authors, including Kuk and Chen (1992), Sy and Taylor (2000) and Peng and Dear (2000), have considered logistic/Cox regression model. Their methods di er in handling the baseline hazard function in the estimation. Taylor (1995) proposed a logistic/Kaplan-Meier approach to analyze the incidence model by leaving the latency distribution un-specied. However his method imposes a rigid assumption that the la-tency distribution does not depend on the covariates. For the cure model with xed cure time, the focus is on the incidence part since the latency time is a xed constant. Jung (1996) and Subramanian (2001) proposed inference methods to estimate parameters in a binary regression model under censoring.

(3)

In this paper, we consider regression analysis for the third type of cure models in which cure is determined by the order of competing risks. The primary goal is to assess the e ects of covariates on the cure rate. Using the SARS example, letT1 be the time to

hospital discharge and T2 be the time to death. Dene =I(T1 T

2) as the indicator

for failure type which is the indicator for the cure event. Given the value of , two latency distributions can be Dened, namely Q1(t) = Pr(T1 > t

jT 1 T 2) and Q2(t) = Pr(T2 > t jT

1 > T2). Notice thatQj(t) (j = 12) also represent the survival functions of

the sojourn times in the two-path model and the cause-specic survival functions in the context of competing risks. We assume that Pr(T1

T 2

jZ) = Pr( = 1jZ) = ( 0Z),

where Z : p 1 denotes a vector of covariates, ;1(

) is the link function which is

monotonic and di erentiable. A common example is the logistic regression model

(0Z) = exp( 0Z)

1 + exp(0Z):

In general, covariates also a ect the latency distributions and thus we should write

Q1Z(t) = Pr(T1 > t jT 1 T 2Z) and Q2Z(t) = Pr(T2 > t jT 1 > T2Z). In the SARS

example, it seems more important to investigate what factors a ect whether a patient can be cured than to study how long it takes for them to recover. Therefore we prefer not making rigid assumptions on the forms of QjZ(t) (j = 12).

The major objective is to develop inference methods for estimating when the cure status, , may be unknown due to censoring. In Section 2, we rst review the inference procedure which complete information of is available and then discuss the likelihood inference given censored data. In Section 3 several inference methods for estimating

are proposed. Section 4 contains concluding remarks.

2 Preliminary Analysis

Letf(T

1iT2ii) (i= 1:::n)

gbe identically and independently replications of (T

1T2).

When the observation period is long enough such that i (i = 1:::n) are completely

observed, standard techniques for generalized linear models can be applied to estimate

. Based on complete data, f(iZi) (i = 1:::n)g, where i = I(T 1i

T

2i), the

(4)

likelihood function becomes L() = Yn i=1 (0Z i) i f1;( 0Z i)g 1; i (1)

which gives the score equation ~ U() = Xn i=1 fi;( 0Z i)g (0Z i) (0Z i)(0Z i)Zi (2) where (t) =@(t)=@t and (t) = 1;(t).

In practice it happens that subjects may drop out from the study or, at the end of the follow-up, some patients still have not developed the events interest. Let C be the censoring variable and assume that it is independent of bothT1 andT2. Under competing

risks, one observes Xi =T1i ^T 2i ^Ci, 1i =I(T1i T 2i ^Ci), 2i =I(T2i T 1i ^Ci) for i= 1:::n. Letting3i =I(Ci T 1i ^T

2i), 1i+2i+3i = 1. Note that when 1i = 1,

i = 1, while 2i = 1, i = 0. However the value of i is unknown if 3i = 1.

Without loss of generality, assume that T1, T2 and C are all continuous variables.

Based on censored data, the likelihood function becomes

LC / n Y i=1 n (0Z i)f1Z(xi)] 1i(0Z i)f2Z(xi)] 2iS Z(xi)]3i o (3) where fjZ(x) =; @ @xQjZ(x) for j = 12 and SZ(xi) =(0Z i)Q1Z(xi) + ( 0Z i)Q2Z(xi):

Notice that when censoring exists, the likelihood function of contains nuisance parame-ters related toQjZ(t) (j = 12) which implies that additional assumption on the latency

distributions is required if likelihood-based inference is pursued. Parametric regression models may be assumed for QjZ(t) (j = 12). Larsen and Dinse (1985) considered the

same framework as discussed here and assumed a logistic/piecewise-exponential model. In the next section, we review maximum likelihood estimation and then propose other inference methods under more exible assumptions.

(5)

3 The Proposed Methods

3.1 EM algorithm for maximum likelihood estimation

Previous analysis implies that, when censoring is present, likelihood estimation of

requires additional knowledge on the latency distributions. Assume temporarily that a parametric form is imposed onQjZ(t) (j = 12). Since direct maximization ofLC in (3)

is dicult, the EM algorithm can be used to obtain the maximum likelihood estimator. The likelihood based on pseudo-data, f(i

1i2iXiZi) (i= 1:::n) g, is Lf / n Y i=1 (0Z i)h1Z i(xi)] 3iQ 1Z i(xi)] i (0Z i)h2Z i(xi)] 3iQ 2Z i(xi)] 1;i where 3i = 1 ; 1i ; 2i and hjZ(x) = h ; @ @xQjZ(x) i .

QjZ(x) for j = 12. The E-step

takes the expectation of logLf with respect to the distribution of the unobserved is,

given the observed data and the current parameter values. The expected log-likelihood, denoted as lf(Q1Q2w (m)), becomes n X i=1 8 < : 2 X j=1 jiln(hjZ i(xi))] +w (m) i ln(( 0Z i)Q1Z i(xi)) + (1 ;w (m) i )ln(( 0Z i)Q2Z i(xi)) 9 = where w(m) i = 1i+I(1i =2i = 0) ((m) 0 Zi)Q(m) 1Z i(xi) ((m) 0 Zi)Q(m) 1Z i(xi) +( (m) 0 Zi)Q(m) 2Z i(xi) and (m), Q (m)

jZi (j = 12) denoted the current parameter values at the mth iteration.

Note that the last term in w(m)

i is the conditional probability that the ith patient will

be eventually cured given that the cure event has not occurred by time xi.

Further, we can write

lf(Q1Q2w (m) ) =LQ1 +LQ2+L where LQ1 = n X i=1 n 1i ln(h 1Z i(xi)) +w (m) i ln(Q 1Z i(xi)) o LQ2 = n X i=1 n 2i ln(h 2Z i(xi)) + (1 ;w (m) i )ln(Q 2Z i(xi)) o 4

(6)

and L =Xn i=1 n w(m) i ln(( 0Z i)) + (1;w (m) i )ln(( 0Z i)) o :

It is important to note that in the M-step of the algorithm, LQ1, LQ2 and L can

be maximized separately by treating w(m)

i as a xed constant. The EM procedure is

iterative in a way that the estimates obtained previously are used to update the value of w(m)

i in the current maximization step. Maximization of L is straightforward while

maximization ofLQj depends on the form ofQjZ(x) (j = 12). For parametric analysis,

convergence and properties of the resulting estimates follow the standard results.

3.2 Imputation by Conditional Mean

One alternative to adjust for censoring is to directly modify the score equation in (2). Based on the available data, we can impute the missing i, if3i = 1, by its conditional

mean which equals ~ i =EijT 1i ^T 2i > CiCi =xiZi] = Q1Z i(xi)( 0Z i) Q1Z i(xi)( 0Z i) +Q2Z i(xi)( 0Z i):

Here we propose two estimating functions of which can avoid going through the maximization procedure of LQjZ (j = 12). Both methods modify the results in Wang

(2003) by assuming that the covariates take nite number of values. By partitioning the sample according to the value of Z, we can use Wangs method to obtain the non-parametric estimators, ^Qjzk(t) (j = 12) and ^pzk(x) fork = 1:::J, where ^pZ(x) is an

estimator of

pz(x) =E(j

3 = 1X =xZ =z):

An estimating function by model-based imputation is given by

U1() = n X i=1 n ^i;( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (4)

where ^i = 1 for 1i = 1 ^i = 0 for 2i = 1 and if3i = 1, we set

^i = Q^1Z i(xi)( 0Z i) ^ Q1Z i(xi)( 0Z i) + ^Q2Z i(xi)( 0Z i): 5

(7)

An estimating function by model-free imputation is given by U2() = n X i=1 n ~i;( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (5)

where ~i = 1 for 1i = 1 ~i = 0 for 2i = 1 and if3i = 1, we set ~i = ^pz

i(xi). Denote

^

j as the solution of Uj() = 0 for j = 12.

3.3 Inverse Probability Weighting

When censoring is present, there is higher chance of rst observing an event associated with smaller failure time. To see this, one can express 1 = I(T1

C) which implies

that 1 is a biased proxy of in the presence of censoring. The larger value of failure

time T1, the higher chance that will be censored. When covariates exist, it can be

shown that EI(1 = 1) jT 1T2Z] =EI(T1 T 2 ^C)jT 1T2Z] = I(T1 T 2 jZ)GZ(T 1)

where GZ(t) = Pr(C > tjZ). It follows that

E(jZ) =E 1 GZ(X) Z ! =E " E I(T1 C) GZ(T1) T 1T2Z !# :

To simplify the analysis, we let GZ(t) = Pr(C > t) which is estimated by the

Kaplan-Meier estimator ^ G(t) = Y ut ( 1; P nk=1I(Xk =u 3k = 1) P nk=1I(Xk u) ) :

Replacing i by 1i=G^(Xi), one obtain the following estimating function

U3() = n X i=1 ( I(1i = 1) ^ G(Xi) ;( 0Z i) ) (0Z i) (0Z i)(0Z i)Zi:

The estimator of can be obtained as a solution of U3() = 0, denoted as ^3. The

technique of inverse probability weighting has been widely used in recent literature of survival analysis such as the papers by Jung (1996) and Lin, Sun and Ying (1999).

The proposed way of bias adjustment by inverse weighting implicitly assumes that there is no information about beyond the observational period. Specically, dene

(8)

t=supft: Pr(T 1

^T

2 > t)>0

gand c=supft: Pr(C > t)>0g. If t> c, which implies

that Pr(T1 ^T 2 > c)>0, we have E 1 G(X) T 1T2 ! = ( if T1 ^T 2 c 0 if T1 ^T 2 > c = I(T1 ^T 2 c)6= :

The requirement of t c is a condition of sucient follow-up. We can show that,

when t c and under some regularity conditions, ^

3 is consistent and p

n(^3 ;

0)

converges to a mean-zero normal random variable, where 0 is the true value of . The

asymptotic variance of ^3 is also derived.

Alternatively, we can use (12) jointly to construct an estimating equation. By the

same idea, we have the estimating function

U4() = n X i=1 ( 1i ; 2i ^ G(Xi) ;(( 0Z i);( 0Z i)) ) (0Z i) (0Z i)(0Z i)Zi:

The nite-sample comparison of U3() and U4() could be explored in the simulation

study.

3.4 Imputation by unconditional mean

In the context of competing risks, Pr( = 1) can be measured from the incidence function. Specically let T

1 =T 1 if T1 T 2 and T 1 = 1if T 1 > T2. It follows that S(t) = Pr(T 1 > t) = Pr(T 1 > tT1 T 2) + Pr(T1 > T2): Hence lim t!1 Pr(T 1 > t) = Pr(T 1 > T2) = Pr( = 0):

When a nonparametric estimator ofS(t) is available, denoted asS

np(t), we may consider

the estimating equation

U5() = n X i=1 n (1;S^ np(xmax));( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (6)

where xmax is the maximum value of Xi (i= 1:::n) with 1i = 1. There exist several

nonparametric estimators of S(t).

(9)

3.5 Comparison of the methods

The formulation of mixture models suggests that the incidence probability and the la-tency distributions can be modeled separately. Due to the problem of identiability as discussed in Li et al. (2001), classical cure models rely on joint estimation of the two components even if only one part is of major interest. For the cure model considered here, the event of cure is explicitly dened and hence identiability is not an issue. Con-sequently the requirement on the lengthy of the follow-up period is less strict. In Section 2.1 we have seen that as long asCi > T1i

^T

2i for alli= 1:::n, it is natural to estimate

the two components separately. The focus here is on the incidence part. When there exist some observations with Ci > T1i

^T

2i, additional information is needed beside the

binary regression model itself.

In the likelihood-based analysis, model specication onQjZ(t) (j = 12) is required.

Specically for a censored observation with 1i =2i = 0, the knowledge of the latency

distributions is used in wi or ~i for weight assignment since the observed sojourn time

as well as the covariate still reveal useful information. Ignoring such information would lead to bias results. However making additional assumptions increases the possibility of making mistakes. Parametric modeling on the latency distributions is usually the standard approach. Flexibility of the imposed models is an important concern. In this article, we discuss nonparametric analysis for the latency distributions. It is important to note that although no distributional assumption is made, nonparametric analysis re-quires that the subjects are independently and identically distributions or homogeneous in some sense. Thats why we need to assume thatQjZ(t) (j = 12) do not depend onZ

orZ is discrete so that the sample can be partitioned into homogeneous subgroups. The likelihood-based estimator of using logistic/Kaplan-Meier approach by Taylor (1995) is computationally intensive since it involves back-and-forward iterations between and high dimensional parameters in QjZ(t) (j = 12). The proposed estimators 1 and 2,

that use the plug-in approach to handle the nuisance parameters separately, are easier to implement. It is possible that nding the root of

U1() = 0 is dicult since it is a complicated function of . One alternative is to

(10)

modify ^i by i = Q^1Z i(xi)(^ 0 (k;1)Zi) ^ Q1Z i(xi)(^ 0 (k;1)Zi) + ^Q 2Z i(xi)(^ 0 (k;1)Zi)

where ^(k;1) is the previously estimated value of fork

1. A reasonable choice of ^ (0)

is ^2.

Now dene t = supft : Pr(T 1

^T

2 > t) > 0

g and c = supft : Pr(C > t) > 0g.

The estimators, ^j (j = 123), all rely on the assumption of sucient follow-up, that

is t c. The validity of ^

3 strongly depends on this condition since

E 1 Gz(X) T 1T2Z ! =I(T1 ^T 2 c): In general, E 1 GZ(X) Z ! =E(I(T1 ^T 2 c)jZ)6=E(jZ):

Since Wangs nonparametric estimators reply on the assumption of sucient follow-up, ^1 and ^2 are also a ected via the imputed values. However Wang proposed a

modication when this condition is violated and can also be used here to correct the problem to some degree.

4 Concluding Remarks

Literature on illness-death models is abundant. There have been increasing research interests to analyze the problem in the framework of competing risks or mixture mod-els. One advantage of the mixture formulation is that it allows Separate modeling for incidence and latency distributions. >From our analysis, the purpose of joint estimation is for correctly utilizing partial information provided by censored data. One can avoid making model assumption on the latency part if follow-up is sucient. We see that there is a tradeo between making strong assumptions on the quality of data or on the model. The techniques used to handle the e ect of censoring can be viewed as applications of the principles for handling missing data that are reviewed in the book by Little and Rubin (2002).

(11)

References

Andersen, P. K., Borgan, ., Gill, R. D., and Keiding, N. (1993) Statistical Models

Based on Counting Processes. New York: Springer-Verlag.

Betensky, R. A. and Schoenfeld, D. A. (2001) Nonparametric estimation in a cure model with random cure times. Biometrics,

57

, 282{286.

Crowley, J. and Hu, M. (1977) Covariance analysis of heart transplant survival data. J. Am. Statist. Ass.,

72

, 27{36.

Kuk, A. Y. C. and Chen, C. (1992) A mixture model combining logistic regression with proportional hazards regressions. Biometrika,

79

, 531{541.

Farewell, V. T. (1982) The use of mixture models for the analysis of survival data with long-term survivors. Biometrics,

38

, 1041{1046.

Fine, J. P., Jiang, H. and Chappell, R. (2001) On semi-competing risks data. Biometrika,

88

, 907{919.

Hougaard, P. (2000) Analysis of Multivariate Survival Analysis. New York: Springer-Verlag.

Klein, J. P. and Moeschberger, M. L. (1997) Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer-Verlag.

Laska, E. M. and Meisner, M. J. (1992) Nonparametric estimation and testing in a cure model. Biometrics,

48

, 1223{1234.

Larson, M. G. and Dinse, G. E. (1985) A mixture model for the regression analysis of competing risks data. Applied Statistics,

34

, 201{211.

Li, C.-S., Taylor, J. M. G. and Sy, J. P. (2001) Identiability of cure models. Statistics & Probability Letters,

54

, 389{395.

Lin, D. Y., Sun, W. and Ying, Z. (1999) Nonparametric estimation of the gap time distributions for serial events with censored data. Biometrika,

86

, 59{70.

(12)

Maller, R. A. and Zhou, S. (1992) Estimating the proportion of immunes in a censored sample. Biometrika,

79

, 731{739.

Maller, R. A. and Zhou, S. (1994) Testing for sucient follow-up and outliers in survival data. J. Am. Statistic. Assoc.,

89

, 1499{1506.

Maller, R. A. and Zhou, S. (1996) Survival Analysis with Long-Term Survivors. Wiley: New York.

Peng, Y. and Dear, K. B. G. (2000). A nonparametric mixture model for cure rate estimation. Biometrics,

56

, 237-243.

Prentice, R.L., Kalbeisch, J. D., Peterson, A. V., Flournoy, N., Farewell, V. T. and Breslow, N. E. (1978) The analysis of failure times in the presence of competing risks. Biometrics,

34

, 541{554.

Taylor, J. M. G. (1995) Semi-parametric estimation in failure time mixture models. Biometrics,

51

, 899{907.

Wang, W. (2003) Inference on the association parameters for copula models under dependent censoring. J. R. Statist. Soc. B,

65

, 257{273.

Zhao, L. P. and LeMarchand, L. (1992) An analytical method for assessing patterns of familiar aggregation in case-control studies.

Genetic Epidemiology,

9

, 141{154.