Regression Analysis for Cure Rate under a Random
Cure Time Model
Weijing Wang
NSC midterm report: 92-2118-M-009-007
Email: wjwang@stat.nctu.edu.tw
Institute of Statistics, National Chiao-Tung University Hsin-Chu, Taiwan, R.O.C.
May, 2004
Summary
This article considers regression analysis for cure models in presence of competing risks. The model is formulated by a mixture representation. The main interest is in the in-cidence part, which measures the probability of a specic type of failure or cure rate. Assuming a binary regression model, several inference methods for estimating the re-gression parameters are proposed to handle the missing cure status due to censoring. The latency distributions, despite of less interest, play an essential role to utilize the partial but biased information provided by censored data. Alternatively a distribution-free procedure is also developed given that the quality of the data is good enough in terms of sucient follow-up.
Key words: Cause-specic hazard Competing Risks Cure models EM algorithm
Illness-Death Models Imputation Logistic regression Missing Data Mixture model Sucient follow-up Susceptibility.
1 Introduction
Cure models allow for the possibility of not developing the event of interest despite of long-term follow-up. Several versions of cure models have appeared in the literature which di er in how \cure is dened. Under the classical setting, cure is not clearly specied in the sense that cured (or immune) individuals are not directly observed but always mixed with temporarily censored but susceptible ones. The book by Maller and Zhou (1996) is an excellent reference on the subject. Another type of the models assumes that a patient is cured if he/she does not develop the event of interest within a pre-specied time period. One can refer to Laska and Meinser (1992) for further references. For the model considered in the article, cure is determined by the order of competing events. Throughout the paper, we will use the severe acute respiratory syndrome (SARS) as an illustrating example of such models. SARS is a life-threatening acute disease that resulted in a global outbreak in 2003. The phenomenon can be described by a two-path model, sometimes known as an illness-death model, depicted in Figure 1, where subjects may follow two di erent paths, 1 ! 2 ! 3 or 1 ! 3. For the SARS example, the
intermediate state refers to hospital discharge with recovery and the absorbing state is death. Betensky and Schoened (2001) call the third type as cure models with random cure times.
Cure models are often expressed by a mixture formulation that contains two compo-nents: one component is related to long-term incidence and the other is related to the latency distribution given the cure status. Covariates may have di erent inuences on the two parts and the mixture formulation provides a exible way to study the e ects separately. For the classical type of model, Farewell (1982) assumed a logistic/Weibull model. Several authors, including Kuk and Chen (1992), Sy and Taylor (2000) and Peng and Dear (2000), have considered logistic/Cox regression model. Their methods di er in handling the baseline hazard function in the estimation. Taylor (1995) proposed a logistic/Kaplan-Meier approach to analyze the incidence model by leaving the latency distribution un-specied. However his method imposes a rigid assumption that the la-tency distribution does not depend on the covariates. For the cure model with xed cure time, the focus is on the incidence part since the latency time is a xed constant. Jung (1996) and Subramanian (2001) proposed inference methods to estimate parameters in a binary regression model under censoring.
In this paper, we consider regression analysis for the third type of cure models in which cure is determined by the order of competing risks. The primary goal is to assess the e ects of covariates on the cure rate. Using the SARS example, letT1 be the time to
hospital discharge and T2 be the time to death. Dene =I(T1 T
2) as the indicator
for failure type which is the indicator for the cure event. Given the value of , two latency distributions can be Dened, namely Q1(t) = Pr(T1 > t
jT 1 T 2) and Q2(t) = Pr(T2 > t jT
1 > T2). Notice thatQj(t) (j = 12) also represent the survival functions of
the sojourn times in the two-path model and the cause-specic survival functions in the context of competing risks. We assume that Pr(T1
T 2
jZ) = Pr( = 1jZ) = ( 0Z),
where Z : p 1 denotes a vector of covariates, ;1(
) is the link function which is
monotonic and di erentiable. A common example is the logistic regression model
(0Z) = exp( 0Z)
1 + exp(0Z):
In general, covariates also a ect the latency distributions and thus we should write
Q1Z(t) = Pr(T1 > t jT 1 T 2Z) and Q2Z(t) = Pr(T2 > t jT 1 > T2Z). In the SARS
example, it seems more important to investigate what factors a ect whether a patient can be cured than to study how long it takes for them to recover. Therefore we prefer not making rigid assumptions on the forms of QjZ(t) (j = 12).
The major objective is to develop inference methods for estimating when the cure status, , may be unknown due to censoring. In Section 2, we rst review the inference procedure which complete information of is available and then discuss the likelihood inference given censored data. In Section 3 several inference methods for estimating
are proposed. Section 4 contains concluding remarks.
2 Preliminary Analysis
Letf(T
1iT2ii) (i= 1:::n)
gbe identically and independently replications of (T
1T2).
When the observation period is long enough such that i (i = 1:::n) are completely
observed, standard techniques for generalized linear models can be applied to estimate
. Based on complete data, f(iZi) (i = 1:::n)g, where i = I(T 1i
T
2i), the
likelihood function becomes L() = Yn i=1 (0Z i) i f1;( 0Z i)g 1; i (1)
which gives the score equation ~ U() = Xn i=1 fi;( 0Z i)g (0Z i) (0Z i)(0Z i)Zi (2) where (t) =@(t)=@t and (t) = 1;(t).
In practice it happens that subjects may drop out from the study or, at the end of the follow-up, some patients still have not developed the events interest. Let C be the censoring variable and assume that it is independent of bothT1 andT2. Under competing
risks, one observes Xi =T1i ^T 2i ^Ci, 1i =I(T1i T 2i ^Ci), 2i =I(T2i T 1i ^Ci) for i= 1:::n. Letting3i =I(Ci T 1i ^T
2i), 1i+2i+3i = 1. Note that when 1i = 1,
i = 1, while 2i = 1, i = 0. However the value of i is unknown if 3i = 1.
Without loss of generality, assume that T1, T2 and C are all continuous variables.
Based on censored data, the likelihood function becomes
LC / n Y i=1 n (0Z i)f1Z(xi)] 1i(0Z i)f2Z(xi)] 2iS Z(xi)]3i o (3) where fjZ(x) =; @ @xQjZ(x) for j = 12 and SZ(xi) =(0Z i)Q1Z(xi) + ( 0Z i)Q2Z(xi):
Notice that when censoring exists, the likelihood function of contains nuisance parame-ters related toQjZ(t) (j = 12) which implies that additional assumption on the latency
distributions is required if likelihood-based inference is pursued. Parametric regression models may be assumed for QjZ(t) (j = 12). Larsen and Dinse (1985) considered the
same framework as discussed here and assumed a logistic/piecewise-exponential model. In the next section, we review maximum likelihood estimation and then propose other inference methods under more exible assumptions.
3 The Proposed Methods
3.1 EM algorithm for maximum likelihood estimation
Previous analysis implies that, when censoring is present, likelihood estimation of
requires additional knowledge on the latency distributions. Assume temporarily that a parametric form is imposed onQjZ(t) (j = 12). Since direct maximization ofLC in (3)
is dicult, the EM algorithm can be used to obtain the maximum likelihood estimator. The likelihood based on pseudo-data, f(i
1i2iXiZi) (i= 1:::n) g, is Lf / n Y i=1 (0Z i)h1Z i(xi)] 3iQ 1Z i(xi)] i (0Z i)h2Z i(xi)] 3iQ 2Z i(xi)] 1;i where 3i = 1 ; 1i ; 2i and hjZ(x) = h ; @ @xQjZ(x) i .
QjZ(x) for j = 12. The E-step
takes the expectation of logLf with respect to the distribution of the unobserved is,
given the observed data and the current parameter values. The expected log-likelihood, denoted as lf(Q1Q2w (m)), becomes n X i=1 8 < : 2 X j=1 jiln(hjZ i(xi))] +w (m) i ln(( 0Z i)Q1Z i(xi)) + (1 ;w (m) i )ln(( 0Z i)Q2Z i(xi)) 9 = where w(m) i = 1i+I(1i =2i = 0) ((m) 0 Zi)Q(m) 1Z i(xi) ((m) 0 Zi)Q(m) 1Z i(xi) +( (m) 0 Zi)Q(m) 2Z i(xi) and (m), Q (m)
jZi (j = 12) denoted the current parameter values at the mth iteration.
Note that the last term in w(m)
i is the conditional probability that the ith patient will
be eventually cured given that the cure event has not occurred by time xi.
Further, we can write
lf(Q1Q2w (m) ) =LQ1 +LQ2+L where LQ1 = n X i=1 n 1i ln(h 1Z i(xi)) +w (m) i ln(Q 1Z i(xi)) o LQ2 = n X i=1 n 2i ln(h 2Z i(xi)) + (1 ;w (m) i )ln(Q 2Z i(xi)) o 4
and L =Xn i=1 n w(m) i ln(( 0Z i)) + (1;w (m) i )ln(( 0Z i)) o :
It is important to note that in the M-step of the algorithm, LQ1, LQ2 and L can
be maximized separately by treating w(m)
i as a xed constant. The EM procedure is
iterative in a way that the estimates obtained previously are used to update the value of w(m)
i in the current maximization step. Maximization of L is straightforward while
maximization ofLQj depends on the form ofQjZ(x) (j = 12). For parametric analysis,
convergence and properties of the resulting estimates follow the standard results.
3.2 Imputation by Conditional Mean
One alternative to adjust for censoring is to directly modify the score equation in (2). Based on the available data, we can impute the missing i, if3i = 1, by its conditional
mean which equals ~ i =EijT 1i ^T 2i > CiCi =xiZi] = Q1Z i(xi)( 0Z i) Q1Z i(xi)( 0Z i) +Q2Z i(xi)( 0Z i):
Here we propose two estimating functions of which can avoid going through the maximization procedure of LQjZ (j = 12). Both methods modify the results in Wang
(2003) by assuming that the covariates take nite number of values. By partitioning the sample according to the value of Z, we can use Wangs method to obtain the non-parametric estimators, ^Qjzk(t) (j = 12) and ^pzk(x) fork = 1:::J, where ^pZ(x) is an
estimator of
pz(x) =E(j
3 = 1X =xZ =z):
An estimating function by model-based imputation is given by
U1() = n X i=1 n ^i;( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (4)
where ^i = 1 for 1i = 1 ^i = 0 for 2i = 1 and if3i = 1, we set
^i = Q^1Z i(xi)( 0Z i) ^ Q1Z i(xi)( 0Z i) + ^Q2Z i(xi)( 0Z i): 5
An estimating function by model-free imputation is given by U2() = n X i=1 n ~i;( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (5)
where ~i = 1 for 1i = 1 ~i = 0 for 2i = 1 and if3i = 1, we set ~i = ^pz
i(xi). Denote
^
j as the solution of Uj() = 0 for j = 12.
3.3 Inverse Probability Weighting
When censoring is present, there is higher chance of rst observing an event associated with smaller failure time. To see this, one can express 1 = I(T1
C) which implies
that 1 is a biased proxy of in the presence of censoring. The larger value of failure
time T1, the higher chance that will be censored. When covariates exist, it can be
shown that EI(1 = 1) jT 1T2Z] =EI(T1 T 2 ^C)jT 1T2Z] = I(T1 T 2 jZ)GZ(T 1)
where GZ(t) = Pr(C > tjZ). It follows that
E(jZ) =E 1 GZ(X) Z ! =E " E I(T1 C) GZ(T1) T 1T2Z !# :
To simplify the analysis, we let GZ(t) = Pr(C > t) which is estimated by the
Kaplan-Meier estimator ^ G(t) = Y ut ( 1; P nk=1I(Xk =u 3k = 1) P nk=1I(Xk u) ) :
Replacing i by 1i=G^(Xi), one obtain the following estimating function
U3() = n X i=1 ( I(1i = 1) ^ G(Xi) ;( 0Z i) ) (0Z i) (0Z i)(0Z i)Zi:
The estimator of can be obtained as a solution of U3() = 0, denoted as ^3. The
technique of inverse probability weighting has been widely used in recent literature of survival analysis such as the papers by Jung (1996) and Lin, Sun and Ying (1999).
The proposed way of bias adjustment by inverse weighting implicitly assumes that there is no information about beyond the observational period. Specically, dene
t=supft: Pr(T 1
^T
2 > t)>0
gand c=supft: Pr(C > t)>0g. If t> c, which implies
that Pr(T1 ^T 2 > c)>0, we have E 1 G(X) T 1T2 ! = ( if T1 ^T 2 c 0 if T1 ^T 2 > c = I(T1 ^T 2 c)6= :
The requirement of t c is a condition of sucient follow-up. We can show that,
when t c and under some regularity conditions, ^
3 is consistent and p
n(^3 ;
0)
converges to a mean-zero normal random variable, where 0 is the true value of . The
asymptotic variance of ^3 is also derived.
Alternatively, we can use (12) jointly to construct an estimating equation. By the
same idea, we have the estimating function
U4() = n X i=1 ( 1i ; 2i ^ G(Xi) ;(( 0Z i);( 0Z i)) ) (0Z i) (0Z i)(0Z i)Zi:
The nite-sample comparison of U3() and U4() could be explored in the simulation
study.
3.4 Imputation by unconditional mean
In the context of competing risks, Pr( = 1) can be measured from the incidence function. Specically let T
1 =T 1 if T1 T 2 and T 1 = 1if T 1 > T2. It follows that S(t) = Pr(T 1 > t) = Pr(T 1 > tT1 T 2) + Pr(T1 > T2): Hence lim t!1 Pr(T 1 > t) = Pr(T 1 > T2) = Pr( = 0):
When a nonparametric estimator ofS(t) is available, denoted asS
np(t), we may consider
the estimating equation
U5() = n X i=1 n (1;S^ np(xmax));( 0Z i) o (0Z i) (0Z i)(0Z i)Zi (6)
where xmax is the maximum value of Xi (i= 1:::n) with 1i = 1. There exist several
nonparametric estimators of S(t).
3.5 Comparison of the methods
The formulation of mixture models suggests that the incidence probability and the la-tency distributions can be modeled separately. Due to the problem of identiability as discussed in Li et al. (2001), classical cure models rely on joint estimation of the two components even if only one part is of major interest. For the cure model considered here, the event of cure is explicitly dened and hence identiability is not an issue. Con-sequently the requirement on the lengthy of the follow-up period is less strict. In Section 2.1 we have seen that as long asCi > T1i
^T
2i for alli= 1:::n, it is natural to estimate
the two components separately. The focus here is on the incidence part. When there exist some observations with Ci > T1i
^T
2i, additional information is needed beside the
binary regression model itself.
In the likelihood-based analysis, model specication onQjZ(t) (j = 12) is required.
Specically for a censored observation with 1i =2i = 0, the knowledge of the latency
distributions is used in wi or ~i for weight assignment since the observed sojourn time
as well as the covariate still reveal useful information. Ignoring such information would lead to bias results. However making additional assumptions increases the possibility of making mistakes. Parametric modeling on the latency distributions is usually the standard approach. Flexibility of the imposed models is an important concern. In this article, we discuss nonparametric analysis for the latency distributions. It is important to note that although no distributional assumption is made, nonparametric analysis re-quires that the subjects are independently and identically distributions or homogeneous in some sense. Thats why we need to assume thatQjZ(t) (j = 12) do not depend onZ
orZ is discrete so that the sample can be partitioned into homogeneous subgroups. The likelihood-based estimator of using logistic/Kaplan-Meier approach by Taylor (1995) is computationally intensive since it involves back-and-forward iterations between and high dimensional parameters in QjZ(t) (j = 12). The proposed estimators 1 and 2,
that use the plug-in approach to handle the nuisance parameters separately, are easier to implement. It is possible that nding the root of
U1() = 0 is dicult since it is a complicated function of . One alternative is to
modify ^i by i = Q^1Z i(xi)(^ 0 (k;1)Zi) ^ Q1Z i(xi)(^ 0 (k;1)Zi) + ^Q 2Z i(xi)(^ 0 (k;1)Zi)
where ^(k;1) is the previously estimated value of fork
1. A reasonable choice of ^ (0)
is ^2.
Now dene t = supft : Pr(T 1
^T
2 > t) > 0
g and c = supft : Pr(C > t) > 0g.
The estimators, ^j (j = 123), all rely on the assumption of sucient follow-up, that
is t c. The validity of ^
3 strongly depends on this condition since
E 1 Gz(X) T 1T2Z ! =I(T1 ^T 2 c): In general, E 1 GZ(X) Z ! =E(I(T1 ^T 2 c)jZ)6=E(jZ):
Since Wangs nonparametric estimators reply on the assumption of sucient follow-up, ^1 and ^2 are also a ected via the imputed values. However Wang proposed a
modication when this condition is violated and can also be used here to correct the problem to some degree.
4 Concluding Remarks
Literature on illness-death models is abundant. There have been increasing research interests to analyze the problem in the framework of competing risks or mixture mod-els. One advantage of the mixture formulation is that it allows Separate modeling for incidence and latency distributions. >From our analysis, the purpose of joint estimation is for correctly utilizing partial information provided by censored data. One can avoid making model assumption on the latency part if follow-up is sucient. We see that there is a tradeo between making strong assumptions on the quality of data or on the model. The techniques used to handle the e ect of censoring can be viewed as applications of the principles for handling missing data that are reviewed in the book by Little and Rubin (2002).
References
Andersen, P. K., Borgan, ., Gill, R. D., and Keiding, N. (1993) Statistical Models
Based on Counting Processes. New York: Springer-Verlag.
Betensky, R. A. and Schoenfeld, D. A. (2001) Nonparametric estimation in a cure model with random cure times. Biometrics,
57
, 282{286.Crowley, J. and Hu, M. (1977) Covariance analysis of heart transplant survival data. J. Am. Statist. Ass.,
72
, 27{36.Kuk, A. Y. C. and Chen, C. (1992) A mixture model combining logistic regression with proportional hazards regressions. Biometrika,
79
, 531{541.Farewell, V. T. (1982) The use of mixture models for the analysis of survival data with long-term survivors. Biometrics,
38
, 1041{1046.Fine, J. P., Jiang, H. and Chappell, R. (2001) On semi-competing risks data. Biometrika,
88
, 907{919.Hougaard, P. (2000) Analysis of Multivariate Survival Analysis. New York: Springer-Verlag.
Klein, J. P. and Moeschberger, M. L. (1997) Survival Analysis: Techniques for Censored and Truncated Data. New York: Springer-Verlag.
Laska, E. M. and Meisner, M. J. (1992) Nonparametric estimation and testing in a cure model. Biometrics,
48
, 1223{1234.Larson, M. G. and Dinse, G. E. (1985) A mixture model for the regression analysis of competing risks data. Applied Statistics,
34
, 201{211.Li, C.-S., Taylor, J. M. G. and Sy, J. P. (2001) Identiability of cure models. Statistics & Probability Letters,
54
, 389{395.Lin, D. Y., Sun, W. and Ying, Z. (1999) Nonparametric estimation of the gap time distributions for serial events with censored data. Biometrika,
86
, 59{70.Maller, R. A. and Zhou, S. (1992) Estimating the proportion of immunes in a censored sample. Biometrika,
79
, 731{739.Maller, R. A. and Zhou, S. (1994) Testing for sucient follow-up and outliers in survival data. J. Am. Statistic. Assoc.,
89
, 1499{1506.Maller, R. A. and Zhou, S. (1996) Survival Analysis with Long-Term Survivors. Wiley: New York.
Peng, Y. and Dear, K. B. G. (2000). A nonparametric mixture model for cure rate estimation. Biometrics,
56
, 237-243.Prentice, R.L., Kalbeisch, J. D., Peterson, A. V., Flournoy, N., Farewell, V. T. and Breslow, N. E. (1978) The analysis of failure times in the presence of competing risks. Biometrics,
34
, 541{554.Taylor, J. M. G. (1995) Semi-parametric estimation in failure time mixture models. Biometrics,
51
, 899{907.Wang, W. (2003) Inference on the association parameters for copula models under dependent censoring. J. R. Statist. Soc. B,
65
, 257{273.Zhao, L. P. and LeMarchand, L. (1992) An analytical method for assessing patterns of familiar aggregation in case-control studies.
Genetic Epidemiology,