NONPARAMETRIC INFERENCE UNDER DEPENDENT TRUNCATION Ming-Yen Cheng

(1)

UNDER DEPENDENT TRUNCATION

Ming-Yen Cheng^1,2 Peter Hall¹ You-Jun Yang²

Dedicated to S´andor Cs¨org˝o on the occasion of his 60th Anniversary

ABSTRACT. Data truncation is a problem in scientific investigations. So far, statistical models and inferences are mostly based on the assumption that the survival and truncation times are independent, which can be unrealistic in applications.

In a nonparametric setting, we discuss identifiability of the conditional and unconditional survival and hazard functions when the survival times are subject to dependent truncation, namely, the survival time is dependent on the truncation time. Nonparametric kernel estimators of these unknowns are proposed. Useful- ness of the nonparametric estimators are demonstrated through their theoretical properties, an application and a simulation study.

KEYWORDS. Conditional distribution, dependent truncation, hazard rate, identifiability, kernel, nonparametric estimation, survival function, truncation.

AMS SUBJECT CLASSIFICATION: Primary 62N02, Secondary 62F12.

SHORT TITLE. Dependent Truncation.

1 Centre for Mathematics and its Applications, Australian National University, Canberra, ACT 0200, Australia

2 Department of Mathematics, National Taiwan University, Taipei 106, Taiwan

(2)

1. INTRODUCTION

We consider nonparametric inference in truncated survival-data problems, where the time of onset of a disease, and the time of death, are observable if and only if the onset time falls to the left of a time-point t, and the time of death lies to the right of t. Models of this type address settings, commonly seen in prevalent cohort studies, where a population is surveyed at time t, and only individuals who have experienced the disease and still have the disease at that time are followed up.

Under this setting, the survival time between the two events is left truncated by the truncation time from the initiating event to enrollment.

In this context there exists a literature on parametric approaches to inference, and also on nonparametric methods, in cases where the onset time and survival time are stochastically independent or quasi-independent. We shall discuss shortly this work. However, this quasi-independent, or quasi-stationary, assumption may be violated in practice. For example, advances in medical treatment and patient management may reduce the risk of death for patients with later onset times. To cope with such dependence, Wang et al. (1993) included a parametric component in their hazard regression models. Nevertheless, the more general nonparametric setting, where t is fixed and no assumptions are made about the relationship between onset time and survival time, is not well understood.

At first sight it might appear that there is little hope of making progress with such a completely general nonparametric approach, since the distributions of onset time and survival time are distorted so much by the operation of truncation at t. However, a closer inspection reveals that a surprising amount of detail can be recovered.

For example, it is possible to nonparametrically identify, up to a constant of proportionality and to the right of t, the survival-time density and survival-time

(3)

distribution function. This may be achieved without making any structural assumptions, such as the independence hypothesis mentioned earlier. However, the constant of proportionality, and the density to the left of t, are not identifiable from a nonparametric viewpoint. Nevertheless, a “working value” of the constant of proportionality is readily obtained by either fitting a model, or proceeding nonparametrically under the assumption of independence. In this way, fully nonparametric methods can be used to provide a check on the validity of parametric or structural approaches. For example, a model, or the independence assumption, can be used to estimate both the constant of proportionality and the density and distribution to the right of t. Then, nonparametric estimators of these quantities can be compared with the “predictions” of the model.

Moreover, the conditional hazard rate for survival time s, given that the onset time lies to the right of t − s, is fully identifiable in nonparametric terms, as too is a version of the cumulative hazard rate. We shall suggest fully nonparametric estimators of these quantities and those discussed in the previous paragraph, and outline their properties.

A related setting is that where t is random. Here, significantly more information is available for inference than in the fixed-t case, and greater progress can be made under relatively weak assumptions. See, for example, the work of Wang et al. (1993), which addressed models for prevalent cohort data where patients are recruited at different, random time points. In this research the survival-time distribution was modelled under both independence and dependence assumptions on the initiating time.

In many contexts the term “survival time” should be replaced by “incubation time,” or “induction time,” or “lag time.” In statistical and mathematical terms these quantities have the same role in the model as survival time. Thus, the descrip-

(4)

tion given by Lagakos et al. (1988) of nonparametric inference for induction time in the setting of HIV infection, involves an assumption that is functionally equivalent to the independence of onset time and survival time in a survival-data problem.

Kalbfleisch and Lawless (1989) presented parametric and nonparametric methodology under the same independence condition. The nonparametric induction-time distribution estimators in both papers are in the spirit of methods for nonparametric maximum-likelihood estimation discussed by Woodroofe (1985); see also Li (1995).

If a time-reversal argument is used (see e.g. Lagakos et al., 1988; Kalbfleisch and Lawless, 1989) then, after appropriate adjustments, our methods for the conditional survival and hazard functions can be applied to problems arising from retrospec- tive studies, where only individuals who have experienced the second event before time t are observed. Another closely related situation is that of estimating age- specific incidence, where the time horizon is age instead of calendar time. Outside the realm of biomedical and epidemiological studies, truncated data frequently occur in actuarial, demographic and physical research. Therefore, our results about identifiability, and our nonparametric methods, have a wide range of applications.

Keiding (1991) considered estimating age-specific incidence of certain diseases under independent truncation. Kalbfleisch and Lawless (1991) treated regression of the survival distribution on other covariates. There further exist many related parametric, semiparametric and nonparametric approaches, based on the independence assumption, for various problems of inference about the survival distribution, cumulative hazard and hazard functions, etc. See, for example, Becker and Marschner (1993), Pagano et al. (1994), Gross and Lai (1996), Cui (1999), Grigoletto and Akritas (1999), Sun and Zhou (2001) and Li et al. (2002). Tests for the indepen- dence assumption have been proposed by, for example, Tsai (1990), Kalbfkeisch and Lawless (1991), Efron and Petrosian (1994) and Martin and Betensky (2005).

(5)

Since our estimator of the conditional survival distribution conveys the shape of the true curve correctly, it can be used to check the quasi-independence condition and to offer a guild for parametric modeling of the dependence. Further, the proposed methods may be incorporated in semiparametric hazard or survival regression when nonparametric modeling of the dependence is desired.

Section 2 discusses identifiability of the conditional and unconditional survival and hazard functions for both the truncated and original populations. Estimators of these quantities are proposed in section 3. In section 4, an application to a real-data set arising from a cancer study and a simulation study are presented. There, data arise from the right truncation model mentioned two paragraphs before. Section 5 gives theoretical properties of estimators of the conditional distribution, hazard and density functions in the truncated population. Appendix I contains some notation and Appendix II gives a proof of the main theorem.

2. IDENTIFIABILITY ISSUES

2.1. Model, main problems, and definitions. Assume the random vector (X^∗, Y^∗) is distributed in the triangular region 0 < x < y < ∞. We observe (X^∗, Y^∗) only if X^∗ ≤ t < Y^∗, where t > 0 is fixed. Thus, we have no access to information about the distribution of (X^∗, Y^∗) conditional on X^∗ > t, and so we shall condition on X^∗ ≤ t. That is, we shall assume that X^∗ ≤ t. This mathematical model describes data resulting from a medical survey, at time t, of a population, where X^∗ and Y^∗ denote respectively the time of onset, and time of cessation, of a particular medical condition in a given individual. Each member of the population who is observed at time t to exhibit the condition, is monitored until the condition ceases to be present.

Let (X, Y ) have the distribution of (X^∗, Y^∗) conditional on Y^∗ > t, and let

(6)

(X₁, Y₁), . . . , (X_n, Y_n) be independent random two-vectors distributed as (X, Y ).

Using these data we wish to estimate the distribution of the time duration, S^∗ = Y^∗− X^∗, of the medical condition. Below, we shall refer to S^∗ as the survival time;

if the medical condition is fatal, in which case Y^∗ denotes the time of death, then this terminology is standard. We also seek estimators of the hazard function and cumulative hazard function, each of them conditional on onset time.

Put S = Y − X, and let fS^∗X^∗(s, x) and fS^∗|X^∗(s | x) denote, respectively, the joint density of S^∗ and X^∗ and the conditional density of S^∗ given X^∗. Define f_X^∗(x), f_X(x), etc, analogously. We shall replace f by F when referring to the corresponding distribution function. In this notation, the survival time distribution is FS^∗, and the conditional hazard function and its cumulative are, respectively,

λ(s | x) = f_S^∗_|X^∗(s | x)

1 − F_S^∗_|X^∗(s | x), Λ(s | x) = Z _s

0

λ(u | x) du . (2.1)

2.2. Formulae for the probability of observing (X^∗, Y^∗). The definition of the distribution of X, and the fact that fS^∗X^∗ = fX^∗f_S^∗_|X^∗, entail, for a constant q ≥ 1,

fX(x) = q Z _∞

t−x

fS^∗X^∗(s, x) ds = q fX^∗(x) Z _∞

t−x

f_S^∗_|X^∗(s | x) ds

= q fX^∗(x) {1 − F_S^∗_|X^∗(t − x | x)} ,

for 0 < x ≤ t. Therefore, provided

P (S^∗ > t − x | X^∗ = x) > 0 for all 0 < x ≤ t , (2.2)

we have, for 0 < x ≤ t,

fX^∗(x) = p fX(x)

1 − F_S^∗_|X^∗(t − x | x), (2.3) where p = q⁻¹ is determined by the fact thatR

0<x≤tfX^∗(x) dx = 1.

(7)

Note that

p = Z _t

0

{1 − FS^∗|X^∗(t − x | x)} fX^∗(x) dx = P {(X^∗, Y^∗) is observed} .

That is, p equals the proportion of data (X^∗, Y^∗) that are observed as pairs (X, Y ).

In general the value of p is not identifiable merely from the distribution of (X, Y ).

Indeed, we may make p as close to zero, or as close to 1, as we like, by altering the distribution of (X^∗, Y^∗) but without affecting the (X, Y ) distribution. However, under more stringent conditions on the distribution of (X^∗, Y^∗), p can be identified.

In particular, if

S^∗ and X^∗ are statistically independent (2.4)

then we may estimate FS^∗ root-n consistently from n data on (X, Y ) (see e.g.

Woodroofe, 1985), and thence we may estimate

p =

½ Z _t

0

fX(x)

1 − F_S^∗(t − x) dx

¾₋₁

(2.5)

at the same rate, from the same data. Result (2.5) follows from (2.3), provided (2.2) and (2.4) hold. In the context of independence, (2.2) is equivalent to P (S^∗ > t) > 0.

He and Yang (1998) suggested an estimator of the truncation probability 1−p using a representation different from (2.5).

An alternative approach, more parametric in that it demands particular dis- tributions for S^∗ and X^∗, but less structured from the viewpoint that it does not assume independence, is to fit a model to (X^∗, Y^∗) and estimate model parameters from the data (Xi, Yi). The value of p can then be computed for the model with parameters fitted.

In practice, estimating p under either (2.4) or a parametric model provides a “working guide” to choice of p, even if (2.4), or the model, may be false. A

(8)

graph of nonparametric estimators of F_S^∗, for a range of values of p centred around the working estimator of p, provides information where none might otherwise be available.

2.3. Identifying conditional hazard rate. It follows from the definition of (X, Y ) that, provided that s ≥ t − x,

P (S > s | X = x) = P (S^∗ > s | S^∗ > t − x , X^∗ = x) , or equivalently, that

F¯_S|X(s | x) ≡ 1 − F_S|X(s | x) = 1 − FS^∗|X^∗(s | x)

1 − F_S^∗_|X^∗(t − x | x). (2.6) Noting the definitions of λ(s | x) and Λ(s | x) at (2.1), and that ¯F_S|X(t − x | x) = 1, we may deduce from (2.6) that for s ≥ t − x,

λ(s | x) = −∂ ¯F_S|X(s | x)/∂s

F¯_S|X(s | x) , (2.7)

L(s, x) ≡ Λ(s | x) − Λ(t − x | x) = − log©F¯_S|X(s | x)ª

. (2.8)

Since, for this range of values of s, ¯F_S|X(s | x) is identifiable from data on (X, Y ), then it follows from (2.7) and (2.8) that within the same range, λ(s | x) and Λ(s | x)−

Λ(t − x | x) are also identifiable from data on (X, Y ). However, since we can alter the distribution of (X^∗, S^∗) in the range s < t − x without affecting the distribution of (X, Y ), then neither λ(s | x) nor Λ(s | x) − Λ(t − x | x) is identifiable for s < t − x, and Λ(s | x) is not identifiable for any s > 0.

Of course, λ(s | x) and Λ(s | x) are both identifiable, for s > 0 and 0 < x ≤ t, under the independence assumption, (2.4). In this context the conditioning on x is irrelevant, and λ and Λ are functionals of FS^∗.

2.4. Identifying overall hazard and distribution functions. Assuming (2.2) holds, we may use (2.3) and (2.6) to obtain the formulae,

FS^∗(s) = p Z _t

0

fX(x) FS^∗|X^∗(s | x)

1 − F_S^∗_|X^∗(t − x | x)dx for s > 0 (2.9)

(9)

= 1 − p Z _t

0

fX(x) ¯F_S|X(s | x) dx for s ≥ t , (2.10) f_S^∗(s) = p

Z _t

0

f_X(x) f_S|X(s | x) dx for s ≥ t . (2.11)

Let us assume, for the time being, that p is known. Then, in view of (2.9), FS^∗ is identifiable from data on (X, Y ), if and only if the integral on the right-hand side of (2.9) is identifiable. Since knowing (X, Y ) is equivalent to knowing (S, X), which is constrained by S ≥ t − X, then we cannot, in general, identify F_S^∗_|X^∗(s | x) outside the range s ≥ t − x. However, if s < t then the integral at (2.9) requires knowledge of F_S^∗_|X^∗(s | x) for s to the left of t−x. Therefore, if the only information we have is in terms of the distribution of (X, Y ), we cannot identify FS^∗(s) outside the range s ≥ t. Indeed, for the reasons just given it is possible to construct two examples of smooth, non-pathological distributions of (X^∗, Y^∗) for which p, and the distribution of (X, Y ), are identical, but the values FS^∗(s) differ for s < t. Within the identifiable range s ≥ t, (2.10) provides an explicit formula for F_S^∗ in terms of p and the identifiable functions f_X and F_S|X.

Formula (2.10) may equivalently be written as

FS^∗(s) = 1 − π P (S > s | S > t) for s ≥ t , (2.12)

where the function P (S > s | S > t) is of course identifiable for s ≥ t, and the constant π ≡ P (S^∗ > t) is not identifiable from data on (X, Y ) alone. We argued in section 2.2 that a “working approximation” to p might be constructed by estimating p under the assumption that S^∗ and X^∗are independent, or via a parametric model.

The same approach can be taken for π, and the function P (S > s | S > t) estimated root-n consistently from data on (X, Y ).

Note particularly that (2.11), and also (2.12), imply that we can identify, up to a constant of proportionality and to the right of t, the survival-time density f_S^∗.

(10)

In addition, from (2.10) and (2.11), the overall hazard rate

λ_S^∗(s) = f_S^∗(s) 1 − FS^∗(s) is identifiable from data on (X, Y ) for all s ≥ t.

3. METHODOLOGY

3.1. Outline of methodology. We shall suggest kernel-based methods for estimating F_S|X(s | x), using the data (Xi, Yi). The estimators are of Nelson–Aalen or product- limit type. They can be applied directly to estimate L(s, x), given at (2.8), and indirectly to estimate λ(s | x) and FS^∗(s). Our methods allow these quantities to be estimated throughout the range where they are identifiable, which for L(s, x) and λ(s | x) is s ≥ t − x and, for F_S^∗(s), is s ≥ t.

The estimator of F_S^∗(s) may be smoothed, and then differentiated, to give an estimator of fS^∗(s) for s ≥ t. An alternative approach to estimating FS^∗ and fS^∗ is to proceed via the representation (2.12). That method too will be discussed. Our estimators enjoy optimal nonparametric convergence rates.

3.2. Estimators of F_S|X and L. We shall treat only second-order methods, designed for estimating conditional distributions or densities with two bounded derivatives.

Higher-order estimators may be treated similarly.

Let K, a kernel function, be a bounded, compactly supported, symmetric prob- ability density, and let h denote a bandwidth. Put Kh(u) = h⁻¹K(u/h). Two estimators of F_S|X, consistent in the range s ≥ t − x and 0 < x ≤ t, are:

Fb_S|X(s | x) = 1 − exp

½

− Xn

i=1

I(S_i≤ s) K_h(x − X_i) P

j≤nI(t − Xj ≤ Si ≤ Sj) Kh(x − Xj)

¾

, (3.1)

FeS|X(s | x) = 1 − Yn i=1

½

1 − K_h(x − X_i)

P

j≤nI(t − Xj ≤ Si ≤ Sj) Kh(x − Xj)

¾_I(S_i_≤s) .

(11)

The first estimator, based on (2.8), derives from a Nelson-Aalen estimator of L(s, x):

L(s, x) =b Xn i=1

I(S_i ≤ s) K_h(x − X_i) P

j≤nI(t − Xj ≤ Si ≤ Sj) Kh(x − Xj).

The second estimator, eF_S|X(s | x), is of product-limit type, and may be viewed as an approximation by Taylor expansion to bL(s, x). It is analogous to another estimator of L(s, x), eL(s, x) say.

Result (5.5), below, will show that bL converges to L at rate (nh)^−1/2 + h², where the first term derives from error-about-the-mean and the second from bias.

A central limit theorem for bL is obtainable from a bivariate form of (5.7).

3.3. Smoothing bF_S|X and eF_S|X, and estimating λ. We cannot immediately differ- entiate either bF_S|X(s | x) or eF_S|X(s | x) in s, since they are not smooth functions of that variable. However, this difficulty can be eliminated by replacing the indica- tor function I(Si ≤ s), in the definitions of both estimators, by M {(s − Si)/H}, where M denotes the distribution function corresponding to the density K, and H is another bandwidth. Provided K is continuous, the functions bFS|X(s | x) and FeS|X(s | x) now have continuous derivatives with respect to s, and these derivatives can be viewed as estimators of f_S|X(s | x).

If we were interested only in smoothing bF_S|X and eF_S|X to remove visually distracting jumps in graphs of function estimates, such as bL(s, x), we would choose H by eye. If we wished to estimate f_S|X(s | x) or λ_S|X(s | x), we would generally take H = h, so that only one bandwidth had to be chosen. This approach gives estimators, ˆf_S|X and ˆλ, say, of f_S|X and λ, respectively, with optimal convergence rates. Indeed, under the assumption that fSX has two bounded derivatives, using the above smoothing technique, and choosing H = h = h(n) → 0 and nh² → ∞ as n → ∞, the pointwise convergence rates of ˆfS|X ≡ ∂ bFS|X/∂s and ∂ eFS|X/∂s to f_S|X(s | x) equal O_p{(nh²)^−1/2+ h²}. Taking h to be of size n^−1/6, this gives the

(12)

optimal mean-square convergence rate of n^−2/3, for functions with two bounded derivatives. In the case of ˆf_S|X these results follow from (5.6) below, which also implies a central limit theorem for the estimator.

In more detail, substitution into (2.7) gives

λ(s | x) =ˆ ∂ bF_S|X(s | x)/∂s 1 − bF_S|X(s | x) =

Xn i=1

Kh(s − Si) Kh(x − Xi) P

j≤nI(t − Xj ≤ Si≤ Sj) Kh(x − Xj), (3.2) fˆ_S|X(s | x) = ©

1 − bF_S|X(s | x)ªλ(s | x) .ˆ

Here, different values of the smoothing parameter h could be used to construct λ(s | x) and bˆ F_S|X(s | x); the bandwidths would be of sizes n^−1/6 and n^−1/5, respectively, reflecting the fact that the curve estimation problems are bivariate for the former and univariate for the latter.

3.4. Alternative estimator of F_S|X, and edge effects. Another approach to estimat- ing FS|X is to proceed via a kernel estimator, ˇfSX say, of fSX:

fˇSX(s, x) = 1 nh²

Xn i=1

K

µs − Si

h

¶ K

µx − Xi

h

¶ .

Since, for s ≥ t − x,

FS|X(s | x) = R

t−x<u<sfSX(u, x) du R

u>t−xfSX(u, x) du , then, for appropriate choice of the bandwidth h,

Fˇ_S|X(s | x) = R

t−x<u<sfˇSX(u, x) du R

u>t−xfˇSX(u, x) du is a consistent estimator of FS|X(s | x).

A disadvantage of this approach, however, is that it is relatively sensitive to the impact of edge effects, for example those that occur near the diagonal boundary defined by s+x = t. These effects can have substantial influence on the bias of ˇF_S|X,

(13)

with the result that ˇF_S|X(s | x) is, in general, not consistent at either the diagonal boundary or the vertical boundaries defined by x = 0 and x = t. Corrections for edge effects need to be incorporated, in particular if ˇF_S|X is used as the basis for an estimator of FS^∗. Those adjustments can be tedious and cumbersome.

The estimators bF_S|X and eF_S|X are substantially more robust than ˇF_S|X against edge effects due to bias. See, for example, the first part of the theorem in section 5, where the order of bias, O(h²), in (5.5) is valid along the diagonal boundary and within h of the vertical boundary. Along the vertical boundaries the order of bias is O(h), rather than O(1) as in the case of ˇF_S|X, and so the estimators bF_S|X and Fe_S|X are consistent along the vertical boundaries.

However, bF_S|X and eF_S|X can suffer problems in the upper tail of the distribu- tion, in particular if it happens that for some (s, x),

F_S|X(s | x) = 1 . (3.3)

Condition (5.1), in our discussion below of theoretical properties of bF_S|X(s | x), reflects this point. However, provided the distribution of S given X is unbounded to the right, the identity (3.3) will never arise, and the upper limit of values of s for which methods, based on bF_S|X or eF_S|X, can estimate F_S|X effectively, will steadily increase with sample size.

3.5. Estimators of FS^∗, fS^∗and λS^∗. One approach to estimating FS^∗ is to construct an estimator of FS|X, as suggested in section 3.2, then compute an estimator of ˆfX

of f_X using standard kernel methods, and combine them into an estimator of F_S^∗, for a particular value of p, by substituting into (2.10):

Fb_S^∗(s) = 1 − p Z _t

0

fˆ_X(x) {1 − bF_S|X(s | x)} dx

for s ≥ t. Provided we appropriately undersmooth when constructing ˆf_X and bF_S|X,

(14)

which means choosing a bandwidth h satisfying n^−1+δ ≤ h = O(n^−1/4) for some

1

2 < δ ≤ ³₄, the estimator bFS^∗ is root-n consistent for FS^∗. It is simpler, however, to work from (2.12), and take

FbS^∗(s) = 1 − π bG(s, t) (3.4)

for s ≥ t, where

G(s, t) =b P

i I(Si > s) P

i I(S_i > t) (3.5)

estimates G(s, t) ≡ P (S > s | S > t). Provided P (S > s) > 0, the estimator bG is root-n consistent for G.

Discontinuities in s can be removed using the method suggested in section 3.3.

Here this amounts to replacing the indicator function I(S_i > s) in (3.5) by 1 − M {(s − S_i)/H}, where M is a distribution function and H denotes a bandwidth.

As in section 3.3, if our reason for smoothing was to remove jumps in bG( · , t) then H would be small, of order n⁻¹, and would be chosen by eye. On the other hand, if our aim was to estimate fS^∗ then we would use a larger bandwidth. Taking H = h in this case, we differentiate (3.4) with respect to s, obtaining a kernel estimator of fS^∗(s):

fˆS^∗(s) = π P

i Kh(s − Si) P

i I(Si > t) (3.6)

for s ≥ t, where K = M⁰. Choosing h to be of size n^−1/5 gives an estimator fˆS^∗ with optimal mean-square convergence rate n^−4/5, provided π assumes its true value. At the very least, without any attempt to evaluate π, (3.6) gives, for s > t, a completely nonparametric estimator of a function, fS^∗(s)/π, that is proportional to the true survival density, f_S^∗(s).

As suggested in section 2.4, one approach to choosing π would be to use either a parametric model, or the structural, but nonparametric, assumption that S^∗ and

(15)

X^∗ are independent, to produce a working value of the non-identifiable quantity π = P (S^∗ > t); and to plot the estimates ˆfS^∗(s), s ≥ t, for values of π on either side of the working value.

Forming ratio of ˆfS^∗(s) and 1 − bFS^∗(s) we have an estimator λˆS^∗(s) =

P

i Kh(s − Si) P

i I(Si > s) (3.7)

of the overall hazard rate λS^∗(s) for all s to the right of t.

The estimators at (3.6) and (3.7) can suffer edge effects at the left- hand bound- ary, i.e. near s = t. These can be removed by conventional means, for example through using a boundary kernel there or by binning the data and fitting a local linear smoother.

4. NUMERICAL STUDY

4.1. An application. We present here an application to a dataset from a breast cancer study conducted by the National Cancer Institute of Canada Clinical Trials Group. Between April 1998 and July 1999, 305 eligible patients, who had inoperable metastatic or recurrent breast cancer with an Eastern Cooperative Oncology Group score of 0 to 2, were entered onto the study from 38 centres in Eastern Europe (n = 192), Canada (n = 46), South Africa (n = 34), Western Europe (n = 24), and Australia (n = 9).

Let eX^∗, eY^∗ and t (= 355), all expressed in months and calculated from 1st January 1970, respectively denote the time a patient was first diagnosed with breast cancer, the time the patient had the first recurrence of breast cancer, and the time to the end of the study. Of major interest is the survival time, or progression free time, eS^∗ = eY^∗− eX^∗ from the first incidence of breast cancer to the first recurrence.

In this study, only individuals with eY^∗ ≤ t can be observed. Equivalently, the survival time eS^∗ is subject to right truncation by t − eX^∗. Let ( eX, eS) denote

(16)

the truncated version of ( eX^∗, eS^∗), i.e. ( eX, eS) equals ( eX^∗, eS^∗) conditional on eS^∗ ≤ t − eX^∗. The left panel of Figure 1 plots eX + eS against eX for the 287 recurrent breast cancer patients, among the total of 305 metastatic or recurrent breast cancer cases.

If the time-reversal transformation

X^∗ = t − eX^∗, S^∗ = t − eS^∗, X = t − eX , S = t − eS , (4.1)

is employed, then S^∗ is subject to left truncation and (X, S) has the distribution of (X^∗, S^∗) conditional on S^∗ ≥ t − X^∗. Hence, the methods proposed in section 3 can be applied to analyse this data set. The right panel of Figure 1 shows the transformed data on the truncated population of (X, S). Note that one isolated point, corresponding to a much earlier initial time of original breast cancer than the others, has been deleted.

In section 4.1.1, the conditional survival and hazard functions of eS given eX are estimated using (4.1) and estimators of the conditional survival and hazard functions of S given X given in section 3.2. Section 4.1.2 studies the original population ( eX^∗, eS^∗). Under the model in section 2, FS^∗(s) is identifiable up to a constant of proportionality and λS^∗(s) is identifiable for s ≥ t; F_S^∗_|X^∗(s|x) is identifiable up to a constant of proportionality and λ_S^∗_|X^∗(s|x) is identifiable for s > t − x. Under the model here, neither FeS^∗(es) nor λeS^∗(es) is identifiable for any 0 < es < t; FeS^∗|Xe^∗(es|ex) is identifiable up to a constant of proportionality and λeS^∗|Xe^∗(es|ex) is identifiable up to a constant for all 0 < es < t − ex.

4.1.1. Truncated distribution. Following from the time-reversal transformation (4.1), for all 0 ≤ es ≤ t − ex,

FeS|Xe(es | ex) = 1 − F_{S | X}(s⁻| x) , feS|Xe(es | ex) = f_S|X(s | x) ,

(17)

Observe from Figure 1 that truncation is heavy for large ex values, and that there are very few points with small ex values. Also, there are edge effects. Therefore, in the analysis of conditional survival distributions, ex was fixed at ten middle values 220, 230,. . . , 310. The kernel function was K(u) = ¹⁵₁₆(1 − u²)²I(|u| ≤ 1). The bandwidth h was taken as 15 for the first five ex values, and 7.5 for the others.

Figure 2 depicts the estimates of the truncated conditional distribution func- tion, FeS|Xe(es | ex), over the range 0 ≤ es ≤ t− ex where it is identifiable. The truncated conditional distributions are noticeably different. Therefore, analysis based on the overall distribution of the true survival time eS^∗, instead of on its conditional distribution on eX^∗, could be quite misleading.

F_S|X(s⁻| x) = λ_{S | X}(s | x) 1 − FS|X(s | x) F_S|X(s⁻| x) ,

4.1.2. Original distribution. The survival distribution in the original population ( eX^∗, eS^∗) is of significant interest. Lagakos et al. (1988) derived the product-limit estimator of FeS^∗(es) using (4.1) and assuming independence between eS^∗ and eX^∗. Then one can obtain the corresponding Nelson-Aalen estimator of the cumulative hazard. The product-limit estimate of FeS^∗(es) is shown in Figure 3. On the other hand, for all 0 < es < t,

FeS^∗(es) = P (S^∗ ≥ s) = p Z _t

0

fX(x) ¯FS^∗|X^∗(s⁻| x)

1 − F_S^∗_|X^∗(t − x | x)dx . (4.2)

(18)

Hence, without the independence condition, neither FeS^∗(es) nor λeS^∗(es) is identifiable using data on ( eS, eX) for any 0 < es < t. Even with information on p, this difficulty remains as the integral in (4.2) cannot be changed to the form in (2.10) for any 0 < es < t. This in can be seen from the data structure eX^∗ < eX^∗+ eS^∗ < t where t is a constant.

From (2.6) and (4.1), for 0 ≤ es ≤ t − ex,

FeS^∗|Xe^∗(es | ex) = ¯F_S^∗_|X^∗(s⁻| x) = ¯F_S|X(s⁻| x) ¯F_S^∗_|X^∗(t − x | x) . (4.3)

The conditional truncation probability F_S^∗_|X^∗(t−x | x) is not identifiable from data on the truncated version ( eS, eX). Under the independence assumption, ¯F_S^∗_|X^∗(t − x | x) equals ¯FS^∗(t − x), or equivalently FeS^∗(t − ex), which can be estimated by the product-limit method described before. Combining this working value of ¯F_S^∗_|X^∗(t−

x | x) and our estimator of ¯F_S|X(s⁻| x) we obtain an estimator of FeS^∗|Xe^∗(es | ex) over the range 0 ≤ es ≤ t − ex. The results are also depicted in Figure 3. It shows that the conditional distribution of the survival time eS^∗ given the initial time eX^∗ varies over the initial time. And the product-limit estimate of the unconditional distribution function differs from the estimates of the conditional distribution functions in shape.

Often, the hazard function gives better insight into the survival distribution than the survival function does. From (4.3), for 0 ≤ es ≤ t − ex,

λeS^∗|Xe^∗(es | ex) = λS^∗|X^∗(s | x) F¯_S^∗_|X^∗(s | x) F_S^∗_|X^∗(s⁻| x)

= λ_S^∗_|X^∗(s | x) F¯_S|X(s | x) ¯F_S^∗_|X^∗(t − x | x) 1 − ¯F_S|X(s⁻| x) ¯F_S^∗_|X^∗(t − x | x).

Therefore, while λ_S^∗_|X^∗(s | x) is identifiable for all s ≥ t − x, λeS^∗|Xe^∗(es | ex) is not identifiable in the region 0 ≤ es ≤ t − ex. An estimator is available using our estimators of FS|X(s | x) and λS^∗|X^∗(s | x) (= λS|X(s | x) for s ≥ t − x) given in (3.1) and (3.2), and a working value of F_S^∗_|X^∗(t − x | x), e.g. the product-limit estimator

(19)

of F_S^∗(t − x), by assuming independence. The results are depicted in Figure 4.

When ex = 260 or 270, the first peak in the conditional hazard estimates occurs at around es = 40 or 20, much earlier than the first peak (around es = 65) in the other curves. Alternatively, under independence of eS^∗ and eX^∗, kernel smoothing the Nelson-Aalen estimator yields an estimator of the overall hazard function λeS^∗(es).

This overall hazard rate estimate, shown in Figure 4, does not resemble any of the conditional hazard rate estimates.

The observations made from Figures 3 and 4 suggest that, for this data set, dependence between the survival time eS^∗ and the truncation time t − eX^∗ exists. In that case, traditional approaches that rely on the independence assumption, such as the product-limit and Nelson-Aalen estimators, may not be useful. By contrast, our methods provide a better description of the dependence by illustrating how the conditional survival and hazard change over the truncation time.

4.2. Simulation study. Here we use a simulation study to demonstrate that, in the case of dependent truncation, our methods indeed recover information about the original distribution from the truncated data more accurately than methods based on the independence assumption. We consider a setting similar to the breast- cancer data: n = 300 observations were collected on ( eX, eY ), which equals ( eX^∗, eY^∗) conditional on eX^∗ ≤ eY^∗ ≤ t. Let the truncation time t be 350. Take eX^∗ = 350 − Z, where Z follows an Exponential distribution with rate 0.03 and is constrained by Z ≤ t (= 350). Suppose the conditional distribution of the survival time, eS^∗ = Ye^∗ − eX^∗, given eX^∗ = ex is a log-logistic distribution with density function

s/α(ex)ª_β(e^x)−1 h

1 +© e

s/α(ex)ª_β(ex)i₂ , es > 0 .

(20)

The survival and hazard functions are respectively Fe¯S^∗|Xe^∗(es | ex) =

h 1 +©

e

s/α(ex)ª_β(e^x)i₋₁

s/α(ex)ª_β(ex)−1

h 1 +©

e

s/α(ex)ª_β(ex)i . We set α(ex) = exp(ex/70) and β(ex) = (ex + 100)/100. The methods described in section 4.1 were applied to 1000 random samples generate from this model.

Panel (a) of Figure 5 plots the overall hazard function λeS^∗(es) and the con- ditional hazard function λeS^∗|Xe^∗(es | ex) for ex = 210, 230, 250 and 270. Panel (b) of Figure 5 gives the pointwise 10th, 50th and 90th percentiles of the product-limit estimator (by independence assumption) of λeS^∗(es) and the true curve. As expected, this estimator fails here since independence does not hold and the overall hazard function does not resemble any of the conditional hazard functions. Given in panel (c) of Figure 5 are five typical realizations of the product-limit estimator that cor- respond to the 10th, 30th, 50th, 70th and 90th percentiles of the 1000 integrated squared errors, calculated from zero to the 95th quantile of the true distribution of Se^∗. All of them depart from the true curve dramatically.

Figure 6 plots the pointwise 10th, 50th and 90th percentiles of our estimator of λeS^∗|Xe^∗(es | ex), described in section 4.1.2, for ex = 210, 230, 250 and 270. Although the product-limit estimator, used as a working value of F_S^∗_|X^∗(t − x | x) in our method, is biased without the independence assumption, our estimator performs quite well.

Comparing Figures 5 and 6, it is clear that our observation in section 4.1.2 are not just artifacts of random fluctuation. Instead, our estimators indeed recover shapes of the underlying curves accurately.

We also experimented with replacing the working value of F_S^∗_|X^∗(t − x | x) by the true value in the estimation of λeS^∗|Xe^∗(es | ex). The outcomes are not given here, but this approach always does better. However, only for large values of es or ex is it significantly better than our estimator.

(21)

As stated in section 4.1.1, the conditional hazard function λeS|Xe(es | ex) is identifiable; we gave an estimator of it there. Figure 7, presenting performance of our estimator of λeS|Xe(es | ex), is an analog of Figure 6. It validates our estimator for a relatively small sample size, n = 300.

5. THEORETICAL PROPERTIES

First we discuss the estimator bF_S|X, introduced in section 3.2. Results similar to those below may be derived in the case of eFS|X. Our main theoretical formula, from which our other results will follow, will describe uniform convergence properties of Fb_S|X(s | x) for values of (s, x) satisfying 0 < t − x < s < s₀, where s₀ has the property:

sup

0<x<t F (s0| x) < 1 . (5.1)

Since our estimators are based on second-order kernel methods then their biases are of order h². This is true for both density and distribution estimators. Concise expressions for the order h² bias terms are especially complex, however, and so we shall not give them here; they will be represented simply as O(h²) in asymptotic approximations. On the other hand, we shall be relatively concise about error- about-the mean terms.

Next we introduce notation. Let c_j and β_j denote functions, not depending on n or h, which are introduced in Appendix I. Put

a1(s, x ; s⁰, x⁰| h) = Kh(x − x⁰) c1(s, x ; s⁰, x⁰) ,

a2(s, x ; s⁰, x⁰| h) = Kh(x − x⁰) Kh(s − s⁰) c2(s, x ; s⁰, x⁰)

and bj(s, x ; s⁰, x⁰| h) = aj(s, x ; s⁰, x⁰| h) − E{aj(s, x ; S, X | h)}. We shall make the following assumptions:

the distribution of (S, X) is continuous, with density fSX, which has two continuous derivatives on its support; the support of fXequals the interval [0, t], on which f_X is bounded away from zero;

(5.2)

(22)

K is a bounded, symmetric, H¨older-continuous probability density, sup-

ported on [−1, 1]; (5.3)

for some δ ∈ (0, 1), h = h(n) satisfies n^δh → 0 and n^1−δh → ∞. (5.4)

Theorem 1. Assume (5.2)–(5.4), and that s0 satisfies (5.1). Define bF_S|X by (3.1).

Then, uniformly in (s, x) in the range that 0 < t − x < s < s0 with h ≤ x ≤ t − h or s ≤ min(t, t − x + h)

Fb_S|X(s | x) − F_S|X(s | x)

F¯_S|X(s | x) = h²β₁(s, x) + 1 n

Xn i=1

(nh)^−1/2+ h²ª

. (5.5)

Theorem 2. Assume the conditions in Theorem 1. If, in the definition at (3.1), we replace I(S_i ≤ s) in the numerator on the right-hand side by M {(s − S_i)/h}, where M denotes the distribution function corresponding to density K; if we then define fˆ_S|X(s | x) as the derivative of bF_S|X(s | x) with respect to s and ˆλ(s | x) as in (3.2);

and if we replace the assumption “n^1−δh → ∞” in (5.4) by “n^1−δh² → ∞”; then, for (s, x) satisfying 0 < t − x < s < s0 with h ≤ x ≤ t − h and s > min(t, t − x + h),

λ(s | x) − λ(s | x) = λ(s | x)ˆ

½

h²β2(s, x) + 1 n

Xn i=1

b2(s, x ; Si, Xi| h)

¾

+ op

, (5.6)

fˆ_S|X(s | x) − f_S|X(s | x) = f_S|X(s | x)

½

h²β3(s, x) + 1 n

Xn i=1

b2(s, x ; Si, Xi| h)

¾

+ op

. (5.7)

Theorem 1 does not specifically address the order of the bias of bF_S|X within h of the vertical boundaries, where x = 0 or x = t. However, the bias there can be shown to be of order h. Formulae for the error-about-the-mean of bFS|X in those places are given in (A.8).

(23)

The series on the right-hand side of (5.5) is of course a sum of independent and identically distributed random variables, and so (5.5) can be used readily to derive a central limit theorem for bF_S|X. Indeed, n⁻¹ P

i b1(x, s; Si, Xi| h) is asymp- totically normally distributed with zero mean and variance (nh)⁻¹σ1(s, x)², where σ1(s, x)² = κ fX(x) E{c1(s, x; S, X)²} and κ =R

(nh)^−1/2σ1(s, x) Nn(s, x) + h²β1(s, x)ª + op

¡h²¢

, (5.8) where the random variable Nn(s, x) is asymptotically normally distributed with zero mean and unit variance.

Standard arguments may be used to prove that, under the conditions of the the- orem, n⁻¹ P

i b₁(x, s; S_i, X_i| h) is of order (nh)^−1/2(log n)^1/2, uniformly in (s, x) in the range 0 < t − x < s < s0. Hence, (5.5) may also be used to prove that

sup

(s,x) : 0<t−x<s<s0, h≤x≤t−h

¯¯

¯ bF_S|X(s | x) − F_S|X(s | x)

¯¯

¯ = Op

and ©

are each asymptotically normally distributed with respective asymp- totic means h²β2(s, x) and h²β3(s, x) and common variance (nh²)⁻¹σ₂²(s, x), where σ₂²(s, x) = κ²f_SX(s, x)E©

c₂(s, x; S, X)²ª .

APPENDIX I: DEFINITIONS OF FUNCTIONS c_j AND β_j Put

κj(h, s, x) = Z

v : min(0,t−s)<x−vh<t

v^jK(v) dv , ρj(h, s, x) = κj(h, s, x) κ₀(h, s, x), c_j(s, x ; s⁰, x⁰) = I(j = 1) I(s⁰ ≤ s) + I(j = 2)

I(j = 1) I(t − x⁰ ≤ s) + I(j = 2)ª

× I(t − x⁰ ≤ s⁰) Z _s⁰

t−x⁰

f_SX(u, x) κ0(h, u, x) r(u, x)² du ,