Binary Response Models with M-phase Case-Control Data

30  Download (0)

Full text

(1)

PREPRINT

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 17.pdf

Binary Response Models with M-phase Case-Control Data

Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang

December 1, 2011

(2)

Binary Response Models with M -phase Case-Control Data

Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang Department of Mathematics, National Taiwan University

December 1, 2011

Abstract

The present study aimed to characterize the relationship between a binary response and covariates of interest through a more general single-index regression model. With M -phase (M ≥ 2) case-control data supplemented by information on a response and certain covariates, we primarily propose a pseudo likelihood estimation for the index coefficients of this type of semiparametric model. Additionally, our approach can be readily adopted to accommodate case-control sampling with a continuous response and outcome-dependent sampling. With the considered data setting in the receiver operating characteristic curve analysis, an estima- tion for the accuracy measure is provided and is borrowed to seek an optimal linear predictor in the class of potential linear predictors. To check model correctness, a pseudo least squares approach is further employed as an aid to devising suitable testing procedures. The general theoretical frameworks for the proposed estimators and the bootstrap inference are also de- veloped in this article. Finally, extensive simulations and two empirical applications are used to illustrate the applicability of our methodology.

1 Introduction

To account for the relationship between potential covariates X = (X1, . . . , Xd)> and a binary response Y (1: case; 0: control), researchers and practitioners try to better understand the con- ditional probability of Y. We can reasonably argue that fully parametric models might be mis- specified and fully nonparametric ones usually suffer from the curse of dimensionality (Bellman 1961). Thus, a more flexible semiparametric model becomes a great interest and compromise.

With prospective studies, recent research has developed a variety of estimation approaches and

Keywords: accuracy measure; bootstrap; case-control; cross-validation; missing data; M-phase; pseudo least squares; pseudo maximum likelihood estimator; receiver operating characteristic curve; single-index model.

(3)

the related inferences for certain semiparametric models. In application, it is often impractical to obtain this type of data due to the high cost or tremendous amount of time in collecting expensive variables. An alternative cost-effective strategy to solve this difficulty is to adopt supplementary M-phase sampling plans.

In the simple case-control design, Anderson (1972) and Pyke (1979) showed that the covari- ate effects in logistic regression models can be estimated in a normal manner as if data were coming from a prospective study. However, this attractive feature cannot carry over to arbi- trary regression models without some degree of information on the marginal probability of Y . To overcome the confronting problem and further enhance study efficiency, the supplementary M-phase sampling mechanism has been adopted in several clinical and epidemiological studies.

Let (X(1)>, . . . , X(M )>)> be an alternative expression of X with categorical or discrete covariates X(m) having finite support {x(m)1 , . . . , x(m)Jm} and being designed to be observed at the mth phase, and Rm+1 indicate the missing status of X(m+1), which is defined to be 1 if X(m+1) is observed and 0 otherwise, at the (m + 1)th phase, m = 1, . . . , M − 1. Using these missing data nota- tions, the so-called M-phase case-control sampling design can be concisely described as below:

First of all, a prospective cohort of size n is stratified by the values of (Y, X(1)) at the first phase.

For each individual with the combination value {Y = y, X(1) = x(1)j1 , . . . , X(m) = x(m)jm } at the mth phase, the sampling indicator Rm+1 is randomly generated from the independent Bernoulli or fi- nite population sampling scheme with the corresponding selection probability q(y, xb (1)j1 , . . . , x(m)jm ) = n(y, x(1)j1 , . . . , x(m)jm )/Pn

i=1QimI(Yi = y, Xi(1) = x(1)j1 , . . . , Xi(m) = x(m)jm ), where n(y, x(1)j1 , . . . , x(m)jm ) is the expected or specified subsample size at the (m + 1)th phase and Qm+1 =Qm+1

k=2 Rk. For the sake of simplification, both of case-control sample and its population are treated as the first phase without distinction (cf. Breslow and Holubkov 1997). Under stratified two-phase case-control studies, Scott and Wild (1991) proposed a maximum likelihood estimation for logistic regression.

Meanwhile, Wild (1991) provided a maximum likelihood approach as well as two pseudo maximum likelihood ones for the regression parameters of an arbitrary model with the link function being

(4)

known up to the parameters and general covariates. It was further shown by Scott and Wild (1997) that the maximum likelihood estimator can be obtained by iteratively updating the derived pseudo-likelihood equation. As one can see, the feasibility of their approaches mainly relies on the monotonic assumption of a response function or another suitable condition. Based on the con- tinuous response Yo with supplementary information on its dichotomized response Y = I(Yo ≤ yc) for some threshold value yc, a more efficient estimation was developed by Jiang, Scott, and Wild (2009) under the validity of a parametric model for the conditional distribution of Yo. Recently, Lee, Scott, and Wild (2010) also presented an efficient profile likelihood approach for multi-phase case-control data and unified earlier developments.

In this article, we focus on a fairly useful single-index model (SIM):

P (Y = 1|X = x) = Gβ0(xβ0), (1.1)

where Gβ(·) is an unknown function and xβ = x1+ (x2, . . . , xd)>β with β = (β2, . . . , βd)> and β0 = (β02, . . . , β0d)>being a vector of true index coefficients. The specification of one for the coefficient of a significant covariate, sayX1, is mainly to solve the problem of identifiability. With a prospective data collection, readers can refer the profile likelihood approach of Klein and Spady (1993) and the pseudo least squares one of Ichimura (1993). Based on supplementary M-phase case-control data, we developed a pseudo likelihood estimation criterion for β0 by fully utilizing available data at each phase. Under some suitable conditions, the resulting pseudo maximum likelihood estimator (PMLE) is shown to be √

n-consistent, asymptotic normal, and the most efficient among the proposed estimators. The frequency distribution of a bootstrap analogue is further employed to estimate the sampling distribution of the PMLE and make inferences on β0 or the related parameters of interest. As for the case-control data with a continuous outcome (cf. Jiang, Scott, and Wild 2009) or the supplementary continuous-outcome-dependent data (cf. Zhou el al. 2002, 2007; Song, Zhou, and Kosorok 2009), a more flexible semiparametric model

fYo(yo|x) = fβ0(yo, xβ0) (1.2) can be adopted to characterize the conditional density function ofYo, wherefβ(·, ·)is an unspecified

(5)

bivariate function. Of interest is the fact that the induced model for the dichotomized outcome Y has the same form of that in (1.1). In this study, we introduced a new estimation procedure for this type of data and addressed its theoretical merit and practical limitation. When Gβ(υ) is strictly increasing in υ and the components of X are continuous or mixed discrete-continuous, the receiver operating characteristic (ROC) curve of the single-indexXβ0 has been well known to be the highest among those of scalar-valued functions ofX. An alternative estimation criterion was aimed at maximizing the proposed estimators for the areas under the ROC curves (AUC) with respect to all potential linear predictors. To check the correctness of the SIM in (1.1), our pseudo least squares estimation (PLSE) approach is considerable in the establishment of testing procedures.

The test rules of Xia (2009) were further extended with some modifications to accommodate the considered sampling scheme. There are some interesting features in our proposed approaches:

First, the PMLE is the asymptotically most efficient among the other ones and has remarkably good finite sample performance. Second, the estimation approach of the accuracy measure provides an estimator for the optimal composite markerXβ0. Third, thePLSEcriterion is found to be useful in model checking although the resulting estimator is inefficient relative to the PMLE.

Under the M-phase case-control design, we outline a class of pseudo estimation procedures for β0 and devise testing procedures to check model correctness in Section 2. Section 3 is devoted to the related large sample properties and the bootstrap inference. In Section 4, simulations and applications to two empirical data are used to illustrate the practicality of our methodology. Some concluding remarks and future research are given in Section 5. Finally, the proofs of the technical lemmas and the main results are postponed to the appendices.

2 Estimation and Model Checking

To simplify the presentation, we let X[m] = (X(1)>, . . . , X(m)>)>, X[−m] = (X(m+1)>, . . . , X(M )>)>, p(y, xb [m]) = Qm

k=1q(y, xb [k]) with q(y, xb [m]) = n(y, x[m])/(nqb1(y, x[m])), R(y, x[m]) = Rm+1/q(y, xb [m]), Q(y, x[m]) =Qm

k=1R(y, x[k]), and ψ(υ) represent the joint probability density function of {Y = y}

and X . The M-phase case-control design can be depicted as the following flowchart:

(6)

(Y, X)

yrandom sample or study population

D11= {(Yi, Xi[1])}ni=1 −−−−→ DR2=1 12 R3=1

−−−−→ . . . −−−−−−→ DRM −1=1 1(M −1) RM=1

−−−−→ D1M,

yR2=0

yR3=0

yRM=0

D02 D03 D0M

whereD0(m+1)= {(Yi, Xi[m], Qim, Ri(m+1)) : Qim= 1, Ri(m+1)= 0}andD1(m+1)= {(Yi, Xi[m+1], Qi(m+1)) : Qi(m+1) = 1}. It is noted from Rm = 0 that the subsequent sampling indicators are automatically to be zero. In addition, one suitable condition is assumed in the sequel:

(A1) X[−m] ⊥ Rm+1 | D1m, m = 1, . . . , (M − 1).

2.1 Pseudo Maximum Likelihood Estimation

From the perspective of missing-data, the corresponding likelihood function can be derived to be proportional to

L(β, Gβ, F ) =

n

Y

i=1 M

Y

m=1

[IGβ(Yi, Xi[m])]Qim(1−Ri(m+1)), (2.1) where IGβ(y, x[m]) = R Gyβ(xβ)(1 − Gβ(xβ)1−ydx[−m]Fm(x) and RM +1 ≡ 0 with Fm(x) = P (X[m] = x[m], X[−m] ≤ x[−m]) and IGβ(y, x[M ]) = Gyβ(xβ)(1 − Gβ(xβ))1−y. In our estimation procedure, Fm(x)’s andGβ(υ) are treated as nuisance parameters. By using the inverse probability weighting technique on members with complete covariates, Gβ(υ) = ψ(υ)/P1

y=0ψ(υ), and the equality Fm(x) = E[Q(Y, X[M −1])I(X[m]= x[m], X[−m] ≤ x[−m]]), it enables us to construct the estimators for Gβ(X[m][m]+ x[−m]β[−m]) and Fm(Xi[m], x[−m]) in IGβ(Yi, Xi[m])’s as

Gbβ(X[m][m]+ x[−m]β[−m]) =

ψb(X[m][m]+ x[−m]β[−m]) P1

y=0ψb(X[m][m]+ x[−m]β[−m])

= P

j6=iQ(Yj, Xj[M −1])YjKh(X− (X[m]

[m] + x[−m]β[−m]) P

j6=iQ(Yj, Xj[M −1])Kh(X− (X[m]

[m]+ x[−m]β[−m])) and Fbm(Xi[m], x[−m]) = 1

n − 1 X

j6=i

Q(Yj, Xj[M −1])I(Xj[m]= Xi[m], Xj[−m]≤ x[−m]),

where Kh(u) = K(u/h)/h, K(u) is a bounded symmetric density function with bounded support and R K(2)(u)du = 0, and h is a positive-valued bandwidth. Thus, the PMLE bβh is naturally proposed to be the maximizer of the pseudo log-likelihood function

`1p(β) = 1 n

n

X

i=1 M

X

m=1

Qim(1 − Ri(m+1)) ln cIGβ(Yi, Xi[m]). (2.2)

(7)

The reason for the specification of a particular second-order kernel function is mainly to achieve the

√n-consistency of bβh. In application, the quartic kernel function K(ν) = (15/16)(1 − ν2)2I(|ν| ≤ 1) is frequently adopted. As for the bandwidth selection, an optimal smoother hopt is sought by maximizing

`1p( bβh) with respect to h, i.e. hopt = arg maxh`1p( bβh).

In this subsection, we further consider another design in which the binary status Y = I(Yo≤ yc) of the continuous response Yo is known for each individual at the first phase and the values of Yo are measured at the second phase. The proxy of the log-likelihood function can be derived to be

`2p(β) = 1 n

n

X

i=1

{QiMln bfβ(Yio, X) + (1 − Ri2)[Yilnpbc+ (1 − Yi) ln(1 −pbc))]

+

M −1

X

m=2

Qim(1 − Ri(m+1)) ln Z

fbβ(Yio, X[m][m]+ x[−m]β[−m])dx[−m]Fbm(Xi[m], x[−m])}, (2.3) where pbc=R [R{y

0>yc}fbβ(yo, xβ)dyo]d bFM(x) is an estimator of pc= P (Yo > yc) and

fbβ(Yio, X[m][m]+ x[−m]β[−m]) = P

j6=iQ(Yj, Xj[M −1])Kh1(Yjo− Yio)Kh2(X− (X[m]

[m]+ x[−m]β[−m])) P

j6=iQ(Yj, Xj[M −1])Kh2(X− (X[m]

[m]+ x[−m]

β[−m])) , i = 1, . . . , n, with Khk(u) = K4(u/hk)/hk, k = 1, 2, and K4(u) being the fourth-order kernel function such as K4(u) = (105/64)(1 − 5u2 + 7u4 − 3u6)I(|u| ≤ 1). The corresponding maximizer, say ˘βh, of the pseudo log- likelihood function `2p(β) is naturally proposed as an estimator of β0. It is noted that the validity of (2.3) is ascertained by modifying assumption (A1) as (A10) (Yo, X[−m]) ⊥ Rm+1 | D1m, m = 1, . . . , (M − 1).

Paralleling with the proof steps of bβh, the fourth-order kernel function is required to ensure the n−1/2 convergence rate of ˘βh. A more attractive feature of this estimator is that ˘βh should be asymptotically more efficient than bβh, which can be treated as the maximizer of the pseudo likelihood function of the induced response Y . However, the merit might disappear in small-sample case due to the numerical instability caused by negative weights for some regions near the ends of the interval [−1, 1]. Furthermore, the estimation procedure is likely to be computationally inefficient because the random quantitypbcstrongly relies on some complicated integration methods in the calculation of R

{y0>yc}fbβ(yo, xβ)dyo. As for the outcome-dependent sampling with a continuous response, our estimation criterion can also be adapted via the induced counting process I(Yo ≤ yo) and an appropriate weight function over possible yo.

(8)

2.2 Estimation Through an Accuracy Measure

When Gβ(υ) is strictly increasing in υ with continuous or mixed discrete-continuous X, an alternative estimation approach is suggested for β0. It is easy to derive that the AUC of Xβ has the following probability expression:

A(β) = P (X > X|Yi = 1, Yj = 0), i 6= j. (2.4) By applying the Neyman-Pearson lemma, the linear predictor Xβ0 can be shown to have the highest ROC curve and, hence, the largest AUC among all scalar transformations of X.

Under the supplementary M -phase case-control sampling and assumption (A1), we have

E[Q(Yi, Xi[M −1])Q(Yj, Xj[M −1])Aij(β)] = p(1 − p)A(β), (2.5) where Aij(β) = I(X > X)Yi(1−Yj). The above equality indicates that only the final-phase subsample, i.e. D1M, is utilized and all the sampled members are inversely weighted by their selection probabilities.

Thus, for each fixed β, the accuracy measure A(β) in (2.4) can be estimated by its sample analogue and the resulting estimator is given by

Ab1(β) = Pn

i=1

P

j6=iQ(Yi, Xi[M −1])Q(Yj, Xj[M −1])Aij(β) Pn

i=1

P

j6=iYi(1 − Yj) . (2.6)

Since β0 is the unique maximizer of A(β), the estimator eβ is naturally defined to be a maximizer of bA1(β) in a quite straightforward manner. When the monotonicity of Gβ(υ) or the SIM in (1.1) is violated, the estimated linear predictor X

βe is still optimal in the class of all potential linear predictors. In contrast with the pseudo-likelihood estimation approach, the estimated selection probabilities play central roles in the asymptotic variances of bA1(β) and eβ. It is evidenced through our numerical experiments that βbh outperforms eβ in terms of the mean squared error although bA1( eβ) ≥ bA1( bβh). The AUC estimator Ab1( eβ) is further explored to have a better performance than bA1( bβh), whereas the difference between both estimators is not apparent. Due to the non-differentiability of bA1(β), the PMLE can directly provide an essential ingredient in the estimation of A(β0). In addition, the numerical results show that bA1( bβh) possesses an important advantage in computational speed and cost, especially in accommodating high- dimensional covariate spaces.

Let Ψ(xβ) = yR

{υ<xβ}ψ(υ)dυ + (1 − y)R

{υ>xβ}ψ(υ)dυ, y=0,1. Another perspective for the

(9)

estimation of A(β) is to include individuals with fully observable covariates either from cases or controls.

One further derives that E[Q(Yj, Xj[M −1])Q(Yj, Xj[M −1])Aij(β)+Q(Yi, Xi[M −1])(ΨYi(X)+Ψ1−Y i(X))] = 3p(1 − p)A(β). Substituting bΨ(X) = Pn

j6=iQ(Yj, Xj[M −1])(yI(X < X)(1 − Yj) + (1 − y)I(X >

X)Yj)/(n − 1) for Ψ(X) and bp(1 −p) =b Pn i=1

P

j6=iYi(1 − Yj)/n(n − 1) for p(1 − p), A(β) can also be estimated by

Ab2(β) = Pn

i=1

P

j6=iQ(Yi, Xi[M −1])(Q(Yj, Xj[M −1])Aij(β) + bΨYi(X) + bΨ1−Y i(X)) 3Pn

i=1

P

j6=iYi(1 − Yj) ∀β. (2.7)

After some tedious algebra, bA2(β) can be derived to be the same as that in (2.6).

2.3 Pseudo Least Squares Estimation and Model Checking

One of the major aims in this study was to check the correctness of the SIM in (1.1). This testing problem is of particular importance in practice and can be formulated as the following hypotheses:





H0: P (Y = 1|X = x) = Gβ0(xβ0) for all x ∈ X ; HA: P (Y = 1|X = x) 6= Gβ0(xβ0) for some x ∈ X .

(2.8)

Our testing procedures are based on the error εR(β) = QM(Y − GR(Xβ)) of the final-phase data, where GR(xβ) = Gβ(xβ)E[p(1, xb [M −1])]

P1

y=0(1 − Gβ(xβ))1−yGyβ(xβ)E[p(y, xb [M −1])] (2.9) and E[Y |X = x, QM = 1] = GR(xβ0) when H0is true. The reason for doing so is that the error ε(β) = Y − Gβ(Xβ) is not available for subjects with incomplete covariates even Gβ(υ) is known. Under the validity of the SIM, it is easy to derive from the equality E[ε2R(β)|X = x, QM = 1] = E[ε2R0)|X = x, QM = 1] + (GR(xβ) − GR(xβ0))2 that the minimizer βR0= arg minβE[ε2R(β)] is equal to the true parameter β0. On the other hand, if the considered model is incorrect, εRR0) can be further projected into another linear combination of covariates, i.e. there exists a non-zero minimizer γ0 of E[(εRR0) − QMν(Xγ))2] for some function ν(·).

By Substituting

R(xβ) = Gbβ(xβ)p(1, xb [M −1]) P1

y=0(1 − bGβ(xβ))1−yGbyβ(xβ)p(y, xb [M −1]) (2.10) for GR(xβ), the estimator ¯βh of βR0 is defined to be the minimizer of the following pseudo sum of squares:

(10)

SSp(β) =

n

X

i=1

QiM(Yi− ¯GR(X))2. (2.11) Since εRR0) is an unknown random quantity, the corresponding residual eR( ¯βh) = QM(Yi− ¯GR(Xβ¯h)) is naturally used to build testing standards. When the SIM is correct, the PLSE ¯βh can be shown to be a consistent estimator of β0, while it is asymptotically inefficient relative to the PMLE bβh. Similarly, an estimator for γ0 can be obtained as

¯

γ = arg min

γ RSSn(γ), (2.12)

where RSSn(γ) = Pn

i=1(eiR( ¯βh) − QiMν(X¯ ))2/Pn

i=1QiM with ¯ν(X) =Pn

j6=iejR( ¯βh)Kh(X − X) /Pn

j6=iQjMKh(X− X) being a smoothing estimator of ν(X), i = 1, . . . , n.

To set up a test rule, we consider the test statistic Fn= RSSn(¯γ)/T SSn, where T SSn=Pn

i=1(eiR( ¯βh)

−¯eR( ¯βh))2/Pn

i=1QiM and ¯eR( ¯β) =Pn

i=1eiR( ¯βh)/Pn

i=1QiM. The null hypothesis H0 in (2.8) is rejected if Fn≤ cα with a significance level α. The critical value cα can be determined by employing the frequency distribution of a wild-bootstrap estimator. The bootstrap analogue ebiR( ¯βh) of the ith residual eiR( ¯βh) is drawn from a two-point distribution (5 +√

5)/10δ(1−5)e

iR( ¯βh)/2+ (5 −√

5)/10δ(1+5)e

iR( ¯βh)/2 and the bootstrap estimator Fnb is computed by using a bootstrap sample {ebiR( ¯βh)}ni=1 in both of RSSn(¯γ) and T SSn. Thus, the αth, 0 < α < 1, quantile of B independent bootstrap replicates {Fnb}Bb=1, denoted by qα(Fnb), can be treated as the Monte-Carlo critical value in the test criterion. An alternative approach is to extend the single-indexing cross-validation (SIC) measure of Xia (2009) in the M -phase sampling scheme. The modified SIC measure is proposed as

SICn = Pn

i=1(eiR( ¯βh) − QiMν¯−i(xγi))2 Pn

i=1QiM , (2.13)

where ¯γi = arg minγPn

j6=i(ejR( ¯βh) − QjMν¯−i(X))2and ¯ν−i(xγ) is the same as ¯ν(xγ) with the ith subject being deleted, i = 1, . . . , n. The test rule is then established with being rejected if SICn< T SSn. Remark 1. Under the validity of the SIM in (1.1), the efficiency of the PLSE can be further improved through a pseudo weighted sum of squares with the weights 1/[ ¯GR(X)(1 − ¯GR(X))], i = 1, . . . , n. The pseudo weighted least squares estimator (PWLSE), say bβh, can also be obtained by maximizing

`1p(β) = 1 n

n

X

i=1

QiM[Yiln ¯GR(X) + (1 − Yi) ln(1 − ¯GR(X))]. (2.14)

(11)

3 Asymptotic Properties

In this section, the theoretical development is mainly based on the independent Bernoulli sampling scheme in the M -phase case-control design. To simplify the complicated mathematical expressions, some concise notations are introduced below. Let q(y, x[m]) = q0(y, x[m])/q1(y, x[m]) with q`(y, x[m]) = P (Y = y, X[m] = x[m], Q(m−`) = 1) and q(Y, X[M ]) = 0, p(y, x[m]) = Qm

k=1q(y, x[k]), and Mβk(x, υ) = E[( eX −ex)⊗k|Xβ = υ]ψβ(υ) with eX = (X2, . . . , Xd)>, ( eX −x)e ⊗0 = 1, ( eX −x)e ⊗1= ( eX −ex), and ( eX −x)e ⊗2 = ( eX −x)( ee X −ex)>. In addition, the target functions of the estimators ∂βkψb(xβ), k = 0, 1, 2, and ∂βGbβ(xβ) are denoted by ψ(k)(x, υ) = ∂υk(Gyβ(υ)(1 − Gβ(υ))1−yMβk(x, υ)) and Hβ(x, υ) = ψ(1)(x, υ)

P1

y=0ψ(υ)− Gβ(υ) P1

y=0ψ(1)(x, υ) (P1

y=0ψβ(υ))2.

3.1 Consistency and Limiting Distribution

For the convenience of derivation of the main results, the following conditions are assumed throughout this article:

(A2) |n(y, x[m])/n − q0(y, x[m])| = Op(n−1/2) and q1(y, x[m]) > 0 ∀(y, x[m])), m = 1, . . . , (M − 1). (A3) X and B are compact and β0 is an interior of B.

(A4) ∂υ3ψ(υ), and ∂υ3Mβ`(x, υ)are uniformly Lipschitz continuous in (υ, x, β). (A5) P1y=0ψ0(υ) > 0 and 0 < Gβ0(υ) < 1 for all υ.

(A6) h = h0n−ζ for some positive constantsh0 and ζ ∈ (1/8, 1/5). (A7) E[Hβ⊗2

0 (X, Xβ0)] is non-singular.

The consistency of bβh and the limiting distribution of √

n( bβh− β0) are established below.

Theorem 1. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, bβh

−→ βa.s. 0 and

√n( bβh− β0)−→ Nd d−1(0, Σ10)), where Σ10) = V2−10)V10)V2−10) with

V1(β) = V2(β) +

M −1

X

m=1

E[1 − q(Y, X[M −m+1]) p(Y, X[M −m+1])

m

X

k=1

p2(Y, X[M −k+1])(Jβ⊗2(Y, X[M −k+1]) − Jβ⊗2(Y, X[M −k]))],

V2(β) =

M

X

m=1

E[(1 − q(Y, X[m+1]))p(Y, X[m])Jβ⊗2(Y, X[m])], Jβ(y, x[m]) = (−1)1−yR Hβ(x, xβ)dx[−m]Fm(x) IGβ(y, x[m]) .

(12)

When Fm(x) or any√

n-consistent estimator is in place of bFm(x), the resulting estimator can also achieve the efficiency bound of bβh. More precisely speaking, the variation of bFm(x) does not influence the asymp- totic variance of bβh. For the asymptotic behaviors of bβh and ¯βh, their derivations are very similar to the proof of Theorem 1 and are omitted here.

Theorem 2. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, (i) ¯βh

−→ βa.s. 0 and√

n( ¯βh− β0)−→ Nd d−1(0, Σ200)), and (ii) bβh−→ βa.s. 0 and √

n( bβh− β0)−→ Nd d−1(0, Σ210)), where Σ2`0) = V22`−10)V21`0)V22`−10), ` = 0, 1, with

V21`(β) = V22`(β) + E[

M −1

X

m=1

1 − q(Y, X[m])

p(Y, X[m]) p2(Y, X[M −1])(Y − GR(Xβ))2HR⊗2(X, Xβ) (GR(Xβ)(1 − GR(Xβ))2` ], V22`(β) = E[p(Y, X[M −1])(Y − GR(Xβ))2HR⊗2(X, Xβ)

(GR(Xβ)(1 − GR(Xβ)))2` ], HR(x, xβ) = GR(xβ)(1 − GR(xβ))

Gβ(xβ)(1 − Gβ(xβ))Hβ(x, xβ).

We write A  B if the matrix A − B is positive semidefinite. By using V220−10) − V221−10)  0 and E[

1

X

`=0

(−1)`(

M −1

X

m=1

1 − q(Y, X[m])

p(Y, X[m]) )p2(Y, X[M −1])(Y − GR(Xβ0))2HR⊗2(X, Xβ0)

(GR(Xβ0)(1 − GR(Xβ0))2−` ]  0, (3.1) it follows that

0  E[(

1

X

`=0

(−1)`V22`−10)(

M −1

X

m=1

1 − q(Y, X[m])

p(Y, X[m]) )1/2p(Y, X[M −1])(Y − GR(Xβ0))HR(X, Xβ0) (GR(Xβ0)(1 − GR(Xβ0))` )⊗2]

 (Σ200) − Σ210)) − (V210−10) − V211−10))  (Σ200) − Σ210)), (3.2)

which implies that bβhis asymptotically more efficient than ¯βh. Along the same line as those in (3.1)-(3.2), βbh can also be shown to be asymptotically more efficient than bβh. With a full random sample, i.e. the selection probabilities are all equal to one, we further conclude from Theorems 1-2 that√

n( bβh− β0) and

√n( bβh− β0) have the same limiting distribution Nd−1(0, (E[H⊗2(X, Xβ0)/Gβ0(Xβ0)(1 − Gβ0(Xβ0))])−1) and √

n( ¯βh− β0)→ Nd d−1(0, (E[Gβ0(Xβ0)(1 − Gβ0(Xβ0))H⊗2(X, Xβ0)])−1).

Obviously, one can see from (2.4) and (2.6) that β0and eβ are also the maximizers of A0(β) = P (X >

X, Yi = 1, Yj = 0) and bA0(β) = Pn i=1

P

j6=iQ(Yi, Xi[M −1])Q(Yj, Xj[M −1])Aij(β)/n(n − 1), respectively.

In the theoretical development, the uniform convergence and the asymptotic approximation of the U- statistic eA0(β) =Pn

i=1

P

j6=iAij(β)/n(n − 1) are repeatedly used. As for the difference in the asymptotic variances of bA0(β) and eA0(β), it is mainly caused by the variation from selection probabilities. Two

(13)

assumptions are further imposed for the derivation of the large sample properties of eβ and bA1( eβ):

(A8) ∂β2Ψ(υ), y = 0, 1, are uniformly Lipschitz continuous in(υ, β). (A9) V2A0) is non-singular.

Theorem 3. Suppose that assumptions (A1)-(A3) and (A8)-(A9) are satisfied. When n −→ ∞, (i) eβ −→ βp 0 and √

n( eβ − β0)−→ Nd d−1(0, Σ30)), and (ii) bA1( eβ)−→ A(βp 0) and√

n( bA1( eβ) − A(β0))−→ N (0, σd 320)),

where Σ30) = V2A−10)V1A0)V2A−10) and σ320) = [σ020)+V ar(P1

y=0Ψ0(Xβ0)−A(β0)(1−2p)(Y − p))]/[p(1 − p)]2 with

σ020) =

M −1

X

m=1

E[1 − q(Y, X[m]) p(Y, X[m]) (

1

X

y=0

Ψ0(Xβ0))2], V0A0) = E[(

1

X

y=0

βΨ0(Xβ0))⊗2],

V1A0) = V0A0) +

M −1

X

m=1

E[1 − q(Y, X[m]) p(Y, X[m]) (

1

X

y=0

βΨ0(Xβ0))⊗2], V2A0) = 0.5E[

1

X

y=0

β2Ψ0(Xβ0)].

We note that the asymptotic properties of eβ still hold even the monotonic assumption on G(υ) is violated. Following along the same lines as the proof of Theorem 3(ii), both bA1( bβh) and bA1( ¯βh) can be shown to have the same asymptotic distribution as that of bA1( eβ). As stressed in the foregoing, βbh avoids the computational cost and the complexity in the maximization of bA1(β) with respect to β.

With a full random sample, the asymptotic variances Σ30) and σ230) in Theorem 3 will reduce to V2A−10)V0A0)V2A−10) and [V ar(A(β0)(1 − 2p)(Y − p) +P1

y=0Ψ0(Xβ0))]/[p(1 − p)]2. Using the final- phase data in estimation, the asymptotic semiparametric efficiency of bβh is evidenced by the argument of Ichimuri (1993). As claimed by Sherman (1993), one can also conclude that eβ does not achieve the semiparametric efficiency bound. Conclusively, bβhis the asymptotically most efficient among the proposed estimators.

3.2 Bootstrap Inferences

Because a direct estimation of the asymptotic variance Σ10) is complicated and impractical, we employ a bootstrap technique to construct confidence intervals for β0 or the related parameters of interest. Suppose that ξ , . . . , ξ are randomly drawn from a multinomial distribution M ulti(1, n−1, . . . , n−1) with ξ =

(14)

j1, . . . , ξjn)> and wk =Pn

j=1ξjk/n, j, k = 1, . . . , n. The bootstrap analogue bβwh of bβh is defined to be the maximizer of the following bootstrap pseudo log-likelihood:

`wp(β) =

n

X

i=1 M

X

m=1

wiQim(1 − Ri(m+1)) ln cIG(Yi, Xi[m]), (3.3)

where bG(Xβ[m][m]+ x[−m]β[−m]) and bFwm(X[m], x[−m]) are the bootstrap analogues of bGβ(Xβ[m][m]+ x[−m]β[−m]) and Fbm(X[m], x[−m]) with Qw(y, x[M −1]) = wQM/bpw(y, x[M −1]) andpbw(y, x[M −1]) =QM −1

k=1 qbw(y, x) substitut- ing for Q(y, x[M −1])/n and p(y, xb [M −1]), bqw(y, x[m−1]) = n(y, x[m−1])/qbw1(y, x[m−1]), andqbw1(y, x[m−1]) = Pn

j=1wjQj(m−1)I(Yj = y, Xj[m−1] = x[m−1]). It is derived in the next theorem that the sampling distri- bution of bβh can be approximated by the frequency distribution of bβwh.

Theorem 4. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, Pw(√

n( bβwh− bβh) ≤ u) − P (√

n( bβh− β0) ≤ u)−→ 0p ∀u = (u2, . . . , ud)T, (3.4) where Pw(·) = P (·|(SM

m=2D0m)S D1M).

The validity of Theorems 1 and 4 enables us to give the normal-type and quantile-type bootstrap confidence intervals for β0k by

βbkh± z1−α/2sew( bβwkh− bβkh) and ( bβkh− qw(1−α/2)( bβwkh− bβkh), bβkh− qwα/2( bβwkh− bβkh)), (3.5) k = 2, . . . , (d − 1), where z1−α/2 is the (1 − α/2)th quantile value of the standard normal distribution, and sew(·) and q(·) denote the standard error and the 100ςth percentile of B bootstrap estimates. Generally, appropriate bootstrap replications are required to ensure the numerical stability.

4 Numerical Experiments and Applications

Under the setting of two-phase sampling, we conducted a class of simulation experiments to assess the performances of the proposed estimators and the related inference procedures. In the simulation process, data were generated from 1000 replications and the bootstrap inferences were based on 500 bootstrap samples, which enable us to obtain stable numerical results. In the last scenario of the simulation, 1000 random samples of size 1500 were repeatedly taken from the full data of Wilms’ tumor studies and three-phase subsampling of a smaller size was further undertaken from each sample. We quantify and investigate the efficiency loss incurred by the PMLE for a three-phase random sample relative to that for

(15)

a full random sample. Finally, a two-phase subsample drawn from a low birthweight study was used to illustrate the applicability of our developed approaches.

4.1 Simulation I - Estimation and Bootstrap Confidence Intervals

The covariates X = X(2) = (X1, X2, X3)> were designed to follow a trivariate normal distribution with mean (0, 0, 0)>, standard deviation (1, 1, 1)>, and correlation coefficient ρ of 0.5. The conditional distri- bution of the binary response Y was set to be the logistic regression model

(Model 1) Gβ0(xβ0) = exp(−5.8 + 6xβ0) 1 + exp(−5.8 + 6xβ0)

with the index coefficients β0 = (1, −0.6, −0.4)>. Data with sample sizes of 250 and 500 were simulated according to the above design. With the correlation of 0.5, the marginal probability of Y is computed to be about 0.16. Under the two-phase case-control sampling with n(y), y = 0, 1, being specified at the second phase, the combinations of subsamples (n(1), n(0)) were set to be (25k, 50k), (25k, 100k), (50k, 50k), or (50k, 100k) for n = 250k, k = 1, 2. The corresponding response selective probabilities (q(1),b q(0)) willb approach (0.5, 0.25), (0.5, 0.5), (1, 0.25), or (1, 0.5) as the sample size increases and the specified conditions are similar in nature to those found in application.

Table 4.1 displays the means and the standard deviations of 1000 estimates of bβh, ¯βh, and eβ. One can expect that the standard deviations of these estimators tend to be smaller for a larger sample size.

Generally speaking, the considered sampling scheme will cause relatively small subsample sizes of the second-phase covariate data and greater variations in estimators. Compared with the pseudo likelihood and the pseudo least squares estimation approaches, the estimator obtained from the optimal AUC crite- rion tends to have a relatively larger bias, particularly in small subsamples. It is indicated from this table that bβhis generally the most efficient relative to ¯βh and eβ. As for the accuracy measure A(β0), both of the estimators bA1( bβh) and bA1( eβ) have fairly similar performance in terms of mean squared errors although the variation of bA1( eβ) is slightly smaller (Table 4.2). In addition, the CPU time for the computation of βbh is much shorter than those for eβ in the numerical experiments.

Table 4.3 further provides the bootstrap estimates for the standard deviations of the PMLEs, the empirical coverage probabilities, the 0.95 quantile intervals of 1000 estimates, and the averages of 1000 normal-type and quantile-type bootstrap confidence intervals. It seems that the asymptotic variance

(16)

is slightly overestimated by the bootstrap estimator while its accuracy is improved as the sample size increases. Although the constructed confidence intervals are typically wider than the corresponding true quantile intervals, their differences are not strongly associated with the sample size. The empirical coverage probabilities of normal-type bootstrap intervals are also detected to be higher than those of the quantile-type ones. Overall, the constructed 95% confidence intervals possess coverage probabilities which are quite close to the nominal value for appropriate sample sizes.

4.2 Simulation II - Model Checking

In this simulation scenario, the covariates were specified to be the same as those in Section 4.1. We assess the performance of our testing procedures for model correctness through Model 1 and the following model:

(Model 2) P (Y = 1|X = x) = exp(−5 + 6(x1x2− 0.6x33)) 1 + exp(−5 + 6(x1x2− 0.6x33)).

It is noted that Model 2 deviates from the structure of the SIM in (1.1). Although ¯βhis not asymptotically efficient relative to bβh and eβ when Gβ0(υ) is a strictly increasing function, the pseudo least squares estimation procedure is useful in the development of test rules.

With different sample sizes and combinations of subsamples at the second phase, the estimated sizes and powers, and the rejection proportions based on the SIC measure are provided in Table 4.4. When the SIM is valid, the simulation results indicate that the estimated sizes of the wild bootstrap test are all less than the nominal size at 0.05. In addition, the rejection probabilities of the SIC criterion are higher than the estimated sizes and powers of the wild bootstrap test, especially in a small sample. It is further found that the rejection probabilities may not be readily decreased as the sample size increases. Conclusively, a larger size of the second-phase covariate sample is typically associated with an increase in power.

4.3 Investigation of Efficiency Loss in the PMLE

The analyzed data were collected from the National Wilms’ Tumor Study Group and were further provided by Kulich and Lin (2004). The Wilms’ tumor, also called nephroblastoma, is a medical diagnosis and is the most common malignant (cancerous) tumor of kidney in childhood. A total of 3915 children, who had been treated for Wilms’ tumor, participated in the third and fourth Wilms’ tumor studies (NWTS-3 and NWTS-4). Measurements taken in our data analysis included: relapse within three years (relaps),

(17)

local histological type of the tumor (lhist) with unfavorable histologies being coding as 1 and favorable histologies being 0, stage of disease at diagnosis (stage) with stages 1 to 4, levels of age at diagnosis with age1 and age2 indicating the statuses of age≤ 1 year and age in (1,4) years, central histology (thist), which is the true histological type of the tumor and is coded as lhist, and tumor diameter (diam) in centimeters. In addition, the numbers of cases and controls in the cohort were 603 and 3312, respectively.

The primary research interest is to investigate the influences of the factors stage, (age1, age2), thist, and diam on the binary response variable relapse. Since the paired covariates (thist, this ∗ age2), (stage, stage ∗ diam), and (diam, stage ∗ diam) are highly correlated with correlation coefficients of 0.641, 0.671, and 0.848, respectively, the SIM with the linear predictor thist + β02stage + β03age1 + β04age2+ β05(thist ∗ age1) + β06diam is used to characterize the conditional probability of relapse. In this numerical experiment, we drew a random sample of size 1500 from the full data and repeated this process 1000 times. A subsample from each simulated data was obtained by further undertaking the three-phase case-control sampling. In the initial phase, there were 24 strata defined by the values of X(1) = (relapse, lhist, stage, age1, age2). The selection probabilities in the strata with {relapse = 1}

were specified to be 1 in the subsequent phases. For those with {relapse = 0}, 100 units were sampled from each stratum with size ≥ 100 and all units were taken with size < 100 at the second phase. In this phase, the values of the expensive covariate X(2) = thist were obtained for sampled individuals. At the third phase, 25 units and all units were sampled with sizes of strata ≥ 25 and < 25, respectively. There were, on average, about 1170 children in the second-phase covariate data and 600 ones in the complete covariate data. Our another aim is to assess the loss of estimation efficiency caused by such a sampling design. In Table 4.5, we provide the averages of 1000 parameter estimates and the corresponding standard errors in the sample data analysis as well as in three-phase data analyses. The averages of 1000 PMLEs for a full random sample and three-phase data are quite close to each other. As one can expect, the efficiency loss of the PMLE is related with the size of subsample taken for each ascertained covariate.

4.4 Application to Auckland Birthweight Collaborative Study

The Auckland Birthweight Collaborative (ABC) study was conducted from 1995 to 1997. The primary research interest focuses on exploring potential demographic characteristics and risk factors of pregnant

Figure

Updating...

References

Related subjects :