Binary Response Models with M-phase Case-Control Data

(1)

PREPRINT

國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 17.pdf

Binary Response Models with M-phase Case-Control Data

Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang

December 1, 2011

(2)

Binary Response Models with M -phase Case-Control Data

Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang Department of Mathematics, National Taiwan University

December 1, 2011

Abstract

The present study aimed to characterize the relationship between a binary response and covariates of interest through a more general single-index regression model. With M -phase (M ≥ 2) case-control data supplemented by information on a response and certain covariates, we primarily propose a pseudo likelihood estimation for the index coefficients of this type of semiparametric model. Additionally, our approach can be readily adopted to accommodate case-control sampling with a continuous response and outcome-dependent sampling. With the considered data setting in the receiver operating characteristic curve analysis, an estimation for the accuracy measure is provided and is borrowed to seek an optimal linear predictor in the class of potential linear predictors. To check model correctness, a pseudo least squares approach is further employed as an aid to devising suitable testing procedures. The general theoretical frameworks for the proposed estimators and the bootstrap inference are also developed in this article. Finally, extensive simulations and two empirical applications are used to illustrate the applicability of our methodology.

1 Introduction

To account for the relationship between potential covariates X = (X₁, . . . , X_d)^> and a binary response Y (1: case; 0: control), researchers and practitioners try to better understand the conditional probability of Y. We can reasonably argue that fully parametric models might be mis- specified and fully nonparametric ones usually suffer from the curse of dimensionality (Bellman 1961). Thus, a more flexible semiparametric model becomes a great interest and compromise.

With prospective studies, recent research has developed a variety of estimation approaches and

Keywords: accuracy measure; bootstrap; case-control; cross-validation; missing data; M-phase; pseudo least squares; pseudo maximum likelihood estimator; receiver operating characteristic curve; single-index model.

(3)

the related inferences for certain semiparametric models. In application, it is often impractical to obtain this type of data due to the high cost or tremendous amount of time in collecting expensive variables. An alternative cost-effective strategy to solve this difficulty is to adopt supplementary M-phase sampling plans.

In the simple case-control design, Anderson (1972) and Pyke (1979) showed that the covariate effects in logistic regression models can be estimated in a normal manner as if data were coming from a prospective study. However, this attractive feature cannot carry over to arbitrary regression models without some degree of information on the marginal probability of Y . To overcome the confronting problem and further enhance study efficiency, the supplementary M-phase sampling mechanism has been adopted in several clinical and epidemiological studies.

Let (X^(1)>, . . . , X^{(M )>})^> be an alternative expression of X with categorical or discrete covariates X^(m) having finite support {x^(m)₁ , . . . , x^(m)_J_m} and being designed to be observed at the mth phase, and Rm+1 indicate the missing status of X^(m+1), which is defined to be 1 if X^(m+1) is observed and 0 otherwise, at the (m + 1)th phase, m = 1, . . . , M − 1. Using these missing data notations, the so-called M-phase case-control sampling design can be concisely described as below:

First of all, a prospective cohort of size n is stratified by the values of (Y, X⁽¹⁾) at the first phase.

For each individual with the combination value {Y = y, X⁽¹⁾ = x⁽¹⁾_j₁ , . . . , X^(m) = x^(m)_j_m } at the mth phase, the sampling indicator Rm+1 is randomly generated from the independent Bernoulli or finite population sampling scheme with the corresponding selection probability q(y, x_b ⁽¹⁾_j₁ , . . . , x^(m)_j_m ) = n(y, x⁽¹⁾_j₁ , . . . , x^(m)_j_m )/Pn

i=1QimI(Yi = y, X_i⁽¹⁾ = x⁽¹⁾_j₁ , . . . , X_i^(m) = x^(m)_j_m ), where n(y, x⁽¹⁾_j₁ , . . . , x^(m)_j_m ) is the expected or specified subsample size at the (m + 1)th phase and Qm+1 =Qm+1

k=2 Rk. For the sake of simplification, both of case-control sample and its population are treated as the first phase without distinction (cf. Breslow and Holubkov 1997). Under stratified two-phase case-control studies, Scott and Wild (1991) proposed a maximum likelihood estimation for logistic regression.

Meanwhile, Wild (1991) provided a maximum likelihood approach as well as two pseudo maximum likelihood ones for the regression parameters of an arbitrary model with the link function being

(4)

known up to the parameters and general covariates. It was further shown by Scott and Wild (1997) that the maximum likelihood estimator can be obtained by iteratively updating the derived pseudo-likelihood equation. As one can see, the feasibility of their approaches mainly relies on the monotonic assumption of a response function or another suitable condition. Based on the continuous response Yo with supplementary information on its dichotomized response Y = I(Yo ≤ y_c) for some threshold value yc, a more efficient estimation was developed by Jiang, Scott, and Wild (2009) under the validity of a parametric model for the conditional distribution of Yo. Recently, Lee, Scott, and Wild (2010) also presented an efficient profile likelihood approach for multi-phase case-control data and unified earlier developments.

In this article, we focus on a fairly useful single-index model (SIM):

P (Y = 1|X = x) = Gβ0(xβ0), (1.1)

where G_β(·) is an unknown function and x_β = x1+ (x2, . . . , x_d)^>β with β = (β2, . . . , β_d)^> and β0 = (β02, . . . , β_0d)^>being a vector of true index coefficients. The specification of one for the coefficient of a significant covariate, sayX1, is mainly to solve the problem of identifiability. With a prospective data collection, readers can refer the profile likelihood approach of Klein and Spady (1993) and the pseudo least squares one of Ichimura (1993). Based on supplementary M-phase case-control data, we developed a pseudo likelihood estimation criterion for β0 by fully utilizing available data at each phase. Under some suitable conditions, the resulting pseudo maximum likelihood estimator (PMLE) is shown to be √

n-consistent, asymptotic normal, and the most efficient among the proposed estimators. The frequency distribution of a bootstrap analogue is further employed to estimate the sampling distribution of the PMLE and make inferences on β₀ or the related parameters of interest. As for the case-control data with a continuous outcome (cf. Jiang, Scott, and Wild 2009) or the supplementary continuous-outcome-dependent data (cf. Zhou el al. 2002, 2007; Song, Zhou, and Kosorok 2009), a more flexible semiparametric model

f_Y_o(y_o|x) = f_β₀(y_o, x_β₀) (1.2) can be adopted to characterize the conditional density function ofYo, wherefβ(·, ·)is an unspecified

(5)

bivariate function. Of interest is the fact that the induced model for the dichotomized outcome Y has the same form of that in (1.1). In this study, we introduced a new estimation procedure for this type of data and addressed its theoretical merit and practical limitation. When Gβ(υ) is strictly increasing in υ and the components of X are continuous or mixed discrete-continuous, the receiver operating characteristic (ROC) curve of the single-indexXβ0 has been well known to be the highest among those of scalar-valued functions ofX. An alternative estimation criterion was aimed at maximizing the proposed estimators for the areas under the ROC curves (AUC) with respect to all potential linear predictors. To check the correctness of the SIM in (1.1), our pseudo least squares estimation (PLSE) approach is considerable in the establishment of testing procedures.

The test rules of Xia (2009) were further extended with some modifications to accommodate the considered sampling scheme. There are some interesting features in our proposed approaches:

First, the PMLE is the asymptotically most efficient among the other ones and has remarkably good finite sample performance. Second, the estimation approach of the accuracy measure provides an estimator for the optimal composite markerXβ0. Third, thePLSEcriterion is found to be useful in model checking although the resulting estimator is inefficient relative to the PMLE.

Under the M-phase case-control design, we outline a class of pseudo estimation procedures for β0 and devise testing procedures to check model correctness in Section 2. Section 3 is devoted to the related large sample properties and the bootstrap inference. In Section 4, simulations and applications to two empirical data are used to illustrate the practicality of our methodology. Some concluding remarks and future research are given in Section 5. Finally, the proofs of the technical lemmas and the main results are postponed to the appendices.

2 Estimation and Model Checking

To simplify the presentation, we let X^[m] = (X^(1)>, . . . , X^(m)>)^>, X^[−m] = (X^(m+1)>, . . . , X^{(M )>})^>, p(y, xb ^[m]) = Qm

k=1q(y, xb ^[k]) with q(y, x_b ^[m]) = n(y, x^[m])/(nqb₁(y, x^[m])), R(y, x^[m]) = R_m+1/q(y, xb ^[m]), Q(y, x^[m]) =Qm

k=1R(y, x^[k]), and ψyβ(υ) represent the joint probability density function of {Y = y}

and X . The M-phase case-control design can be depicted as the following flowchart:

(6)

(Y, X)



yrandom sample or study population

D11= {(Yi, X_i^[1])}ⁿ_i=1 −−−−→ D^R²⁼¹ 12 R₃=1

−−−−→ . . . −−−−−−→ D^R^{M −1}⁼¹ 1(M −1) R_M=1

−−−−→ D1M,



 y^R²⁼⁰



 y^R³⁼⁰



 y^R^M⁼⁰

D₀₂ D₀₃ D_0M

whereD_0(m+1)= {(Yi, X_i^[m], Qim, R_i(m+1)) : Qim= 1, R_i(m+1)= 0}andD_1(m+1)= {(Yi, X_i^[m+1], Q_i(m+1)) : Q_i(m+1) = 1}. It is noted from Rm = 0 that the subsequent sampling indicators are automatically to be zero. In addition, one suitable condition is assumed in the sequel:

(A1) X^[−m] ⊥ R_m+1 | D_1m, m = 1, . . . , (M − 1).

2.1 Pseudo Maximum Likelihood Estimation

From the perspective of missing-data, the corresponding likelihood function can be derived to be proportional to

L(β, Gβ, F ) =

n

Y

i=1 M

Y

m=1

[IGβ(Yi, X_i^[m])]^Q^im^(1−R^i(m+1)⁾, (2.1) where IG_β(y, x^[m]) = R G^y_β(x_β)(1 − G_β(x_β)^1−yd_x[−m]F_m(x) and R_{M +1} ≡ 0 with F_m(x) = P (X^[m] = x^[m], X^[−m] ≤ x^[−m]) and IGβ(y, x^{[M ]}) = G^y_β(xβ)(1 − Gβ(xβ))^1−y. In our estimation procedure, F_m(x)’s andG_β(υ) are treated as nuisance parameters. By using the inverse probability weighting technique on members with complete covariates, G_β(υ) = ψ_1β(υ)/P1

y=0ψ_yβ(υ), and the equality F_m(x) = E[Q(Y, X^{[M −1]})I(X^[m]= x^[m], X^[−m] ≤ x^[−m]]), it enables us to construct the estimators for G_β(X_iβ^[m]_[m]+ x^[−m]_β_[−m]) and F_m(X_i^[m], x^[−m]) in IG_β(Y_i, X_i^[m])’s as

Gb_β(X_iβ^[m]_[m]+ x^[−m]_β_[−m]) =

ψb_1β(X_iβ^[m]_[m]+ x^[−m]_β_[−m]) P1

y=0ψbyβ(X_iβ^[m][m]+ x^[−m]_β[−m])

= P

j6=iQ(Yj, X_j^{[M −1]})YjK_h(X_jβ− (X^[m]

iβ^[m] + x^[−m]_β_[−m]) P

j6=iQ(Yj, X_j^{[M −1]})Kh(Xjβ− (X^[m]

iβ^[m]+ x^[−m]_β[−m])) and Fbm(X_i^[m], x^[−m]) = 1

n − 1 X

j6=i

Q(Yj, X_j^{[M −1]})I(X_j^[m]= X_i^[m], X_j^[−m]≤ x^[−m]),

where K_h(u) = K(u/h)/h, K(u) is a bounded symmetric density function with bounded support and R K⁽²⁾(u)du = 0, and h is a positive-valued bandwidth. Thus, the PMLE bβ_h is naturally proposed to be the maximizer of the pseudo log-likelihood function

`_1p(β) = 1 n

n

X

i=1 M

X

m=1

Q_im(1 − R_i(m+1)) ln cIG_β(Y_i, X_i^[m]). (2.2)

(7)

The reason for the specification of a particular second-order kernel function is mainly to achieve the

√n-consistency of bβ_h. In application, the quartic kernel function K(ν) = (15/16)(1 − ν²)²I(|ν| ≤ 1) is frequently adopted. As for the bandwidth selection, an optimal smoother hopt is sought by maximizing

`_1p( bβ_h) with respect to h, i.e. h_opt = arg max_h`_1p( bβ_h).

In this subsection, we further consider another design in which the binary status Y = I(Y_o≤ y_c) of the continuous response Y_o is known for each individual at the first phase and the values of Y_o are measured at the second phase. The proxy of the log-likelihood function can be derived to be

`_2p(β) = 1 n

n

X

i=1

{Q_iMln bf_β(Y_io, X_iβ) + (1 − R_i2)[Y_ilnpb_c+ (1 − Y_i) ln(1 −pb_c))]

+

M −1

X

m=2

Q_im(1 − R_i(m+1)) ln Z

fb_β(Y_io, X_iβ^[m]_[m]+ x^[−m]_β_[−m])d_x[−m]Fb_m(X_i^[m], x^[−m])}, (2.3) where pb_c=R [R_{y

0>yc}fb_β(y_o, x_β)dy_o]d bF_M(x) is an estimator of p_c= P (Y_o > y_c) and

fbβ(Yio, X_iβ^[m][m]+ x^[−m]_β[−m]) = P

j6=iQ(Yj, X_j^{[M −1]})Kh1(Yjo− Y_io)Kh2(Xjβ− (X^[m]

iβ^[m]+ x^[−m]_β[−m])) P

j6=iQ(Y_j, X_j^{[M −1]})K_h₂(X_jβ− (X^[m]

iβ^[m]+ x^[−m]

β^[−m])) , i = 1, . . . , n, with K_h_k(u) = K₄(u/h_k)/h_k, k = 1, 2, and K₄(u) being the fourth-order kernel function such as K₄(u) = (105/64)(1 − 5u² + 7u⁴ − 3u⁶)I(|u| ≤ 1). The corresponding maximizer, say ˘βh, of the pseudo log- likelihood function `_2p(β) is naturally proposed as an estimator of β₀. It is noted that the validity of (2.3) is ascertained by modifying assumption (A1) as (A1⁰) (Yo, X^[−m]) ⊥ Rm+1 | D_1m, m = 1, . . . , (M − 1).

Paralleling with the proof steps of bβ_h, the fourth-order kernel function is required to ensure the n^−1/2 convergence rate of ˘β_h. A more attractive feature of this estimator is that ˘β_h should be asymptotically more efficient than bβh, which can be treated as the maximizer of the pseudo likelihood function of the induced response Y . However, the merit might disappear in small-sample case due to the numerical instability caused by negative weights for some regions near the ends of the interval [−1, 1]. Furthermore, the estimation procedure is likely to be computationally inefficient because the random quantitypb_cstrongly relies on some complicated integration methods in the calculation of R

{y0>yc}fb_β(yo, x_β)dyo. As for the outcome-dependent sampling with a continuous response, our estimation criterion can also be adapted via the induced counting process I(Y_o ≤ y_o) and an appropriate weight function over possible y_o.

(8)

2.2 Estimation Through an Accuracy Measure

When G_β(υ) is strictly increasing in υ with continuous or mixed discrete-continuous X, an alternative estimation approach is suggested for β₀. It is easy to derive that the AUC of X_β has the following probability expression:

A(β) = P (Xiβ > Xjβ|Yi = 1, Yj = 0), i 6= j. (2.4) By applying the Neyman-Pearson lemma, the linear predictor X_β₀ can be shown to have the highest ROC curve and, hence, the largest AUC among all scalar transformations of X.

Under the supplementary M -phase case-control sampling and assumption (A1), we have

E[Q(Y_i, X_i^{[M −1]})Q(Y_j, X_j^{[M −1]})A_ij(β)] = p(1 − p)A(β), (2.5) where Aij(β) = I(X_iβ > X_jβ)Yi(1−Yj). The above equality indicates that only the final-phase subsample, i.e. D1M, is utilized and all the sampled members are inversely weighted by their selection probabilities.

Thus, for each fixed β, the accuracy measure A(β) in (2.4) can be estimated by its sample analogue and the resulting estimator is given by

Ab₁(β) = Pn

i=1

P

j6=iQ(Y_i, X_i^{[M −1]})Q(Y_j, X_j^{[M −1]})A_ij(β) Pn

i=1

P

j6=iY_i(1 − Y_j) . (2.6)

Since β₀ is the unique maximizer of A(β), the estimator eβ is naturally defined to be a maximizer of bA₁(β) in a quite straightforward manner. When the monotonicity of G_β(υ) or the SIM in (1.1) is violated, the estimated linear predictor X

βe is still optimal in the class of all potential linear predictors. In contrast with the pseudo-likelihood estimation approach, the estimated selection probabilities play central roles in the asymptotic variances of bA₁(β) and eβ. It is evidenced through our numerical experiments that βb_h outperforms eβ in terms of the mean squared error although bA₁( eβ) ≥ bA₁( bβ_h). The AUC estimator Ab₁( eβ) is further explored to have a better performance than bA₁( bβh), whereas the difference between both estimators is not apparent. Due to the non-differentiability of bA₁(β), the PMLE can directly provide an essential ingredient in the estimation of A(β0). In addition, the numerical results show that bA₁( bβ_h) possesses an important advantage in computational speed and cost, especially in accommodating high- dimensional covariate spaces.

Let Ψ_yβ(x_β) = yR

{υ<x_β}ψ_0β(υ)dυ + (1 − y)R

{υ>x_β}ψ_1β(υ)dυ, y=0,1. Another perspective for the

(9)

estimation of A(β) is to include individuals with fully observable covariates either from cases or controls.

One further derives that E[Q(Yj, X_j^{[M −1]})Q(Yj, X_j^{[M −1]})Aij(β)+Q(Yi, X_i^{[M −1]})(Ψ^Y_0βⁱ(Xiβ)+Ψ^1−Y_1β ⁱ(Xiβ))] = 3p(1 − p)A(β). Substituting bΨyβ(Xiβ) = Pn

j6=iQ(Yj, X_j^{[M −1]})(yI(Xjβ < Xiβ)(1 − Yj) + (1 − y)I(Xjβ >

Xiβ)Yj)/(n − 1) for Ψyβ(Xiβ) and bp(1 −p) =b Pn i=1

P

j6=iYi(1 − Yj)/n(n − 1) for p(1 − p), A(β) can also be estimated by

Ab₂(β) = Pn

i=1

P

j6=iQ(Y_i, X_i^{[M −1]})(Q(Y_j, X_j^{[M −1]})A_ij(β) + bΨ^Y_0βⁱ(X_iβ) + bΨ^1−Y_1β ⁱ(X_iβ)) 3Pn

i=1

P

j6=iY_i(1 − Y_j) ∀β. (2.7)

After some tedious algebra, bA₂(β) can be derived to be the same as that in (2.6).

2.3 Pseudo Least Squares Estimation and Model Checking

One of the major aims in this study was to check the correctness of the SIM in (1.1). This testing problem is of particular importance in practice and can be formulated as the following hypotheses:







H₀: P (Y = 1|X = x) = G_β₀(x_β₀) for all x ∈ X ; H_A: P (Y = 1|X = x) 6= G_β₀(x_β₀) for some x ∈ X .

(2.8)

Our testing procedures are based on the error ε_R(β) = Q_M(Y − G_R(X_β)) of the final-phase data, where G_R(x_β) = Gβ(xβ)E[p(1, xb ^{[M −1]})]

P1

y=0(1 − G_β(x_β))^1−yG^y_β(x_β)E[p(y, xb ^{[M −1]})] (2.9) and E[Y |X = x, Q_M = 1] = G_R(x_β₀) when H₀is true. The reason for doing so is that the error ε(β) = Y − G_β(X_β) is not available for subjects with incomplete covariates even G_β(υ) is known. Under the validity of the SIM, it is easy to derive from the equality E[ε²_R(β)|X = x, Q_M = 1] = E[ε²_R(β₀)|X = x, Q_M = 1] + (G_R(x_β) − G_R(x_β₀))² that the minimizer β_R0= arg min_βE[ε²_R(β)] is equal to the true parameter β₀. On the other hand, if the considered model is incorrect, εR(βR0) can be further projected into another linear combination of covariates, i.e. there exists a non-zero minimizer γ₀ of E[(ε_R(β_R0) − Q_Mν(X_γ))²] for some function ν(·).

By Substituting

G¯R(xβ) = Gb_β(x_β)p(1, xb ^{[M −1]}) P1

y=0(1 − bG_β(x_β))^1−yGb^y_β(x_β)p(y, xb ^{[M −1]}) (2.10) for GR(xβ), the estimator ¯βh of βR0 is defined to be the minimizer of the following pseudo sum of squares:

(10)

SS_p(β) =

n

X

i=1

Q_iM(Y_i− ¯G_R(X_iβ))². (2.11) Since ε_R(β_R0) is an unknown random quantity, the corresponding residual e_R( ¯β_h) = Q_M(Yi− ¯G_R(Xβ¯_h)) is naturally used to build testing standards. When the SIM is correct, the PLSE ¯β_h can be shown to be a consistent estimator of β0, while it is asymptotically inefficient relative to the PMLE bβ_h. Similarly, an estimator for γ0 can be obtained as

¯

γ = arg min

γ RSS_n(γ), (2.12)

where RSSn(γ) = Pn

i=1(eiR( ¯βh) − QiMν(X¯ iγ))²/Pn

i=1QiM with ¯ν(Xiγ) =Pn

j6=iejR( ¯βh)Kh(Xjγ − X_iγ) /Pn

j6=iQ_jMK_h(X_jγ− X_iγ) being a smoothing estimator of ν(X_iγ), i = 1, . . . , n.

To set up a test rule, we consider the test statistic F_n= RSS_n(¯γ)/T SS_n, where T SS_n=Pn

i=1(e_iR( ¯β_h)

−¯e_R( ¯β_h))²/Pn

i=1Q_iM and ¯e_R( ¯β) =Pn

i=1e_iR( ¯β_h)/Pn

i=1Q_iM. The null hypothesis H₀ in (2.8) is rejected if F_n≤ c_α with a significance level α. The critical value c_α can be determined by employing the frequency distribution of a wild-bootstrap estimator. The bootstrap analogue e^b_iR( ¯β_h) of the ith residual e_iR( ¯β_h) is drawn from a two-point distribution (5 +√

5)/10δ₍₁₋^√_5)e

iR( ¯βh)/2+ (5 −√

5)/10δ₍₁₊^√_5)e

iR( ¯βh)/2 and the bootstrap estimator F_n^b is computed by using a bootstrap sample {e^b_iR( ¯β_h)}ⁿ_i=1 in both of RSS_n(¯γ) and T SS_n. Thus, the αth, 0 < α < 1, quantile of B independent bootstrap replicates {F_n^b}^B_b=1, denoted by q_α(F_n^b), can be treated as the Monte-Carlo critical value in the test criterion. An alternative approach is to extend the single-indexing cross-validation (SIC) measure of Xia (2009) in the M -phase sampling scheme. The modified SIC measure is proposed as

SIC_n = Pn

i=1(e_iR( ¯β_h) − Q_iMν¯−i(x_i¯_γ_i))² Pn

i=1Q_iM , (2.13)

where ¯γi = arg minγPn

j6=i(ejR( ¯β_h) − QjMν¯−i(Xjγ))²and ¯ν−i(xγ) is the same as ¯ν(xγ) with the ith subject being deleted, i = 1, . . . , n. The test rule is then established with being rejected if SIC_n< T SS_n. Remark 1. Under the validity of the SIM in (1.1), the efficiency of the PLSE can be further improved through a pseudo weighted sum of squares with the weights 1/[ ¯G_R(X_iβ)(1 − ¯G_R(X_iβ))], i = 1, . . . , n. The pseudo weighted least squares estimator (PWLSE), say bβ_h^∗, can also be obtained by maximizing

`^∗_1p(β) = 1 n

n

X

i=1

Q_iM[Y_iln ¯G_R(X_iβ) + (1 − Y_i) ln(1 − ¯G_R(X_iβ))]. (2.14)

(11)

3 Asymptotic Properties

In this section, the theoretical development is mainly based on the independent Bernoulli sampling scheme in the M -phase case-control design. To simplify the complicated mathematical expressions, some concise notations are introduced below. Let q(y, x^[m]) = q0(y, x^[m])/q1(y, x^[m]) with q`(y, x^[m]) = P (Y = y, X^[m] = x^[m], Q_(m−`) = 1) and q(Y, X^{[M ]}) = 0, p(y, x^[m]) = Qm

k=1q(y, x^[k]), and M_β^k(x, υ) = E[( eX −ex)^⊗k|X_β = υ]ψβ(υ) with eX = (X2, . . . , Xd)^>, ( eX −x)e ^⊗0 = 1, ( eX −x)e ^⊗1= ( eX −ex), and ( eX −x)e ^⊗2 = ( eX −x)( ee X −ex)^>. In addition, the target functions of the estimators ∂_β^kψb_yβ(x_β), k = 0, 1, 2, and ∂_βGb_β(x_β) are denoted by ψ_yβ^(k)(x, υ) = ∂_υ^k(G^y_β(υ)(1 − G_β(υ))^1−yM_β^k(x, υ)) and H_β(x, υ) = ψ_1β⁽¹⁾(x, υ)

P1

y=0ψ_yβ(υ)− G_β(υ) P1

y=0ψ_yβ⁽¹⁾(x, υ) (P1

y=0ψ_β(υ))².

3.1 Consistency and Limiting Distribution

For the convenience of derivation of the main results, the following conditions are assumed throughout this article:

(A2) |n(y, x^[m])/n − q₀(y, x^[m])| = O_p(n^−1/2) and q₁(y, x^[m]) > 0 ∀(y, x^[m])), m = 1, . . . , (M − 1). (A3) X and B are compact and β0 is an interior of B.

(A4) ∂_υ³ψ_yβ(υ), and ∂_υ³M_β^`(x, υ)are uniformly Lipschitz continuous in (υ, x, β). (A5) ^P¹_y=0ψ_yβ₀(υ) > 0 and 0 < G_β₀(υ) < 1 for all υ.

(A6) h = h0n^−ζ for some positive constantsh0 and ζ ∈ (1/8, 1/5). (A7) E[H_β^⊗2

0 (X, X_β₀)] is non-singular.

The consistency of bβ_h and the limiting distribution of √

n( bβ_h− β₀) are established below.

Theorem 1. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, bβh

−→ βa.s. ₀ and

√n( bβ_h− β₀)−→ N^d _d−1(0, Σ₁(β₀)), where Σ₁(β₀) = V₂⁻¹(β₀)V₁(β₀)V₂⁻¹(β₀) with

V1(β) = V2(β) +

M −1

X

m=1

E[1 − q(Y, X^{[M −m+1]}) p(Y, X^{[M −m+1]})

m

X

k=1

p²(Y, X^{[M −k+1]})(J_β^⊗2(Y, X^{[M −k+1]}) − J_β^⊗2(Y, X^{[M −k]}))],

V2(β) =

M

X

m=1

E[(1 − q(Y, X^[m+1]))p(Y, X^[m])J_β^⊗2(Y, X^[m])], Jβ(y, x^[m]) = (−1)^1−yR H_β(x, x_β)d_x[−m]F_m(x) IGβ(y, x^[m]) .

(12)

When Fm(x) or any√

n-consistent estimator is in place of bFm(x), the resulting estimator can also achieve the efficiency bound of bβ_h. More precisely speaking, the variation of bF_m(x) does not influence the asymptotic variance of bβ_h. For the asymptotic behaviors of bβ^∗_h and ¯β_h, their derivations are very similar to the proof of Theorem 1 and are omitted here.

Theorem 2. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, (i) ¯βh

−→ βa.s. ₀ and√

n( ¯βh− β₀)−→ N^d _d−1(0, Σ20(β0)), and (ii) bβ_h^∗−→ β^a.s. ₀ and √

n( bβ_h^∗− β₀)−→ N^d _d−1(0, Σ₂₁(β₀)), where Σ2`(β0) = V_22`⁻¹(β0)V21`(β0)V_22`⁻¹(β0), ` = 0, 1, with

V_21`(β) = V_22`(β) + E[

M −1

X

m=1

1 − q(Y, X^[m])

p(Y, X^[m]) p²(Y, X^{[M −1]})(Y − G_R(X_β))²H_R^⊗2(X, X_β) (GR(Xβ)(1 − GR(Xβ))^2` ], V_22`(β) = E[p(Y, X^{[M −1]})(Y − G_R(X_β))²H_R^⊗2(X, X_β)

(G_R(X_β)(1 − G_R(X_β)))^2` ], H_R(x, x_β) = GR(xβ)(1 − GR(xβ))

G_β(x_β)(1 − G_β(x_β))H_β(x, x_β).

We write A B if the matrix A − B is positive semidefinite. By using V₂₂₀⁻¹(β₀) − V₂₂₁⁻¹(β₀) 0 and E[

1

X

`=0

(−1)^`(

M −1

X

m=1

1 − q(Y, X^[m])

p(Y, X^[m]) )p²(Y, X^{[M −1]})(Y − G_R(X_β₀))²H_R^⊗2(X, X_β₀)

(G_R(X_β₀)(1 − G_R(X_β₀))^2−` ] 0, (3.1) it follows that

0 E[(

1

X

`=0

(−1)^`V_22`⁻¹(β₀)(

M −1

X

m=1

1 − q(Y, X^[m])

p(Y, X^[m]) )^1/2p(Y, X^{[M −1]})(Y − G_R(X_β₀))H_R(X, X_β₀) (G_R(X_β₀)(1 − G_R(X_β₀))^` )^⊗2]

(Σ₂₀(β₀) − Σ₂₁(β₀)) − (V₂₁₀⁻¹(β₀) − V₂₁₁⁻¹(β₀)) (Σ₂₀(β₀) − Σ₂₁(β₀)), (3.2)

which implies that bβ_h^∗is asymptotically more efficient than ¯β_h. Along the same line as those in (3.1)-(3.2), βb_h can also be shown to be asymptotically more efficient than bβ_h^∗. With a full random sample, i.e. the selection probabilities are all equal to one, we further conclude from Theorems 1-2 that√

n( bβ_h− β₀) and

√n( bβ_h^∗− β₀) have the same limiting distribution Nd−1(0, (E[H^⊗2(X, Xβ0)/Gβ0(Xβ0)(1 − Gβ0(Xβ0))])⁻¹) and √

n( ¯β_h− β₀)→ N^d _d−1(0, (E[G_β₀(X_β₀)(1 − G_β₀(X_β₀))H^⊗2(X, X_β₀)])⁻¹).

Obviously, one can see from (2.4) and (2.6) that β₀and eβ are also the maximizers of A₀(β) = P (X_iβ >

X_jβ, Y_i = 1, Y_j = 0) and bA₀(β) = Pn i=1

P

j6=iQ(Y_i, X_i^{[M −1]})Q(Y_j, X_j^{[M −1]})A_ij(β)/n(n − 1), respectively.

In the theoretical development, the uniform convergence and the asymptotic approximation of the U- statistic eA₀(β) =Pn

i=1

P

j6=iA_ij(β)/n(n − 1) are repeatedly used. As for the difference in the asymptotic variances of bA₀(β) and eA₀(β), it is mainly caused by the variation from selection probabilities. Two

(13)

assumptions are further imposed for the derivation of the large sample properties of eβ and bA₁( eβ):

(A8) ∂_β²Ψyβ(υ), y = 0, 1, are uniformly Lipschitz continuous in(υ, β). (A9) V_2A(β₀) is non-singular.

Theorem 3. Suppose that assumptions (A1)-(A3) and (A8)-(A9) are satisfied. When n −→ ∞, (i) eβ −→ β^p ₀ and √

n( eβ − β₀)−→ N^d _d−1(0, Σ₃(β₀)), and (ii) bA₁( eβ)−→ A(β^p ₀) and√

n( bA₁( eβ) − A(β₀))−→ N (0, σ^d ₃²(β₀)),

where Σ₃(β₀) = V_2A⁻¹(β₀)V_1A(β₀)V_2A⁻¹(β₀) and σ₃²(β₀) = [σ₀²(β₀)+V ar(P₁

y=0Ψ_yβ₀(X_β₀)−A(β₀)(1−2p)(Y − p))]/[p(1 − p)]² with

σ₀²(β₀) =

M −1

X

m=1

E[1 − q(Y, X^[m]) p(Y, X^[m]) (

1

X

y=0

Ψ_yβ₀(X_β₀))²], V_0A(β₀) = E[(

1

X

y=0

∂_βΨ_yβ₀(X_β₀))^⊗2],

V_1A(β₀) = V_0A(β₀) +

M −1

X

m=1

E[1 − q(Y, X^[m]) p(Y, X^[m]) (

1

X

y=0

∂_βΨ_yβ₀(X_β₀))^⊗2], V_2A(β₀) = 0.5E[

1

X

y=0

∂_β²Ψ_yβ₀(X_β₀)].

We note that the asymptotic properties of eβ still hold even the monotonic assumption on G(υ) is violated. Following along the same lines as the proof of Theorem 3(ii), both bA₁( bβ_h) and bA₁( ¯β_h) can be shown to have the same asymptotic distribution as that of bA₁( eβ). As stressed in the foregoing, βb_h avoids the computational cost and the complexity in the maximization of bA₁(β) with respect to β.

With a full random sample, the asymptotic variances Σ3(β0) and σ²₃(β0) in Theorem 3 will reduce to V_2A⁻¹(β₀)V_0A(β₀)V_2A⁻¹(β₀) and [V ar(A(β₀)(1 − 2p)(Y − p) +P1

y=0Ψ_yβ₀(X_β₀))]/[p(1 − p)]². Using the final- phase data in estimation, the asymptotic semiparametric efficiency of bβ^∗_h is evidenced by the argument of Ichimuri (1993). As claimed by Sherman (1993), one can also conclude that eβ does not achieve the semiparametric efficiency bound. Conclusively, bβ_his the asymptotically most efficient among the proposed estimators.

3.2 Bootstrap Inferences

Because a direct estimation of the asymptotic variance Σ₁(β₀) is complicated and impractical, we employ a bootstrap technique to construct confidence intervals for β₀ or the related parameters of interest. Suppose that ξ , . . . , ξ are randomly drawn from a multinomial distribution M ulti(1, n⁻¹, . . . , n⁻¹) with ξ =

(14)

(ξj1, . . . , ξjn)^> and wk =Pn

j=1ξjk/n, j, k = 1, . . . , n. The bootstrap analogue bβwh of bβh is defined to be the maximizer of the following bootstrap pseudo log-likelihood:

`_wp(β) =

n

X

i=1 M

X

m=1

w_iQ_im(1 − R_i(m+1)) ln cIG_wβ(Y_i, X_i^[m]), (3.3)

where bGwβ(X_β^[m][m]+ x^[−m]_β[−m]) and bFwm(X^[m], x^[−m]) are the bootstrap analogues of bGβ(X_β^[m][m]+ x^[−m]_β[−m]) and Fb_m(X^[m], x^[−m]) with Q_w(y, x^{[M −1]}) = wQ_M/bp_w(y, x^{[M −1]}) andpb_w(y, x^{[M −1]}) =QM −1

k=1 qb_w(y, x) substituting for Q(y, x^{[M −1]})/n and p(y, xb ^{[M −1]}), bqw(y, x^[m−1]) = n(y, x^[m−1])/qbw1(y, x^[m−1]), andqbw1(y, x^[m−1]) = Pn

j=1w_jQ_j(m−1)I(Y_j = y, X_j^[m−1] = x^[m−1]). It is derived in the next theorem that the sampling distribution of bβ_h can be approximated by the frequency distribution of bβ_wh.

Theorem 4. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, P_w(√

n( bβ_wh− bβ_h) ≤ u) − P (√

n( bβ_h− β₀) ≤ u)−→ 0^p ∀u = (u₂, . . . , u_d)^T, (3.4) where Pw(·) = P (·|(SM

m=2D_0m)S D_1M).

The validity of Theorems 1 and 4 enables us to give the normal-type and quantile-type bootstrap confidence intervals for β0k by

βb_kh± z_1−α/2se_w( bβ_wkh− bβ_kh) and ( bβ_kh− q_w(1−α/2)( bβ_wkh− bβ_kh), bβ_kh− q_wα/2( bβ_wkh− bβ_kh)), (3.5) k = 2, . . . , (d − 1), where z_1−α/2 is the (1 − α/2)th quantile value of the standard normal distribution, and se_w(·) and q_wς(·) denote the standard error and the 100ςth percentile of B bootstrap estimates. Generally, appropriate bootstrap replications are required to ensure the numerical stability.

4 Numerical Experiments and Applications

Under the setting of two-phase sampling, we conducted a class of simulation experiments to assess the performances of the proposed estimators and the related inference procedures. In the simulation process, data were generated from 1000 replications and the bootstrap inferences were based on 500 bootstrap samples, which enable us to obtain stable numerical results. In the last scenario of the simulation, 1000 random samples of size 1500 were repeatedly taken from the full data of Wilms’ tumor studies and three-phase subsampling of a smaller size was further undertaken from each sample. We quantify and investigate the efficiency loss incurred by the PMLE for a three-phase random sample relative to that for

(15)

a full random sample. Finally, a two-phase subsample drawn from a low birthweight study was used to illustrate the applicability of our developed approaches.

4.1 Simulation I - Estimation and Bootstrap Confidence Intervals

The covariates X = X⁽²⁾ = (X₁, X₂, X₃)^> were designed to follow a trivariate normal distribution with mean (0, 0, 0)^>, standard deviation (1, 1, 1)^>, and correlation coefficient ρ of 0.5. The conditional distribution of the binary response Y was set to be the logistic regression model

(Model 1) G_β₀(x_β₀) = exp(−5.8 + 6x_β₀) 1 + exp(−5.8 + 6x_β₀)

with the index coefficients β₀ = (1, −0.6, −0.4)^>. Data with sample sizes of 250 and 500 were simulated according to the above design. With the correlation of 0.5, the marginal probability of Y is computed to be about 0.16. Under the two-phase case-control sampling with n(y), y = 0, 1, being specified at the second phase, the combinations of subsamples (n(1), n(0)) were set to be (25k, 50k), (25k, 100k), (50k, 50k), or (50k, 100k) for n = 250k, k = 1, 2. The corresponding response selective probabilities (q(1),b q(0)) willb approach (0.5, 0.25), (0.5, 0.5), (1, 0.25), or (1, 0.5) as the sample size increases and the specified conditions are similar in nature to those found in application.

Table 4.1 displays the means and the standard deviations of 1000 estimates of bβ_h, ¯β_h, and eβ. One can expect that the standard deviations of these estimators tend to be smaller for a larger sample size.

Generally speaking, the considered sampling scheme will cause relatively small subsample sizes of the second-phase covariate data and greater variations in estimators. Compared with the pseudo likelihood and the pseudo least squares estimation approaches, the estimator obtained from the optimal AUC criterion tends to have a relatively larger bias, particularly in small subsamples. It is indicated from this table that bβhis generally the most efficient relative to ¯βh and eβ. As for the accuracy measure A(β0), both of the estimators bA₁( bβ_h) and bA₁( eβ) have fairly similar performance in terms of mean squared errors although the variation of bA₁( eβ) is slightly smaller (Table 4.2). In addition, the CPU time for the computation of βb_h is much shorter than those for eβ in the numerical experiments.

Table 4.3 further provides the bootstrap estimates for the standard deviations of the PMLEs, the empirical coverage probabilities, the 0.95 quantile intervals of 1000 estimates, and the averages of 1000 normal-type and quantile-type bootstrap confidence intervals. It seems that the asymptotic variance

(16)

is slightly overestimated by the bootstrap estimator while its accuracy is improved as the sample size increases. Although the constructed confidence intervals are typically wider than the corresponding true quantile intervals, their differences are not strongly associated with the sample size. The empirical coverage probabilities of normal-type bootstrap intervals are also detected to be higher than those of the quantile-type ones. Overall, the constructed 95% confidence intervals possess coverage probabilities which are quite close to the nominal value for appropriate sample sizes.

4.2 Simulation II - Model Checking

In this simulation scenario, the covariates were specified to be the same as those in Section 4.1. We assess the performance of our testing procedures for model correctness through Model 1 and the following model:

(Model 2) P (Y = 1|X = x) = exp(−5 + 6(x₁x₂− 0.6x³₃)) 1 + exp(−5 + 6(x1x2− 0.6x³₃)).

It is noted that Model 2 deviates from the structure of the SIM in (1.1). Although ¯β_his not asymptotically efficient relative to bβ_h and eβ when G_β₀(υ) is a strictly increasing function, the pseudo least squares estimation procedure is useful in the development of test rules.

With different sample sizes and combinations of subsamples at the second phase, the estimated sizes and powers, and the rejection proportions based on the SIC measure are provided in Table 4.4. When the SIM is valid, the simulation results indicate that the estimated sizes of the wild bootstrap test are all less than the nominal size at 0.05. In addition, the rejection probabilities of the SIC criterion are higher than the estimated sizes and powers of the wild bootstrap test, especially in a small sample. It is further found that the rejection probabilities may not be readily decreased as the sample size increases. Conclusively, a larger size of the second-phase covariate sample is typically associated with an increase in power.

4.3 Investigation of Efficiency Loss in the PMLE

The analyzed data were collected from the National Wilms’ Tumor Study Group and were further provided by Kulich and Lin (2004). The Wilms’ tumor, also called nephroblastoma, is a medical diagnosis and is the most common malignant (cancerous) tumor of kidney in childhood. A total of 3915 children, who had been treated for Wilms’ tumor, participated in the third and fourth Wilms’ tumor studies (NWTS-3 and NWTS-4). Measurements taken in our data analysis included: relapse within three years (relaps),

(17)

local histological type of the tumor (lhist) with unfavorable histologies being coding as 1 and favorable histologies being 0, stage of disease at diagnosis (stage) with stages 1 to 4, levels of age at diagnosis with age1 and age2 indicating the statuses of age≤ 1 year and age in (1,4) years, central histology (thist), which is the true histological type of the tumor and is coded as lhist, and tumor diameter (diam) in centimeters. In addition, the numbers of cases and controls in the cohort were 603 and 3312, respectively.

The primary research interest is to investigate the influences of the factors stage, (age1, age2), thist, and diam on the binary response variable relapse. Since the paired covariates (thist, this ∗ age₂), (stage, stage ∗ diam), and (diam, stage ∗ diam) are highly correlated with correlation coefficients of 0.641, 0.671, and 0.848, respectively, the SIM with the linear predictor thist + β₀₂stage + β₀₃age₁ + β04age2+ β05(thist ∗ age1) + β06diam is used to characterize the conditional probability of relapse. In this numerical experiment, we drew a random sample of size 1500 from the full data and repeated this process 1000 times. A subsample from each simulated data was obtained by further undertaking the three-phase case-control sampling. In the initial phase, there were 24 strata defined by the values of X⁽¹⁾ = (relapse, lhist, stage, age₁, age₂). The selection probabilities in the strata with {relapse = 1}

were specified to be 1 in the subsequent phases. For those with {relapse = 0}, 100 units were sampled from each stratum with size ≥ 100 and all units were taken with size < 100 at the second phase. In this phase, the values of the expensive covariate X⁽²⁾ = thist were obtained for sampled individuals. At the third phase, 25 units and all units were sampled with sizes of strata ≥ 25 and < 25, respectively. There were, on average, about 1170 children in the second-phase covariate data and 600 ones in the complete covariate data. Our another aim is to assess the loss of estimation efficiency caused by such a sampling design. In Table 4.5, we provide the averages of 1000 parameter estimates and the corresponding standard errors in the sample data analysis as well as in three-phase data analyses. The averages of 1000 PMLEs for a full random sample and three-phase data are quite close to each other. As one can expect, the efficiency loss of the PMLE is related with the size of subsample taken for each ascertained covariate.

4.4 Application to Auckland Birthweight Collaborative Study

The Auckland Birthweight Collaborative (ABC) study was conducted from 1995 to 1997. The primary research interest focuses on exploring potential demographic characteristics and risk factors of pregnant