**PREPRINT**

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

### www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 17.pdf

## Binary Response Models with M-phase Case-Control Data

### Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang

### December 1, 2011

### Binary Response Models with M -phase Case-Control Data

### Chin-Tsang Chiang, Ren-Hong Bai, and Ming-Yueh Huang Department of Mathematics, National Taiwan University

### December 1, 2011

Abstract

The present study aimed to characterize the relationship between a binary response and covariates of interest through a more general single-index regression model. With M -phase (M ≥ 2) case-control data supplemented by information on a response and certain covariates, we primarily propose a pseudo likelihood estimation for the index coefficients of this type of semiparametric model. Additionally, our approach can be readily adopted to accommodate case-control sampling with a continuous response and outcome-dependent sampling. With the considered data setting in the receiver operating characteristic curve analysis, an estima- tion for the accuracy measure is provided and is borrowed to seek an optimal linear predictor in the class of potential linear predictors. To check model correctness, a pseudo least squares approach is further employed as an aid to devising suitable testing procedures. The general theoretical frameworks for the proposed estimators and the bootstrap inference are also de- veloped in this article. Finally, extensive simulations and two empirical applications are used to illustrate the applicability of our methodology.

### 1 Introduction

To account for the relationship between potential covariates X = (X_{1}, . . . , X_{d})^{>} and a binary
response Y (1: case; 0: control), researchers and practitioners try to better understand the con-
ditional probability of Y. We can reasonably argue that fully parametric models might be mis-
specified and fully nonparametric ones usually suffer from the curse of dimensionality (Bellman
1961). Thus, a more flexible semiparametric model becomes a great interest and compromise.

With prospective studies, recent research has developed a variety of estimation approaches and

Keywords: accuracy measure; bootstrap; case-control; cross-validation; missing data; M-phase; pseudo least squares; pseudo maximum likelihood estimator; receiver operating characteristic curve; single-index model.

the related inferences for certain semiparametric models. In application, it is often impractical to obtain this type of data due to the high cost or tremendous amount of time in collecting expensive variables. An alternative cost-effective strategy to solve this difficulty is to adopt supplementary M-phase sampling plans.

In the simple case-control design, Anderson (1972) and Pyke (1979) showed that the covari- ate effects in logistic regression models can be estimated in a normal manner as if data were coming from a prospective study. However, this attractive feature cannot carry over to arbi- trary regression models without some degree of information on the marginal probability of Y . To overcome the confronting problem and further enhance study efficiency, the supplementary M-phase sampling mechanism has been adopted in several clinical and epidemiological studies.

Let (X^{(1)>}, . . . , X^{(M )>})^{>} be an alternative expression of X with categorical or discrete covariates
X^{(m)} having finite support {x^{(m)}_{1} , . . . , x^{(m)}_{J}_{m}} and being designed to be observed at the mth phase,
and Rm+1 indicate the missing status of X^{(m+1)}, which is defined to be 1 if X^{(m+1)} is observed
and 0 otherwise, at the (m + 1)th phase, m = 1, . . . , M − 1. Using these missing data nota-
tions, the so-called M-phase case-control sampling design can be concisely described as below:

First of all, a prospective cohort of size n is stratified by the values of (Y, X^{(1)}) at the first phase.

For each individual with the combination value {Y = y, X^{(1)} = x^{(1)}_{j}_{1} , . . . , X^{(m)} = x^{(m)}_{j}_{m} } at the mth
phase, the sampling indicator Rm+1 is randomly generated from the independent Bernoulli or fi-
nite population sampling scheme with the corresponding selection probability q(y, x_{b} ^{(1)}_{j}_{1} , . . . , x^{(m)}_{j}_{m} ) =
n(y, x^{(1)}_{j}_{1} , . . . , x^{(m)}_{j}_{m} )/Pn

i=1QimI(Yi = y, X_{i}^{(1)} = x^{(1)}_{j}_{1} , . . . , X_{i}^{(m)} = x^{(m)}_{j}_{m} ), where n(y, x^{(1)}_{j}_{1} , . . . , x^{(m)}_{j}_{m} ) is the
expected or specified subsample size at the (m + 1)th phase and Qm+1 =Qm+1

k=2 Rk. For the sake of simplification, both of case-control sample and its population are treated as the first phase without distinction (cf. Breslow and Holubkov 1997). Under stratified two-phase case-control studies, Scott and Wild (1991) proposed a maximum likelihood estimation for logistic regression.

Meanwhile, Wild (1991) provided a maximum likelihood approach as well as two pseudo maximum likelihood ones for the regression parameters of an arbitrary model with the link function being

known up to the parameters and general covariates. It was further shown by Scott and Wild
(1997) that the maximum likelihood estimator can be obtained by iteratively updating the derived
pseudo-likelihood equation. As one can see, the feasibility of their approaches mainly relies on the
monotonic assumption of a response function or another suitable condition. Based on the con-
tinuous response Yo with supplementary information on its dichotomized response Y = I(Yo ≤ y_{c})
for some threshold value yc, a more efficient estimation was developed by Jiang, Scott, and Wild
(2009) under the validity of a parametric model for the conditional distribution of Yo. Recently,
Lee, Scott, and Wild (2010) also presented an efficient profile likelihood approach for multi-phase
case-control data and unified earlier developments.

In this article, we focus on a fairly useful single-index model (SIM):

P (Y = 1|X = x) = Gβ0(xβ0), (1.1)

where G_{β}(·) is an unknown function and x_{β} = x1+ (x2, . . . , x_{d})^{>}β with β = (β2, . . . , β_{d})^{>} and β0 =
(β02, . . . , β_{0d})^{>}being a vector of true index coefficients. The specification of one for the coefficient of
a significant covariate, sayX1, is mainly to solve the problem of identifiability. With a prospective
data collection, readers can refer the profile likelihood approach of Klein and Spady (1993) and
the pseudo least squares one of Ichimura (1993). Based on supplementary M-phase case-control
data, we developed a pseudo likelihood estimation criterion for β0 by fully utilizing available
data at each phase. Under some suitable conditions, the resulting pseudo maximum likelihood
estimator (PMLE) is shown to be √

n-consistent, asymptotic normal, and the most efficient among
the proposed estimators. The frequency distribution of a bootstrap analogue is further employed
to estimate the sampling distribution of the PMLE and make inferences on β_{0} or the related
parameters of interest. As for the case-control data with a continuous outcome (cf. Jiang, Scott,
and Wild 2009) or the supplementary continuous-outcome-dependent data (cf. Zhou el al. 2002,
2007; Song, Zhou, and Kosorok 2009), a more flexible semiparametric model

f_{Y}_{o}(y_{o}|x) = f_{β}_{0}(y_{o}, x_{β}_{0}) (1.2)
can be adopted to characterize the conditional density function ofYo, wherefβ(·, ·)is an unspecified

bivariate function. Of interest is the fact that the induced model for the dichotomized outcome Y has the same form of that in (1.1). In this study, we introduced a new estimation procedure for this type of data and addressed its theoretical merit and practical limitation. When Gβ(υ) is strictly increasing in υ and the components of X are continuous or mixed discrete-continuous, the receiver operating characteristic (ROC) curve of the single-indexXβ0 has been well known to be the highest among those of scalar-valued functions ofX. An alternative estimation criterion was aimed at maximizing the proposed estimators for the areas under the ROC curves (AUC) with respect to all potential linear predictors. To check the correctness of the SIM in (1.1), our pseudo least squares estimation (PLSE) approach is considerable in the establishment of testing procedures.

The test rules of Xia (2009) were further extended with some modifications to accommodate the considered sampling scheme. There are some interesting features in our proposed approaches:

First, the PMLE is the asymptotically most efficient among the other ones and has remarkably good finite sample performance. Second, the estimation approach of the accuracy measure provides an estimator for the optimal composite markerXβ0. Third, thePLSEcriterion is found to be useful in model checking although the resulting estimator is inefficient relative to the PMLE.

Under the M-phase case-control design, we outline a class of pseudo estimation procedures for β0 and devise testing procedures to check model correctness in Section 2. Section 3 is devoted to the related large sample properties and the bootstrap inference. In Section 4, simulations and applications to two empirical data are used to illustrate the practicality of our methodology. Some concluding remarks and future research are given in Section 5. Finally, the proofs of the technical lemmas and the main results are postponed to the appendices.

### 2 Estimation and Model Checking

To simplify the presentation, we let X^{[m]} = (X^{(1)>}, . . . , X^{(m)>})^{>}, X^{[−m]} = (X^{(m+1)>}, . . . , X^{(M )>})^{>},
p(y, xb ^{[m]}) = Qm

k=1q(y, xb ^{[k]}) with q(y, x_{b} ^{[m]}) = n(y, x^{[m]})/(nqb_{1}(y, x^{[m]})), R(y, x^{[m]}) = R_{m+1}/q(y, xb ^{[m]}),
Q(y, x^{[m]}) =Qm

k=1R(y, x^{[k]}), and ψyβ(υ) represent the joint probability density function of {Y = y}

and X . The M-phase case-control design can be depicted as the following flowchart:

(Y, X)

yrandom sample or study population

D11= {(Yi, X_{i}^{[1]})}^{n}_{i=1} −−−−→ D^{R}^{2}^{=1} 12
R_{3}=1

−−−−→ . . . −−−−−−→ D^{R}^{M −1}^{=1} 1(M −1)
R_{M}=1

−−−−→ D1M,

y^{R}^{2}^{=0}

y^{R}^{3}^{=0}

y^{R}^{M}^{=0}

D_{02} D_{03} D_{0M}

whereD_{0(m+1)}= {(Yi, X_{i}^{[m]}, Qim, R_{i(m+1)}) : Qim= 1, R_{i(m+1)}= 0}andD_{1(m+1)}= {(Yi, X_{i}^{[m+1]}, Q_{i(m+1)}) :
Q_{i(m+1)} = 1}. It is noted from Rm = 0 that the subsequent sampling indicators are automatically
to be zero. In addition, one suitable condition is assumed in the sequel:

(A1) X^{[−m]} ⊥ R_{m+1} | D_{1m}, m = 1, . . . , (M − 1).

### 2.1 Pseudo Maximum Likelihood Estimation

From the perspective of missing-data, the corresponding likelihood function can be derived to be proportional to

L(β, Gβ, F ) =

n

Y

i=1 M

Y

m=1

[IGβ(Yi, X_{i}^{[m]})]^{Q}^{im}^{(1−R}^{i(m+1)}^{)}, (2.1)
where IG_{β}(y, x^{[m]}) = R G^{y}_{β}(x_{β})(1 − G_{β}(x_{β})^{1−y}d_{x}[−m]F_{m}(x) and R_{M +1} ≡ 0 with F_{m}(x) = P (X^{[m]} =
x^{[m]}, X^{[−m]} ≤ x^{[−m]}) and IGβ(y, x^{[M ]}) = G^{y}_{β}(xβ)(1 − Gβ(xβ))^{1−y}. In our estimation procedure,
F_{m}(x)’s andG_{β}(υ) are treated as nuisance parameters. By using the inverse probability weighting
technique on members with complete covariates, G_{β}(υ) = ψ_{1β}(υ)/P1

y=0ψ_{yβ}(υ), and the equality
F_{m}(x) = E[Q(Y, X^{[M −1]})I(X^{[m]}= x^{[m]}, X^{[−m]} ≤ x^{[−m]}]), it enables us to construct the estimators for
G_{β}(X_{iβ}^{[m]}_{[m]}+ x^{[−m]}_{β}_{[−m]}) and F_{m}(X_{i}^{[m]}, x^{[−m]}) in IG_{β}(Y_{i}, X_{i}^{[m]})’s as

Gb_{β}(X_{iβ}^{[m]}_{[m]}+ x^{[−m]}_{β}_{[−m]}) =

ψb_{1β}(X_{iβ}^{[m]}_{[m]}+ x^{[−m]}_{β}_{[−m]})
P1

y=0ψbyβ(X_{iβ}^{[m]}[m]+ x^{[−m]}_{β}[−m])

= P

j6=iQ(Yj, X_{j}^{[M −1]})YjK_{h}(X_{jβ}− (X^{[m]}

iβ^{[m]} + x^{[−m]}_{β}_{[−m]})
P

j6=iQ(Yj, X_{j}^{[M −1]})Kh(Xjβ− (X^{[m]}

iβ^{[m]}+ x^{[−m]}_{β}[−m]))
and Fbm(X_{i}^{[m]}, x^{[−m]}) = 1

n − 1 X

j6=i

Q(Yj, X_{j}^{[M −1]})I(X_{j}^{[m]}= X_{i}^{[m]}, X_{j}^{[−m]}≤ x^{[−m]}),

where K_{h}(u) = K(u/h)/h, K(u) is a bounded symmetric density function with bounded support and
R K^{(2)}(u)du = 0, and h is a positive-valued bandwidth. Thus, the PMLE bβ_{h} is naturally proposed to be
the maximizer of the pseudo log-likelihood function

`_{1p}(β) = 1
n

n

X

i=1 M

X

m=1

Q_{im}(1 − R_{i(m+1)}) ln cIG_{β}(Y_{i}, X_{i}^{[m]}). (2.2)

The reason for the specification of a particular second-order kernel function is mainly to achieve the

√n-consistency of bβ_{h}. In application, the quartic kernel function K(ν) = (15/16)(1 − ν^{2})^{2}I(|ν| ≤ 1) is
frequently adopted. As for the bandwidth selection, an optimal smoother hopt is sought by maximizing

`_{1p}( bβ_{h}) with respect to h, i.e. h_{opt} = arg max_{h}`_{1p}( bβ_{h}).

In this subsection, we further consider another design in which the binary status Y = I(Y_{o}≤ y_{c}) of the
continuous response Y_{o} is known for each individual at the first phase and the values of Y_{o} are measured
at the second phase. The proxy of the log-likelihood function can be derived to be

`_{2p}(β) = 1
n

n

X

i=1

{Q_{iM}ln bf_{β}(Y_{io}, X_{iβ}) + (1 − R_{i2})[Y_{i}lnpb_{c}+ (1 − Y_{i}) ln(1 −pb_{c}))]

+

M −1

X

m=2

Q_{im}(1 − R_{i(m+1)}) ln
Z

fb_{β}(Y_{io}, X_{iβ}^{[m]}_{[m]}+ x^{[−m]}_{β}_{[−m]})d_{x}[−m]Fb_{m}(X_{i}^{[m]}, x^{[−m]})}, (2.3)
where pb_{c}=R [R_{{y}

0>yc}fb_{β}(y_{o}, x_{β})dy_{o}]d bF_{M}(x) is an estimator of p_{c}= P (Y_{o} > y_{c}) and

fbβ(Yio, X_{iβ}^{[m]}[m]+ x^{[−m]}_{β}[−m]) =
P

j6=iQ(Yj, X_{j}^{[M −1]})Kh1(Yjo− Y_{io})Kh2(Xjβ− (X^{[m]}

iβ^{[m]}+ x^{[−m]}_{β}[−m]))
P

j6=iQ(Y_{j}, X_{j}^{[M −1]})K_{h}_{2}(X_{jβ}− (X^{[m]}

iβ^{[m]}+ x^{[−m]}

β^{[−m]})) , i = 1, . . . , n,
with K_{h}_{k}(u) = K_{4}(u/h_{k})/h_{k}, k = 1, 2, and K_{4}(u) being the fourth-order kernel function such as K_{4}(u) =
(105/64)(1 − 5u^{2} + 7u^{4} − 3u^{6})I(|u| ≤ 1). The corresponding maximizer, say ˘βh, of the pseudo log-
likelihood function `_{2p}(β) is naturally proposed as an estimator of β_{0}. It is noted that the validity of (2.3)
is ascertained by modifying assumption (A1) as (A1^{0}) (Yo, X^{[−m]}) ⊥ Rm+1 | D_{1m}, m = 1, . . . , (M − 1).

Paralleling with the proof steps of bβ_{h}, the fourth-order kernel function is required to ensure the n^{−1/2}
convergence rate of ˘β_{h}. A more attractive feature of this estimator is that ˘β_{h} should be asymptotically
more efficient than bβh, which can be treated as the maximizer of the pseudo likelihood function of the
induced response Y . However, the merit might disappear in small-sample case due to the numerical
instability caused by negative weights for some regions near the ends of the interval [−1, 1]. Furthermore,
the estimation procedure is likely to be computationally inefficient because the random quantitypb_{c}strongly
relies on some complicated integration methods in the calculation of R

{y0>yc}fb_{β}(yo, x_{β})dyo. As for the
outcome-dependent sampling with a continuous response, our estimation criterion can also be adapted
via the induced counting process I(Y_{o} ≤ y_{o}) and an appropriate weight function over possible y_{o}.

### 2.2 Estimation Through an Accuracy Measure

When G_{β}(υ) is strictly increasing in υ with continuous or mixed discrete-continuous X, an alternative
estimation approach is suggested for β_{0}. It is easy to derive that the AUC of X_{β} has the following
probability expression:

A(β) = P (Xiβ > Xjβ|Yi = 1, Yj = 0), i 6= j. (2.4)
By applying the Neyman-Pearson lemma, the linear predictor X_{β}_{0} can be shown to have the highest ROC
curve and, hence, the largest AUC among all scalar transformations of X.

Under the supplementary M -phase case-control sampling and assumption (A1), we have

E[Q(Y_{i}, X_{i}^{[M −1]})Q(Y_{j}, X_{j}^{[M −1]})A_{ij}(β)] = p(1 − p)A(β), (2.5)
where Aij(β) = I(X_{iβ} > X_{jβ})Yi(1−Yj). The above equality indicates that only the final-phase subsample,
i.e. D1M, is utilized and all the sampled members are inversely weighted by their selection probabilities.

Thus, for each fixed β, the accuracy measure A(β) in (2.4) can be estimated by its sample analogue and the resulting estimator is given by

Ab_{1}(β) =
Pn

i=1

P

j6=iQ(Y_{i}, X_{i}^{[M −1]})Q(Y_{j}, X_{j}^{[M −1]})A_{ij}(β)
Pn

i=1

P

j6=iY_{i}(1 − Y_{j}) . (2.6)

Since β_{0} is the unique maximizer of A(β), the estimator eβ is naturally defined to be a maximizer of bA_{1}(β)
in a quite straightforward manner. When the monotonicity of G_{β}(υ) or the SIM in (1.1) is violated, the
estimated linear predictor X

βe is still optimal in the class of all potential linear predictors. In contrast
with the pseudo-likelihood estimation approach, the estimated selection probabilities play central roles
in the asymptotic variances of bA_{1}(β) and eβ. It is evidenced through our numerical experiments that
βb_{h} outperforms eβ in terms of the mean squared error although bA_{1}( eβ) ≥ bA_{1}( bβ_{h}). The AUC estimator
Ab_{1}( eβ) is further explored to have a better performance than bA_{1}( bβh), whereas the difference between both
estimators is not apparent. Due to the non-differentiability of bA_{1}(β), the PMLE can directly provide
an essential ingredient in the estimation of A(β0). In addition, the numerical results show that bA_{1}( bβ_{h})
possesses an important advantage in computational speed and cost, especially in accommodating high-
dimensional covariate spaces.

Let Ψ_{yβ}(x_{β}) = yR

{υ<x_{β}}ψ_{0β}(υ)dυ + (1 − y)R

{υ>x_{β}}ψ_{1β}(υ)dυ, y=0,1. Another perspective for the

estimation of A(β) is to include individuals with fully observable covariates either from cases or controls.

One further derives that E[Q(Yj, X_{j}^{[M −1]})Q(Yj, X_{j}^{[M −1]})Aij(β)+Q(Yi, X_{i}^{[M −1]})(Ψ^{Y}_{0β}^{i}(Xiβ)+Ψ^{1−Y}_{1β} ^{i}(Xiβ))] =
3p(1 − p)A(β). Substituting bΨyβ(Xiβ) = Pn

j6=iQ(Yj, X_{j}^{[M −1]})(yI(Xjβ < Xiβ)(1 − Yj) + (1 − y)I(Xjβ >

Xiβ)Yj)/(n − 1) for Ψyβ(Xiβ) and bp(1 −p) =b Pn i=1

P

j6=iYi(1 − Yj)/n(n − 1) for p(1 − p), A(β) can also be estimated by

Ab_{2}(β) =
Pn

i=1

P

j6=iQ(Y_{i}, X_{i}^{[M −1]})(Q(Y_{j}, X_{j}^{[M −1]})A_{ij}(β) + bΨ^{Y}_{0β}^{i}(X_{iβ}) + bΨ^{1−Y}_{1β} ^{i}(X_{iβ}))
3Pn

i=1

P

j6=iY_{i}(1 − Y_{j}) ∀β. (2.7)

After some tedious algebra, bA_{2}(β) can be derived to be the same as that in (2.6).

### 2.3 Pseudo Least Squares Estimation and Model Checking

One of the major aims in this study was to check the correctness of the SIM in (1.1). This testing problem is of particular importance in practice and can be formulated as the following hypotheses:

H_{0}: P (Y = 1|X = x) = G_{β}_{0}(x_{β}_{0}) for all x ∈ X ;
H_{A}: P (Y = 1|X = x) 6= G_{β}_{0}(x_{β}_{0}) for some x ∈ X .

(2.8)

Our testing procedures are based on the error ε_{R}(β) = Q_{M}(Y − G_{R}(X_{β})) of the final-phase data, where
G_{R}(x_{β}) = Gβ(xβ)E[p(1, xb ^{[M −1]})]

P1

y=0(1 − G_{β}(x_{β}))^{1−y}G^{y}_{β}(x_{β})E[p(y, xb ^{[M −1]})] (2.9)
and E[Y |X = x, Q_{M} = 1] = G_{R}(x_{β}_{0}) when H_{0}is true. The reason for doing so is that the error ε(β) = Y −
G_{β}(X_{β}) is not available for subjects with incomplete covariates even G_{β}(υ) is known. Under the validity
of the SIM, it is easy to derive from the equality E[ε^{2}_{R}(β)|X = x, Q_{M} = 1] = E[ε^{2}_{R}(β_{0})|X = x, Q_{M} =
1] + (G_{R}(x_{β}) − G_{R}(x_{β}_{0}))^{2} that the minimizer β_{R0}= arg min_{β}E[ε^{2}_{R}(β)] is equal to the true parameter β_{0}.
On the other hand, if the considered model is incorrect, εR(βR0) can be further projected into another
linear combination of covariates, i.e. there exists a non-zero minimizer γ_{0} of E[(ε_{R}(β_{R0}) − Q_{M}ν(X_{γ}))^{2}]
for some function ν(·).

By Substituting

G¯R(xβ) = Gb_{β}(x_{β})p(1, xb ^{[M −1]})
P1

y=0(1 − bG_{β}(x_{β}))^{1−y}Gb^{y}_{β}(x_{β})p(y, xb ^{[M −1]}) (2.10)
for GR(xβ), the estimator ¯βh of βR0 is defined to be the minimizer of the following pseudo sum of squares:

SS_{p}(β) =

n

X

i=1

Q_{iM}(Y_{i}− ¯G_{R}(X_{iβ}))^{2}. (2.11)
Since ε_{R}(β_{R0}) is an unknown random quantity, the corresponding residual e_{R}( ¯β_{h}) = Q_{M}(Yi− ¯G_{R}(Xβ¯_{h}))
is naturally used to build testing standards. When the SIM is correct, the PLSE ¯β_{h} can be shown to be
a consistent estimator of β0, while it is asymptotically inefficient relative to the PMLE bβ_{h}. Similarly, an
estimator for γ0 can be obtained as

¯

γ = arg min

γ RSS_{n}(γ), (2.12)

where RSSn(γ) = Pn

i=1(eiR( ¯βh) − QiMν(X¯ iγ))^{2}/Pn

i=1QiM with ¯ν(Xiγ) =Pn

j6=iejR( ¯βh)Kh(Xjγ − X_{iγ})
/Pn

j6=iQ_{jM}K_{h}(X_{jγ}− X_{iγ}) being a smoothing estimator of ν(X_{iγ}), i = 1, . . . , n.

To set up a test rule, we consider the test statistic F_{n}= RSS_{n}(¯γ)/T SS_{n}, where T SS_{n}=Pn

i=1(e_{iR}( ¯β_{h})

−¯e_{R}( ¯β_{h}))^{2}/Pn

i=1Q_{iM} and ¯e_{R}( ¯β) =Pn

i=1e_{iR}( ¯β_{h})/Pn

i=1Q_{iM}. The null hypothesis H_{0} in (2.8) is rejected
if F_{n}≤ c_{α} with a significance level α. The critical value c_{α} can be determined by employing the frequency
distribution of a wild-bootstrap estimator. The bootstrap analogue e^{b}_{iR}( ¯β_{h}) of the ith residual e_{iR}( ¯β_{h})
is drawn from a two-point distribution (5 +√

5)/10δ_{(1−}^{√}_{5)e}

iR( ¯βh)/2+ (5 −√

5)/10δ_{(1+}^{√}_{5)e}

iR( ¯βh)/2 and the
bootstrap estimator F_{n}^{b} is computed by using a bootstrap sample {e^{b}_{iR}( ¯β_{h})}^{n}_{i=1} in both of RSS_{n}(¯γ) and
T SS_{n}. Thus, the αth, 0 < α < 1, quantile of B independent bootstrap replicates {F_{n}^{b}}^{B}_{b=1}, denoted by
q_{α}(F_{n}^{b}), can be treated as the Monte-Carlo critical value in the test criterion. An alternative approach
is to extend the single-indexing cross-validation (SIC) measure of Xia (2009) in the M -phase sampling
scheme. The modified SIC measure is proposed as

SIC_{n} =
Pn

i=1(e_{iR}( ¯β_{h}) − Q_{iM}ν¯−i(x_{i¯}_{γ}_{i}))^{2}
Pn

i=1Q_{iM} , (2.13)

where ¯γi = arg minγPn

j6=i(ejR( ¯β_{h}) − QjMν¯−i(Xjγ))^{2}and ¯ν−i(xγ) is the same as ¯ν(xγ) with the ith subject
being deleted, i = 1, . . . , n. The test rule is then established with being rejected if SIC_{n}< T SS_{n}.
Remark 1. Under the validity of the SIM in (1.1), the efficiency of the PLSE can be further improved
through a pseudo weighted sum of squares with the weights 1/[ ¯G_{R}(X_{iβ})(1 − ¯G_{R}(X_{iβ}))], i = 1, . . . , n. The
pseudo weighted least squares estimator (PWLSE), say bβ_{h}^{∗}, can also be obtained by maximizing

`^{∗}_{1p}(β) = 1
n

n

X

i=1

Q_{iM}[Y_{i}ln ¯G_{R}(X_{iβ}) + (1 − Y_{i}) ln(1 − ¯G_{R}(X_{iβ}))]. (2.14)

### 3 Asymptotic Properties

In this section, the theoretical development is mainly based on the independent Bernoulli sampling scheme
in the M -phase case-control design. To simplify the complicated mathematical expressions, some concise
notations are introduced below. Let q(y, x^{[m]}) = q0(y, x^{[m]})/q1(y, x^{[m]}) with q`(y, x^{[m]}) = P (Y = y, X^{[m]} =
x^{[m]}, Q_{(m−`)} = 1) and q(Y, X^{[M ]}) = 0, p(y, x^{[m]}) = Qm

k=1q(y, x^{[k]}), and M_{β}^{k}(x, υ) = E[( eX −ex)^{⊗k}|X_{β} =
υ]ψβ(υ) with eX = (X2, . . . , Xd)^{>}, ( eX −x)e ^{⊗0} = 1, ( eX −x)e ^{⊗1}= ( eX −ex), and ( eX −x)e ^{⊗2} = ( eX −x)( ee X −ex)^{>}.
In addition, the target functions of the estimators ∂_{β}^{k}ψb_{yβ}(x_{β}), k = 0, 1, 2, and ∂_{β}Gb_{β}(x_{β}) are denoted by
ψ_{yβ}^{(k)}(x, υ) = ∂_{υ}^{k}(G^{y}_{β}(υ)(1 − G_{β}(υ))^{1−y}M_{β}^{k}(x, υ)) and H_{β}(x, υ) = ψ_{1β}^{(1)}(x, υ)

P1

y=0ψ_{yβ}(υ)− G_{β}(υ)
P1

y=0ψ_{yβ}^{(1)}(x, υ)
(P1

y=0ψ_{β}(υ))^{2}.

### 3.1 Consistency and Limiting Distribution

For the convenience of derivation of the main results, the following conditions are assumed throughout this article:

(A2) |n(y, x^{[m]})/n − q_{0}(y, x^{[m]})| = O_{p}(n^{−1/2}) and q_{1}(y, x^{[m]}) > 0 ∀(y, x^{[m])}), m = 1, . . . , (M − 1).
(A3) X and B are compact and β0 is an interior of B.

(A4) ∂_{υ}^{3}ψ_{yβ}(υ), and ∂_{υ}^{3}M_{β}^{`}(x, υ)are uniformly Lipschitz continuous in (υ, x, β).
(A5) ^{P}^{1}_{y=0}ψ_{yβ}_{0}(υ) > 0 and 0 < G_{β}_{0}(υ) < 1 for all υ.

(A6) h = h0n^{−ζ} for some positive constantsh0 and ζ ∈ (1/8, 1/5).
(A7) E[H_{β}^{⊗2}

0 (X, X_{β}_{0})] is non-singular.

The consistency of bβ_{h} and the limiting distribution of √

n( bβ_{h}− β_{0}) are established below.

Theorem 1. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, bβh

−→ βa.s. _{0} and

√n( bβ_{h}− β_{0})−→ N^{d} _{d−1}(0, Σ_{1}(β_{0})), where Σ_{1}(β_{0}) = V_{2}^{−1}(β_{0})V_{1}(β_{0})V_{2}^{−1}(β_{0}) with

V1(β) = V2(β) +

M −1

X

m=1

E[1 − q(Y, X^{[M −m+1]})
p(Y, X^{[M −m+1]})

m

X

k=1

p^{2}(Y, X^{[M −k+1]})(J_{β}^{⊗2}(Y, X^{[M −k+1]}) − J_{β}^{⊗2}(Y, X^{[M −k]}))],

V2(β) =

M

X

m=1

E[(1 − q(Y, X^{[m+1]}))p(Y, X^{[m]})J_{β}^{⊗2}(Y, X^{[m]})], Jβ(y, x^{[m]}) = (−1)^{1−y}R H_{β}(x, x_{β})d_{x}[−m]F_{m}(x)
IGβ(y, x^{[m]}) .

When Fm(x) or any√

n-consistent estimator is in place of bFm(x), the resulting estimator can also achieve
the efficiency bound of bβ_{h}. More precisely speaking, the variation of bF_{m}(x) does not influence the asymp-
totic variance of bβ_{h}. For the asymptotic behaviors of bβ^{∗}_{h} and ¯β_{h}, their derivations are very similar to the
proof of Theorem 1 and are omitted here.

Theorem 2. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞, (i) ¯βh

−→ βa.s. _{0} and√

n( ¯βh− β_{0})−→ N^{d} _{d−1}(0, Σ20(β0)), and
(ii) bβ_{h}^{∗}−→ β^{a.s.} _{0} and √

n( bβ_{h}^{∗}− β_{0})−→ N^{d} _{d−1}(0, Σ_{21}(β_{0})),
where Σ2`(β0) = V_{22`}^{−1}(β0)V21`(β0)V_{22`}^{−1}(β0), ` = 0, 1, with

V_{21`}(β) = V_{22`}(β) + E[

M −1

X

m=1

1 − q(Y, X^{[m]})

p(Y, X^{[m]}) p^{2}(Y, X^{[M −1]})(Y − G_{R}(X_{β}))^{2}H_{R}^{⊗2}(X, X_{β})
(GR(Xβ)(1 − GR(Xβ))^{2`} ],
V_{22`}(β) = E[p(Y, X^{[M −1]})(Y − G_{R}(X_{β}))^{2}H_{R}^{⊗2}(X, X_{β})

(G_{R}(X_{β})(1 − G_{R}(X_{β})))^{2`} ], H_{R}(x, x_{β}) = GR(xβ)(1 − GR(xβ))

G_{β}(x_{β})(1 − G_{β}(x_{β}))H_{β}(x, x_{β}).

We write A B if the matrix A − B is positive semidefinite. By using V_{220}^{−1}(β_{0}) − V_{221}^{−1}(β_{0}) 0 and
E[

1

X

`=0

(−1)^{`}(

M −1

X

m=1

1 − q(Y, X^{[m]})

p(Y, X^{[m]}) )p^{2}(Y, X^{[M −1]})(Y − G_{R}(X_{β}_{0}))^{2}H_{R}^{⊗2}(X, X_{β}_{0})

(G_{R}(X_{β}_{0})(1 − G_{R}(X_{β}_{0}))^{2−`} ] 0, (3.1)
it follows that

0 E[(

1

X

`=0

(−1)^{`}V_{22`}^{−1}(β_{0})(

M −1

X

m=1

1 − q(Y, X^{[m]})

p(Y, X^{[m]}) )^{1/2}p(Y, X^{[M −1]})(Y − G_{R}(X_{β}_{0}))H_{R}(X, X_{β}_{0})
(G_{R}(X_{β}_{0})(1 − G_{R}(X_{β}_{0}))^{`} )^{⊗2}]

(Σ_{20}(β_{0}) − Σ_{21}(β_{0})) − (V_{210}^{−1}(β_{0}) − V_{211}^{−1}(β_{0})) (Σ_{20}(β_{0}) − Σ_{21}(β_{0})), (3.2)

which implies that bβ_{h}^{∗}is asymptotically more efficient than ¯β_{h}. Along the same line as those in (3.1)-(3.2),
βb_{h} can also be shown to be asymptotically more efficient than bβ_{h}^{∗}. With a full random sample, i.e. the
selection probabilities are all equal to one, we further conclude from Theorems 1-2 that√

n( bβ_{h}− β_{0}) and

√n( bβ_{h}^{∗}− β_{0}) have the same limiting distribution Nd−1(0, (E[H^{⊗2}(X, Xβ0)/Gβ0(Xβ0)(1 − Gβ0(Xβ0))])^{−1})
and √

n( ¯β_{h}− β_{0})→ N^{d} _{d−1}(0, (E[G_{β}_{0}(X_{β}_{0})(1 − G_{β}_{0}(X_{β}_{0}))H^{⊗2}(X, X_{β}_{0})])^{−1}).

Obviously, one can see from (2.4) and (2.6) that β_{0}and eβ are also the maximizers of A_{0}(β) = P (X_{iβ} >

X_{jβ}, Y_{i} = 1, Y_{j} = 0) and bA_{0}(β) = Pn
i=1

P

j6=iQ(Y_{i}, X_{i}^{[M −1]})Q(Y_{j}, X_{j}^{[M −1]})A_{ij}(β)/n(n − 1), respectively.

In the theoretical development, the uniform convergence and the asymptotic approximation of the U-
statistic eA_{0}(β) =Pn

i=1

P

j6=iA_{ij}(β)/n(n − 1) are repeatedly used. As for the difference in the asymptotic
variances of bA_{0}(β) and eA_{0}(β), it is mainly caused by the variation from selection probabilities. Two

assumptions are further imposed for the derivation of the large sample properties of eβ and bA_{1}( eβ):

(A8) ∂_{β}^{2}Ψyβ(υ), y = 0, 1, are uniformly Lipschitz continuous in(υ, β).
(A9) V_{2A}(β_{0}) is non-singular.

Theorem 3. Suppose that assumptions (A1)-(A3) and (A8)-(A9) are satisfied. When n −→ ∞,
(i) eβ −→ β^{p} _{0} and √

n( eβ − β_{0})−→ N^{d} _{d−1}(0, Σ_{3}(β_{0})), and
(ii) bA_{1}( eβ)−→ A(β^{p} _{0}) and√

n( bA_{1}( eβ) − A(β_{0}))−→ N (0, σ^{d} _{3}^{2}(β_{0})),

where Σ_{3}(β_{0}) = V_{2A}^{−1}(β_{0})V_{1A}(β_{0})V_{2A}^{−1}(β_{0}) and σ_{3}^{2}(β_{0}) = [σ_{0}^{2}(β_{0})+V ar(P_{1}

y=0Ψ_{yβ}_{0}(X_{β}_{0})−A(β_{0})(1−2p)(Y −
p))]/[p(1 − p)]^{2} with

σ_{0}^{2}(β_{0}) =

M −1

X

m=1

E[1 − q(Y, X^{[m]})
p(Y, X^{[m]}) (

1

X

y=0

Ψ_{yβ}_{0}(X_{β}_{0}))^{2}], V_{0A}(β_{0}) = E[(

1

X

y=0

∂_{β}Ψ_{yβ}_{0}(X_{β}_{0}))^{⊗2}],

V_{1A}(β_{0}) = V_{0A}(β_{0}) +

M −1

X

m=1

E[1 − q(Y, X^{[m]})
p(Y, X^{[m]}) (

1

X

y=0

∂_{β}Ψ_{yβ}_{0}(X_{β}_{0}))^{⊗2}], V_{2A}(β_{0}) = 0.5E[

1

X

y=0

∂_{β}^{2}Ψ_{yβ}_{0}(X_{β}_{0})].

We note that the asymptotic properties of eβ still hold even the monotonic assumption on G(υ) is
violated. Following along the same lines as the proof of Theorem 3(ii), both bA_{1}( bβ_{h}) and bA_{1}( ¯β_{h}) can
be shown to have the same asymptotic distribution as that of bA_{1}( eβ). As stressed in the foregoing,
βb_{h} avoids the computational cost and the complexity in the maximization of bA_{1}(β) with respect to β.

With a full random sample, the asymptotic variances Σ3(β0) and σ^{2}_{3}(β0) in Theorem 3 will reduce to
V_{2A}^{−1}(β_{0})V_{0A}(β_{0})V_{2A}^{−1}(β_{0}) and [V ar(A(β_{0})(1 − 2p)(Y − p) +P1

y=0Ψ_{yβ}_{0}(X_{β}_{0}))]/[p(1 − p)]^{2}. Using the final-
phase data in estimation, the asymptotic semiparametric efficiency of bβ^{∗}_{h} is evidenced by the argument
of Ichimuri (1993). As claimed by Sherman (1993), one can also conclude that eβ does not achieve the
semiparametric efficiency bound. Conclusively, bβ_{h}is the asymptotically most efficient among the proposed
estimators.

### 3.2 Bootstrap Inferences

Because a direct estimation of the asymptotic variance Σ_{1}(β_{0}) is complicated and impractical, we employ a
bootstrap technique to construct confidence intervals for β_{0} or the related parameters of interest. Suppose
that ξ , . . . , ξ are randomly drawn from a multinomial distribution M ulti(1, n^{−1}, . . . , n^{−1}) with ξ =

(ξj1, . . . , ξjn)^{>} and wk =Pn

j=1ξjk/n, j, k = 1, . . . , n. The bootstrap analogue bβwh of bβh is defined to be the maximizer of the following bootstrap pseudo log-likelihood:

`_{wp}(β) =

n

X

i=1 M

X

m=1

w_{i}Q_{im}(1 − R_{i(m+1)}) ln cIG_{wβ}(Y_{i}, X_{i}^{[m]}), (3.3)

where bGwβ(X_{β}^{[m]}[m]+ x^{[−m]}_{β}[−m]) and bFwm(X^{[m]}, x^{[−m]}) are the bootstrap analogues of bGβ(X_{β}^{[m]}[m]+ x^{[−m]}_{β}[−m]) and
Fb_{m}(X^{[m]}, x^{[−m]}) with Q_{w}(y, x^{[M −1]}) = wQ_{M}/bp_{w}(y, x^{[M −1]}) andpb_{w}(y, x^{[M −1]}) =QM −1

k=1 qb_{w}(y, x) substitut-
ing for Q(y, x^{[M −1]})/n and p(y, xb ^{[M −1]}), bqw(y, x^{[m−1]}) = n(y, x^{[m−1]})/qbw1(y, x^{[m−1]}), andqbw1(y, x^{[m−1]}) =
Pn

j=1w_{j}Q_{j(m−1)}I(Y_{j} = y, X_{j}^{[m−1]} = x^{[m−1]}). It is derived in the next theorem that the sampling distri-
bution of bβ_{h} can be approximated by the frequency distribution of bβ_{wh}.

Theorem 4. Suppose that assumptions (A1)-(A7) are satisfied. When n −→ ∞,
P_{w}(√

n( bβ_{wh}− bβ_{h}) ≤ u) − P (√

n( bβ_{h}− β_{0}) ≤ u)−→ 0^{p} ∀u = (u_{2}, . . . , u_{d})^{T}, (3.4)
where Pw(·) = P (·|(SM

m=2D_{0m})S D_{1M}).

The validity of Theorems 1 and 4 enables us to give the normal-type and quantile-type bootstrap confidence intervals for β0k by

βb_{kh}± z_{1−α/2}se_{w}( bβ_{wkh}− bβ_{kh}) and ( bβ_{kh}− q_{w(1−α/2)}( bβ_{wkh}− bβ_{kh}), bβ_{kh}− q_{wα/2}( bβ_{wkh}− bβ_{kh})), (3.5)
k = 2, . . . , (d − 1), where z_{1−α/2} is the (1 − α/2)th quantile value of the standard normal distribution, and
se_{w}(·) and q_{wς}(·) denote the standard error and the 100ςth percentile of B bootstrap estimates. Generally,
appropriate bootstrap replications are required to ensure the numerical stability.

### 4 Numerical Experiments and Applications

Under the setting of two-phase sampling, we conducted a class of simulation experiments to assess the performances of the proposed estimators and the related inference procedures. In the simulation process, data were generated from 1000 replications and the bootstrap inferences were based on 500 bootstrap samples, which enable us to obtain stable numerical results. In the last scenario of the simulation, 1000 random samples of size 1500 were repeatedly taken from the full data of Wilms’ tumor studies and three-phase subsampling of a smaller size was further undertaken from each sample. We quantify and investigate the efficiency loss incurred by the PMLE for a three-phase random sample relative to that for

a full random sample. Finally, a two-phase subsample drawn from a low birthweight study was used to illustrate the applicability of our developed approaches.

### 4.1 Simulation I - Estimation and Bootstrap Confidence Intervals

The covariates X = X^{(2)} = (X_{1}, X_{2}, X_{3})^{>} were designed to follow a trivariate normal distribution with
mean (0, 0, 0)^{>}, standard deviation (1, 1, 1)^{>}, and correlation coefficient ρ of 0.5. The conditional distri-
bution of the binary response Y was set to be the logistic regression model

(Model 1) G_{β}_{0}(x_{β}_{0}) = exp(−5.8 + 6x_{β}_{0})
1 + exp(−5.8 + 6x_{β}_{0})

with the index coefficients β_{0} = (1, −0.6, −0.4)^{>}. Data with sample sizes of 250 and 500 were simulated
according to the above design. With the correlation of 0.5, the marginal probability of Y is computed to be
about 0.16. Under the two-phase case-control sampling with n(y), y = 0, 1, being specified at the second
phase, the combinations of subsamples (n(1), n(0)) were set to be (25k, 50k), (25k, 100k), (50k, 50k), or
(50k, 100k) for n = 250k, k = 1, 2. The corresponding response selective probabilities (q(1),b q(0)) willb
approach (0.5, 0.25), (0.5, 0.5), (1, 0.25), or (1, 0.5) as the sample size increases and the specified conditions
are similar in nature to those found in application.

Table 4.1 displays the means and the standard deviations of 1000 estimates of bβ_{h}, ¯β_{h}, and eβ. One
can expect that the standard deviations of these estimators tend to be smaller for a larger sample size.

Generally speaking, the considered sampling scheme will cause relatively small subsample sizes of the
second-phase covariate data and greater variations in estimators. Compared with the pseudo likelihood
and the pseudo least squares estimation approaches, the estimator obtained from the optimal AUC crite-
rion tends to have a relatively larger bias, particularly in small subsamples. It is indicated from this table
that bβhis generally the most efficient relative to ¯βh and eβ. As for the accuracy measure A(β0), both of the
estimators bA_{1}( bβ_{h}) and bA_{1}( eβ) have fairly similar performance in terms of mean squared errors although
the variation of bA_{1}( eβ) is slightly smaller (Table 4.2). In addition, the CPU time for the computation of
βb_{h} is much shorter than those for eβ in the numerical experiments.

Table 4.3 further provides the bootstrap estimates for the standard deviations of the PMLEs, the empirical coverage probabilities, the 0.95 quantile intervals of 1000 estimates, and the averages of 1000 normal-type and quantile-type bootstrap confidence intervals. It seems that the asymptotic variance

is slightly overestimated by the bootstrap estimator while its accuracy is improved as the sample size increases. Although the constructed confidence intervals are typically wider than the corresponding true quantile intervals, their differences are not strongly associated with the sample size. The empirical coverage probabilities of normal-type bootstrap intervals are also detected to be higher than those of the quantile-type ones. Overall, the constructed 95% confidence intervals possess coverage probabilities which are quite close to the nominal value for appropriate sample sizes.

### 4.2 Simulation II - Model Checking

In this simulation scenario, the covariates were specified to be the same as those in Section 4.1. We assess the performance of our testing procedures for model correctness through Model 1 and the following model:

(Model 2) P (Y = 1|X = x) = exp(−5 + 6(x_{1}x_{2}− 0.6x^{3}_{3}))
1 + exp(−5 + 6(x1x2− 0.6x^{3}_{3})).

It is noted that Model 2 deviates from the structure of the SIM in (1.1). Although ¯β_{h}is not asymptotically
efficient relative to bβ_{h} and eβ when G_{β}_{0}(υ) is a strictly increasing function, the pseudo least squares
estimation procedure is useful in the development of test rules.

With different sample sizes and combinations of subsamples at the second phase, the estimated sizes and powers, and the rejection proportions based on the SIC measure are provided in Table 4.4. When the SIM is valid, the simulation results indicate that the estimated sizes of the wild bootstrap test are all less than the nominal size at 0.05. In addition, the rejection probabilities of the SIC criterion are higher than the estimated sizes and powers of the wild bootstrap test, especially in a small sample. It is further found that the rejection probabilities may not be readily decreased as the sample size increases. Conclusively, a larger size of the second-phase covariate sample is typically associated with an increase in power.

### 4.3 Investigation of Efficiency Loss in the PMLE

The analyzed data were collected from the National Wilms’ Tumor Study Group and were further provided by Kulich and Lin (2004). The Wilms’ tumor, also called nephroblastoma, is a medical diagnosis and is the most common malignant (cancerous) tumor of kidney in childhood. A total of 3915 children, who had been treated for Wilms’ tumor, participated in the third and fourth Wilms’ tumor studies (NWTS-3 and NWTS-4). Measurements taken in our data analysis included: relapse within three years (relaps),

local histological type of the tumor (lhist) with unfavorable histologies being coding as 1 and favorable histologies being 0, stage of disease at diagnosis (stage) with stages 1 to 4, levels of age at diagnosis with age1 and age2 indicating the statuses of age≤ 1 year and age in (1,4) years, central histology (thist), which is the true histological type of the tumor and is coded as lhist, and tumor diameter (diam) in centimeters. In addition, the numbers of cases and controls in the cohort were 603 and 3312, respectively.

The primary research interest is to investigate the influences of the factors stage, (age1, age2), thist,
and diam on the binary response variable relapse. Since the paired covariates (thist, this ∗ age_{2}),
(stage, stage ∗ diam), and (diam, stage ∗ diam) are highly correlated with correlation coefficients of
0.641, 0.671, and 0.848, respectively, the SIM with the linear predictor thist + β_{02}stage + β_{03}age_{1} +
β04age2+ β05(thist ∗ age1) + β06diam is used to characterize the conditional probability of relapse. In
this numerical experiment, we drew a random sample of size 1500 from the full data and repeated this
process 1000 times. A subsample from each simulated data was obtained by further undertaking the
three-phase case-control sampling. In the initial phase, there were 24 strata defined by the values of
X^{(1)} = (relapse, lhist, stage, age_{1}, age_{2}). The selection probabilities in the strata with {relapse = 1}

were specified to be 1 in the subsequent phases. For those with {relapse = 0}, 100 units were sampled
from each stratum with size ≥ 100 and all units were taken with size < 100 at the second phase. In this
phase, the values of the expensive covariate X^{(2)} = thist were obtained for sampled individuals. At the
third phase, 25 units and all units were sampled with sizes of strata ≥ 25 and < 25, respectively. There
were, on average, about 1170 children in the second-phase covariate data and 600 ones in the complete
covariate data. Our another aim is to assess the loss of estimation efficiency caused by such a sampling
design. In Table 4.5, we provide the averages of 1000 parameter estimates and the corresponding standard
errors in the sample data analysis as well as in three-phase data analyses. The averages of 1000 PMLEs
for a full random sample and three-phase data are quite close to each other. As one can expect, the
efficiency loss of the PMLE is related with the size of subsample taken for each ascertained covariate.

### 4.4 Application to Auckland Birthweight Collaborative Study

The Auckland Birthweight Collaborative (ABC) study was conducted from 1995 to 1997. The primary research interest focuses on exploring potential demographic characteristics and risk factors of pregnant