Estimation of a LogisticRegression Model with Mismeasured Observations

(1)

ESTIMATION OF A LOGISTIC REGRESSION MODEL WITH MISMEASURED OBSERVATIONS

K. F. Cheng and H. M. Hsueh

National Central University and National Chengchi University

Abstract: We consider the estimation problem of a logistic regression model. We assume the response observations and covariate values are both subject to measure-ment errors. We discuss some parametric and semiparametric estimation methods using mismeasured observations with validation data and derive their asypmtotic distributions. Our results are extentions of some well known results in the litera-ture. Comparisons of the asymptotic covariance matrices of the studied estimators are made, and some lower and upper bounds for the asymptotic relative efficiencies are given to show the advantages of the semiparametric method. Some simulation results also show the method performs well.

Key words and phrases: Kernel estimation, estimated likelihood, logistic regression, measurement error, misclassification.

1. Introduction

Logistic regression is the most popular form of binary regression; see Cox (1970) and Pregibon (1981). Researchers often use logistic regression to estimate the effect of various predictors on some binary outcome of interest. Basically, the model assumes that the log of the odds of the outcome is a linear function of the predictors. That is, suppose that the variables (Xi, Yi) follow the model

P r(Yi = 1 | Xi = xi) = exp(xT_i β0) {1 + exp(xT i β0)} ≡ F (x T i β0),

where the Xi are random d-vector predictors and the Yi are Bernoulli response

variables. Usually, the maximum likelihood (ML) method is used to estimate the regression coefficients β0. Under some regularity conditions, the maximum likelihood estimator of β0 is asymptotically normal.

The ML method requires that the data consist of precise measurements for the binary outcomes and predictors. However, the data are often not measured perfectly. For instance, Golm, Holloran and Longini (1998, 1999) mentioned that collecting information on exposure to infection for estimating vaccine efficacy may be mismeasured. On the other hand, Albert, Hunsberger and Bird (1997) and Bollinger and David (1997) gave examples showing that the binary outcome of

(2)

interest may also be misclassified. It is generally true that the usual analyses based on the mismeasured observations lead to inconsistent estimation.

The topic of binary regression when predictors Xi are measured with error

has been the subject of several recent papers, see Carroll, Spiegelman, Lan, Bai-ley and Abbott (1984), Carroll and Wand (1991), Reilly and Pepe (1995), and Lawless, Kalbfleisch and Wild(1999), etc. When the binary responses Yiare

sub-ject to misclassification, Pepe(1992) and Cheng and Hsueh (1999) discussed bias correction methods in the estimation of logistic regression parameters. In this paper, we study the estimation problem of β0when both observations Yiand

pre-dictors Xi are measured with error. Parametric and semiparametric methods are

discussed. We find that the proposed semiparametric estimation method is a gen-eralization of the pseudolikelihood method of Carroll and Wand (1991) and the estimated likelihood method of Pepe (1992). Many other estimating approaches have been proposed in the literature. A method based on the mean score was proposed by Pepe, Reilly and Fleming (1994) and Reilly and Pepe (1995) for the mismeasured outcome data problem and mismeasured covariate data prob-lem, respectively. However, Golm et al. (1999) and Lawless et al. (1999) argued that this semiparametric method is less efficient. Robins, Rotnitzky and Zhao (1994) proposed a class of semiparametric efficient estimators for the model with mismeasured predictors based on the inverse probability weighted estimating equation approach. When both response variable and predictors are subject to measurement error, the semiparametric efficient estimator has not been formally derived yet. In this paper we only emphasize various imputation approaches. The weighting methods, such as the semiparametric efficient estimation derived by Robins et al. (1994), will not be discussed further.

We assume in this paper that the complete data set consists of a primary sample plus a smaller validation subsample which is obtained by double sampling scheme. Extension of the selection probabilities for the validation data set to depend on the observed surrogate covariates is discussed briefly in Section 3. Asymptotically, we also suppose that the validation subsample size is a fraction of the major sample size. The estimators under discussion will be formally defined in Section 2. In Section 3, some asymptotic results are derived for semiparametric and parametric estimators of β0. Comparisons of their asymptotic covariance matrices are given in Section 4. Further, finite sample properties are explored through a simulation study in Section 5.

2. Estimation Methods

Suppose the true random variables (Y, X) are subject to mismeasurement and the surrogate observations are represented by (Y0, W ), here X = (1, X1)T

(3)

a smaller validation subsample is also observed in order to understand the mis-measurement structure. The sampling scheme is to randomly select k units from the primary sample and at the selected units the true measurement devices are used to obtain the validation data. Here the simple random sampling scheme is considered and for the sake of simplicity, the first k units of the primary sample are assumed to be the selected units. Thus we have the validation subsample {(Yi0, Yi, Xi, Wi), i = 1, . . . , k}. Later in Section 3 the theoretical results will be

given for the case that the sampling scheme depends on the value of the surrogate predictors W .

From the primary sample, we denote the regression function of Y0 _{on W = w}

as π0_{(w) = P r(Y}0 _{= 1 | w). For Y and W being assumed mutually}

inde-pendent given X, and Y0 _{and X being mutually independent given Y, W , the}

above regression function can be rewritten as π0_{(w) = π(w, β}0_{), where π(w, β) =}

{1 − θ0(w)}E{F (XTβ) | w} + φ0(w)E{ ¯F (XT_{β) | w} ≡ 1 − ¯π(w, β). Here the} ex-pectation is taken with respect to the conditional density f_X|W0 , ¯_{F (·) = 1 − F (·),} and the misclassification probability functions θ0_{(w) and φ}0_{(w) are defined by}

θ0_{(w) = P r(Y}0 _{= 0 | Y = 1, w) and φ}0_{(w) = P r(Y}0 _{= 1 | Y = 0, w). Note that}

the surrogate W is often a fallible measurement of X and corresponding to a coarser partition in the sample space. The conditional independence of Y0 _and

X given Y , W would be implied if the misclassification scheme depends on X only through the value of W . It is clear that the expectation E[{Y0_{− F (W}T_β0_)}WT_]

is not necessarily zero. Hence it is inappropriate to apply the usual likelihood method to the primary sample for inference about β0. In the following, we discuss some consistent estimates based on different imputation methods.

Suppose first that f_X|W0 and θ0(w), φ0(w) are known a priori. Then for the observation (y0

j, wj), j = k + 1, . . . , n, in the primary sample, the corresponding

log-likelihood and score function are

L∗_{(β | y}0_j, wj)=yj0ln π(wj, β) + (1 − y0j) ln ¯π(wj, β), S∗_{(β | y}0_j, wj)≡ ∂L∗_{(β | y}0_j, wj) ∂β = y 0 j−π(wj, β) π(wj, β)¯π(wj, β){1−θ 0_(w j)−φ0(wj)}E{F (XjTβ) ¯F (XjTβ)Xj | wj}. Since, y0 j − π(wj, β) π(wj, β)¯π(wj, β) = 2y 0 j − 1 y_j0π(wj, β) + (1 − y0j)¯π(wj, β) , 1 − θ0(wj) − φ0(wj) = {1 − θ0(wj) − π(wj, β)} E{F (X T j β) | wj} E{F (XT j β) | wj}E{ ¯F (XjTβ) | wj} ,

(4)

we have y0 j − π(wj, β) π(wj, β)¯π(wj, β){1 − θ 0_(w j) − φ0(wj)} = E(Yj | yj0, wj) − E{F (XjTβ) | wj} E{F (XT j β) | wj}E{ ¯F (XjTβ) | wj} . Consequently, S∗_{(β | y}_j0, wj) = E(Yj | y0j, wj) − E{F (XjTβ) | wj} E{F (XT jβ) | wj}E{ ¯F (XjTβ) | wj} EnF (X_jTβ) ¯F (X_jTβ)Xj | wj o = {A 0 1(y0j, wj) − A02(y0j, wj)}E{F (XjTβ) ¯F (XjTβ)Xj | wj} h A0 1(y0j, wj)E{F (XjTβ) | wj} + A20(y0j, wj)E{ ¯F (XjTβ) | wj} i,

where A0₁(y0_j, wj) = P r(Yj0 = yj0 | Yj = 1, wj) = θ0(wj) (1−y

0 j) 1 − θ0(wj) y 0 j _, and A0₂(y0_j, wj) = P r(Yj0 = yj0 | Yj = 0, wj) = φ0(wj) y 0 j 1 − φ0(wj) (1−y 0 j)_. Using this result, we easily see that the MLE ˆβf of β0can be obtained by solving

the likelihood equations

k X i=1 S(β | yi, xi) + n X j=k+1 S∗_{(β | y}0_j, wj) = 0, (1) where S(β | yi, xi) = {yi− F (xTi β)}xi. Unfortunately, in applications f0

X|W(·), θ0(·) and φ0(·) are rarely known and

ˆ

βf can not be obtained. If the subsample size k is large enough, a simple

esti-mate ˆβscan be obtained by using only the validation subsample, i.e., ˆβs satisfies

Pk

i=1S(β | yi, xi) = 0. Such a simple estimate is in general not efficient; see

Cheng and Hsueh (1999). The basic reason is that the information contained in the primary sample is not properly exploited. To do this, assume there ex-ist parametric functions and parameters (γ0, α0), independent of β0, such that θ0(W ) = θ(W ; α0), φ0(W ) = φ(W ; α0) and f_X|W0 _{(X | W ) = f(X | W ; γ}0) al-most surely. See Carroll and Wand (1991) and Cheng and Hsueh (1999) for related discussions. Then one can employ the usual approach to obtain the joint MLE ( ˆβm, ˆαm, ˆγm). On the other hand, the maximum pseudolikelihood estimate

(MPLE) ˆβp can be obtained by solving (1) with α0 and γ0 in S∗(β | yj0, wj) being

replaced by their MLE ˆαp and ˆγp derived from the validation data.

In general, ˆβm can be shown to be more efficient than ˆβp asymptotically.

However, in solving the equations for finite sample cases, the computational com-plexity grows (and numerical stability deteriorates) with the number of unknown parameters, so that ˆβm may not have proper performance, see the simulation

(5)

The last approach to be discussed is a semiparametric method. Assuming f_X|W0 _{(· | w), θ}0(w) and φ0(w) in S∗_{(β | y}0_j, wj) of (1) are unknown functions,

a nonparametric method is used to estimate them. Formally, we propose the MPLE ˆβsp solving the following pseudolikelihood equations

k X i=1 S(β | yi, xi) + n X j=k+1 ˆ S∗_{(β | y}_j0, wj) = 0, ˆ S∗_{(β | y}0, w) = { Â1(y 0_{, w) − ˆ}_A 2(y0, w)} Ê[F (XTβ) ¯F (XTβ)X | w] h ˆ A1(y0, w) Ê{F (XTβ) | w} + Â2(y0, w) Ê{ ¯F (XTβ) | w} i.

Here, for any integrable function g(x; β), the estimated expectation ˆ_{E{g(X; β) |} w} is given by ˆE{g(X; β) | w} = {Pk

i=1Kh(w − wi)g(xi; β)}/{Pki=1Kh(w −

wi)}. Moreover ˆA1(y0, w) and ˆA2(y0, w) are estimates of A01(y0, w) and A02(y0, w),

with θ0(w) and φ0(w) being replaced by their nonparametric estimates ˆθ(w) = {Pk

i=1Kh(w − wi)yi(1 − yi0)}/{

Pk

i=1Kh(w − wi)yi} and ˆφ(w) = {Pki=1Kh(w −

wi)(1 − yi)yi0}/{Pki=1Kh(w − wi)(1 − yi)}. The above kernel function K(t) is

taken to be a density function, Kh(t) = h−1K(t/h), the bandwidth h depends

on n and tends to zero as n → ∞.

We remark that the proposed semiparametric estimation method generalizes the methods of Carroll and Wand (1991) and Pepe (1992). If Y = Y0 with prob-ability one, i.e., only measurement error occurs in the predictor, ˆβsp is the MPLE

of Carroll and Wand (1991). On the other hand, if X = W with probability one, i.e., only misclassification occurs in the response, ˆβsp reduces to the maximum

estimated likelihood estimate of Pepe (1992). 3. Asymptotic Distributions

The asymptotic distributions of the estimators proposed in the preceding section will be presented here for the case d = 1. Extension to general d is simple and the related results will be given in a remark. The asymptotic properties depend on certain regularity conditions.

A.1 β0 _{∈ Λ, an open set in R}2_.

A.2 E(X2_{) < ∞.}

A.3 The misclassification probability functions θ0_{(w) and φ}0_{(w) ∈ (0, 1), and}

the density function f_W0 (w) of W are strictly positive in the space of W . Furthermore, these functions and their l-th derivatives, l = 1, 2, are in L4

and also satisfy a weighted Lipschitz condition. (A function η(·) is said to satisfy a weighted Lipschitz condition if there exists a constant c and a bounded function ψ in L4 such that |η(x) − η(y)| < ψ(x)|x − y| for all x, y

(6)

A.4 _{The parametric functions θ(w; α),φ(w; α) and f (x | w; γ), and their partial} derivatives ∂θ(w; α)/∂α, ∂φ(w; α)/∂α and ∂f (x | w; γ)/∂γ, are uniformly continuous at (α0, γ0) for all x and w in the support of X and W .

A.5 _{The function K(·) is a second-order kernel (see Gasser and M¨uller (1979)).} A.6 h = hn → 0 with nh2 → ∞ and nh4 → 0, as n → ∞. Also, as n → ∞,

k = rn{1 + O(h2)}, where r ∈ (0, 1].

In the following, we give the basic asymptotic results under the assumption that α = (α0, α1)T and γ = (γ0, γ1)T. Except for the asymptotic normality of

ˆ

βsp, whose proof is given in the Appendix, the proofs for other estimators are

standard and thus are omitted.

Theorem 1. Suppose conditions _{(A.1)−(A.6) hold and let ˜}β be any estimator discussed in Section _{2. Then as n → ∞, n}1/2( ˜_{β − β}0_{) converges in distribution}

to a normal with mean zero. Their asymptotic covariance matrices are lim n→∞Cov {n 1/2_{( ˆ}_β f − β0)} = Σf = {rIv+ (1 − r)Ivc}−1, lim n→∞Cov {n 1/2_{( ˆ}_β s− β0)} = Σs= (rIv)−1, lim n→∞Cov {n 1/2_{( ˆ}_β m− β0)} = Σm= Σf +(1 − r) 2 r ΣfImΣf, lim n→∞Cov {n 1/2_{( ˆ}_β p− β0)} = Σp= Σf+ (1 − r) 2 r ΣfIpΣf, lim n→∞Cov {n 1/2_{( ˆ}_β sp− β0)} = Σsp= Σf +(1 − r) 2 r ΣfIspΣf. Here Iv = E{F (XTβ0) ¯F (XTβ0)X⊗2} and

I_vc= E    {1 − θ0_{(W ) − φ}0_{(W )}}2h_{E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X | W }}i⊗2 π0_{(W ){1 − π}0_{(W )}}   , Ip= E  {1−θ 0_{(W )−φ}0_{(W )}E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X| W }} π0_{(W ){1 − π}0_{(W )}} ˙π0_α(W ; α0, γ0) ˙π_γ0(W ; α0, γ0) !T  ×   E F (XT_β0 ) ˙θ⊗2_{(W ;α}0 ) θ0 (W ){1−θ0 (W )} + ¯ F (XT_β0 ) ˙φ⊗2_{(W ;α}0 ) φ0 (W ){1−φ0 (W )} 0 0 _{E{ ˙}_{f (X | W ; γ}0_)}⊗2   −1 ×E  {1−θ 0_{(W )−φ}0_{(W )}E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X| W }} π0_{(W ){1 − π}0_{(W )}} ˙π0_α(W ; α0, γ0) ˙π_γ0(W ; α0, γ0) !T  T ,

(7)

Isp= E      {1 − θ0(W ) − φ0(W )}hE{F (XTβ0) ¯F (XTβ0_{)X | W }}i⊗2 [π0_{(W ){1 − π}0_{(W )}]}2 ×E{F (XTβ0_{)| W }θ}0_{(W ){1−θ}0_{(W )}+E{ ¯}F (XTβ0_{)| W }φ}0_{(W ){1−φ}0_{(W )}} + {1 − θ0(W ) − φ0(W )}2[E{F2(XTβ0_{) | W } − E}2_{{F (X}Tβ0_{) | W }]}o, Im= r (1 − r)2∆(β0,α0,γ0){∆(α0,γ0)− ∆ T (β0 ,α0 ,γ0 )Σf∆(β0 ,α0 ,γ0 )}−1∆T(β0 ,α0 ,γ0 ), ∆_(β0 ,α0 ,γ0 )= (1 − r)E " {1 − θ0_{(W ) − φ}0_{(W )}E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X | W }} π0_{(W ){1 − π}0_{(W )}} × ˙π 0 α(W ; α0, γ0) ˙π0 γ(W ; α0, γ0) !T # , ∆_(α0_,γ0₎= r   E F (XT_β0 ) ˙θ⊗2_{(W ;α}0 ) θ0 (W ){1−θ0 (W )} + ¯ F (XT_β0 ) ˙φ⊗2_{(W ;α}0 ) φ0 (W ){1−φ0 (W )} 0 0 _{E{ ˙}_{f (X | W ; γ}0_)}⊗2   +(1 − r)E   1 π0_{(W ){1 − π}0_{(W )}} ˙π0_α(W ; α0, γ0) ˙π0 γ(W ; α0, γ0) !⊗2 . In this, ˙πα(W ; α0, γ0) = ∂π0_{(W )} ∂α (α0 ,γ0 ) = E{ ¯F (XTβ0_{) | W } ˙φ(W ; α}0_{) − E{F (X}Tβ0_{) | W } ˙θ(W ; α}0), ˙πγ(W ; α0, γ0) = ∂π 0_{(W )} ∂γ (α0_,γ0₎ = {1 − θ0(W ) − φ0(W )}E{F (XTβ0) ˙_{f (X | W ; γ}0_{) | W },} ˙θ(W ; α0_{) =} ∂θ(W ; α) ∂α α0 , φ(W ; α˙ 0) = ∂φ(W ; α) ∂α α0 , ˙ f(X | W ; γ0) = ∂ ln f (X | W ; γ) ∂γ γ0 , where, for any column vector A, A⊗2= AAT.

Remark 1. It can be seen that Im, Ip and Isp are nonnegative definite matrices,

and thus (1−r)_r 2ΣfImΣf, (1−r)

2

r ΣfIpΣf, (1−r)

2

(8)

addi-tional variations due to estimating the unknown functions θ0(W ), φ0(W ) and f_X|W0 by parametric and nonparametric methods.

Remark 2. It is well known that the information matrix Iv is the variance

of the score; that is, Iv = Var {S(β0 | Y, X)} = E[Var {S(β0 | Y, X) | X}].

We see that the information matrices Ic

v and Isp also have similar expressions:

I_vc _{= E[Var {S}∗(β0 _{| Y}0_{, W ) | W }] = E[ρ(Y, Y}0 _{| W )Var {S}∗(β0 | Y, W ) | W }],

and Isp = E(ρ(Y, Y0 | W )[1−ρ(Y, Y0| W ){1−ρ(Y, F (XTβ0) | W )}]Var {S∗(β0|

Y, W ) | W }), where S∗(β | Y, W ) = ∂ ln P r(Y | W )/∂β, and ρ(Y, Y0 | W ) =

Corr2_{(Y, Y}0 _{| W ) = {{1 − θ}0_{(W ) − φ}0_{(W )}}2_{E{F (X}T_β0_{) | W }E{ ¯}_{F (X}T_β0_{) |}

W }}/[π0(W ){1 − π0(W )}],

ρ(Y, F (XTβ0_{) | W ) = Corr}2(Y, F (XTβ0_{) | W )} =E{F

2_(XT_β0_{) | W } − [E{F (X}T_β0_{) | W }]}2

E{F (XT_β0_{) | W }E{ ¯}_{F (X}T_β0_{) | W }} .

Thus these information matrices depend not only on the score S∗(β0 | Y, W ) but

also on the squared correlation functions.

Remark 3. For obtaining the validation subsample, a simple random sampling design is used. However, the selection probabilities may depend on the value of W . For example, suppose g(w) is the selection probability. Then the asymptotic covariance matrix of ˆβsp becomes Σ∗sp= Σ∗+ Σ∗Isp∗Σ∗, where

Σ∗−1=Eh_{g(W )E{F (X}Tβ0) ¯F (XTβ0)X⊗2 _{| W }}i +E_{{1−g(W )}}{1−θ 0_{(W )−φ}0_{(W )}}2h_{E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X| W }}i⊗2 π0_{(W ){1 − π}0_{(W )}} , I_sp∗ =En{1−g(W )} 2 g(W ) {1−θ0_{(W )−φ}0_{(W )}}h_{E{F (X}T_β0_{) ¯}_{F (X}T_β0_{)X | W }}i⊗2 [π0_{(W ){1 − π}0_{(W )}]}2 ×E{F (XTβ0_{)| W }θ}0_{(W ){1−θ}0_{(W )}+E{ ¯}F (XTβ0_{)| W }φ}0_{(W ){1−φ}0_{(W )}} +{1 − θ0(W ) − φ0(W )}2[E{F2(XTβ0_{) | W } − E}2_{{F (X}Tβ0_{) | W }]}o. Other asymptotic covariance matrices in Theorem 1 can be modified accordingly. Remark 4. It is clear that Σsp reduces to the asymptotic covariance matrix

of Carroll and Wand’s (1991) estimator provided Y = Y0 with probability one. Suppose X = W with probability one, then Σsp is the asymptotic covariance

(9)

Remark 5. The results in Theorem 1 can be extended to d > 1 vector predic-tors, provided the function K is a pth order kernel, p > d, and the bandwidth parameter satisfies nh2d_{→ ∞ and nh}2p_{→ 0 as n → ∞.}

Remark 6. All the asymptotic covariance matrices can be estimated by moment type estimates.

4. Comparisons of Asymptotic Covariance Matrices

In this section, we discuss the behaviors of different estimators by comparing their asymptotic covariance matrices. The general results basically agree with our expectation, but some of their proofs are not trivial. For any matrices A and B, we write A ≥ B if A − B is a nonnegative definite matrix.

4.1. Results under general conditions

First, from Remark 1 of Section 3, we have Σm ≥ Σf, Σp ≥ Σf and

Σsp ≥ Σf. Further, suppose we let {∆v_(α0

,γ0

)}−1 be the asymptotic covariance

matrix of the MLE (ˆαp, ˆγp) which is obtained from the validation subsample.

Then one can rewrite Σp as Σp = Σf[Σ−1f + ∆(β0_,α0_,γ0₎{∆v

(α0_,γ0₎}−1∆_(βT 0_,α0_,γ0₎]Σ_f, and consequently we have

Σp− Σm =Σf∆(β0 ,α0 ,γ0 ) h {∆v(α0 ,γ0 )}−1−{∆(α0 ,γ0 )−∆T(β0 ,α0 ,γ0 )Σf∆(β0 ,α0 ,γ0 )}−1 i ∆T_(β0 ,α0 ,γ0 )Σf. Since {∆v(α0 ,γ0 )}−1 and {∆(α0 ,γ0 )− ∆T(β0 ,α0 ,γ0 )Σf∆(β0 ,α0 ,γ0 )}−1 are respectively,

the asymptotic covariance matrices of (ˆαp, ˆγp) and (ˆαm, ˆγm), the above results

yield the following theorem.

Theorem 2. Given conditions _{(A.1)−(A.6), we have Σ}s ≥ Σf, Σsp ≥ Σf and

Σp≥ Σm≥ Σf.

4.2. Results under special constraints

Cheng and Hsueh (1999) compared the asymptotic covariance matrices when X = W with probability one. Here the comparisons focus on the case when Y = Y0 with probability one.

As a consequence of Theorem 1, if Y = Y0 _{with probability one, the matrices}

Σf and Isp simplify to Σf = h rE{F (XTβ0) ¯F (XTβ0)X⊗2_} +(1 − r)E [E{F (X T_β0_{) ¯}_{F (X}T_β0_{)X | W }]}⊗2 E{F (XT_β0_{) | W }E{ ¯}_{F (X}T_β0_{) | W }} i−1 ,

(10)

Isp= E

ρ(Y, F (XTβ0_{) | W )E}h_{S∗(β0 | Y, W )}⊗2 | W

i

. Then Corollary 1 follows easily from

Σs− Σsp= Σf (1 − r)E{S∗(β0| Y, W )}⊗2 −(1 − r) 2 r E ρ(Y, F (XTβ0_{) | W )E[{S}∗(β0| Y, W )}⊗2 | W ] +(1−r) 2 r E{S∗(β 0 | Y, W )}⊗2hE{S(β0| Y, X)}⊗2i−1E{S∗(β0| Y, W )}⊗2 ! Σf.

Corollary 1. Suppose conditions _{(A.1)−(A.6) are satisfied, that Y = Y}0 with probability one, and ρ(Y, F (XTβ0_{) | W ) ≤ min(r/(1 − r), 1). Then Σ}s≥ Σsp.

Accordingly, the semiparametric estimator ˆβspis always better than the

sim-ple MLE ˆβsunder our conditions. Similar conclusion can be derived for the case

that X = W with probability one. Now suppose β0 is a scalar. Then under the conditions of Theorem 1 and ρ(Y, F (XT_β0_{) | W ) ≤ ρ}∗_{, with probability one for}

some constant ρ∗, the ARE of ˆβs with respect to ˆβsp is always smaller than

e1(ρ∗, I, r) = 1 −(1 − r)

2_{I[I − {ρ}∗_{− r/(1 − r)}]}

{r + (1 − r)I}2 ,

where the coefficient of reliability I = E{S∗(β0| Y, W )}2/E{S(β0 | Y, X)}2 ≤ 1.

This can be used to measure the quality of surrogate predictors W . Some curves of e1 are given in Figure 1 for ρ∗ = 0.3. The results clearly show that if the

coefficient I is large enough, then even for a smaller sampling fraction r, the semi-parametric estimator ˆβsp is still much more efficient than the simple MLE ˆβs.

0.0 0.2 0.4 0.6 0.8 1.0 r 0.0 0.2 0.4 0.6 0.8 1.0 B o u n d o f A R E I=0.1 I=0.3 I=0.5 I=0.7 I=0.9 I=1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 1. For ρ∗_{= 0.3, curves of e}

(11)

Similar results can be derived for the ARE of ˆβsp with respect to ˆβf. We

note that under the same conditions, a lower bound for the ARE of ˆβsp with

respect to ˆβf is

e2(ρ∗, I, r) = 1 − (1 − r) 2_Iρ∗

r2_{+ r(1 − r)I + (1 − r)}2_Iρ∗.

Some curves of e2 are also given in Figure 2 for ρ∗= 0.3. The ARE is at least

0.60 for r ≥ 0.3. Further, for fixed r, e2 increases as the coefficient of reliability

I decreases. 0.0 0.2 0.4 0.6 0.8 1.0 r 0.0 0.2 0.4 0.6 0.8 1.0 B o u n d o f A R E I=0.1 I=0.3 I=0.5 I=0.7 I=1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2. For ρ∗_{= 0.3, curves of e}

2(ρ∗, I, r) at various I.

5. Simulation Studies

In order to study the finite sample performance of the estimates, some em-pirical studies were carried out. The logistic regression model considered was F (xT_β0_{) = exp(x)/{1 + exp(x)} so (β}0

0, β10) = (0, 1). A linear error model

for the predictor X was assumed: Wi = Xi + v · Ui, for some v. Xi and Ui

were pseudo independent N (0, 0.25) random variables. Thus given Wi = wi,

Xi is a N (wi/(1 + v2), (0.25)v2/(1 + v2)) variate. We also assumed that given

Wi = wi, the misclassification probabilities were independent of Xi and Yi, i.e.,

θ0(wi) = φ0(wi), for all wi, where θ0(wi) = θ(wi, α0) and φ0(wi) = φ(wi, α0).

Four different models were considered.

(1a) Wi = Xi+ 0.1Ui and Yi = Yi0 with probability one. In this case, no

misclas-sification occurs, and hence the only functional form needed for estimating ˆ

βm and ˆβp is f (x|w; γ) = N(w/(1 + 4γ), γ/(1 + 4γ)), γ > 0.

(1b) Wi = Xi + 0.1Ui and θ0(wi) = φ0(wi) = 0.1. In this case, the functional

forms needed for computing ˆβm, ˆβp are θ(w, α) = α0, φ(w, α) = α1, and

(12)

(2a) Wi = Xi+ Ui and Yi = Yi0 with probability one. In this case, no

misclas-sification occurs, and hence the only functional form needed for estimating ˆ

βm and ˆβp is f (x|w; γ) = N(w/(1 + 4γ), γ/(1 + 4γ)), γ > 0.

(2b) Wi = Xi + Ui and θ0(wi) = φ0(wi) = exp(wi− 2.5)/{1 + exp(wi − 2.5)}.

In this case, the misclassification probabilities follow a logit model with regression coefficients α0₀ _{= −2.5, α}0₁ = 1. The functional forms needed for estimating ˆβm and ˆβp are θ(w, α) = exp(α0 + α1w)/{1 + exp(α0 +

α1w)}, φ(w, α) = exp(α0 + α1w)/{1 + exp(α0 + α1w)}, and f(x|w; γ) =

N (w/(1 + 4γ), γ/(1 + 4γ)), γ > 0.

Model (1b) features small measurement error for both responses and covariates. On the other hand, Model (2b) has more severe measurement error. The primary sample size used was n = 150, and the sampling fractions for validation subsam-ples were r = 0.2, 0.4 and 0.6. One thousand pseudo data sets were generated to compute the simulated mean squared errors.

The function K(·) used in computing the nonparametric regression estimates was the Epanechnikov kernel, that is, K(t) = (3/4)(1 − t2_)I

[−1,1](t); see Eubank

(1988). Several bandwidths h = aˆσwk−1/3 were used in our simulations. Here

ˆ

σw is the sample standard deviation of W based on the validation data set, and

a is some constant value. Such choice of bandwidth was justified by Sepanski, Knickerbocker and Carroll (1994); see also Carroll and Wand (1991) and Wang and Wang (1997).

First, we investigate the performance of ˆβsp for various choices of a. The

performance of ˆβsp = ( ˆβsp,0, ˆβsp,1)T is measured by its simulated total mean

squared error (TMSE), defined to be M SE( ˆβsp,0) + M SE( ˆβsp,1). Table 1 reports

the simulation results. It is seen that the performance of ˆβsp is not sensitive to

the choice of a. This agrees with many findings for semiparametric estimation using kernel regression. Here, however, a better choice is a = 1 and hence we use this value for h in remaining simulations.

Table 1. Simulated TMSE of ˆβspfor different a

W = X + .1U W = X + U θ0_{(W ) = 0} _θ0_{(W ) = .1} _θ0_{(W ) = 0} _θ0_{(W ) =} exp(W −2.5) {1+exp(W−2.5)} a r = .2 r = .4 r = .6 r = .2 r = .4 r = .6 r = .2 r = .4 r = .6 r = .2 r = .4 r = .6 0.5 .2005 .1860 .1742 .2270 .1972 .1748 .3418 .2444 .2011 .3810 .2584 .2110 1.0 .1899 .1863 .1625 .2112 .1813 .1843 .3091 .2121 .1989 .3158 .2470 .2046 1.5 .1947 .1900 .1715 .1991 .1901 .1870 .3592 .2604 .2182 .3612 .2496 .2238 2.0 .2027 .1900 .1767 .2453 .1867 .1832 .3645 .2665 .2254 .3374 .2505 .2268

Next, we compare the simulated mean squared error of different estimates for β = (β0, β1)T. The results are tabulated in Table 2. Obviously, ˆβsp has the best

(13)

overall performance among the competing estimates. Moreover, the behavior of the simulated TMSE for ˆβsp seems insensitive to the selection of r values.

This is particularly true when the model has only minor measurement errors. Unreported calculations also show that the simulated standard errors (SE) for

ˆ

βsp,0 and ˆβsp,1 are stable. Under all cases considered, the simulated SE’s range

from 0.166 to 0.188 for ˆβsp,0 and from 0.366 to 0.506 for ˆβsp,1. In general, ˆβsp

has much better performance than the other estimates, especially when r ≤ 0.4. From Table 2, we also see that the estimates ˆβf and ˆβp are quite competitive and

better than the simple estimate ˆβs. The latter statement is especially clear when

r = 0.2. However, the simulated SE’s for ˆβf,0, ˆβp,0and ˆβs,0are not very different.

These values range from 0.219 to 0.421. On the other hand, the simulated SE’s for ˆβf,1, ˆβp,1 are smaller than those of ˆβs,1, particularly when r = 0.2. The largest

simulated SE for ˆβf,1 and ˆβp,1 is 0.894 compared with the corresponding value

1.292 for ˆβs,1.

Table 2. Simulated TMSE of different estimators ˆ βs βˆm βˆp βˆf βˆsp W = X + .1U θ0_{(W ) = 0.} _{r = .2} _1.2917 _14.6527 _.8332 _.8318 _.1899 r = .4 .4917 6.4017 .4144 .4570 .1863 r = .6 .2840 1.5523 .2808 .2799 .1625 θ0_{(W ) = 0.1} _{r = .2} _1.2917 _13.6781 _.8721 _.8039 _.2112 r = .4 .4917 4.7618 .4284 .4488 .1813 r = .6 .2840 2.4016 .2967 .2987 .1843 W = X + U θ0_{(W ) = 0.} _{r = .2} _1.2917 _17.0013 _.8311 _.8945 _.3091 r = .4 .4917 5.3543 .4606 .4339 .2121 r = .6 .2840 1.4972 .2697 .2476 .1989 θ0_{(W ) =} exp(W −2.5) 1+exp(W −2.5) r = .2 1.2917 36.7846 .8622 .8203 .3158 r = .4 .4917 7.0655 .3740 .3992 .2470 r = .6 .2840 2.2060 .2754 .2530 .2046

Finally, we comment on the comparison between ˆβmand ˆβp. In general, ˆβp is

better than ˆβm unless r = 0.6. This happens because there are more parameters

to be estimated simultaneously in computing ˆβm, and the validation subsample

size k is not large enough. The largest simulated SE is 3.103 for ˆβm,1and 1.769 for

ˆ

βm,0, showing that the estimates ˆβm are not very stable. The performance of ˆβm

and ˆβp are expected to become better if k is sufficiently large and the parametric

functions θ(w, α), φ(w, α) and f_X|W_{(x|w; γ) are correctly modeled. Cheng and} Hsueh (1999) reported that the performances of ˆβm and ˆβp depend heavily on

(14)

the correct choice of the models for misclassification probabilities when X = W with probability one. Therefore, ˆβm and ˆβp are not very robust and extra care

needs to be taken to apply the parametric methods. Acknowledgements

This research was supported in part by the National Science Council, R.O.C. The authors thank the Editor, an associate editor and referees for their useful comments and suggestions which greatly improve the presentation of this paper. Appendix

Proof of Theorem 1. _{By Taylor’s Theorem and for n → ∞, we have} √ n( ˆβsp−β0) = " 1 n ( −∂ 2_ˆl(β0₎ ∂β2 ) −_2n1 ( −∂ 3_ˆl(β∗₎ ∂β3 ) ( ˆβsp−β0) #−1 1 √_n ( ∂ˆl(β0₎ ∂β ) +op(1),

where β∗ _{is some quantity such that |β}∗_{− β}0_{| ≤ | ˆ}_β

sp− β0|. Here ˆl is the

pseudo-likelihood, and ∂ˆl(β0) ∂β = k X i=1 ( ∂l1,i(β0) ∂β ) + n X j=k+1 ( ∂ˆl2,j(β0) ∂β ) , ∂l1,i(β0) ∂β = {Yi− F (X T i β0)}Xi, ∂ˆl2,j(β0) ∂β = [ Â1(Yj0, Wj) − Â2(Yj0, Wj)] Ê[F (XjTβ0) ¯F (XjTβ0)Xj | Wj] ˆ E{F (XT j β0) | Wj} Â1(Yj0, Wj) + Ê{ ¯F (XjTβ0) | Wj} Â2(Yj0, Wj) . Further, −∂ 2_ˆl(β0₎ ∂β2 = k X i=1 ( −∂ 2_l 1,i(β0) ∂β2 ) + n X j=k+1 ( −∂ 2_ˆl 2,j(β0) ∂β2 ) , −∂ 2_l 1,i(β0) ∂β2 = F (X T i β0) ¯F (XiTβ0)XiXiT, −∂ 2_ˆl 2,j(β0) ∂β2 = [ Â1(Y 0 j , Wj) − Â2(Yj0, Wj)] Ê[F (XjTβ0) ¯F (XjTβ0)Xj| Wj] ˆ E{F (XjTβ0) | Wj} Â1(Yj0, Wj) + Ê{ ¯F (XjTβ0) | Wj} Â2(Yj0, Wj) !⊗2 −[ Â1(Y 0 j, Wj) − Â2(Yj0, Wj)] Ê[{1 − 2F (XjTβ0)}F (XjTβ0) ¯F (XjTβ0)Xj | Wj] ˆ E{F (XT j β0) | Wj} Â1(Yj0, Wj) + Ê{ ¯F (XjTβ0)| Wj} Â2(Yj0, Wj) .

(15)

By the Law of Large Numbers, 1 n ( −∂ 2_ˆl(β0₎ ∂β2 ) = 1 n ( −∂ 2_l(β0₎ ∂β2 ) + " 1 n ( −∂ 2_ˆl(β0₎ ∂β2 ) −_n1 ( −∂ 2_l(β0₎ ∂β2 )# = rE[F (XTβ0) ¯F (XTβ0_)}XXT] +(1 − r)E ( {1 − θ0_{(W ) − φ}0_{(W )}}2 π0_{(W ){1 − π}0_{(W )}} (E[F (X T_β0_{) ¯}_{F (X}T_β0_{)X | W ])}⊗2 ) +op(1) + Op( 1 n√h+ 1 √_n+ h2) = Σ−1_f + op(1), n → ∞, h → 0 and n2h → ∞. Then as n → ∞, h → 0, n2h → ∞,√n( ˆβsp− β0) = {Σf+ op(1)} ∂ˆl(β0 ) ∂β /√n + op(1). Further, by Taylor’s Theorem again, as n → ∞,

n X i=k+1 ( ∂ˆl2,j(β0) ∂β ) = n X i=k+1 ( ∂l2,j(β0) ∂β + Hn,j ) + Op( 1 nh32 + nh4+ 1 h + √ nh32), where Hn,j= _k1Pki=1hi,j, and

hi,j = Kh(Wi− Wj) f (Wj) ( ∂2l2,j(β0) ∂β∂θ0_(W j) ) Yi{1 − Yi0− θ0(Wj)} E{F (XjTβ0) | Wj} + ( ∂2_l 2,j(β0) ∂β∂φ0_(W j) ) (1 − Yi){Yi0− φ0(Wj)} E{ ¯F (XT j β0) | Wj} + ( ∂2_l 2,j(β0) ∂β∂E{F (XT j β0) | Wj} ) h F (X_iTβ0_{) − E{F (X}_jTβ0_{) | W}j} i + ( ∂2l2,j(β0) ∂β∂E[F (X_jTβ0_{) ¯}_{F (X}T j β0)Xj | Wj] ) ×hF (X_iTβ0) ¯F (X_iTβ0)Xi− E[F (XjTβ0) ¯F (XjTβ0)Xj | Wj] i ! . As a consequence, for n → ∞, 1 √ n ( ∂ˆl(β0) ∂β ) = √ n k(n−k) k X i=1 n X j=k+1 " k n ( ∂l1,i(β0) ∂β ) +n−k n ( ∂l2,j(β0) ∂β ) +n−k n hi,j # +Op( 1 n32h 3 2 + n12h4+ 1 n12h + h32).

(16)

Define two i.i.d. random vectors Zi = (Xi, Wi, Yi, Yi0), i = 1, . . . , k and Zj∗ = (Wj, Yj0), j = k + 1, . . . , n and let Qn(Zi; Zj∗) = k n ( ∂l1,i(β0) ∂β ) + n − k n ( ∂l2,j(β0) ∂β ) +n − k n hi,j. Then k−1_(n−k)−1Pk

i=1Pnj=k+1Qn(Zi; Zj∗) is a generalized U-statistic. Applying

the Central Limit Theorem for generalized U-statistics (see Serfling(1980)), we have, for n → ∞, √ n k(n − k) k X i=1 n X j=k+1 Qn(Zi; Zj∗) = √n   1 k k X i=1 E{Qn(Zi; Zj∗) | Zi} + 1 n − k n X i=k+1 E{Qn(Zi; Zj∗) | Zj∗}  + o_p(1) d → N 0, Σ−1_f +(1 − r) 2 r Isp ! , E{Qn(Zi; Zj∗) | Zi} = k n ( ∂l1,i(β0) ∂β ) +n − k n {1 − θ0_(W i) − φ0(Wi)}E[F (XiTβ0) ¯F (XiTβ0)Xi| Wi] π0_(W i){1 − π0(Wi)} ×Yi{1 − Yi0− θ0(Wi)} − (1 − Yi){Yi0− φ0(Wi)} −{1 − θ0(Wi) − φ0(Wi)}[F (XiTβ0) − E{F (XiTβ0) | Wi}] + Op(h2), E{Qn(Zi; Zj∗) | Zj∗} = n − k_n ( ∂l2,j(β0) ∂β ) . Consequently, as n → ∞,√n( ˆβsp− β0)→ Nd 0, Σf +(1−r) 2 r ΣfIspΣf . References

Albert, P. S., Hunsberger, S. A. and Bird, F. M. (1997). Modeling repeated measures with monotonic ordinal responses and misclassification, with applications to studying matura-tion. J. Amer. Statist. Assoc. 92, 1304-1311.

Bollinger, C. R. and David, M. H. (1997). Modeling discrete choice with response error: Food stamp participation. J. Amer. Statist. Assoc. 92, 827-835.

Carroll, R. J., Spiegelman, C. H., Lan, K. K. G., Bailey, K. T. and Abbott, R. D. (1984). On errors-in-variables for binary regression models. Biometrika 71, 19-26.

Carroll, R. J. and Wand, M. P. (1991). Semiparametric estimation in logistic measurement error models. J. Roy. Statist. Soc. Ser. B 53, 573-585.

Cheng, K. F. and Hsueh, H. M. (1999). Correcting bias due to misclassification in the estimation of logistic regression models. Statist. Probab. Lett. 44, 229-240.

(17)

Cox, D. R. (1970). The Analysis of Binary Data. Chapman and Hall, London.

Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York.

Gasser, T. and M¨uller, H. G. (1979). Kernel Estimation of Regression Functions in Smoothing Techniques for Curve Estimation. (Edited by T. Gasser and M. Rosenblatt), 23-68. Lecture Notes in Mathematics 757, Springer-Verlag, Berlin.

Golm, G. T., Halloran, M. E. and Longini, I. M. Jr. (1998). Semi-parametric models for mismeasured exposure information in vaccine trials. Statist. Medicine 17, 2335-2352. Golm, G. T., Halloran, M. E. and Longini, I. M. Jr. (1999). Semiparametric methods for

mul-tiple exposure mismeasurement and bivariate outcome in HIV vaccine trials. Biometrics 55_{, 94-101.}

Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. J. Roy. Statist. Soc. Ser. B 61, 413-438.

Pepe, M. S. (1992). Inference using surrogate outcome data and a validation sample. Biometrika 79_{, 355-365.}

Pepe, M. S., Reilly, M. and Fleming, T. R. (1994). Auxiliary outcome data and the mean score method. J. Statist. Plann. Inference 42, 137-160.

Pregibon, D. (1981). Logistic regression diagnostics. Ann. Statist. 9, 705-724.

Reilly, M. and Pepe, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82, 299-314.

Robins, J. M., Rotnitzky, A., Zhao, L.-P. and Lipsitz, S. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89, 846-866.

Sepanski, J. H., Knickerbocker, R. and Carroll, R. J. (1994). A semiparametric correction for attenuation. J. Amer. Statist. Assoc. 89, 1366-1373.

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley, New York.

Wang, C. Y. and Wang, S. (1997). Semiparametric methods in logistic regression with mea-surement error. Statist. Sinica 7, 1103-1120.

Graduate Institute of Statistics, National Central University, Chungli, Taiwan, R.O.C. E-mail: kfcheng@cc.ncu.edu.tw

Department of Statistics, National Chengchi University, Taipei, Taiwan, R.O.C. E-mail: hsueh@nccu.edu.tw