**PREPRINT**

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

### www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 15.pdf

## New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

### Chin-Tsang Chiang and Ming-Yueh Huang

### July 31, 2011

### New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

### Chin-Tsang Chiang and Ming-Yueh Huang

### Department of Mathematics, National Taiwan University, Taipei 10617, Taiwan, ROC

### July 31, 2011

Abstract

In this article, a more flexible single-index regression model is employed to char- acterize the conditional distribution. The pseudo least integrated squares approach is proposed to estimate the index coefficients. As shown in the numerical results, our esti- mator outperforms the existing ones in terms of the mean squared error. Moreover, we provide the generalized cross-validation criteria for bandwidth selection and utilize the frequency distributions of weighted bootstrap analogues for the estimation of asymp- totic variance and the construction of confidence intervals. With a defined residual process, a test rule is established to check the adequacy of an applied single-index condi- tional distribution model. To tackle with the problem of sparse variables, a multi-stage adaptive Lasso algorithm is developed to enhance the ability of identifying significant variables. All of our procedures are found to be easily implemented, numerically stable, and highly adaptive to a variety of data structures. In addition, we assess the finite sam- ple performances of the proposed estimation and inference procedures through extensive simulation experiments. Two empirical examples from the house-price study in Boston and the environmental study in New York are further used to illustrate applications of the methodology.

Keywords: Adaptive Lasso; Cross-validation; Curse of dimensionality; Multi-stage adaptive Lasso; Naive bootstrap; Oracle properties; Single-index; Pseudo least integrated squares estimator; Pseudo least squares estimator; Pseudo maximum likelihood estimator; Random weighted bootstrap; Residual process.

### 1 Introduction

We consider the conditional distribution F_{Y}(y|x) of a real-valued response Y given continuous
covariates X = x, where X = (X1, · · · , Xd)^{T} and x = (x1, · · · , xd)^{T}. In regression analysis, a
wide cross-section of research interests is pursued in the study of the conditional mean E[Y |x].

A more complete methodology and theoretical framework related to fully nonparametric and semiparametric distribution models still remains and a further investigation is necessary. As one can see, with a large number of covariatess, a fully nonparametric distribution usually suffers from the curse of dimensionality (Bellman (1961)). Although parametric models have played prominent roles in applications, they are frequently detected to be inadequate in many studies. Consequently, a more flexible semiparametric model becomes a great interest to characterize the dependence of Y on X and avoids the impact of misspecification of parametric models and the difficulty in the estimation of nonparametric distributions.

One of the most popular extension of parametric models is the single-index (SI) conditional distribution model:

F_{Y}(y|x) = G(y, x_{θ}_{0}), (1)

where G(·, ·) is an unknown bivariate function, x_{θ} = x_{1}+ (x_{2}, · · · , x_{d})^{T}θ, and θ_{0} is a vector of
true index coefficients. The most significant covariate is assumed, without loss of generality,
to be X_{1} and the setting of its coefficient is mainly to deal with the problem of identifia-
bility. When the conditional mean exists, it can be easily obtained from the above model
that E[Y |x] = m(x_{θ}_{0}) with m(·) being some unspecified function. Based on the conditional
mean model, Powell, Stock, and Stoker (1989) utilized the estimation of the density-weighted

average derivative to estimate θ_{0}. Although the estimator was shown to be √

n-consistent,
asymptotically normal, and computationally simple, the numerical instability is usually seen
as a consequence of high-dimensional kernel smoothing. To overcome such a weakness with
practice, Ichimura (1993) developed a semiparametric least squares approach and derived its
asymptotic properties. Meanwhile, H¨ardle, Hall, and Ichimura (1993) recommended a cross-
validation criterion to simultaneously estimate bandwidths and index coefficients. Under the
validity of model (1) with a continuous response, Delecroix, H¨ardle, and Hristache (2003)
introduced the pseudo likelihood (PL) estimation for θ_{0}. Without moment and continuous
conditions on Y , Hall and Yao (2005) suggested an estimation criterion on the basis of the
average squared difference between the empirical estimator and the model-based estimator of
the joint probability of (Y, X^{T})^{T}. As one can see, the good performance of their estimation
procedure is connected to an appropriate number of spheres and the corresponding radii used
in the integral approximation. Currently, there is still no standard rule to determine the val-
ues of these two quantities. Furthermore, the established algorithm is often computationally
slow and intensive, especially in high-dimensional covariate spaces. Confronted with these
problems, we proposed a new type of estimation criterion, which is simple and easily imple-
mented, for θ_{0}. The basic rationale behind this approach is to define the response process
N (y) = I(Y ≤ y) and to directly use the difference between N (y) and its conditional mean
G(y, x_{θ}_{0}) over the support of Y . Under some suitable conditions, the asymptotic distribution
of the pseudo least integrated squares estimator (PLISE) is derived to be multivariate normal.

To make inferences related to θ_{0}, the frequency distributions of its bootstrap analogues are
used to estimate the asymptotic variance of the PLISE because a sandwich-type estimator
tends to provide a very poor approximation. With the proposed residual process, the method

of Xia (2009) is extended to establish a test rule to check the adequacy of model (1). There are two features of the PLISE: Firstly, our estimation approach can be applied to different types of response variable and outperforms the existing ones; secondly, the foregoing inferences can be easily adopted and generalized to the considered problems in this article.

When the true underlying model has a sparse representation, identifying significant co- variates becomes an important issue to enhance the accuracy in prediction. The traditional best-subset selection algorithms are usually computationally infeasible in the presence of a po- tentially high-dimensional covariate space. The ridge regression estimation is another variance- stabilizing technique, which shrinks the least square estimator toward zero but not does not identify significant covariates cleverly. To simultaneously select significant variables and to estimate the parameters in regression models, Tibshirani (1996) introduced a least absolute shrinkage and selection operator (Lasso). Since Lasso variable selection might be inconsistent, Fan and Li (2001) and Zou (2006) proposed a smoothly clipped absolute deviation (SCAD) penalty and an adaptive Lasso instead. In their model specifications, the adaptive Lasso fur- ther avoids the problem of nonconcavity in the SCAD penalty although both of the procedures enjoy the oracle properties. By extending the adaptive Lasso in generalized linear models to our framework, we propose the penalized pseudo least integrated squares estimator (PPLISE) and derive the corresponding oracle properties. Moreover, in a small sample size scenario, a multi-stage adaptive Lasso estimation procedure is proposed to improve the possible selection inconsistency and predictive inaccuracy in the PPLISE.

The rest of this article is organized as follows. In Section 2, we propose the PLISE for θ_{0}
and the cross-validation criteria for bandwidth selection. Moreover, the weighted bootstrap

inference procedures are introduced to estimate the asymptotic variance of the PLISE and construct the confidence regions for parameters of interests. A test rule and a multi-stage adaptive Lasso procedure are established in Section 3. In Sections 4-5, the simulation exper- iments are conducted and the proposed approaches are applied to two empirical examples.

Some concluding remarks and future research topics are provided in Section 6 and the proofs of the main results are placed in the Appendix.

### 2 Estimation and Inference Procedures

Based on a random sample of the form {(X_{i}, Y_{i})}^{n}_{i=1}, the PILSE of θ_{0} and the bandwidth
selection criteria are proposed. The frequency distributions of bootstrap analogues are fully
employed to estimate the asymptotic variance of the PILSE and construct the confidence
intervals for the parameters of interest.

### 2.1 Estimation and Bandwidth Selection

For each fixed (y, xθ), the approach of Hall, Wolff, and Yao (1999) can be applied for the
estimation of G(y, x_{θ}). Let K(u) denote a kernel density, h be a positive-valued bandwidth,
N_{`}(y, X_{iθ}) = P

j6=i(N_{j})^{`}(y)K_{h}(X_{jθ} − X_{iθ})/(n − 1), i = 1, · · · , n, ` = 0, 1, and K_{h}(u) =
h^{−1}K(u/h). The Nadaraya-Watson estimator for G(y, X_{iθ}) is given by bG(y, X_{iθ}) = N_{1}(y, X_{iθ})/

N_{0}(y, X_{iθ}). By using the response process N (y) and a consistent estimator of G(y, x_{θ}), the
PLISE bθ is proposed to be a minimizer of the pseudo sum of integrated squares (PSIS):

SS(θ) = 1 n

n

X

i=1

Z

Y

e^{2}_{i}(y; θ)dW_{ni}(y), (2)

where Y is the support of Y or the interval of interest, e_{i}(y; θ) = N_{i}(y)− bG(y, X_{iθ}), and W_{ni}(y)
is a non-negative weight function. In practical implementation, bG(y, x_{θ}) is set to be zero if
the denominator N_{0}(y, x_{θ}) is zero. Although a local linear estimator of G(y, x_{θ}) can be used
in the PSIS, it does not share the properties of a cumulative distribution function and might
cause some complications in the above estimation procedure.

It follows from a direct algebraic calculation that

E[(N (y) − G(y, X_{θ}))^{2}] = E[(N (y) − F_{Y}(y|X))^{2}] + E[(F_{Y}(y|X) − G(y, X_{θ}))^{2}]. (3)

Since the first term at the right-hand side of (3) does not depend on θ, both minimizers of E[R

Y(N (y) − G(y, Xθ))^{2}dW (y)] and E[R

Y(FY(y|X) − G(y, Xθ))^{2}dW (y)] can be shown to be θ0

under the validity of model (1), where W (y) is a convergent function of W_{n}(y). Thus, minimiz-
ing SS(θ) is on average approximated by minimizing E[R

Y(F_{Y}(y|X) − G(y, X_{θ}))^{2}dW (y)] with
respect to θ. In our theoretical development and numerical implementation, the quartic kernel
K(u) = (15/16)(1−u^{2})^{2}I(|u| ≤ 1) is specified. The advantage of such a density function is that
bθ can achieve the√

n-consistency. As a spacial case, the uniform distribution or the empirical
distribution of Y can be specified for W_{ni}(y)’s in (2). In the case where G(y, x_{θ}) is known, the
optimal weight for w_{ni}(y) = dW_{ni}(y)/dy is proportional to {G(y, X_{iθ})(1 − G(y, X_{iθ}))}^{−1}, the
reciprocal of the conditional variance of N_{i}(y), at each fixed y. We further replace G(y, x_{θ}) by
a consistent estimator bG(y, x_{b}_{θ}) and iteratively update the weight estimation. The resulting
estimator coincides with the maximizer of the following log-pseudo likelihood function for a
random sample {N_{i}(y) : 1 ≤ i ≤ n}:

l_{p}(θ) = 1
n

n

X

i=1

Z

Y

(N_{i}(y) ln( bG(y, X_{iθ})) + (1 − N_{i}(y)) ln(1 − bG(y, X_{iθ}))dy. (4)

Let y_{(1)} < · · · < y_{(m)} denote the distinct order statistics of {Y_{1}, · · · , Y_{n}} and W_{(j)} =
Ry(j+1)

y(j) dW_{ni}(y). Since N_{i}(y)’s are zero-one processes and bG(y, X_{iθ})’s are step functions with
jumps occurring at {y_{(1)}, · · · , y_{(m)}}, the PSIS in (2) has a computationally more attractive
alternative as follows:

SS(θ) = 1 n

n

X

i=1 m−1

X

j=1

e^{2}_{i}(y_{(j)}; θ)W_{(j)}. (5)

In contrast, the estimation procedure of Hall and Yao (2005) is often computationally inten-
sive. When the response variable Y is discrete and has a finite support, the above estimation
criterion can also be applied. As for the binary response with values in {0, 1}, the PSIS will
automatically reduce to the sum of squares in Ichimura (1993). In kernel estimation, a crite-
rion for bandwidth selection is provided via generalizing the most commonly used “leave one
subject out” cross-validation procedure of Rice and Silverman (1991). The optimal bandwidth
h_{cv} is naturally defined to be the unique minimizer of

CV_{1}(h) = 1
n

n

X

i=1 m−1

X

j

e^{2}_{i}(y_{(j)}; bθ_{i})W_{(j)} (6)

with bθ_{i} = arg min{(n − 1)^{−1}P

l6=i

Pm−1

j=1 e^{2}_{l}(y_{(j)}; θ)W_{(j)}}. Another criterion developed by
H¨ardle, Hall, and Ichimura (1993) is further adopted and extended to our framework. The
estimators of h and θ_{0} can be simultaneously obtained via minimizing CV_{2}(h, θ) = SS(θ).

### 2.2 Asymptotic Properties

Suppose that X has a compact support X and θ_{0} is an interior point of the compact pa-
rameter space Θ ⊆ R^{d−1}. Let f (xθ) be the density function of Xθ on Xθ = {xθ : x ∈
X , θ ∈ Θ}, M_{l}_{1}_{l}_{2}(y, x_{θ}) = E[G^{l}^{1}(y, X_{θ}_{0})(x − X)^{⊗l}^{2}|x_{θ}], l_{1} = 0, 1, l_{2} = 0, 1, 2, H_{1}(y, x_{θ}_{0}) =

M_{01}(y, x_{θ}_{0})∂_{x}_{θ}G(y, x_{θ}_{0}), V_{1θ}_{0} = 2E[(R

Yε(y; θ_{0})H_{1}(y, X_{θ}_{0})dW (y))^{⊗2}], and V_{2θ}_{0} = 4E[R

Y H_{1}^{⊗2}(y,
X_{θ}_{0})dW (y)]. To establish the asymptotic properties of bθ, some suitable conditions are made
before establishing the asymptotic properties of bθ as follows:

(A1) inf_{X}_{θ}f (x_{θ}) > 0.

(A2) d^{3}_{x}

θf (x_{θ}) and ∂_{x}^{3}

θM_{12}(y, x_{θ}) are Lipschitz continuous in x_{θ} with the Lipschitz constants
being independent of (y, x_{θ}).

(A3) h = h_{0}n^{−ς}^{1} for ς_{1} ∈ (1/8, 1/5) and some positive constant h_{0}.
(A4) V_{1θ}_{0} and V_{2θ}_{0} are nonsingular.

Since the classes of kernel functions indexed by (h, x_{θ}) are Euclidean (Pakes and Pollard
(1989)), the imposed conditions entail that the considered classes of functions are Euclidean.

By applying Theorem II.37 of Pollard (1984), one has

sup

Y×X_{θ}

|∂_{θ}^{l}^{2}N_{l}_{1}(y, x_{θ}) − ∂_{x}^{l}^{2}

θ(f (x_{θ})M_{l}_{1}_{l}_{2}(y, x_{θ}))| = o(p

ln n/nh^{2l}^{2}^{+1}) + O(h^{2}) a.s. (7)

For simplicity, the consistency and asymptotic normality of bθ are established in the following
theorem with the case of deterministic weight functions W_{i}(y)’s.

Theorem 1. Suppose that assumptions (A1)-(A4) are satisfied. Then, bθ → θ^{p} 0 and √
n(bθ −
θ_{0})→ N (0, Σ^{d} _{θ}_{0}) as n → ∞, where Σ_{θ}_{0} = V_{1θ}^{−1}

0V_{2θ}_{0}V_{1θ}^{−1}

0.

For a continuous response Y , the PMLE eθ of Delecroix, H¨ardle, and Hristache (2003) is obtained by maximizing

l(θ) = 1 n

n

X

i=1

ln(

P

j6=iK_{4h}_{2}(Y_{j}− Y_{i})K_{4h}_{1}(X_{jθ}− X_{iθ})
P

j6=iK_{4h}_{1}(X_{jθ}− X_{iθ}) ), (8)

where K_{4h}(u) = K_{4}(u/h)/h and K_{4}(u) = (105/64)(1 − 5u^{2}+ 7u^{4}− 3u^{6})I(|u| ≤ 1) (cf. Gørgens
(2004)). The fourth-order kernel function is required to ensure the 1/√

n convergence rate.

The authors concluded that the proposed estimator can achieve the asymptotic efficiency. We further examined some mistakes in the theoretical derivations and that eθ only reaches the semiparametric efficiency bound. Let g(y, xθ0) = fY(y|x),

W_{1θ}_{0} = E[(∂_{x}^{2}

θg(Y, X_{θ}_{0})

g(Y, X_{θ}_{0}) + 2dx_{θ}f (Xθ0)∂x_{θ}g(Y, Xθ0)

f (X_{θ}_{0}) )M_{02}(y, X_{θ}_{0}) + ∂x_{θ}g(Y, Xθ0)
g(Y, X_{θ}_{0})

·(2∂x_{θ}M02(y, Xθ0) − M_{01}^{⊗2}(y, Xθ0))], and W2θ0 = E[(∂_{x}_{θ}g(Y, X_{θ}_{0})

g(Y, X_{θ}_{0}) )^{2}(X − E[X|Xθ0])^{⊗2})].

The proofs for Theorem 1 are processed in the same manner for eθ_{1} → θ^{p} _{0} and √

n(eθ_{1}− θ_{0})→^{d}
N (0, W_{1θ}^{−1}

0W_{2θ}_{0}W_{1θ}^{−1}

0) under assumption (A1) and the assumptions:

(B1) d^{3}_{x}_{θ}f (xθ), ∂_{x}^{3}_{θ}M02(y, xθ), and ∂_{x}^{3}_{θ}E[g(Y, Xθ0)(x − X)^{⊗2}|xθ] are Lipschitz continuous in xθ

with the Lipschitz constants being independent of (y, x_{θ}).

(B2) h_{k}= h_{0k}n^{−ς}^{2}, k = 1, 2, for some positive constants h_{0k}’s and ς_{2} ∈ (1/16, 1/6).

(B3) W1θ0 and W2θ0 are nonsingular.

### 2.3 Bootstrap Inferences

Based on the limiting distribution of√

n(bθ−θ_{0}), a general rule in the construction of confidence
intervals usually relies on an appropriate estimator of Σ_{θ}_{0}. One of the most widely used
estimators is the sandwich-type estimator bΣ

bθ = bV^{−1}

1bθ Vb_{2b}_{θ}Vb^{−1}

1bθ , where
Vb_{1b}_{θ} = 2

n

n

X

i=1 m−1

X

j=1

(∂_{θ}G(yb _{(j)}, X_{ib}_{θ}))^{⊗2}W_{(j)} and bV_{2b}_{θ} = 4
n

n

X

i=1

(

m−1

X

j=1

e_{i}(y_{(j)}; bθ)∂_{θ}G(yb _{(j)}, X_{ib}_{θ})W_{(j)})^{⊗2}.

In practical implementation, a sufficiently good performance of bΣ

θb essentially requires an adequate bandwidth. Although the smoother in bΣ

θb can be chosen different from that in
G(y, xb _{θ}_{b}), there is still no standard criterion for doing so.

An alternative approach to avoid encountering such a situation is to employ the frequency
distribution of bootstrap replications. A natural resampling approach is to draw independent
bootstrap random vectors U_{1}^{nb}, · · · , U_{n}^{nb} from the empirical distribution P_{n,U} = n^{−1}Pn

i=1I_{U}_{i}
with U_{i} = (X_{i}, Y_{i}), i = 1, · · · , n. The bootstrap analogue bθ^{nb} is straightforward created
via solving SS(θ) in (2) based on a bootstrap sample {U_{1}^{nb}, · · · , U_{n}^{nb}}. Without requiring
drawing observations from the collected data, we further adopt general weighted bootstrap
approximations for the sampling distribution of bθ. Let ξ_{1}, · · · , ξ_{n} be independently generated
from a common distribution with mean µ and variance σ^{2}. The random weighted bootstrap
estimator bθ^{rw} is then defined to be the minimizer of

SS^{rw}(θ) =

n

X

i=1

D_{i}

m−1

X

j=1

(e^{rw}_{i} (y_{(j)}; θ))^{2}W_{(j)}, (9)

where D_{i} = ξ_{i}/Pn

j=1ξ_{j}, e^{rw}_{i} (y; θ) = N_{i}(y)− bG^{rw}(y, X_{iθ}), bG^{rw}(y, X_{iθ}) = N_{1}^{rw}(y, X_{iθ})/N_{0}^{rw}(y, X_{iθ}),
and N_{l}^{rw}(y, Xiθ) =P

j6=iDj N_{j}^{l}(y)Kh(Xjθ− Xiθ). This bootstrapping approach can be treated
as the naive bootstrap one with a measure P_{n,U}^{rw} =Pn

i=1D_{i}I_{U}_{i}. It is interesting to note that
the dependent weights D_{i}’s can also be replaced with the independent weights ξ_{i}/(nµ)’s.

The random bootstrap confidence intervals for θ_{l}, l = 2, · · · , (d − 1), are constructed by

bθ_{l}± ρz_{1−α/2}se^{rw}(bθ_{l}^{rw}− bθ_{l}) or (bθ_{l}− ρq_{1−α/2}^{rw} (bθ_{l}^{rw}− bθ_{l}), bθ_{l}− ρq_{α/2}^{rw}(bθ^{rw}_{l} − bθ_{l})), (10)
where ρ = µ/σ is a scale factor modification for the variability in the weights, zp is the pth
quantile value of the standard normal distribution, and se^{rw}(·) and q^{rw}(·) denote the standard

error and the 100pth percentile of B bθ^{rw}’s, respectively. Let P^{∗}(·) represent the probability
measure conditioning on {U_{1}, · · · , U_{n}}. The validity of (10) is given in the next theorem.

Theorem 2. Suppose that assumptions in Theorem 1 are satisfied. Then,

P^{∗}(√

nρ(bθ^{rw}− bθ) ≤ w) − P (√

n(bθ − θ_{0}) ≤ w)→ 0 for all w = (w^{p} _{2}, · · · , w_{d})^{T} as n → ∞. (11)

### 3 Model Test and Sparse Models

In this section, a test rule is established for the adequacy of model (1). The PPLISE is proposed to tackle with the problem of sparse variables. A multi-stage adaptive Lasso procedure is further developed to improve the accuracy of variable selection.

### 3.1 Model Checking

Let ε(y; θ) = N (y) − G(y, X_{θ}) and θ_{1} be the minimizer of R

YE[ε^{2}(y; θ)]dW (y). It is straight-
forward to yield θ_{1} = θ_{0} and E[ε(y; θ_{1})] = 0 under model (1). If the considered model
is inadequate, ε_{y}(y; θ_{1}) can be projected into some linear combinations of covariates, i.e.

ε(y; θ_{1}) = ν(y, x_{θ}_{2}) + ζ(y) with E[ζ(y)] = 0 for some y and θ_{2} = arg min_{θ}R

Ymin{E[(ε(y; θ_{1}) −
ν(y, x_{θ}))^{2}], E[ε^{2}(y; θ_{1})]}dW (y). The parameter θ_{2} is naturally estimated by the minimizer ˘θ
of RSS_{n}(θ), where

RSS_{n}(θ) =

m−1

X

j=1

1 n

n

X

i=1

min{(e_{i}(y_{(j)}; bθ) −bν(y_{(j)}, X_{iθ}))^{2}, (e_{i}(y_{(j)}; bθ) − ¯e(y_{(j)}; bθ))^{2}}

!
W_{(j)}

with ν(y, Xb _{iθ}) =
P

j6=ie_{j}(y; bθ)K_{h}(X_{jθ}− X_{iθ})
P

j6=iK_{h}(X_{jθ}− X_{iθ}) and ¯e(y; bθ) = 1
n

n

X

i=1

e_{i}(y; bθ).

By further computing TSS_{n} = n^{−1}Pn
i=1

Pm−1

j=1 (e_{i}(y_{(j)}; bθ)−¯e(y_{(j)}; bθ))^{2}W_{(j)}, the test statistic
F_{n}= RSS_{n}(˘θ)/TSS_{n} is used to check model (1):

(H_{0} : F_{Y}(y|x) = G(y, x_{θ}_{0}) for all (x, y)
H_{A}: F_{Y}(y|x) 6= G(y, x_{θ}_{0}) for some (x, y) .

The null hypothesis is rejected with a significance level α whenever F_{n} ≤ q_{α}(F_{n}^{b}), where F_{n}^{b}
is the bootstrap analogue of F_{n} with e^{∗}_{i}(y; bθ)’s substituting for e_{i}(y; bθ)’s and each e^{∗}_{i}(y; bθ)
being independently drawn from a two-point distribution: ((5 +√

5)/10)δ_{(1−}^{√}_{5)e}

i(y;bθ)/2+ ((5 −

√5)/10)δ_{(1+}^{√}_{5)e}

i(y;bθ)/2 (cf. H¨ardle (1989)). As expected, the developed test rule is generally more powerful than those based on X, especially when its dimension is high. Similar to the single-indexing cross-validation value of Xia (2009), we consider the measure

SCV_{n} =

m−1

X

j=1

1 n

n

X

i=1

min{(e_{i}(y_{(j)}; bθ) −bν(y_{(j)}, X_{i˘}_{θ}

i))^{2}, (e_{i}(y_{(j)}; bθ) − ¯e(y_{(j)}; bθ))^{2}}

!

W_{(j)}, (12)

where ˘θ_{i} = arg min_{θ}Pm−1
j=1 (Pn

i=1min{(e_{i}(y_{(j)}; bθ) −bν_{i}(y_{(j)}, X_{iθ}))^{2}, (e_{i}(y_{(j)}; bθ) − ¯e(y_{(j)}; bθ))^{2}})W_{(j)}
if it exists and νb_{i}(y, X_{iθ}) is computed as bν(y, X_{iθ}) with the ith subject being deleted, i =
1, · · · , n. Following the argument of this author, one can also conclude that as the sample
size approaches infinity, SCV_{n}= TSS_{n} if model (1) is adequate and SCV_{n}< TSS_{n} otherwise.

### 3.2 Adaptive Lasso Estimation and Oracle Properties

The PPLISE bθ_{(p)} of θ_{0} is obtained via minimizing the penalized pseudo sum of integrated
squares (PPSIS):

PSS(θ) = SS(θ) + λ

d

X

l=2

|θ_{l}|

|bθ_{l}|, (13)

where λ is a nonnegative regularization or tuning parameter. In this variable selection and estimation procedure, significant covariates receive smaller penalties and tend to have nonzero

coefficient estimates while nonsignificant coefficients will be shrunk into zero. The above
optimization problem entails that the true underlying model can be consistently identified
and bθ_{(p)A}_{0} has the same asymptotic distribution as bθA_{0}, where A_{0} = {l|θ_{0l} 6= 0}.

Theorem 3. Suppose that assumptions (A1)-(A4) are satisfied and λ = λ_{0}n^{−ς}^{3} for ς_{3} ∈
(1/2, 1) and some positive constant λ_{0}. Then, P ( bA = A_{0}) → 1 and √

n(bθ_{(p)A}_{0} − θ_{A}_{0}) →^{d}
N (0, Σ_{θ}_{A0}) as n → ∞, where bA = {l|bθ_{l} 6= 0} and Σ_{θ}_{A0} is the asymptotic variance of bθA_{0}.

It is revealed in our simulation experiments that the one-stage adaptive Lasso estimation
in (13) usually cannot achieve the variable selection well in small sample applications. To
conquer this shortcoming, we develop a multi-stage adaptive Lasso estimation scheme. Let
eθ_{(m)} represent the vector of nonzero estimates at the mth stage, θ_{(m)} = (θ_{(m)l−}^{T} , θ_{(m)l}, θ^{T}_{(m)l+})^{T}
be the corresponding parameter vector with a length of d_{m}, and θ_{(m)l−} and θ_{(m)l+} denote the
vectors of coefficients with sub-indices smaller and greater than l, respectively. Moreover,
SS_{(m)}(θ_{(m)}) is defined as SS(θ) with θ being replaced with θ_{(m)}. The estimation procedure is
implemented through the following steps:

S1. (eθ_{(1)l}^{(1)}, eh^{(1)}_{(1)l}) = arg min_{θ}_{l}_{,h}SS_{(1)}(eθ_{(1)l−}^{(1)} , θ_{(1)l}, eθ_{(0)l+}) + λ|θ_{l}|/|eθ_{(1)l}^{(0)}| with eθ^{(0)}_{(1)l} = bθ_{l} and eθ^{(1)}_{(1)l} = 0
whenever |eθ^{(1)}_{(1)l}| < ε_{0}, l = 2, · · · , d_{1}, for some sufficiently small positive value ε_{0}.

S2. Set (eθ^{(k)}_{(1)l}, eh^{(k)}_{(1)l}) = (0, eh^{(k−1)}_{(1)l} ) if eθ^{(k−1)}_{(1)l} = 0; otherwise, (eθ^{(k)}_{(1)l}, eh^{(k)}_{(1)l}) = arg minθ_{l},hSS_{(1)}
(eθ^{(k)}_{(1)l−}, θ_{(1)l}, eθ_{(1)l+}^{(k−1)}) + λ|θ_{l}|/|eθ_{(0)l}^{(k−1)}| and eθ_{(1)l}^{(k)} = 0 whenever |eθ^{(k)}_{(1)l}| < ε_{0}, k = 2, · · · .

S3. Iterations are stopped if keθ^{(k)}_{(1)}− eθ_{(1)}^{(k−1)}k < ε_{1} for some pre-chosen small value ε_{1} > 0, and
θe_{(1)λ} is set to be non-zero components of eθ_{(1)}^{(k)}.

S4. eθ_{(1)} = eθ_{(1)λ}_{1} with λ_{1} = arg min_{λ}GCV(λ) and

GCV(λ) = SS_{(1)}(eθ_{(1)λ})
(1 − n^{−1}tr(( bV_{1e}_{θ}

(1)λ+ diag(λ/neθ^{2}_{(1)λ}))^{−1}Vb_{1e}_{θ}

(1)λ))^{2}.

S5. Repeat steps 1-4 M times until keθ_{(M )}− eθ_{(M −1)}k < ε_{2} for some small value ε_{2} > 0.

### 4 Monte Carlo Experiments

The performances of proposed estimation and inference procedures were assessed through a class of simulations with a variety of sample sizes, correlation structures of covariates, and error processes. The simulations were based on 1000 replications and the bootstrap inferences were drawn from 500 bootstrap samples, which enable us to obtain stable numerical results.

### 4.1 Assessment of Estimators and Inference Procedures

In this simulation scenario, the covariates X = (X_{1}, X_{2}, X_{3})^{T} were generated from a trivariate
normal distribution with mean zero, standard deviation of 1, and pairwise correlations of 0.2
or 0.5. The response variable Y was generated from the following two models:

M1. Y = X_{θ}_{0}+ ε with θ_{0} = (1, 0.8, −0.5)^{T} and ε ∼ N (0, 0.25).

M2. Y = X_{θ}_{0} + ε with θ_{0} = (1, 0.8, −0.5)^{T} and ε ∼ N (0, 0.25X_{θ}^{2}_{0}).

The uniform distribution and the empirical distribution of Y were used as the weights in (5)
with the resulting estimators being denoted by bθ and bθ^{emp}, respectively. We compared the
finite sample properties of our estimators bθ and bθ^{emp} with the estimator ˘θ of Hall and Yao

(2005), the pseudo maximum likelihood estimator (PMLE) eθ, and the pseudo least squares estimator (PLSE) ¯θ. The inference procedures based on the asymptotic normality of bθ and the frequency distributions of its bootstrap analogues were also studied by simulations. With the limitation of number of pages, the exchangeable random weights were only investigated through independent and identically distributed Gamma(4, 2) random variables, which have better numerical results than others.

Table 4.1 displays the means and the standard deviations of 1000 estimates with the sample
sizes (n) of 100, 250, and 500, and the correlation coefficients (ρ) of 0.2 and 0.5. The biases of
compared estimators are generally inapparent except for eθ and ¯θ under (n, ρ) = (100, 0.5) and
θ under (n, ρ) = (100, 0.5) and model M1. As one can expect, the standard deviations of these˘
estimators decrease as n increases or as ρ becomes small. We further detect in this table that eθ
tends to have a substantially large variance when n is small. The high variability in eθ is mainly
caused by the use of the fourth-order kernel function. In practical applications, the second-
order kernel function is often used to overcome this shortcoming. In addition, the simulation
results indicate that bθ^{emp} has the smallest variance under the validity of heterogeneous error
model and bθ is comparable with ¯θ in the case of a constant error process. Even if ¯θ performs
satisfactorily in the homogeneous error model, it has a relatively large variance among all
estimators in the heterogeneous one. The CPU time for the computation of bθ^{emp} is much
shorter than that for ˘θ although both estimators have a similar performance.

The sandwich-type estimate (SANE), the naive bootstrap estimate (NBE), and the random weighted bootstrap estimate (RWE) of the standard deviation of bθ are provided in table 4.2.

Overall, the SANE underestimates the asymptotic variance in a more pronounced fashion even

for a sufficiently large n. We further found that the bootstrap estimates slightly overestimate the asymptotic variance but their accuracies significantly improve as the sample size increases.

Apparently, the RWE is closer to the asymptotic variance than the NBE. Tables 4.2 also presents the empirical coverage probabilities, the average lengths of 0.95 quantile intervals of 1000 estimates, and the average lengths of 1000 bootstrap normal approximated and quantile confidence intervals (LBNCI, LBQCI). It is revealed that all the bootstrap confidence intervals are wider than the true qunatile intervals and approach the expected ones with adequate sample size. The empirical coverage probabilities of normal approximated confidence intervals (BNCI) tend to be higher than the nominal level whereas the bootstrap quantile intervals (BQI) are slightly smaller than the nominal one. In general, these constructed confidence intervals have fairly accurate coverage probabilities and provide greater precision as the sample size increases.

### 4.2 Assessment of Model Checking and Adaptive Lasso

The performances of testing procedures were studied through models M2 and
M3. Y = X_{1} + 0.8X_{2}^{2}− 0.5(1 + |X_{3}|)^{−1}+ ε with ε ∼ N (0, 0.25X_{θ}^{2}_{0}).

Table 4.3 summarizes the estimated sizes and powers for the hypothesis of model correctness
and the rejection proportions of the single-indexing cross-validation method. Under the va-
lidity of model M2, the simulation results indicate that the estimated sizes are all smaller
than the nominal size at 0.05. The power performance under model M3 is fairly good and
the high power is generally associated with the large sample size. From the simulation exper-
iments, the measure SCV_{n} tends to have high rejection rates for model M3 and, except for

(n, ρ) = (250, 0.2), to some extent higher ones for model M2.

Table 4.3

The estimated sizes and powers, and the rejection proportions (RP) based on 1000 (SCV_{n}) values

M2 M3

(n, ρ) α RP β RP

(100, 0.2) 0.001 0.063 0.903 0.952 (250, 0.2) 0.000 0.050 0.984 0.994 (100, 0.5) 0.001 0.060 0.936 0.991 (250, 0.5) 0.000 0.066 1.000 1.000

We further assessed the multi-stage adaptive Lasso algorithm through models M1-M2
with X = (X_{1}, X_{2}, X_{3}, X_{4}, X_{5}, X_{6}, X_{7})^{T} and θ_{0} = (1, 0.8, −0.5, 0, 0, 0, 0)^{T}. Since the average
numbers of selecting incorrect zero coefficients are zero in all approaches, we only displayed
those of correct zero coefficients. The mean squared error E[(θ_{est} − θ_{0})^{T}Σ_{X}(θ_{est} − θ_{0})] of
any estimator θest was utilized to evaluate the predictive accuracy, where ΣX is the variance-
covariance matrix of X. Table 4.4 gives the average numbers of correct zeros, the mean squared
errors of estimators, and the proportions of variable selection. For the sample size of 100, the
second-stage adaptive Lasso is found to have more accurate magnitudes of zero coefficients
than the first-stage one while no significant improvement is detected in the third-stage one.

To sum up, the performance of the PLISE is the worst in selecting important covariates.

Again, the covariates with zero coefficients are rarely selected in our multi-stage adaptive Lasso estimation procedure. Interestingly, the influences of correlation coefficient and error structure are inapparent in the second-stage and third-stage ones. The performances of all the adaptive Lasso estimation procedures become undifferentiated to each other as the sample size increases. Moreover, an estimator obtained from the multi-stage adaptive Lasso yields a smaller mean squared error than those of the others.

### 5 Empirical Examples

The PLISE and the corresponding inference procedures are applied to the Boston house-price data. The multi-stage adaptive Lasso procedure is further adopted to identify significant meteorological variables on ozone concentration in the New York metropolitan area. Moreover, all the considered variables are all standardized to have mean of zero and variance of one.

### 5.1 Application to a Study of House Price

The first analyzed data were collected by the U.S. Census Service in the area of Boston.

A total of 506 observations on 14 attributes are contained in this data set. Measurements of interest include median value of owner-occupied homes in $1,000’s (medv), logarithm of percentage of lower status of the population (lstat), average number of rooms per dwelling (rm), logarithm of full-value property-tax rate per $10,000 (tax), pupil teacher ratio by town (ptrat), and weighted distances to five Boston employment centres (dis).

The PLISE (−0.715, 0.470, 0.350, 0.201) is obtained with the bootstrap standard errors
(0.2116, 0.1194, 0.0645, 0.0420). The cross-validation bandwidth h_{cv} = 1.0765 is chosen for
the coefficient estimates of the SI: lstat + θ_{2}rm + θ_{3}tax + θ_{4}ptrat + θ_{5}dis. The PMLE
(−0.843, 0.615, 0.433, 0.144) and the PLSE (−0.302, 0.217, 0.270, 0.302) are also found to have
very similar explanations of the meaning of predictors. The 0.95 bootstrap normal approx-
imated and quantile confidence intervals are (−1.1302, −0.3007) and (−1.3029, −0.3170) for
rm, (0.2360, 0.7040) and (0.2363, 0.6862) for tax, (0.2240, 0.4766) and (0.2191, 0.4723) for
ptrat, and (0.1189, 0.2834) and (0.1290, 0.2906) for dis. These variables are detected to be

significantly associated with medv. It concurs with the simulation finding that the constructed
confidence intervals were fairly close when the sample size is large. In addition, the values of
SCV_{n} and the test statistic F_{n}are computed to be 1.987 and 0.878, respectively. As evidenced
from SCV_{n}< TSS_{n}(= 2.108) and the bootstrap p-value of 0.006, the fitted SI model might be
too simple. A more thorough investigation is needed to spot the relationship of the covariates
on the conditional distribution of medv.

### 5.2 Application to a Study of Air Quality

The second data contain the measurements of daily ozone concentration (ozone), wind speed (wind), daily maximum temperature (temp), and solar radiation level (solar) on 111 successive days at meteorological stations from May to September 1973 in New York metropolitan area.

The variables wind, temp, solar, wind^{2}, temp^{2}, solar^{2}, wind ∗ temp, wind ∗ solar, and temp ∗
solar were included in the initial model fitting with the first one being the baseline covariate.

Our primary interest is to detect significant meteorological factors on ozone.

The PLISE (−1.417, −0.211, −0.482, 0.087, 0.560, −0.984, 0.077, −0.471) is obtained with
the cross-validation bandwidth h_{cv} = 0.7517 being chosen. The PLISE and the PMLE
(−1.351, −0.162, −0.582, −0.033, 0.441, −0.871, −0.032, −0.381) are more close, whereas the
PLSE (−1.840, 0.073, −0.283, 0.324, 0.393, −0.947, 0.290, −0.632) is slightly different. We fur-
ther implement the multiple-stage adaptive Lasso estimation and identify that the predic-
tors solar, temp^{2}, and wind ∗ solar have zero coefficients. Note that the coefficients with
zero estimates from the first-stage are the same as those from the second-stage. The es-
timates of coefficients, the bootstrap standard errors, and the bootstrap confidence inter-

vals are presented in table 5.1. We apply the conditional distribution model with the SI:

wind + θ_{2}temp + θ_{3}wind^{2}+ θ_{4}solar^{2}+ θ_{5}wind ∗ temp + θ_{6}temp ∗ solar and test the adequacy
of the fitted model. The value of the test statistic F_{n} is approximately 0.993 (bootstrap p-
value=0.382) and SCV_{n}= 8.387 is greater than TSS_{n}= 8.185. Thus, no significant evidence
is identified to reject the considered model.

Table 5.1

The PPLISE, the bootstrap standard errors, and the 0.95 bootstrap confidence intervals

Variable PPLISE RWE BNCI BQCI

temp -1.401 0.1778 (-1.7496,-1.0525) (-1.7453,-1.0689)
wind^{2} -0.497 0.0707 (-0.6360,-0.3588) (-0.6218,-0.3859)
solar^{2} 0.528 0.1227 ( 0.2871, 0.7680) ( 0.3252, 0.8155)
wind ∗ temp -1.020 0.0910 (-1.1988,-0.8420) (-1.1234,-0.8614)
temp ∗ solar -0.356 0.0803 (-0.5136,-0.1987) (-0.5060,-0.2067)

### 6 Concluding Remarks and Further Extensions

This article presents an appealing estimation procedure for index coefficients and outperforms
the existing ones. Compared with the PMLE, an important advantage of the PLISE is that it
only requires a lower-order kernel in a one-dimensional bandwidth space. The modified cross-
validation scores and residual process are also provided for bandwidth selection and model
checking. Because of the poor approximation of the sandwich-type estimator, we employ
random weighted bootstrap analogues of the asymptotic variance of the PLISE. The L^{1}-
penalty with random weights is further adopted into the PLISE criterion to improve estimation
and variable selection simultaneously in sparse high-dimensional models. Under the partial
orthogonality condition of Huang, Ma, and Zhang (2008), our PPLISE still enjoys the oracle

property when the number of covariates increases exponentially with the sample size.

In some applications, the predictive abilities of covariates might depend on the values of a response variable. It is more realistic to consider the following varying-index model:

F_{Y}(y|x) = G(y, x_{θ}_{0}_{(y)}), (14)

where θ_{0}(y) is a vector of index coefficient functions of y. This modelling approach is especially
useful to handle an ordinal response variable and for quantile forecasting. The PSIS in (5)
and the PPSIS in (13) can be modified as

SS(θ(y)) = 1 n

n

X

i=1

(N_{i}(y) − bG(y, X_{iθ(y)}))^{2} and PSS(θ(y)) = SS(θ(y)) + λ_{y}

d

X

k=2

|θ_{k}(y)|

|bθ_{k}(y)|. (15)
In survival analysis, the response measurement represents the time to a particular event. It
is worthy to note that the considered model includes more acceptable proportional hazards
and accelerated failure time models. A major challenge in dealing with this issue is that the
failure times of some individuals might not be available due to censoring. Our results should
be valuable in the development of related inferences.

### APPENDIX: PROOFS

Proof of Theorem 1.

By assumptions (A1) and (A3), we can derive from (7) that sup

Y×X_{θ}

| bG(y, x_{θ}) − G(y, x_{θ}_{0})| ≤ 1

infX_{θ}f (x_{θ}) sup

Y×X_{θ}

|N_{1}(y, x_{θ}) − f (x_{θ})M_{10}(y, x_{θ})|

+sup_{Y×X}

θ|N_{1}(y, x_{θ})|

infX_{θ}N_{0}(y, x_{θ}) sup

X_{θ}

|N_{0}(y, x_{θ}) − f (x_{θ})| = o_{p}(p

ln n/nh) + O(h^{2}) = o_{p}(1). (16)

It follows immediately from (16) that

sup

θ

|SS(θ) − Z

Y

E[(N (y) − M_{10}(y, X_{θ}))^{2}]dW (y)| = o_{p}(1). (17)

Moreover, θ_{0} can be shown to be a unique minimizer of R

YE[(N (y) − M_{10}(y, X_{θ}))^{2}]dW (y)
through the inequalityR

Y E[(N (y) − M10(y, Xθ))^{2}]dW (y) ≥R

YE[ε^{2}(y; θ0)]dW (y). With (17),
the consistency of bθ is ensured via applying Theorem 5.1 of Ichimura (1993).

Along the same lines as the proof in (16), one has

sup

Y×X

|∂_{θ}G(y, xb _{θ}_{0}) − H_{1}(y, x_{θ}_{0})| = o_{p}(p

ln n/nh^{3}) + O(h^{2}). (18)

The score function√

nS(θ0) =√

n∂θSS(θ0) can be further decomposed as

√nS(θ_{0}) =

1

X

l1=0 1

X

l2=0

√nS_{l}_{1}_{l}_{2}(θ_{0}), (19)

where S_{l}_{1}_{l}_{2}(θ_{0}) = −(2/n)Pn
i=1

R

Yε^{1−l}_{i} ^{1}(y; θ_{0})H_{1}^{1−l}^{2}(y, X_{iθ}_{0})(G(y, X_{iθ}_{0}) − bG(y, X_{iθ}_{0}))^{l}^{1}(∂_{θ}G(y,b
Xiθ0) − H1(y, Xiθ0))^{l}^{2}dWni(y). It is implied from (16) and (18) that

√nS_{11}(θ_{0}) = o_{p}(1). (20)

Let A_{1mi}= −M_{01}^{2−m}(y, Xiθ0)∂_{x}^{2−m}_{θ} G(y, Xiθ0)

f (X_{iθ}_{0}) , A_{2mi} = ε_{i}(y; θ_{0})G^{2−m}(y, X_{iθ}_{0})

f (X_{iθ}_{0}) , m = 1, 2,
A_{2mi}= 1

f^{2}(X_{iθ}_{0})(ε_{i}(y; θ_{0})∂_{x}_{θ}(f (X_{iθ}_{0})M_{11}(y, X_{iθ}_{0}))G^{4−m}(y, X_{iθ}_{0}) + (m − 4)f (X_{iθ}_{0})M_{11}(y, X_{iθ}_{0})

∂_{x}_{θ}G(y, X_{iθ}_{0})), m = 3, 4, φ_{kmij} = N_{j}^{2−m}(y)K_{h}(X_{jθ}_{0}− X_{iθ}_{0}) − G^{2−m}(y, X_{iθ}_{0})f (X_{iθ}_{0}), k, m = 1, 2,
and φ_{2mij} = N_{j}^{4−m}(y)∂_{θ}K_{h}(X_{jθ}_{0}− X_{iθ}_{0}) − G^{2−m}(y, X_{iθ}_{0})∂_{x}_{θ}(f (X_{iθ}_{0})M_{01}(y, X_{iθ}_{0})), m = 3, 4.

The terms √

nS_{l}_{1}_{l}_{2}(θ_{0}), l_{1} 6= l_{2}, in (19) can be rewritten as

√nSl1l2(θ0) = −2

√n

2k

X

m=1

X

i6=j

Z

Y

AkmiφkmijdWni(y) + op(1), k = l1+ 2l2. (21)

A little tedious but straightforward algebra yields E[A_{kmi}|x_{iθ}_{0}] = 0 and E[φ_{kmij}|x_{i}, y_{i}] =
O(h^{2}), k = 1, 2, which imply that

P (sup

Y

√n|S_{l}_{1}_{l}_{2}(θ_{0})| > ε) ≤ 1
ε^{2}n(n − 1)^{2}

2k

X

m=1

sup

Y

(X

i6=j

(E[A^{2}_{kmi}φ^{2}_{kmij}] +X

l6=i,j

E[A^{2}_{kmi}φ_{kmij}φ_{kmil}])

= O(1/n) + O(h^{4}), k = l_{1} + 2l_{2}, for any ε > 0. (22)

Combining (20)-(22), one has √

nS(θ_{0}) = √

nS_{00}(θ_{0}) + o_{p}(1). By applying the central limit
theorem, it is easily to show that

√nS(θ_{0})→ N (0, V^{d} _{1θ}_{0}). (23)

Similar to the proof of (16) and (18), there exist functions H_{2}(y, x_{θ}) and H_{2}(θ) = E[R

Y(H_{1}^{⊗2}(y
X_{θ}) − ε(y; θ)H_{2}(y, X_{θ}))dW_{ni}(y)] satisfying

sup

Y×X_{θ}

|∂_{θ}^{2}G(y, xb _{θ}) − H_{2}(y, x_{θ})| = o_{p}(p

ln n/nh^{5}) + O(h^{2}) and (24)

sup

θ

|I(θ) − H_{2}(θ)| = o_{p}(p

ln n/nh^{5}) + O(h^{2}). (25)
From (18), (24), and the law of large numbers, a direct calculation yields

I(θ_{0}) = V_{2θ}_{0} − 2
n

n

X

i=1

Z

Y

ε_{iy}(θ_{0})H_{2}(y, X_{iθ}_{0})dW_{ni}(y) + o_{p}(1) = V_{2θ}_{0} + o_{p}(1). (26)
By assumption (A1), there exists an open set Θ_{0} ⊂ Θ such that sup_{θ∈Θ}_{0}kH_{2}(θ) − H_{2}(θ_{0})k

< ε/3 for any ε > 0. We can further derive from (25) that P ( sup

θ∈Θ0

kI(θ) − I(θ_{0})k > ε) ≤ P ( sup

θ∈Θ0

kI(θ) − H_{2}(θ)k > ε

3) + P (kH_{2}(θ_{0}) − I(θ_{0})k > ε
3).

and, hence, P ( sup

θ∈Θ0

kI(θ) − I(θ_{0})k > ε) → 0 as n → ∞. (27)

Applying the Taylor expansion to S(bθ) around θ_{0} to the second order, one has

S(θ_{0}) + I(θ^{∗})(bθ − θ_{0}) = 0, (28)

where θ^{∗} lies on the line segment between bθ and θ_{0}. Finally, the limiting distribution of

√n(bθ − θ0) is obtained by the Slutsky’s theorem, the consistency of bθ, (23), and (27)-(28).

Proof of Theorem 2.

Let S^{rw}(θ) and I^{rw}(θ) denote the first and second derivatives of SS^{rw}(θ). Paralleling the
proof steps in (7), we can show that

sup

Y×X_{θ}

|∂_{θ}^{l}^{2}N_{l}^{rw}_{1} (y, xθ) − ∂_{θ}^{l}^{2}Nl1(y, xθ)| = op^{∗}(1/nh^{2l}^{2}^{+1}). (29)

By the Taylor expansion and (29), one has

sup

θ

|SS^{rw}(θ) − SS(θ)| ≤ 1
n

n

X

i=1

(ξ_{i}

µ − 1) sup

θ

| Z

Y

e^{2}_{i}(y; θ)dW_{ni}(y)| + o_{p}^{∗}(1) = o_{p}^{∗}(1). (30)

The same argument for the convergence of bθ to θ implies that (bθ^{rw}− bθ) = o_{p}^{∗}(1). From (29)
and S(bθ) = 0, the score function S^{rw}(bθ) can be expressed as

S^{rw}(bθ) = S^{rw}_{00}(bθ) + S^{rw}_{01}(bθ) + S^{rw}_{10}(bθ) + o_{p}^{∗}(p

1/n) (31)

with IS^{rw}_{l}_{1}_{l}_{2}(bθ) = −(2/n)Pn

i=1(ξ_{i}/µ)R

Ye^{1−l}_{i} ^{1}(y; θ)(∂_{θ}G(y, Xb _{ib}_{θ}))^{1−l}^{2}(∂_{θ}Gb^{rw}(y, X_{ib}_{θ})−∂_{θ}G(y, Xb _{ib}_{θ}))^{l}^{1}
( bG(y, X_{ib}_{θ}) − bG^{rw}(y, X_{ib}_{θ}))^{l}^{2}dW_{ni}(y). We further confirm from E^{∗}[S^{rw}_{00}(bθ)] = 0, E^{∗}[(S^{rw}_{00}(bθ))^{2}]→^{p}
ρ^{−2}V2θ0, and the central limit theorem for independent random vectors (Serfling (1980)) that

P^{∗}(ρV_{2}^{−1/2}√

nS^{rw}_{00}(bθ) ≤ w)→ Π^{p} ^{d}_{l=2}Φ(w_{l}). (32)