New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

(1)

PREPRINT

國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2011- 15.pdf

New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

Chin-Tsang Chiang and Ming-Yueh Huang

July 31, 2011

(2)

New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

Chin-Tsang Chiang and Ming-Yueh Huang

Department of Mathematics, National Taiwan University, Taipei 10617, Taiwan, ROC

July 31, 2011

Abstract

In this article, a more flexible single-index regression model is employed to characterize the conditional distribution. The pseudo least integrated squares approach is proposed to estimate the index coefficients. As shown in the numerical results, our estimator outperforms the existing ones in terms of the mean squared error. Moreover, we provide the generalized cross-validation criteria for bandwidth selection and utilize the frequency distributions of weighted bootstrap analogues for the estimation of asymptotic variance and the construction of confidence intervals. With a defined residual process, a test rule is established to check the adequacy of an applied single-index conditional distribution model. To tackle with the problem of sparse variables, a multi-stage adaptive Lasso algorithm is developed to enhance the ability of identifying significant variables. All of our procedures are found to be easily implemented, numerically stable, and highly adaptive to a variety of data structures. In addition, we assess the finite sample performances of the proposed estimation and inference procedures through extensive simulation experiments. Two empirical examples from the house-price study in Boston and the environmental study in New York are further used to illustrate applications of the methodology.

Keywords: Adaptive Lasso; Cross-validation; Curse of dimensionality; Multi-stage adaptive Lasso; Naive bootstrap; Oracle properties; Single-index; Pseudo least integrated squares estimator; Pseudo least squares estimator; Pseudo maximum likelihood estimator; Random weighted bootstrap; Residual process.

(3)

1 Introduction

We consider the conditional distribution F_Y(y|x) of a real-valued response Y given continuous covariates X = x, where X = (X1, · · · , Xd)^T and x = (x1, · · · , xd)^T. In regression analysis, a wide cross-section of research interests is pursued in the study of the conditional mean E[Y |x].

A more complete methodology and theoretical framework related to fully nonparametric and semiparametric distribution models still remains and a further investigation is necessary. As one can see, with a large number of covariatess, a fully nonparametric distribution usually suffers from the curse of dimensionality (Bellman (1961)). Although parametric models have played prominent roles in applications, they are frequently detected to be inadequate in many studies. Consequently, a more flexible semiparametric model becomes a great interest to characterize the dependence of Y on X and avoids the impact of misspecification of parametric models and the difficulty in the estimation of nonparametric distributions.

One of the most popular extension of parametric models is the single-index (SI) conditional distribution model:

F_Y(y|x) = G(y, x_θ₀), (1)

where G(·, ·) is an unknown bivariate function, x_θ = x₁+ (x₂, · · · , x_d)^Tθ, and θ₀ is a vector of true index coefficients. The most significant covariate is assumed, without loss of generality, to be X₁ and the setting of its coefficient is mainly to deal with the problem of identifia- bility. When the conditional mean exists, it can be easily obtained from the above model that E[Y |x] = m(x_θ₀) with m(·) being some unspecified function. Based on the conditional mean model, Powell, Stock, and Stoker (1989) utilized the estimation of the density-weighted

(4)

average derivative to estimate θ₀. Although the estimator was shown to be √

n-consistent, asymptotically normal, and computationally simple, the numerical instability is usually seen as a consequence of high-dimensional kernel smoothing. To overcome such a weakness with practice, Ichimura (1993) developed a semiparametric least squares approach and derived its asymptotic properties. Meanwhile, H¨ardle, Hall, and Ichimura (1993) recommended a cross- validation criterion to simultaneously estimate bandwidths and index coefficients. Under the validity of model (1) with a continuous response, Delecroix, H¨ardle, and Hristache (2003) introduced the pseudo likelihood (PL) estimation for θ₀. Without moment and continuous conditions on Y , Hall and Yao (2005) suggested an estimation criterion on the basis of the average squared difference between the empirical estimator and the model-based estimator of the joint probability of (Y, X^T)^T. As one can see, the good performance of their estimation procedure is connected to an appropriate number of spheres and the corresponding radii used in the integral approximation. Currently, there is still no standard rule to determine the values of these two quantities. Furthermore, the established algorithm is often computationally slow and intensive, especially in high-dimensional covariate spaces. Confronted with these problems, we proposed a new type of estimation criterion, which is simple and easily implemented, for θ₀. The basic rationale behind this approach is to define the response process N (y) = I(Y ≤ y) and to directly use the difference between N (y) and its conditional mean G(y, x_θ₀) over the support of Y . Under some suitable conditions, the asymptotic distribution of the pseudo least integrated squares estimator (PLISE) is derived to be multivariate normal.

To make inferences related to θ₀, the frequency distributions of its bootstrap analogues are used to estimate the asymptotic variance of the PLISE because a sandwich-type estimator tends to provide a very poor approximation. With the proposed residual process, the method

(5)

of Xia (2009) is extended to establish a test rule to check the adequacy of model (1). There are two features of the PLISE: Firstly, our estimation approach can be applied to different types of response variable and outperforms the existing ones; secondly, the foregoing inferences can be easily adopted and generalized to the considered problems in this article.

When the true underlying model has a sparse representation, identifying significant covariates becomes an important issue to enhance the accuracy in prediction. The traditional best-subset selection algorithms are usually computationally infeasible in the presence of a po- tentially high-dimensional covariate space. The ridge regression estimation is another variance- stabilizing technique, which shrinks the least square estimator toward zero but not does not identify significant covariates cleverly. To simultaneously select significant variables and to estimate the parameters in regression models, Tibshirani (1996) introduced a least absolute shrinkage and selection operator (Lasso). Since Lasso variable selection might be inconsistent, Fan and Li (2001) and Zou (2006) proposed a smoothly clipped absolute deviation (SCAD) penalty and an adaptive Lasso instead. In their model specifications, the adaptive Lasso further avoids the problem of nonconcavity in the SCAD penalty although both of the procedures enjoy the oracle properties. By extending the adaptive Lasso in generalized linear models to our framework, we propose the penalized pseudo least integrated squares estimator (PPLISE) and derive the corresponding oracle properties. Moreover, in a small sample size scenario, a multi-stage adaptive Lasso estimation procedure is proposed to improve the possible selection inconsistency and predictive inaccuracy in the PPLISE.

The rest of this article is organized as follows. In Section 2, we propose the PLISE for θ₀ and the cross-validation criteria for bandwidth selection. Moreover, the weighted bootstrap

(6)

inference procedures are introduced to estimate the asymptotic variance of the PLISE and construct the confidence regions for parameters of interests. A test rule and a multi-stage adaptive Lasso procedure are established in Section 3. In Sections 4-5, the simulation experiments are conducted and the proposed approaches are applied to two empirical examples.

Some concluding remarks and future research topics are provided in Section 6 and the proofs of the main results are placed in the Appendix.

2 Estimation and Inference Procedures

Based on a random sample of the form {(X_i, Y_i)}ⁿ_i=1, the PILSE of θ₀ and the bandwidth selection criteria are proposed. The frequency distributions of bootstrap analogues are fully employed to estimate the asymptotic variance of the PILSE and construct the confidence intervals for the parameters of interest.

2.1 Estimation and Bandwidth Selection

For each fixed (y, xθ), the approach of Hall, Wolff, and Yao (1999) can be applied for the estimation of G(y, x_θ). Let K(u) denote a kernel density, h be a positive-valued bandwidth, N_`(y, X_iθ) = P

j6=i(N_j)^`(y)K_h(X_jθ − X_iθ)/(n − 1), i = 1, · · · , n, ` = 0, 1, and K_h(u) = h⁻¹K(u/h). The Nadaraya-Watson estimator for G(y, X_iθ) is given by bG(y, X_iθ) = N₁(y, X_iθ)/

N₀(y, X_iθ). By using the response process N (y) and a consistent estimator of G(y, x_θ), the PLISE bθ is proposed to be a minimizer of the pseudo sum of integrated squares (PSIS):

SS(θ) = 1 n

n

X

i=1

Z

Y

e²_i(y; θ)dW_ni(y), (2)

(7)

where Y is the support of Y or the interval of interest, e_i(y; θ) = N_i(y)− bG(y, X_iθ), and W_ni(y) is a non-negative weight function. In practical implementation, bG(y, x_θ) is set to be zero if the denominator N₀(y, x_θ) is zero. Although a local linear estimator of G(y, x_θ) can be used in the PSIS, it does not share the properties of a cumulative distribution function and might cause some complications in the above estimation procedure.

It follows from a direct algebraic calculation that

E[(N (y) − G(y, X_θ))²] = E[(N (y) − F_Y(y|X))²] + E[(F_Y(y|X) − G(y, X_θ))²]. (3)

Since the first term at the right-hand side of (3) does not depend on θ, both minimizers of E[R

Y(N (y) − G(y, Xθ))²dW (y)] and E[R

Y(FY(y|X) − G(y, Xθ))²dW (y)] can be shown to be θ0

under the validity of model (1), where W (y) is a convergent function of W_n(y). Thus, minimizing SS(θ) is on average approximated by minimizing E[R

Y(F_Y(y|X) − G(y, X_θ))²dW (y)] with respect to θ. In our theoretical development and numerical implementation, the quartic kernel K(u) = (15/16)(1−u²)²I(|u| ≤ 1) is specified. The advantage of such a density function is that bθ can achieve the√

n-consistency. As a spacial case, the uniform distribution or the empirical distribution of Y can be specified for W_ni(y)’s in (2). In the case where G(y, x_θ) is known, the optimal weight for w_ni(y) = dW_ni(y)/dy is proportional to {G(y, X_iθ)(1 − G(y, X_iθ))}⁻¹, the reciprocal of the conditional variance of N_i(y), at each fixed y. We further replace G(y, x_θ) by a consistent estimator bG(y, x_b_θ) and iteratively update the weight estimation. The resulting estimator coincides with the maximizer of the following log-pseudo likelihood function for a random sample {N_i(y) : 1 ≤ i ≤ n}:

l_p(θ) = 1 n

n

X

i=1

Z

Y

(N_i(y) ln( bG(y, X_iθ)) + (1 − N_i(y)) ln(1 − bG(y, X_iθ))dy. (4)

(8)

Let y₍₁₎ < · · · < y_(m) denote the distinct order statistics of {Y₁, · · · , Y_n} and W_(j) = Ry(j+1)

y(j) dW_ni(y). Since N_i(y)’s are zero-one processes and bG(y, X_iθ)’s are step functions with jumps occurring at {y₍₁₎, · · · , y_(m)}, the PSIS in (2) has a computationally more attractive alternative as follows:

SS(θ) = 1 n

n

X

i=1 m−1

X

j=1

e²_i(y_(j); θ)W_(j). (5)

In contrast, the estimation procedure of Hall and Yao (2005) is often computationally intensive. When the response variable Y is discrete and has a finite support, the above estimation criterion can also be applied. As for the binary response with values in {0, 1}, the PSIS will automatically reduce to the sum of squares in Ichimura (1993). In kernel estimation, a criterion for bandwidth selection is provided via generalizing the most commonly used “leave one subject out” cross-validation procedure of Rice and Silverman (1991). The optimal bandwidth h_cv is naturally defined to be the unique minimizer of

CV₁(h) = 1 n

n

X

i=1 m−1

X

j

e²_i(y_(j); bθ_i)W_(j) (6)

with bθ_i = arg min{(n − 1)⁻¹P

l6=i

Pm−1

j=1 e²_l(y_(j); θ)W_(j)}. Another criterion developed by H¨ardle, Hall, and Ichimura (1993) is further adopted and extended to our framework. The estimators of h and θ₀ can be simultaneously obtained via minimizing CV₂(h, θ) = SS(θ).

2.2 Asymptotic Properties

Suppose that X has a compact support X and θ₀ is an interior point of the compact parameter space Θ ⊆ R^d−1. Let f (xθ) be the density function of Xθ on Xθ = {xθ : x ∈ X , θ ∈ Θ}, M_l₁_l₂(y, x_θ) = E[G^l¹(y, X_θ₀)(x − X)^⊗l²|x_θ], l₁ = 0, 1, l₂ = 0, 1, 2, H₁(y, x_θ₀) =

(9)

M₀₁(y, x_θ₀)∂_x_θG(y, x_θ₀), V_1θ₀ = 2E[(R

Yε(y; θ₀)H₁(y, X_θ₀)dW (y))^⊗2], and V_2θ₀ = 4E[R

Y H₁^⊗2(y, X_θ₀)dW (y)]. To establish the asymptotic properties of bθ, some suitable conditions are made before establishing the asymptotic properties of bθ as follows:

(A1) inf_X_θf (x_θ) > 0.

(A2) d³_x

θf (x_θ) and ∂_x³

θM₁₂(y, x_θ) are Lipschitz continuous in x_θ with the Lipschitz constants being independent of (y, x_θ).

(A3) h = h₀n^−ς¹ for ς₁ ∈ (1/8, 1/5) and some positive constant h₀. (A4) V_1θ₀ and V_2θ₀ are nonsingular.

Since the classes of kernel functions indexed by (h, x_θ) are Euclidean (Pakes and Pollard (1989)), the imposed conditions entail that the considered classes of functions are Euclidean.

By applying Theorem II.37 of Pollard (1984), one has

sup

Y×X_θ

|∂_θ^l²N_l₁(y, x_θ) − ∂_x^l²

θ(f (x_θ)M_l₁_l₂(y, x_θ))| = o(p

ln n/nh^2l²⁺¹) + O(h²) a.s. (7)

For simplicity, the consistency and asymptotic normality of bθ are established in the following theorem with the case of deterministic weight functions W_i(y)’s.

Theorem 1. Suppose that assumptions (A1)-(A4) are satisfied. Then, bθ → θ^p 0 and √ n(bθ − θ₀)→ N (0, Σ^d _θ₀) as n → ∞, where Σ_θ₀ = V_1θ⁻¹

0V_2θ₀V_1θ⁻¹

0.

For a continuous response Y , the PMLE eθ of Delecroix, H¨ardle, and Hristache (2003) is obtained by maximizing

l(θ) = 1 n

n

X

i=1

ln(

P

j6=iK_4h₂(Y_j− Y_i)K_4h₁(X_jθ− X_iθ) P

j6=iK_4h₁(X_jθ− X_iθ) ), (8)

(10)

where K_4h(u) = K₄(u/h)/h and K₄(u) = (105/64)(1 − 5u²+ 7u⁴− 3u⁶)I(|u| ≤ 1) (cf. Gørgens (2004)). The fourth-order kernel function is required to ensure the 1/√

n convergence rate.

The authors concluded that the proposed estimator can achieve the asymptotic efficiency. We further examined some mistakes in the theoretical derivations and that eθ only reaches the semiparametric efficiency bound. Let g(y, xθ0) = fY(y|x),

W_1θ₀ = E[(∂_x²

θg(Y, X_θ₀)

g(Y, X_θ₀) + 2dx_θf (Xθ0)∂x_θg(Y, Xθ0)

f (X_θ₀) )M₀₂(y, X_θ₀) + ∂x_θg(Y, Xθ0) g(Y, X_θ₀)

·(2∂x_θM02(y, Xθ0) − M₀₁^⊗2(y, Xθ0))], and W2θ0 = E[(∂_x_θg(Y, X_θ₀)

g(Y, X_θ₀) )²(X − E[X|Xθ0])^⊗2)].

The proofs for Theorem 1 are processed in the same manner for eθ₁ → θ^p ₀ and √

n(eθ₁− θ₀)→^d N (0, W_1θ⁻¹

0W_2θ₀W_1θ⁻¹

0) under assumption (A1) and the assumptions:

(B1) d³_x_θf (xθ), ∂_x³_θM02(y, xθ), and ∂_x³_θE[g(Y, Xθ0)(x − X)^⊗2|xθ] are Lipschitz continuous in xθ

with the Lipschitz constants being independent of (y, x_θ).

(B2) h_k= h_0kn^−ς², k = 1, 2, for some positive constants h_0k’s and ς₂ ∈ (1/16, 1/6).

(B3) W1θ0 and W2θ0 are nonsingular.

2.3 Bootstrap Inferences

Based on the limiting distribution of√

n(bθ−θ₀), a general rule in the construction of confidence intervals usually relies on an appropriate estimator of Σ_θ₀. One of the most widely used estimators is the sandwich-type estimator bΣ

bθ = bV⁻¹

1bθ Vb_2b_θVb⁻¹

1bθ , where Vb_1b_θ = 2

n

X

i=1 m−1

X

j=1

(∂_θG(yb _(j), X_ib_θ))^⊗2W_(j) and bV_2b_θ = 4 n

n

X

i=1

(

m−1

X

j=1

e_i(y_(j); bθ)∂_θG(yb _(j), X_ib_θ)W_(j))^⊗2.

(11)

In practical implementation, a sufficiently good performance of bΣ

θb essentially requires an adequate bandwidth. Although the smoother in bΣ

θb can be chosen different from that in G(y, xb _θ_b), there is still no standard criterion for doing so.

An alternative approach to avoid encountering such a situation is to employ the frequency distribution of bootstrap replications. A natural resampling approach is to draw independent bootstrap random vectors U₁^nb, · · · , U_n^nb from the empirical distribution P_n,U = n⁻¹Pn

i=1I_U_i with U_i = (X_i, Y_i), i = 1, · · · , n. The bootstrap analogue bθ^nb is straightforward created via solving SS(θ) in (2) based on a bootstrap sample {U₁^nb, · · · , U_n^nb}. Without requiring drawing observations from the collected data, we further adopt general weighted bootstrap approximations for the sampling distribution of bθ. Let ξ₁, · · · , ξ_n be independently generated from a common distribution with mean µ and variance σ². The random weighted bootstrap estimator bθ^rw is then defined to be the minimizer of

SS^rw(θ) =

n

X

i=1

D_i

m−1

X

j=1

(e^rw_i (y_(j); θ))²W_(j), (9)

where D_i = ξ_i/Pn

j=1ξ_j, e^rw_i (y; θ) = N_i(y)− bG^rw(y, X_iθ), bG^rw(y, X_iθ) = N₁^rw(y, X_iθ)/N₀^rw(y, X_iθ), and N_l^rw(y, Xiθ) =P

j6=iDj N_j^l(y)Kh(Xjθ− Xiθ). This bootstrapping approach can be treated as the naive bootstrap one with a measure P_n,U^rw =Pn

i=1D_iI_U_i. It is interesting to note that the dependent weights D_i’s can also be replaced with the independent weights ξ_i/(nµ)’s.

The random bootstrap confidence intervals for θ_l, l = 2, · · · , (d − 1), are constructed by

bθ_l± ρz_1−α/2se^rw(bθ_l^rw− bθ_l) or (bθ_l− ρq_1−α/2^rw (bθ_l^rw− bθ_l), bθ_l− ρq_α/2^rw(bθ^rw_l − bθ_l)), (10) where ρ = µ/σ is a scale factor modification for the variability in the weights, zp is the pth quantile value of the standard normal distribution, and se^rw(·) and q^rw(·) denote the standard

(12)

error and the 100pth percentile of B bθ^rw’s, respectively. Let P^∗(·) represent the probability measure conditioning on {U₁, · · · , U_n}. The validity of (10) is given in the next theorem.

Theorem 2. Suppose that assumptions in Theorem 1 are satisfied. Then,

P^∗(√

nρ(bθ^rw− bθ) ≤ w) − P (√

n(bθ − θ₀) ≤ w)→ 0 for all w = (w^p ₂, · · · , w_d)^T as n → ∞. (11)

3 Model Test and Sparse Models

In this section, a test rule is established for the adequacy of model (1). The PPLISE is proposed to tackle with the problem of sparse variables. A multi-stage adaptive Lasso procedure is further developed to improve the accuracy of variable selection.

3.1 Model Checking

Let ε(y; θ) = N (y) − G(y, X_θ) and θ₁ be the minimizer of R

YE[ε²(y; θ)]dW (y). It is straightforward to yield θ₁ = θ₀ and E[ε(y; θ₁)] = 0 under model (1). If the considered model is inadequate, ε_y(y; θ₁) can be projected into some linear combinations of covariates, i.e.

ε(y; θ₁) = ν(y, x_θ₂) + ζ(y) with E[ζ(y)] = 0 for some y and θ₂ = arg min_θR

Ymin{E[(ε(y; θ₁) − ν(y, x_θ))²], E[ε²(y; θ₁)]}dW (y). The parameter θ₂ is naturally estimated by the minimizer ˘θ of RSS_n(θ), where

RSS_n(θ) =

m−1

X

j=1

1 n

n

X

i=1

min{(e_i(y_(j); bθ) −bν(y_(j), X_iθ))², (e_i(y_(j); bθ) − ¯e(y_(j); bθ))²}

! W_(j)

with ν(y, Xb _iθ) = P

j6=ie_j(y; bθ)K_h(X_jθ− X_iθ) P

j6=iK_h(X_jθ− X_iθ) and ¯e(y; bθ) = 1 n

n

X

i=1

e_i(y; bθ).

(13)

By further computing TSS_n = n⁻¹Pn i=1

Pm−1

j=1 (e_i(y_(j); bθ)−¯e(y_(j); bθ))²W_(j), the test statistic F_n= RSS_n(˘θ)/TSS_n is used to check model (1):

(H₀ : F_Y(y|x) = G(y, x_θ₀) for all (x, y) H_A: F_Y(y|x) 6= G(y, x_θ₀) for some (x, y) .

The null hypothesis is rejected with a significance level α whenever F_n ≤ q_α(F_n^b), where F_n^b is the bootstrap analogue of F_n with e^∗_i(y; bθ)’s substituting for e_i(y; bθ)’s and each e^∗_i(y; bθ) being independently drawn from a two-point distribution: ((5 +√

5)/10)δ₍₁₋^√_5)e

i(y;bθ)/2+ ((5 −

√5)/10)δ₍₁₊^√_5)e

i(y;bθ)/2 (cf. H¨ardle (1989)). As expected, the developed test rule is generally more powerful than those based on X, especially when its dimension is high. Similar to the single-indexing cross-validation value of Xia (2009), we consider the measure

SCV_n =

m−1

X

j=1

1 n

n

X

i=1

min{(e_i(y_(j); bθ) −bν(y_(j), X_i˘_θ

i))², (e_i(y_(j); bθ) − ¯e(y_(j); bθ))²}

!

W_(j), (12)

where ˘θ_i = arg min_θPm−1 j=1 (Pn

i=1min{(e_i(y_(j); bθ) −bν_i(y_(j), X_iθ))², (e_i(y_(j); bθ) − ¯e(y_(j); bθ))²})W_(j) if it exists and νb_i(y, X_iθ) is computed as bν(y, X_iθ) with the ith subject being deleted, i = 1, · · · , n. Following the argument of this author, one can also conclude that as the sample size approaches infinity, SCV_n= TSS_n if model (1) is adequate and SCV_n< TSS_n otherwise.

3.2 Adaptive Lasso Estimation and Oracle Properties

The PPLISE bθ_(p) of θ₀ is obtained via minimizing the penalized pseudo sum of integrated squares (PPSIS):

PSS(θ) = SS(θ) + λ

d

X

l=2

|θ_l|

|bθ_l|, (13)

where λ is a nonnegative regularization or tuning parameter. In this variable selection and estimation procedure, significant covariates receive smaller penalties and tend to have nonzero

(14)

coefficient estimates while nonsignificant coefficients will be shrunk into zero. The above optimization problem entails that the true underlying model can be consistently identified and bθ_(p)A₀ has the same asymptotic distribution as bθA₀, where A₀ = {l|θ_0l 6= 0}.

Theorem 3. Suppose that assumptions (A1)-(A4) are satisfied and λ = λ₀n^−ς³ for ς₃ ∈ (1/2, 1) and some positive constant λ₀. Then, P ( bA = A₀) → 1 and √

n(bθ_(p)A₀ − θ_A₀) →^d N (0, Σ_θ_A0) as n → ∞, where bA = {l|bθ_l 6= 0} and Σ_θ_A0 is the asymptotic variance of bθA₀.

It is revealed in our simulation experiments that the one-stage adaptive Lasso estimation in (13) usually cannot achieve the variable selection well in small sample applications. To conquer this shortcoming, we develop a multi-stage adaptive Lasso estimation scheme. Let eθ_(m) represent the vector of nonzero estimates at the mth stage, θ_(m) = (θ_(m)l−^T , θ_(m)l, θ^T_(m)l+)^T be the corresponding parameter vector with a length of d_m, and θ_(m)l− and θ_(m)l+ denote the vectors of coefficients with sub-indices smaller and greater than l, respectively. Moreover, SS_(m)(θ_(m)) is defined as SS(θ) with θ being replaced with θ_(m). The estimation procedure is implemented through the following steps:

S1. (eθ_(1)l⁽¹⁾, eh⁽¹⁾_(1)l) = arg min_θ_l_,hSS₍₁₎(eθ_(1)l−⁽¹⁾ , θ_(1)l, eθ_(0)l+) + λ|θ_l|/|eθ_(1)l⁽⁰⁾| with eθ⁽⁰⁾_(1)l = bθ_l and eθ⁽¹⁾_(1)l = 0 whenever |eθ⁽¹⁾_(1)l| < ε₀, l = 2, · · · , d₁, for some sufficiently small positive value ε₀.

S2. Set (eθ^(k)_(1)l, eh^(k)_(1)l) = (0, eh^(k−1)_(1)l ) if eθ^(k−1)_(1)l = 0; otherwise, (eθ^(k)_(1)l, eh^(k)_(1)l) = arg minθ_l,hSS₍₁₎ (eθ^(k)_(1)l−, θ_(1)l, eθ_(1)l+^(k−1)) + λ|θ_l|/|eθ_(0)l^(k−1)| and eθ_(1)l^(k) = 0 whenever |eθ^(k)_(1)l| < ε₀, k = 2, · · · .

S3. Iterations are stopped if keθ^(k)₍₁₎− eθ₍₁₎^(k−1)k < ε₁ for some pre-chosen small value ε₁ > 0, and θe_(1)λ is set to be non-zero components of eθ₍₁₎^(k).

(15)

S4. eθ₍₁₎ = eθ_(1)λ₁ with λ₁ = arg min_λGCV(λ) and

GCV(λ) = SS₍₁₎(eθ_(1)λ) (1 − n⁻¹tr(( bV_1e_θ

(1)λ+ diag(λ/neθ²_(1)λ))⁻¹Vb_1e_θ

(1)λ))².

S5. Repeat steps 1-4 M times until keθ_{(M )}− eθ_{(M −1)}k < ε₂ for some small value ε₂ > 0.

4 Monte Carlo Experiments

The performances of proposed estimation and inference procedures were assessed through a class of simulations with a variety of sample sizes, correlation structures of covariates, and error processes. The simulations were based on 1000 replications and the bootstrap inferences were drawn from 500 bootstrap samples, which enable us to obtain stable numerical results.

4.1 Assessment of Estimators and Inference Procedures

In this simulation scenario, the covariates X = (X₁, X₂, X₃)^T were generated from a trivariate normal distribution with mean zero, standard deviation of 1, and pairwise correlations of 0.2 or 0.5. The response variable Y was generated from the following two models:

M1. Y = X_θ₀+ ε with θ₀ = (1, 0.8, −0.5)^T and ε ∼ N (0, 0.25).

M2. Y = X_θ₀ + ε with θ₀ = (1, 0.8, −0.5)^T and ε ∼ N (0, 0.25X_θ²₀).

The uniform distribution and the empirical distribution of Y were used as the weights in (5) with the resulting estimators being denoted by bθ and bθ^emp, respectively. We compared the finite sample properties of our estimators bθ and bθ^emp with the estimator ˘θ of Hall and Yao

(16)

(2005), the pseudo maximum likelihood estimator (PMLE) eθ, and the pseudo least squares estimator (PLSE) ¯θ. The inference procedures based on the asymptotic normality of bθ and the frequency distributions of its bootstrap analogues were also studied by simulations. With the limitation of number of pages, the exchangeable random weights were only investigated through independent and identically distributed Gamma(4, 2) random variables, which have better numerical results than others.

Table 4.1 displays the means and the standard deviations of 1000 estimates with the sample sizes (n) of 100, 250, and 500, and the correlation coefficients (ρ) of 0.2 and 0.5. The biases of compared estimators are generally inapparent except for eθ and ¯θ under (n, ρ) = (100, 0.5) and θ under (n, ρ) = (100, 0.5) and model M1. As one can expect, the standard deviations of these˘ estimators decrease as n increases or as ρ becomes small. We further detect in this table that eθ tends to have a substantially large variance when n is small. The high variability in eθ is mainly caused by the use of the fourth-order kernel function. In practical applications, the second- order kernel function is often used to overcome this shortcoming. In addition, the simulation results indicate that bθ^emp has the smallest variance under the validity of heterogeneous error model and bθ is comparable with ¯θ in the case of a constant error process. Even if ¯θ performs satisfactorily in the homogeneous error model, it has a relatively large variance among all estimators in the heterogeneous one. The CPU time for the computation of bθ^emp is much shorter than that for ˘θ although both estimators have a similar performance.

The sandwich-type estimate (SANE), the naive bootstrap estimate (NBE), and the random weighted bootstrap estimate (RWE) of the standard deviation of bθ are provided in table 4.2.

Overall, the SANE underestimates the asymptotic variance in a more pronounced fashion even

(17)

for a sufficiently large n. We further found that the bootstrap estimates slightly overestimate the asymptotic variance but their accuracies significantly improve as the sample size increases.

Apparently, the RWE is closer to the asymptotic variance than the NBE. Tables 4.2 also presents the empirical coverage probabilities, the average lengths of 0.95 quantile intervals of 1000 estimates, and the average lengths of 1000 bootstrap normal approximated and quantile confidence intervals (LBNCI, LBQCI). It is revealed that all the bootstrap confidence intervals are wider than the true qunatile intervals and approach the expected ones with adequate sample size. The empirical coverage probabilities of normal approximated confidence intervals (BNCI) tend to be higher than the nominal level whereas the bootstrap quantile intervals (BQI) are slightly smaller than the nominal one. In general, these constructed confidence intervals have fairly accurate coverage probabilities and provide greater precision as the sample size increases.

4.2 Assessment of Model Checking and Adaptive Lasso

The performances of testing procedures were studied through models M2 and M3. Y = X₁ + 0.8X₂²− 0.5(1 + |X₃|)⁻¹+ ε with ε ∼ N (0, 0.25X_θ²₀).

Table 4.3 summarizes the estimated sizes and powers for the hypothesis of model correctness and the rejection proportions of the single-indexing cross-validation method. Under the validity of model M2, the simulation results indicate that the estimated sizes are all smaller than the nominal size at 0.05. The power performance under model M3 is fairly good and the high power is generally associated with the large sample size. From the simulation experiments, the measure SCV_n tends to have high rejection rates for model M3 and, except for

(18)

(n, ρ) = (250, 0.2), to some extent higher ones for model M2.

Table 4.3

The estimated sizes and powers, and the rejection proportions (RP) based on 1000 (SCV_n) values

M2 M3

(n, ρ) α RP β RP

(100, 0.2) 0.001 0.063 0.903 0.952 (250, 0.2) 0.000 0.050 0.984 0.994 (100, 0.5) 0.001 0.060 0.936 0.991 (250, 0.5) 0.000 0.066 1.000 1.000

We further assessed the multi-stage adaptive Lasso algorithm through models M1-M2 with X = (X₁, X₂, X₃, X₄, X₅, X₆, X₇)^T and θ₀ = (1, 0.8, −0.5, 0, 0, 0, 0)^T. Since the average numbers of selecting incorrect zero coefficients are zero in all approaches, we only displayed those of correct zero coefficients. The mean squared error E[(θ_est − θ₀)^TΣ_X(θ_est − θ₀)] of any estimator θest was utilized to evaluate the predictive accuracy, where ΣX is the variance- covariance matrix of X. Table 4.4 gives the average numbers of correct zeros, the mean squared errors of estimators, and the proportions of variable selection. For the sample size of 100, the second-stage adaptive Lasso is found to have more accurate magnitudes of zero coefficients than the first-stage one while no significant improvement is detected in the third-stage one.

To sum up, the performance of the PLISE is the worst in selecting important covariates.

Again, the covariates with zero coefficients are rarely selected in our multi-stage adaptive Lasso estimation procedure. Interestingly, the influences of correlation coefficient and error structure are inapparent in the second-stage and third-stage ones. The performances of all the adaptive Lasso estimation procedures become undifferentiated to each other as the sample size increases. Moreover, an estimator obtained from the multi-stage adaptive Lasso yields a smaller mean squared error than those of the others.

(19)

5 Empirical Examples

The PLISE and the corresponding inference procedures are applied to the Boston house-price data. The multi-stage adaptive Lasso procedure is further adopted to identify significant meteorological variables on ozone concentration in the New York metropolitan area. Moreover, all the considered variables are all standardized to have mean of zero and variance of one.

5.1 Application to a Study of House Price

The first analyzed data were collected by the U.S. Census Service in the area of Boston.

A total of 506 observations on 14 attributes are contained in this data set. Measurements of interest include median value of owner-occupied homes in $1,000’s (medv), logarithm of percentage of lower status of the population (lstat), average number of rooms per dwelling (rm), logarithm of full-value property-tax rate per $10,000 (tax), pupil teacher ratio by town (ptrat), and weighted distances to five Boston employment centres (dis).

The PLISE (−0.715, 0.470, 0.350, 0.201) is obtained with the bootstrap standard errors (0.2116, 0.1194, 0.0645, 0.0420). The cross-validation bandwidth h_cv = 1.0765 is chosen for the coefficient estimates of the SI: lstat + θ₂rm + θ₃tax + θ₄ptrat + θ₅dis. The PMLE (−0.843, 0.615, 0.433, 0.144) and the PLSE (−0.302, 0.217, 0.270, 0.302) are also found to have very similar explanations of the meaning of predictors. The 0.95 bootstrap normal approximated and quantile confidence intervals are (−1.1302, −0.3007) and (−1.3029, −0.3170) for rm, (0.2360, 0.7040) and (0.2363, 0.6862) for tax, (0.2240, 0.4766) and (0.2191, 0.4723) for ptrat, and (0.1189, 0.2834) and (0.1290, 0.2906) for dis. These variables are detected to be

(20)

significantly associated with medv. It concurs with the simulation finding that the constructed confidence intervals were fairly close when the sample size is large. In addition, the values of SCV_n and the test statistic F_nare computed to be 1.987 and 0.878, respectively. As evidenced from SCV_n< TSS_n(= 2.108) and the bootstrap p-value of 0.006, the fitted SI model might be too simple. A more thorough investigation is needed to spot the relationship of the covariates on the conditional distribution of medv.

5.2 Application to a Study of Air Quality

The second data contain the measurements of daily ozone concentration (ozone), wind speed (wind), daily maximum temperature (temp), and solar radiation level (solar) on 111 successive days at meteorological stations from May to September 1973 in New York metropolitan area.

The variables wind, temp, solar, wind², temp², solar², wind ∗ temp, wind ∗ solar, and temp ∗ solar were included in the initial model fitting with the first one being the baseline covariate.

Our primary interest is to detect significant meteorological factors on ozone.

The PLISE (−1.417, −0.211, −0.482, 0.087, 0.560, −0.984, 0.077, −0.471) is obtained with the cross-validation bandwidth h_cv = 0.7517 being chosen. The PLISE and the PMLE (−1.351, −0.162, −0.582, −0.033, 0.441, −0.871, −0.032, −0.381) are more close, whereas the PLSE (−1.840, 0.073, −0.283, 0.324, 0.393, −0.947, 0.290, −0.632) is slightly different. We further implement the multiple-stage adaptive Lasso estimation and identify that the predictors solar, temp², and wind ∗ solar have zero coefficients. Note that the coefficients with zero estimates from the first-stage are the same as those from the second-stage. The estimates of coefficients, the bootstrap standard errors, and the bootstrap confidence inter-

(21)

vals are presented in table 5.1. We apply the conditional distribution model with the SI:

wind + θ₂temp + θ₃wind²+ θ₄solar²+ θ₅wind ∗ temp + θ₆temp ∗ solar and test the adequacy of the fitted model. The value of the test statistic F_n is approximately 0.993 (bootstrap p- value=0.382) and SCV_n= 8.387 is greater than TSS_n= 8.185. Thus, no significant evidence is identified to reject the considered model.

Table 5.1

The PPLISE, the bootstrap standard errors, and the 0.95 bootstrap confidence intervals

Variable PPLISE RWE BNCI BQCI

temp -1.401 0.1778 (-1.7496,-1.0525) (-1.7453,-1.0689) wind² -0.497 0.0707 (-0.6360,-0.3588) (-0.6218,-0.3859) solar² 0.528 0.1227 ( 0.2871, 0.7680) ( 0.3252, 0.8155) wind ∗ temp -1.020 0.0910 (-1.1988,-0.8420) (-1.1234,-0.8614) temp ∗ solar -0.356 0.0803 (-0.5136,-0.1987) (-0.5060,-0.2067)

6 Concluding Remarks and Further Extensions

This article presents an appealing estimation procedure for index coefficients and outperforms the existing ones. Compared with the PMLE, an important advantage of the PLISE is that it only requires a lower-order kernel in a one-dimensional bandwidth space. The modified cross- validation scores and residual process are also provided for bandwidth selection and model checking. Because of the poor approximation of the sandwich-type estimator, we employ random weighted bootstrap analogues of the asymptotic variance of the PLISE. The L¹- penalty with random weights is further adopted into the PLISE criterion to improve estimation and variable selection simultaneously in sparse high-dimensional models. Under the partial orthogonality condition of Huang, Ma, and Zhang (2008), our PPLISE still enjoys the oracle

(22)

property when the number of covariates increases exponentially with the sample size.

In some applications, the predictive abilities of covariates might depend on the values of a response variable. It is more realistic to consider the following varying-index model:

F_Y(y|x) = G(y, x_θ₀_(y)), (14)

where θ₀(y) is a vector of index coefficient functions of y. This modelling approach is especially useful to handle an ordinal response variable and for quantile forecasting. The PSIS in (5) and the PPSIS in (13) can be modified as

SS(θ(y)) = 1 n

n

X

i=1

(N_i(y) − bG(y, X_iθ(y)))² and PSS(θ(y)) = SS(θ(y)) + λ_y

d

X

k=2

|θ_k(y)|

|bθ_k(y)|. (15) In survival analysis, the response measurement represents the time to a particular event. It is worthy to note that the considered model includes more acceptable proportional hazards and accelerated failure time models. A major challenge in dealing with this issue is that the failure times of some individuals might not be available due to censoring. Our results should be valuable in the development of related inferences.

APPENDIX: PROOFS

Proof of Theorem 1.

By assumptions (A1) and (A3), we can derive from (7) that sup

Y×X_θ

| bG(y, x_θ) − G(y, x_θ₀)| ≤ 1

infX_θf (x_θ) sup

Y×X_θ

|N₁(y, x_θ) − f (x_θ)M₁₀(y, x_θ)|

+sup_Y×X

θ|N₁(y, x_θ)|

infX_θN₀(y, x_θ) sup

X_θ

|N₀(y, x_θ) − f (x_θ)| = o_p(p

ln n/nh) + O(h²) = o_p(1). (16)

(23)

It follows immediately from (16) that

sup

θ

|SS(θ) − Z

Y

E[(N (y) − M₁₀(y, X_θ))²]dW (y)| = o_p(1). (17)

Moreover, θ₀ can be shown to be a unique minimizer of R

YE[(N (y) − M₁₀(y, X_θ))²]dW (y) through the inequalityR

Y E[(N (y) − M10(y, Xθ))²]dW (y) ≥R

YE[ε²(y; θ0)]dW (y). With (17), the consistency of bθ is ensured via applying Theorem 5.1 of Ichimura (1993).

Along the same lines as the proof in (16), one has

sup

Y×X

|∂_θG(y, xb _θ₀) − H₁(y, x_θ₀)| = o_p(p

ln n/nh³) + O(h²). (18)

The score function√

nS(θ0) =√

n∂θSS(θ0) can be further decomposed as

√nS(θ₀) =

1

X

l1=0 1

X

l2=0

√nS_l₁_l₂(θ₀), (19)

where S_l₁_l₂(θ₀) = −(2/n)Pn i=1

R

Yε^1−l_i ¹(y; θ₀)H₁^1−l²(y, X_iθ₀)(G(y, X_iθ₀) − bG(y, X_iθ₀))^l¹(∂_θG(y,b Xiθ0) − H1(y, Xiθ0))^l²dWni(y). It is implied from (16) and (18) that

√nS₁₁(θ₀) = o_p(1). (20)

Let A_1mi= −M₀₁^2−m(y, Xiθ0)∂_x^2−m_θ G(y, Xiθ0)

f (X_iθ₀) , A_2mi = ε_i(y; θ₀)G^2−m(y, X_iθ₀)

f (X_iθ₀) , m = 1, 2, A_2mi= 1

f²(X_iθ₀)(ε_i(y; θ₀)∂_x_θ(f (X_iθ₀)M₁₁(y, X_iθ₀))G^4−m(y, X_iθ₀) + (m − 4)f (X_iθ₀)M₁₁(y, X_iθ₀)

∂_x_θG(y, X_iθ₀)), m = 3, 4, φ_kmij = N_j^2−m(y)K_h(X_jθ₀− X_iθ₀) − G^2−m(y, X_iθ₀)f (X_iθ₀), k, m = 1, 2, and φ_2mij = N_j^4−m(y)∂_θK_h(X_jθ₀− X_iθ₀) − G^2−m(y, X_iθ₀)∂_x_θ(f (X_iθ₀)M₀₁(y, X_iθ₀)), m = 3, 4.

The terms √

nS_l₁_l₂(θ₀), l₁ 6= l₂, in (19) can be rewritten as

√nSl1l2(θ0) = −2

√n

2k

X

m=1

X

i6=j

Z

Y

AkmiφkmijdWni(y) + op(1), k = l1+ 2l2. (21)

(24)

A little tedious but straightforward algebra yields E[A_kmi|x_iθ₀] = 0 and E[φ_kmij|x_i, y_i] = O(h²), k = 1, 2, which imply that

P (sup

Y

√n|S_l₁_l₂(θ₀)| > ε) ≤ 1 ε²n(n − 1)²

2k

X

m=1

sup

Y

(X

i6=j

(E[A²_kmiφ²_kmij] +X

l6=i,j

E[A²_kmiφ_kmijφ_kmil])

= O(1/n) + O(h⁴), k = l₁ + 2l₂, for any ε > 0. (22)

Combining (20)-(22), one has √

nS(θ₀) = √

nS₀₀(θ₀) + o_p(1). By applying the central limit theorem, it is easily to show that

√nS(θ₀)→ N (0, V^d _1θ₀). (23)

Similar to the proof of (16) and (18), there exist functions H₂(y, x_θ) and H₂(θ) = E[R

Y(H₁^⊗2(y X_θ) − ε(y; θ)H₂(y, X_θ))dW_ni(y)] satisfying

sup

Y×X_θ

|∂_θ²G(y, xb _θ) − H₂(y, x_θ)| = o_p(p

ln n/nh⁵) + O(h²) and (24)

sup

θ

|I(θ) − H₂(θ)| = o_p(p

ln n/nh⁵) + O(h²). (25) From (18), (24), and the law of large numbers, a direct calculation yields

I(θ₀) = V_2θ₀ − 2 n

n

X

i=1

Z

Y

ε_iy(θ₀)H₂(y, X_iθ₀)dW_ni(y) + o_p(1) = V_2θ₀ + o_p(1). (26) By assumption (A1), there exists an open set Θ₀ ⊂ Θ such that sup_θ∈Θ₀kH₂(θ) − H₂(θ₀)k

< ε/3 for any ε > 0. We can further derive from (25) that P ( sup

θ∈Θ0

kI(θ) − I(θ₀)k > ε) ≤ P ( sup

θ∈Θ0

kI(θ) − H₂(θ)k > ε

3) + P (kH₂(θ₀) − I(θ₀)k > ε 3).

and, hence, P ( sup

θ∈Θ0

kI(θ) − I(θ₀)k > ε) → 0 as n → ∞. (27)

(25)

Applying the Taylor expansion to S(bθ) around θ₀ to the second order, one has

S(θ₀) + I(θ^∗)(bθ − θ₀) = 0, (28)

where θ^∗ lies on the line segment between bθ and θ₀. Finally, the limiting distribution of

√n(bθ − θ0) is obtained by the Slutsky’s theorem, the consistency of bθ, (23), and (27)-(28).

Proof of Theorem 2.

Let S^rw(θ) and I^rw(θ) denote the first and second derivatives of SS^rw(θ). Paralleling the proof steps in (7), we can show that

sup

Y×X_θ

|∂_θ^l²N_l^rw₁ (y, xθ) − ∂_θ^l²Nl1(y, xθ)| = op^∗(1/nh^2l²⁺¹). (29)

By the Taylor expansion and (29), one has

sup

θ

|SS^rw(θ) − SS(θ)| ≤ 1 n

n

X

i=1

(ξ_i

µ − 1) sup

θ

| Z

Y

e²_i(y; θ)dW_ni(y)| + o_p^∗(1) = o_p^∗(1). (30)

The same argument for the convergence of bθ to θ implies that (bθ^rw− bθ) = o_p^∗(1). From (29) and S(bθ) = 0, the score function S^rw(bθ) can be expressed as

S^rw(bθ) = S^rw₀₀(bθ) + S^rw₀₁(bθ) + S^rw₁₀(bθ) + o_p^∗(p

1/n) (31)

with IS^rw_l₁_l₂(bθ) = −(2/n)Pn

i=1(ξ_i/µ)R

Ye^1−l_i ¹(y; θ)(∂_θG(y, Xb _ib_θ))^1−l²(∂_θGb^rw(y, X_ib_θ)−∂_θG(y, Xb _ib_θ))^l¹ ( bG(y, X_ib_θ) − bG^rw(y, X_ib_θ))^l²dW_ni(y). We further confirm from E^∗[S^rw₀₀(bθ)] = 0, E^∗[(S^rw₀₀(bθ))²]→^p ρ⁻²V2θ0, and the central limit theorem for independent random vectors (Serfling (1980)) that

P^∗(ρV₂^−1/2√

nS^rw₀₀(bθ) ≤ w)→ Π^p ^d_l=2Φ(w_l). (32)