New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

33  Download (0)

Full text



國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University ~ mathlib/preprint/2011- 15.pdf

New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

Chin-Tsang Chiang and Ming-Yueh Huang

July 31, 2011


New Estimation and Inference Procedures for a Single-Index Conditional Distribution Model

Chin-Tsang Chiang and Ming-Yueh Huang

Department of Mathematics, National Taiwan University, Taipei 10617, Taiwan, ROC

July 31, 2011


In this article, a more flexible single-index regression model is employed to char- acterize the conditional distribution. The pseudo least integrated squares approach is proposed to estimate the index coefficients. As shown in the numerical results, our esti- mator outperforms the existing ones in terms of the mean squared error. Moreover, we provide the generalized cross-validation criteria for bandwidth selection and utilize the frequency distributions of weighted bootstrap analogues for the estimation of asymp- totic variance and the construction of confidence intervals. With a defined residual process, a test rule is established to check the adequacy of an applied single-index condi- tional distribution model. To tackle with the problem of sparse variables, a multi-stage adaptive Lasso algorithm is developed to enhance the ability of identifying significant variables. All of our procedures are found to be easily implemented, numerically stable, and highly adaptive to a variety of data structures. In addition, we assess the finite sam- ple performances of the proposed estimation and inference procedures through extensive simulation experiments. Two empirical examples from the house-price study in Boston and the environmental study in New York are further used to illustrate applications of the methodology.

Keywords: Adaptive Lasso; Cross-validation; Curse of dimensionality; Multi-stage adaptive Lasso; Naive bootstrap; Oracle properties; Single-index; Pseudo least integrated squares estimator; Pseudo least squares estimator; Pseudo maximum likelihood estimator; Random weighted bootstrap; Residual process.


1 Introduction

We consider the conditional distribution FY(y|x) of a real-valued response Y given continuous covariates X = x, where X = (X1, · · · , Xd)T and x = (x1, · · · , xd)T. In regression analysis, a wide cross-section of research interests is pursued in the study of the conditional mean E[Y |x].

A more complete methodology and theoretical framework related to fully nonparametric and semiparametric distribution models still remains and a further investigation is necessary. As one can see, with a large number of covariatess, a fully nonparametric distribution usually suffers from the curse of dimensionality (Bellman (1961)). Although parametric models have played prominent roles in applications, they are frequently detected to be inadequate in many studies. Consequently, a more flexible semiparametric model becomes a great interest to characterize the dependence of Y on X and avoids the impact of misspecification of parametric models and the difficulty in the estimation of nonparametric distributions.

One of the most popular extension of parametric models is the single-index (SI) conditional distribution model:

FY(y|x) = G(y, xθ0), (1)

where G(·, ·) is an unknown bivariate function, xθ = x1+ (x2, · · · , xd)Tθ, and θ0 is a vector of true index coefficients. The most significant covariate is assumed, without loss of generality, to be X1 and the setting of its coefficient is mainly to deal with the problem of identifia- bility. When the conditional mean exists, it can be easily obtained from the above model that E[Y |x] = m(xθ0) with m(·) being some unspecified function. Based on the conditional mean model, Powell, Stock, and Stoker (1989) utilized the estimation of the density-weighted


average derivative to estimate θ0. Although the estimator was shown to be √

n-consistent, asymptotically normal, and computationally simple, the numerical instability is usually seen as a consequence of high-dimensional kernel smoothing. To overcome such a weakness with practice, Ichimura (1993) developed a semiparametric least squares approach and derived its asymptotic properties. Meanwhile, H¨ardle, Hall, and Ichimura (1993) recommended a cross- validation criterion to simultaneously estimate bandwidths and index coefficients. Under the validity of model (1) with a continuous response, Delecroix, H¨ardle, and Hristache (2003) introduced the pseudo likelihood (PL) estimation for θ0. Without moment and continuous conditions on Y , Hall and Yao (2005) suggested an estimation criterion on the basis of the average squared difference between the empirical estimator and the model-based estimator of the joint probability of (Y, XT)T. As one can see, the good performance of their estimation procedure is connected to an appropriate number of spheres and the corresponding radii used in the integral approximation. Currently, there is still no standard rule to determine the val- ues of these two quantities. Furthermore, the established algorithm is often computationally slow and intensive, especially in high-dimensional covariate spaces. Confronted with these problems, we proposed a new type of estimation criterion, which is simple and easily imple- mented, for θ0. The basic rationale behind this approach is to define the response process N (y) = I(Y ≤ y) and to directly use the difference between N (y) and its conditional mean G(y, xθ0) over the support of Y . Under some suitable conditions, the asymptotic distribution of the pseudo least integrated squares estimator (PLISE) is derived to be multivariate normal.

To make inferences related to θ0, the frequency distributions of its bootstrap analogues are used to estimate the asymptotic variance of the PLISE because a sandwich-type estimator tends to provide a very poor approximation. With the proposed residual process, the method


of Xia (2009) is extended to establish a test rule to check the adequacy of model (1). There are two features of the PLISE: Firstly, our estimation approach can be applied to different types of response variable and outperforms the existing ones; secondly, the foregoing inferences can be easily adopted and generalized to the considered problems in this article.

When the true underlying model has a sparse representation, identifying significant co- variates becomes an important issue to enhance the accuracy in prediction. The traditional best-subset selection algorithms are usually computationally infeasible in the presence of a po- tentially high-dimensional covariate space. The ridge regression estimation is another variance- stabilizing technique, which shrinks the least square estimator toward zero but not does not identify significant covariates cleverly. To simultaneously select significant variables and to estimate the parameters in regression models, Tibshirani (1996) introduced a least absolute shrinkage and selection operator (Lasso). Since Lasso variable selection might be inconsistent, Fan and Li (2001) and Zou (2006) proposed a smoothly clipped absolute deviation (SCAD) penalty and an adaptive Lasso instead. In their model specifications, the adaptive Lasso fur- ther avoids the problem of nonconcavity in the SCAD penalty although both of the procedures enjoy the oracle properties. By extending the adaptive Lasso in generalized linear models to our framework, we propose the penalized pseudo least integrated squares estimator (PPLISE) and derive the corresponding oracle properties. Moreover, in a small sample size scenario, a multi-stage adaptive Lasso estimation procedure is proposed to improve the possible selection inconsistency and predictive inaccuracy in the PPLISE.

The rest of this article is organized as follows. In Section 2, we propose the PLISE for θ0 and the cross-validation criteria for bandwidth selection. Moreover, the weighted bootstrap


inference procedures are introduced to estimate the asymptotic variance of the PLISE and construct the confidence regions for parameters of interests. A test rule and a multi-stage adaptive Lasso procedure are established in Section 3. In Sections 4-5, the simulation exper- iments are conducted and the proposed approaches are applied to two empirical examples.

Some concluding remarks and future research topics are provided in Section 6 and the proofs of the main results are placed in the Appendix.

2 Estimation and Inference Procedures

Based on a random sample of the form {(Xi, Yi)}ni=1, the PILSE of θ0 and the bandwidth selection criteria are proposed. The frequency distributions of bootstrap analogues are fully employed to estimate the asymptotic variance of the PILSE and construct the confidence intervals for the parameters of interest.

2.1 Estimation and Bandwidth Selection

For each fixed (y, xθ), the approach of Hall, Wolff, and Yao (1999) can be applied for the estimation of G(y, xθ). Let K(u) denote a kernel density, h be a positive-valued bandwidth, N`(y, X) = P

j6=i(Nj)`(y)Kh(X − X)/(n − 1), i = 1, · · · , n, ` = 0, 1, and Kh(u) = h−1K(u/h). The Nadaraya-Watson estimator for G(y, X) is given by bG(y, X) = N1(y, X)/

N0(y, X). By using the response process N (y) and a consistent estimator of G(y, xθ), the PLISE bθ is proposed to be a minimizer of the pseudo sum of integrated squares (PSIS):

SS(θ) = 1 n






e2i(y; θ)dWni(y), (2)


where Y is the support of Y or the interval of interest, ei(y; θ) = Ni(y)− bG(y, X), and Wni(y) is a non-negative weight function. In practical implementation, bG(y, xθ) is set to be zero if the denominator N0(y, xθ) is zero. Although a local linear estimator of G(y, xθ) can be used in the PSIS, it does not share the properties of a cumulative distribution function and might cause some complications in the above estimation procedure.

It follows from a direct algebraic calculation that

E[(N (y) − G(y, Xθ))2] = E[(N (y) − FY(y|X))2] + E[(FY(y|X) − G(y, Xθ))2]. (3)

Since the first term at the right-hand side of (3) does not depend on θ, both minimizers of E[R

Y(N (y) − G(y, Xθ))2dW (y)] and E[R

Y(FY(y|X) − G(y, Xθ))2dW (y)] can be shown to be θ0

under the validity of model (1), where W (y) is a convergent function of Wn(y). Thus, minimiz- ing SS(θ) is on average approximated by minimizing E[R

Y(FY(y|X) − G(y, Xθ))2dW (y)] with respect to θ. In our theoretical development and numerical implementation, the quartic kernel K(u) = (15/16)(1−u2)2I(|u| ≤ 1) is specified. The advantage of such a density function is that bθ can achieve the√

n-consistency. As a spacial case, the uniform distribution or the empirical distribution of Y can be specified for Wni(y)’s in (2). In the case where G(y, xθ) is known, the optimal weight for wni(y) = dWni(y)/dy is proportional to {G(y, X)(1 − G(y, X))}−1, the reciprocal of the conditional variance of Ni(y), at each fixed y. We further replace G(y, xθ) by a consistent estimator bG(y, xbθ) and iteratively update the weight estimation. The resulting estimator coincides with the maximizer of the following log-pseudo likelihood function for a random sample {Ni(y) : 1 ≤ i ≤ n}:

lp(θ) = 1 n






(Ni(y) ln( bG(y, X)) + (1 − Ni(y)) ln(1 − bG(y, X))dy. (4)


Let y(1) < · · · < y(m) denote the distinct order statistics of {Y1, · · · , Yn} and W(j) = Ry(j+1)

y(j) dWni(y). Since Ni(y)’s are zero-one processes and bG(y, X)’s are step functions with jumps occurring at {y(1), · · · , y(m)}, the PSIS in (2) has a computationally more attractive alternative as follows:

SS(θ) = 1 n



i=1 m−1



e2i(y(j); θ)W(j). (5)

In contrast, the estimation procedure of Hall and Yao (2005) is often computationally inten- sive. When the response variable Y is discrete and has a finite support, the above estimation criterion can also be applied. As for the binary response with values in {0, 1}, the PSIS will automatically reduce to the sum of squares in Ichimura (1993). In kernel estimation, a crite- rion for bandwidth selection is provided via generalizing the most commonly used “leave one subject out” cross-validation procedure of Rice and Silverman (1991). The optimal bandwidth hcv is naturally defined to be the unique minimizer of

CV1(h) = 1 n



i=1 m−1



e2i(y(j); bθi)W(j) (6)

with bθi = arg min{(n − 1)−1P



j=1 e2l(y(j); θ)W(j)}. Another criterion developed by H¨ardle, Hall, and Ichimura (1993) is further adopted and extended to our framework. The estimators of h and θ0 can be simultaneously obtained via minimizing CV2(h, θ) = SS(θ).

2.2 Asymptotic Properties

Suppose that X has a compact support X and θ0 is an interior point of the compact pa- rameter space Θ ⊆ Rd−1. Let f (xθ) be the density function of Xθ on Xθ = {xθ : x ∈ X , θ ∈ Θ}, Ml1l2(y, xθ) = E[Gl1(y, Xθ0)(x − X)⊗l2|xθ], l1 = 0, 1, l2 = 0, 1, 2, H1(y, xθ0) =


M01(y, xθ0)∂xθG(y, xθ0), V0 = 2E[(R

Yε(y; θ0)H1(y, Xθ0)dW (y))⊗2], and V0 = 4E[R

Y H1⊗2(y, Xθ0)dW (y)]. To establish the asymptotic properties of bθ, some suitable conditions are made before establishing the asymptotic properties of bθ as follows:

(A1) infXθf (xθ) > 0.

(A2) d3x

θf (xθ) and ∂x3

θM12(y, xθ) are Lipschitz continuous in xθ with the Lipschitz constants being independent of (y, xθ).

(A3) h = h0n−ς1 for ς1 ∈ (1/8, 1/5) and some positive constant h0. (A4) V0 and V0 are nonsingular.

Since the classes of kernel functions indexed by (h, xθ) are Euclidean (Pakes and Pollard (1989)), the imposed conditions entail that the considered classes of functions are Euclidean.

By applying Theorem II.37 of Pollard (1984), one has



|∂θl2Nl1(y, xθ) − ∂xl2

θ(f (xθ)Ml1l2(y, xθ))| = o(p

ln n/nh2l2+1) + O(h2) a.s. (7)

For simplicity, the consistency and asymptotic normality of bθ are established in the following theorem with the case of deterministic weight functions Wi(y)’s.

Theorem 1. Suppose that assumptions (A1)-(A4) are satisfied. Then, bθ → θp 0 and √ n(bθ − θ0)→ N (0, Σd θ0) as n → ∞, where Σθ0 = V−1



For a continuous response Y , the PMLE eθ of Delecroix, H¨ardle, and Hristache (2003) is obtained by maximizing

l(θ) = 1 n






j6=iK4h2(Yj− Yi)K4h1(X− X) P

j6=iK4h1(X− X) ), (8)


where K4h(u) = K4(u/h)/h and K4(u) = (105/64)(1 − 5u2+ 7u4− 3u6)I(|u| ≤ 1) (cf. Gørgens (2004)). The fourth-order kernel function is required to ensure the 1/√

n convergence rate.

The authors concluded that the proposed estimator can achieve the asymptotic efficiency. We further examined some mistakes in the theoretical derivations and that eθ only reaches the semiparametric efficiency bound. Let g(y, xθ0) = fY(y|x),

W0 = E[(∂x2

θg(Y, Xθ0)

g(Y, Xθ0) + 2dxθf (Xθ0)∂xθg(Y, Xθ0)

f (Xθ0) )M02(y, Xθ0) + ∂xθg(Y, Xθ0) g(Y, Xθ0)

·(2∂xθM02(y, Xθ0) − M01⊗2(y, Xθ0))], and W0 = E[(∂xθg(Y, Xθ0)

g(Y, Xθ0) )2(X − E[X|Xθ0])⊗2)].

The proofs for Theorem 1 are processed in the same manner for eθ1 → θp 0 and √

n(eθ1− θ0)→d N (0, W−1


0) under assumption (A1) and the assumptions:

(B1) d3xθf (xθ), ∂x3θM02(y, xθ), and ∂x3θE[g(Y, Xθ0)(x − X)⊗2|xθ] are Lipschitz continuous in xθ

with the Lipschitz constants being independent of (y, xθ).

(B2) hk= h0kn−ς2, k = 1, 2, for some positive constants h0k’s and ς2 ∈ (1/16, 1/6).

(B3) W0 and W0 are nonsingular.

2.3 Bootstrap Inferences

Based on the limiting distribution of√

n(bθ−θ0), a general rule in the construction of confidence intervals usually relies on an appropriate estimator of Σθ0. One of the most widely used estimators is the sandwich-type estimator bΣ

bθ = bV−1

1bθ Vb2bθVb−1

1bθ , where Vb1bθ = 2




i=1 m−1



(∂θG(yb (j), Xibθ))⊗2W(j) and bV2bθ = 4 n








ei(y(j); bθ)∂θG(yb (j), Xibθ)W(j))⊗2.


In practical implementation, a sufficiently good performance of bΣ

θb essentially requires an adequate bandwidth. Although the smoother in bΣ

θb can be chosen different from that in G(y, xb θb), there is still no standard criterion for doing so.

An alternative approach to avoid encountering such a situation is to employ the frequency distribution of bootstrap replications. A natural resampling approach is to draw independent bootstrap random vectors U1nb, · · · , Unnb from the empirical distribution Pn,U = n−1Pn

i=1IUi with Ui = (Xi, Yi), i = 1, · · · , n. The bootstrap analogue bθnb is straightforward created via solving SS(θ) in (2) based on a bootstrap sample {U1nb, · · · , Unnb}. Without requiring drawing observations from the collected data, we further adopt general weighted bootstrap approximations for the sampling distribution of bθ. Let ξ1, · · · , ξn be independently generated from a common distribution with mean µ and variance σ2. The random weighted bootstrap estimator bθrw is then defined to be the minimizer of

SSrw(θ) =








(erwi (y(j); θ))2W(j), (9)

where Di = ξi/Pn

j=1ξj, erwi (y; θ) = Ni(y)− bGrw(y, X), bGrw(y, X) = N1rw(y, X)/N0rw(y, X), and Nlrw(y, X) =P

j6=iDj Njl(y)Kh(X− X). This bootstrapping approach can be treated as the naive bootstrap one with a measure Pn,Urw =Pn

i=1DiIUi. It is interesting to note that the dependent weights Di’s can also be replaced with the independent weights ξi/(nµ)’s.

The random bootstrap confidence intervals for θl, l = 2, · · · , (d − 1), are constructed by

l± ρz1−α/2serw(bθlrw− bθl) or (bθl− ρq1−α/2rw (bθlrw− bθl), bθl− ρqα/2rw(bθrwl − bθl)), (10) where ρ = µ/σ is a scale factor modification for the variability in the weights, zp is the pth quantile value of the standard normal distribution, and serw(·) and qrw(·) denote the standard


error and the 100pth percentile of B bθrw’s, respectively. Let P(·) represent the probability measure conditioning on {U1, · · · , Un}. The validity of (10) is given in the next theorem.

Theorem 2. Suppose that assumptions in Theorem 1 are satisfied. Then,


nρ(bθrw− bθ) ≤ w) − P (√

n(bθ − θ0) ≤ w)→ 0 for all w = (wp 2, · · · , wd)T as n → ∞. (11)

3 Model Test and Sparse Models

In this section, a test rule is established for the adequacy of model (1). The PPLISE is proposed to tackle with the problem of sparse variables. A multi-stage adaptive Lasso procedure is further developed to improve the accuracy of variable selection.

3.1 Model Checking

Let ε(y; θ) = N (y) − G(y, Xθ) and θ1 be the minimizer of R

YE[ε2(y; θ)]dW (y). It is straight- forward to yield θ1 = θ0 and E[ε(y; θ1)] = 0 under model (1). If the considered model is inadequate, εy(y; θ1) can be projected into some linear combinations of covariates, i.e.

ε(y; θ1) = ν(y, xθ2) + ζ(y) with E[ζ(y)] = 0 for some y and θ2 = arg minθR

Ymin{E[(ε(y; θ1) − ν(y, xθ))2], E[ε2(y; θ1)]}dW (y). The parameter θ2 is naturally estimated by the minimizer ˘θ of RSSn(θ), where

RSSn(θ) =




1 n




min{(ei(y(j); bθ) −bν(y(j), X))2, (ei(y(j); bθ) − ¯e(y(j); bθ))2}

! W(j)

with ν(y, Xb ) = P

j6=iej(y; bθ)Kh(X− X) P

j6=iKh(X− X) and ¯e(y; bθ) = 1 n




ei(y; bθ).


By further computing TSSn = n−1Pn i=1


j=1 (ei(y(j); bθ)−¯e(y(j); bθ))2W(j), the test statistic Fn= RSSn(˘θ)/TSSn is used to check model (1):

(H0 : FY(y|x) = G(y, xθ0) for all (x, y) HA: FY(y|x) 6= G(y, xθ0) for some (x, y) .

The null hypothesis is rejected with a significance level α whenever Fn ≤ qα(Fnb), where Fnb is the bootstrap analogue of Fn with ei(y; bθ)’s substituting for ei(y; bθ)’s and each ei(y; bθ) being independently drawn from a two-point distribution: ((5 +√


i(y;bθ)/2+ ((5 −


i(y;bθ)/2 (cf. H¨ardle (1989)). As expected, the developed test rule is generally more powerful than those based on X, especially when its dimension is high. Similar to the single-indexing cross-validation value of Xia (2009), we consider the measure

SCVn =




1 n




min{(ei(y(j); bθ) −bν(y(j), Xθ

i))2, (ei(y(j); bθ) − ¯e(y(j); bθ))2}


W(j), (12)

where ˘θi = arg minθPm−1 j=1 (Pn

i=1min{(ei(y(j); bθ) −bνi(y(j), X))2, (ei(y(j); bθ) − ¯e(y(j); bθ))2})W(j) if it exists and νbi(y, X) is computed as bν(y, X) with the ith subject being deleted, i = 1, · · · , n. Following the argument of this author, one can also conclude that as the sample size approaches infinity, SCVn= TSSn if model (1) is adequate and SCVn< TSSn otherwise.

3.2 Adaptive Lasso Estimation and Oracle Properties

The PPLISE bθ(p) of θ0 is obtained via minimizing the penalized pseudo sum of integrated squares (PPSIS):

PSS(θ) = SS(θ) + λ





|bθl|, (13)

where λ is a nonnegative regularization or tuning parameter. In this variable selection and estimation procedure, significant covariates receive smaller penalties and tend to have nonzero


coefficient estimates while nonsignificant coefficients will be shrunk into zero. The above optimization problem entails that the true underlying model can be consistently identified and bθ(p)A0 has the same asymptotic distribution as bθA0, where A0 = {l|θ0l 6= 0}.

Theorem 3. Suppose that assumptions (A1)-(A4) are satisfied and λ = λ0n−ς3 for ς3 ∈ (1/2, 1) and some positive constant λ0. Then, P ( bA = A0) → 1 and √

n(bθ(p)A0 − θA0) →d N (0, ΣθA0) as n → ∞, where bA = {l|bθl 6= 0} and ΣθA0 is the asymptotic variance of bθA0.

It is revealed in our simulation experiments that the one-stage adaptive Lasso estimation in (13) usually cannot achieve the variable selection well in small sample applications. To conquer this shortcoming, we develop a multi-stage adaptive Lasso estimation scheme. Let eθ(m) represent the vector of nonzero estimates at the mth stage, θ(m) = (θ(m)l−T , θ(m)l, θT(m)l+)T be the corresponding parameter vector with a length of dm, and θ(m)l− and θ(m)l+ denote the vectors of coefficients with sub-indices smaller and greater than l, respectively. Moreover, SS(m)(m)) is defined as SS(θ) with θ being replaced with θ(m). The estimation procedure is implemented through the following steps:

S1. (eθ(1)l(1), eh(1)(1)l) = arg minθl,hSS(1)(eθ(1)l−(1) , θ(1)l, eθ(0)l+) + λ|θl|/|eθ(1)l(0)| with eθ(0)(1)l = bθl and eθ(1)(1)l = 0 whenever |eθ(1)(1)l| < ε0, l = 2, · · · , d1, for some sufficiently small positive value ε0.

S2. Set (eθ(k)(1)l, eh(k)(1)l) = (0, eh(k−1)(1)l ) if eθ(k−1)(1)l = 0; otherwise, (eθ(k)(1)l, eh(k)(1)l) = arg minθl,hSS(1) (eθ(k)(1)l−, θ(1)l, eθ(1)l+(k−1)) + λ|θl|/|eθ(0)l(k−1)| and eθ(1)l(k) = 0 whenever |eθ(k)(1)l| < ε0, k = 2, · · · .

S3. Iterations are stopped if keθ(k)(1)− eθ(1)(k−1)k < ε1 for some pre-chosen small value ε1 > 0, and θe(1)λ is set to be non-zero components of eθ(1)(k).


S4. eθ(1) = eθ(1)λ1 with λ1 = arg minλGCV(λ) and

GCV(λ) = SS(1)(eθ(1)λ) (1 − n−1tr(( bV1eθ

(1)λ+ diag(λ/neθ2(1)λ))−1Vb1eθ


S5. Repeat steps 1-4 M times until keθ(M )− eθ(M −1)k < ε2 for some small value ε2 > 0.

4 Monte Carlo Experiments

The performances of proposed estimation and inference procedures were assessed through a class of simulations with a variety of sample sizes, correlation structures of covariates, and error processes. The simulations were based on 1000 replications and the bootstrap inferences were drawn from 500 bootstrap samples, which enable us to obtain stable numerical results.

4.1 Assessment of Estimators and Inference Procedures

In this simulation scenario, the covariates X = (X1, X2, X3)T were generated from a trivariate normal distribution with mean zero, standard deviation of 1, and pairwise correlations of 0.2 or 0.5. The response variable Y was generated from the following two models:

M1. Y = Xθ0+ ε with θ0 = (1, 0.8, −0.5)T and ε ∼ N (0, 0.25).

M2. Y = Xθ0 + ε with θ0 = (1, 0.8, −0.5)T and ε ∼ N (0, 0.25Xθ20).

The uniform distribution and the empirical distribution of Y were used as the weights in (5) with the resulting estimators being denoted by bθ and bθemp, respectively. We compared the finite sample properties of our estimators bθ and bθemp with the estimator ˘θ of Hall and Yao


(2005), the pseudo maximum likelihood estimator (PMLE) eθ, and the pseudo least squares estimator (PLSE) ¯θ. The inference procedures based on the asymptotic normality of bθ and the frequency distributions of its bootstrap analogues were also studied by simulations. With the limitation of number of pages, the exchangeable random weights were only investigated through independent and identically distributed Gamma(4, 2) random variables, which have better numerical results than others.

Table 4.1 displays the means and the standard deviations of 1000 estimates with the sample sizes (n) of 100, 250, and 500, and the correlation coefficients (ρ) of 0.2 and 0.5. The biases of compared estimators are generally inapparent except for eθ and ¯θ under (n, ρ) = (100, 0.5) and θ under (n, ρ) = (100, 0.5) and model M1. As one can expect, the standard deviations of these˘ estimators decrease as n increases or as ρ becomes small. We further detect in this table that eθ tends to have a substantially large variance when n is small. The high variability in eθ is mainly caused by the use of the fourth-order kernel function. In practical applications, the second- order kernel function is often used to overcome this shortcoming. In addition, the simulation results indicate that bθemp has the smallest variance under the validity of heterogeneous error model and bθ is comparable with ¯θ in the case of a constant error process. Even if ¯θ performs satisfactorily in the homogeneous error model, it has a relatively large variance among all estimators in the heterogeneous one. The CPU time for the computation of bθemp is much shorter than that for ˘θ although both estimators have a similar performance.

The sandwich-type estimate (SANE), the naive bootstrap estimate (NBE), and the random weighted bootstrap estimate (RWE) of the standard deviation of bθ are provided in table 4.2.

Overall, the SANE underestimates the asymptotic variance in a more pronounced fashion even


for a sufficiently large n. We further found that the bootstrap estimates slightly overestimate the asymptotic variance but their accuracies significantly improve as the sample size increases.

Apparently, the RWE is closer to the asymptotic variance than the NBE. Tables 4.2 also presents the empirical coverage probabilities, the average lengths of 0.95 quantile intervals of 1000 estimates, and the average lengths of 1000 bootstrap normal approximated and quantile confidence intervals (LBNCI, LBQCI). It is revealed that all the bootstrap confidence intervals are wider than the true qunatile intervals and approach the expected ones with adequate sample size. The empirical coverage probabilities of normal approximated confidence intervals (BNCI) tend to be higher than the nominal level whereas the bootstrap quantile intervals (BQI) are slightly smaller than the nominal one. In general, these constructed confidence intervals have fairly accurate coverage probabilities and provide greater precision as the sample size increases.

4.2 Assessment of Model Checking and Adaptive Lasso

The performances of testing procedures were studied through models M2 and M3. Y = X1 + 0.8X22− 0.5(1 + |X3|)−1+ ε with ε ∼ N (0, 0.25Xθ20).

Table 4.3 summarizes the estimated sizes and powers for the hypothesis of model correctness and the rejection proportions of the single-indexing cross-validation method. Under the va- lidity of model M2, the simulation results indicate that the estimated sizes are all smaller than the nominal size at 0.05. The power performance under model M3 is fairly good and the high power is generally associated with the large sample size. From the simulation exper- iments, the measure SCVn tends to have high rejection rates for model M3 and, except for


(n, ρ) = (250, 0.2), to some extent higher ones for model M2.

Table 4.3

The estimated sizes and powers, and the rejection proportions (RP) based on 1000 (SCVn) values

M2 M3

(n, ρ) α RP β RP

(100, 0.2) 0.001 0.063 0.903 0.952 (250, 0.2) 0.000 0.050 0.984 0.994 (100, 0.5) 0.001 0.060 0.936 0.991 (250, 0.5) 0.000 0.066 1.000 1.000

We further assessed the multi-stage adaptive Lasso algorithm through models M1-M2 with X = (X1, X2, X3, X4, X5, X6, X7)T and θ0 = (1, 0.8, −0.5, 0, 0, 0, 0)T. Since the average numbers of selecting incorrect zero coefficients are zero in all approaches, we only displayed those of correct zero coefficients. The mean squared error E[(θest − θ0)TΣXest − θ0)] of any estimator θest was utilized to evaluate the predictive accuracy, where ΣX is the variance- covariance matrix of X. Table 4.4 gives the average numbers of correct zeros, the mean squared errors of estimators, and the proportions of variable selection. For the sample size of 100, the second-stage adaptive Lasso is found to have more accurate magnitudes of zero coefficients than the first-stage one while no significant improvement is detected in the third-stage one.

To sum up, the performance of the PLISE is the worst in selecting important covariates.

Again, the covariates with zero coefficients are rarely selected in our multi-stage adaptive Lasso estimation procedure. Interestingly, the influences of correlation coefficient and error structure are inapparent in the second-stage and third-stage ones. The performances of all the adaptive Lasso estimation procedures become undifferentiated to each other as the sample size increases. Moreover, an estimator obtained from the multi-stage adaptive Lasso yields a smaller mean squared error than those of the others.


5 Empirical Examples

The PLISE and the corresponding inference procedures are applied to the Boston house-price data. The multi-stage adaptive Lasso procedure is further adopted to identify significant meteorological variables on ozone concentration in the New York metropolitan area. Moreover, all the considered variables are all standardized to have mean of zero and variance of one.

5.1 Application to a Study of House Price

The first analyzed data were collected by the U.S. Census Service in the area of Boston.

A total of 506 observations on 14 attributes are contained in this data set. Measurements of interest include median value of owner-occupied homes in $1,000’s (medv), logarithm of percentage of lower status of the population (lstat), average number of rooms per dwelling (rm), logarithm of full-value property-tax rate per $10,000 (tax), pupil teacher ratio by town (ptrat), and weighted distances to five Boston employment centres (dis).

The PLISE (−0.715, 0.470, 0.350, 0.201) is obtained with the bootstrap standard errors (0.2116, 0.1194, 0.0645, 0.0420). The cross-validation bandwidth hcv = 1.0765 is chosen for the coefficient estimates of the SI: lstat + θ2rm + θ3tax + θ4ptrat + θ5dis. The PMLE (−0.843, 0.615, 0.433, 0.144) and the PLSE (−0.302, 0.217, 0.270, 0.302) are also found to have very similar explanations of the meaning of predictors. The 0.95 bootstrap normal approx- imated and quantile confidence intervals are (−1.1302, −0.3007) and (−1.3029, −0.3170) for rm, (0.2360, 0.7040) and (0.2363, 0.6862) for tax, (0.2240, 0.4766) and (0.2191, 0.4723) for ptrat, and (0.1189, 0.2834) and (0.1290, 0.2906) for dis. These variables are detected to be


significantly associated with medv. It concurs with the simulation finding that the constructed confidence intervals were fairly close when the sample size is large. In addition, the values of SCVn and the test statistic Fnare computed to be 1.987 and 0.878, respectively. As evidenced from SCVn< TSSn(= 2.108) and the bootstrap p-value of 0.006, the fitted SI model might be too simple. A more thorough investigation is needed to spot the relationship of the covariates on the conditional distribution of medv.

5.2 Application to a Study of Air Quality

The second data contain the measurements of daily ozone concentration (ozone), wind speed (wind), daily maximum temperature (temp), and solar radiation level (solar) on 111 successive days at meteorological stations from May to September 1973 in New York metropolitan area.

The variables wind, temp, solar, wind2, temp2, solar2, wind ∗ temp, wind ∗ solar, and temp ∗ solar were included in the initial model fitting with the first one being the baseline covariate.

Our primary interest is to detect significant meteorological factors on ozone.

The PLISE (−1.417, −0.211, −0.482, 0.087, 0.560, −0.984, 0.077, −0.471) is obtained with the cross-validation bandwidth hcv = 0.7517 being chosen. The PLISE and the PMLE (−1.351, −0.162, −0.582, −0.033, 0.441, −0.871, −0.032, −0.381) are more close, whereas the PLSE (−1.840, 0.073, −0.283, 0.324, 0.393, −0.947, 0.290, −0.632) is slightly different. We fur- ther implement the multiple-stage adaptive Lasso estimation and identify that the predic- tors solar, temp2, and wind ∗ solar have zero coefficients. Note that the coefficients with zero estimates from the first-stage are the same as those from the second-stage. The es- timates of coefficients, the bootstrap standard errors, and the bootstrap confidence inter-


vals are presented in table 5.1. We apply the conditional distribution model with the SI:

wind + θ2temp + θ3wind2+ θ4solar2+ θ5wind ∗ temp + θ6temp ∗ solar and test the adequacy of the fitted model. The value of the test statistic Fn is approximately 0.993 (bootstrap p- value=0.382) and SCVn= 8.387 is greater than TSSn= 8.185. Thus, no significant evidence is identified to reject the considered model.

Table 5.1

The PPLISE, the bootstrap standard errors, and the 0.95 bootstrap confidence intervals


temp -1.401 0.1778 (-1.7496,-1.0525) (-1.7453,-1.0689) wind2 -0.497 0.0707 (-0.6360,-0.3588) (-0.6218,-0.3859) solar2 0.528 0.1227 ( 0.2871, 0.7680) ( 0.3252, 0.8155) wind ∗ temp -1.020 0.0910 (-1.1988,-0.8420) (-1.1234,-0.8614) temp ∗ solar -0.356 0.0803 (-0.5136,-0.1987) (-0.5060,-0.2067)

6 Concluding Remarks and Further Extensions

This article presents an appealing estimation procedure for index coefficients and outperforms the existing ones. Compared with the PMLE, an important advantage of the PLISE is that it only requires a lower-order kernel in a one-dimensional bandwidth space. The modified cross- validation scores and residual process are also provided for bandwidth selection and model checking. Because of the poor approximation of the sandwich-type estimator, we employ random weighted bootstrap analogues of the asymptotic variance of the PLISE. The L1- penalty with random weights is further adopted into the PLISE criterion to improve estimation and variable selection simultaneously in sparse high-dimensional models. Under the partial orthogonality condition of Huang, Ma, and Zhang (2008), our PPLISE still enjoys the oracle


property when the number of covariates increases exponentially with the sample size.

In some applications, the predictive abilities of covariates might depend on the values of a response variable. It is more realistic to consider the following varying-index model:

FY(y|x) = G(y, xθ0(y)), (14)

where θ0(y) is a vector of index coefficient functions of y. This modelling approach is especially useful to handle an ordinal response variable and for quantile forecasting. The PSIS in (5) and the PPSIS in (13) can be modified as

SS(θ(y)) = 1 n




(Ni(y) − bG(y, Xiθ(y)))2 and PSS(θ(y)) = SS(θ(y)) + λy





|bθk(y)|. (15) In survival analysis, the response measurement represents the time to a particular event. It is worthy to note that the considered model includes more acceptable proportional hazards and accelerated failure time models. A major challenge in dealing with this issue is that the failure times of some individuals might not be available due to censoring. Our results should be valuable in the development of related inferences.


Proof of Theorem 1.

By assumptions (A1) and (A3), we can derive from (7) that sup


| bG(y, xθ) − G(y, xθ0)| ≤ 1

infXθf (xθ) sup


|N1(y, xθ) − f (xθ)M10(y, xθ)|


θ|N1(y, xθ)|

infXθN0(y, xθ) sup


|N0(y, xθ) − f (xθ)| = op(p

ln n/nh) + O(h2) = op(1). (16)


It follows immediately from (16) that



|SS(θ) − Z


E[(N (y) − M10(y, Xθ))2]dW (y)| = op(1). (17)

Moreover, θ0 can be shown to be a unique minimizer of R

YE[(N (y) − M10(y, Xθ))2]dW (y) through the inequalityR

Y E[(N (y) − M10(y, Xθ))2]dW (y) ≥R

YE[ε2(y; θ0)]dW (y). With (17), the consistency of bθ is ensured via applying Theorem 5.1 of Ichimura (1993).

Along the same lines as the proof in (16), one has



|∂θG(y, xb θ0) − H1(y, xθ0)| = op(p

ln n/nh3) + O(h2). (18)

The score function√

nS(θ0) =√

n∂θSS(θ0) can be further decomposed as

√nS(θ0) =



l1=0 1



√nSl1l20), (19)

where Sl1l20) = −(2/n)Pn i=1


Yε1−li 1(y; θ0)H11−l2(y, X0)(G(y, X0) − bG(y, X0))l1(∂θG(y,b X0) − H1(y, X0))l2dWni(y). It is implied from (16) and (18) that

√nS110) = op(1). (20)

Let A1mi= −M012−m(y, X0)∂x2−mθ G(y, X0)

f (X0) , A2mi = εi(y; θ0)G2−m(y, X0)

f (X0) , m = 1, 2, A2mi= 1

f2(X0)(εi(y; θ0)∂xθ(f (X0)M11(y, X0))G4−m(y, X0) + (m − 4)f (X0)M11(y, X0)

xθG(y, X0)), m = 3, 4, φkmij = Nj2−m(y)Kh(X0− X0) − G2−m(y, X0)f (X0), k, m = 1, 2, and φ2mij = Nj4−m(y)∂θKh(X0− X0) − G2−m(y, X0)∂xθ(f (X0)M01(y, X0)), m = 3, 4.

The terms √

nSl1l20), l1 6= l2, in (19) can be rewritten as

√nSl1l20) = −2









AkmiφkmijdWni(y) + op(1), k = l1+ 2l2. (21)


A little tedious but straightforward algebra yields E[Akmi|x0] = 0 and E[φkmij|xi, yi] = O(h2), k = 1, 2, which imply that

P (sup


√n|Sl1l20)| > ε) ≤ 1 ε2n(n − 1)2








(E[A2kmiφ2kmij] +X



= O(1/n) + O(h4), k = l1 + 2l2, for any ε > 0. (22)

Combining (20)-(22), one has √

nS(θ0) = √

nS000) + op(1). By applying the central limit theorem, it is easily to show that

√nS(θ0)→ N (0, Vd 0). (23)

Similar to the proof of (16) and (18), there exist functions H2(y, xθ) and H2(θ) = E[R

Y(H1⊗2(y Xθ) − ε(y; θ)H2(y, Xθ))dWni(y)] satisfying



|∂θ2G(y, xb θ) − H2(y, xθ)| = op(p

ln n/nh5) + O(h2) and (24)



|I(θ) − H2(θ)| = op(p

ln n/nh5) + O(h2). (25) From (18), (24), and the law of large numbers, a direct calculation yields

I(θ0) = V0 − 2 n






εiy0)H2(y, X0)dWni(y) + op(1) = V0 + op(1). (26) By assumption (A1), there exists an open set Θ0 ⊂ Θ such that supθ∈Θ0kH2(θ) − H20)k

< ε/3 for any ε > 0. We can further derive from (25) that P ( sup


kI(θ) − I(θ0)k > ε) ≤ P ( sup


kI(θ) − H2(θ)k > ε

3) + P (kH20) − I(θ0)k > ε 3).

and, hence, P ( sup


kI(θ) − I(θ0)k > ε) → 0 as n → ∞. (27)


Applying the Taylor expansion to S(bθ) around θ0 to the second order, one has

S(θ0) + I(θ)(bθ − θ0) = 0, (28)

where θ lies on the line segment between bθ and θ0. Finally, the limiting distribution of

√n(bθ − θ0) is obtained by the Slutsky’s theorem, the consistency of bθ, (23), and (27)-(28).

Proof of Theorem 2.

Let Srw(θ) and Irw(θ) denote the first and second derivatives of SSrw(θ). Paralleling the proof steps in (7), we can show that



|∂θl2Nlrw1 (y, xθ) − ∂θl2Nl1(y, xθ)| = op(1/nh2l2+1). (29)

By the Taylor expansion and (29), one has



|SSrw(θ) − SS(θ)| ≤ 1 n





µ − 1) sup


| Z


e2i(y; θ)dWni(y)| + op(1) = op(1). (30)

The same argument for the convergence of bθ to θ implies that (bθrw− bθ) = op(1). From (29) and S(bθ) = 0, the score function Srw(bθ) can be expressed as

Srw(bθ) = Srw00(bθ) + Srw01(bθ) + Srw10(bθ) + op(p

1/n) (31)

with ISrwl1l2(bθ) = −(2/n)Pn


Ye1−li 1(y; θ)(∂θG(y, Xb ibθ))1−l2(∂θGbrw(y, Xibθ)−∂θG(y, Xb ibθ))l1 ( bG(y, Xibθ) − bGrw(y, Xibθ))l2dWni(y). We further confirm from E[Srw00(bθ)] = 0, E[(Srw00(bθ))2]→p ρ−2V0, and the central limit theorem for independent random vectors (Serfling (1980)) that


nSrw00(bθ) ≤ w)→ Πp dl=2Φ(wl). (32)




Related subjects :