2007, Vol. 35, No. 2, 522–542 DOI:10.1214/009053606000001398

©Institute of Mathematical Statistics, 2007

**REDUCING VARIANCE IN UNIVARIATE SMOOTHING**
BY MING-YENCHENG, LIANGPENG^{1}ANDJYH-SHYANGWU

*National Taiwan University, Georgia Institute of Technology*
*and Tamkang University*

A variance reduction technique in nonparametric smoothing is proposed:

at each point of estimation, form a linear combination of a preliminary esti- mator evaluated at nearby points with the coefficients specified so that the asymptotic bias remains unchanged. The nearby points are chosen to maxi- mize the variance reduction. We study in detail the case of univariate local linear regression. While the new estimator retains many advantages of the local linear estimator, it has appealing asymptotic relative efficiencies. Band- width selection rules are available by a simple constant factor adjustment of those for local linear estimation. A simulation study indicates that the finite sample relative efficiency often matches the asymptotic relative efficiency for moderate sample sizes. This technique is very general and has a wide range of applications.

**1. Introduction.** Local linear modeling for nonparametric regression has
many advantages and has become very popular. Hastie and Loader [16], Wand
and Jones [27], Fan and Gijbels [11] and others have investigated extensively its
*theoretical and practical properties. To reduce the variance, for each point x, take*
a special linear combination of this local linear estimators at three equally spaced
*points around x. The linear combination satisfies certain moment conditions so*
that the asymptotic bias is unchanged. Below are a few specific features of this
new estimator. First, both local and global automatic bandwidths can be easily
obtained from those for the standard local linear estimator. Second, the asymp-
totic mean squared error is improved considerably and the amount of reduction is
uniform across different locations, regression functions, designs and error distribu-
tions. Third, evidenced by a simulation study, the reduction in asymptotic variance
is effectively projected to finite samples. Fourth, the estimators admit simple forms
and only slightly increase the amount of computation time. Finally, many advan-
tages of the local linear estimator, for example, design adaptivity and automatic
boundary correction, are retained.

Received May 2004; revised May 2006.

1Supported by NSF Grant DMS-04-03443.

*AMS 2000 subject classifications.*Primary 62G08, 62G05; secondary 60G20.

*Key words and phrases. Bandwidth, coverage probability, kernel, local linear regression, non-*
parametric smoothing, variance reduction.

522

In kernel density estimation, Kogure [18] studied a related topic on polyno- mial interpolation and obtained some preliminary results. The purpose is to find optimal allocation of the interpolation points by minimizing the asymptotic mean integrated squared error. Higher-order polynomial interpolation does not change the asymptotic bias but alters the asymptotic variance in a nonhomogeneous way unless the spacings are all large enough, which was assumed in seeking the optimal allocation.

Fan [9] showed that the local linear estimator is minimax optimal among all linear estimators. The proposed estimators are linear and have smaller asymptotic mean squared errors than the local linear estimator. There is no conflict between these two results since the assumptions are different. In [9], the maximum risk of a linear estimator is taken over a class of regression functions that can be approx- imated well linearly. The asymptotic results for our estimators require the slightly more stringent condition that the regression function has a bounded, continuous second derivative at the point of estimation.

There is an extensive literature on modifications and improvements of kernel and local linear estimators. Many of them are aimed at bias reduction, for exam- ple, Abramson [1], Samiuddin and El-Sayyad [22], Jones, Linton and Nielsen [17]

and Choi and Hall [7]. Choi and Hall [7] exploit the idea of taking linear combi- nations of some local linear estimators as well. The key difference is that we take linear combinations to maintain bias but reduce variance while Choi and Hall [7]

take linear combinations to reduce bias while essentially maintaining variance.

More specifically, (i) they use local linear estimates at points symmetrically lo-
*cated around x and we do not, (ii) they take a convex combination and we employ*
different constraints on the coefficients, (iii) bandwidth selection is straightforward
in our case but is much more complicated for theirs, and (iv) our method applies
to local constant fitting estimators but theirs does not. There are very few vari-
ance reduction techniques available. An error-dependent technique in local linear
smoothing was suggested by Cheng and Hall [3]. The amount of variance reduc-
tion depends on the error distribution. Cheng and Hall [4] introduced an adaptive
line integral of the bivariate kernel density estimate to reduce variance. It does not
apply to the univariate case or regression estimation. These two methods require
explicitly or implicitly a pilot estimation of the function or its derivatives. Our
procedure is based on a completely different idea and effectively reduces variance
without pilot estimation.

Section 2presents the methodology and Section3 provides results on the as- ymptotic mean squared error and coverage accuracy. Section 4 discusses some practical issues, including bandwidth selection and implementation. Section5con- tains a numerical study. Possible generalizations and applications of the variance reduction technique are discussed in Section6. Proofs are given in Section7.

**2. Methodology.**

*2.1. Local linear regression. Suppose that independent bivariate observations*
*(X*1*, Y*1*), . . . , (X**n**, Y**n**)*are sampled from the regression model

*Y* *= m(X) + σ(X)ε,*

*where σ (X) is the conditional variance and the random error ε, independent of X,*
*has zero mean and unit variance. Kernel estimators of m(x)= E(Y |X = x) include*
the Nadaraya–Watson, Gasser–Müller and local polynomial estimators; see [8,
25, 27]. Let K be a kernel function and h > 0 be a bandwidth. The local linear
regressor is defined as

*S*_{n,2}*(x)T*_{n,0}*(x)− S**n,1**(x)T*_{n,1}*(x)*
*S**n,0**(x)S**n,2**(x)− S**n,1**(x)S**n,1**(x),*
(2.1)

*where S**n,l**(x)= h*^{}^{n}_{i}_{=1}*(x− X**i**)*^{l}*K**h**(x− X**i**), l= 0, 1, 2, T**n,l**(x)= h*^{}^{n}_{i}_{=1}*(x*−
*X**i**)*^{l}*K**h**(x* *− X**i**)Y**i**, l= 0, 1, and K**h**(t)= K(t/h)/h. This estimator is obtained*
by solving a local linear least squares problem. Its appealing theoretical and nu-
merical properties were discussed by Fan [9] and Fan and Gijbels [11], among
others. Further, it has become very popular, widely used in applications and imple-
mented in statistical software. When the denominator is close to zero, it exhibits a
rather unstable numerical behavior. This is particularly problematic for small sam-
ples or sparse designs. Remedies include modifications proposed by Seifert and
Gasser [23], Cheng, Hall and Titterington [5], Hall and Turlach [15] and Seifert
and Gasser [24]. Although we focus on the local linear estimator denoted by*m(x),*
our variance reduction methods given below can be applied to any such modifica-
tion.

*2.2. Variance reduced estimators. A motivation of our variance reduction*
strategy is to incorporate more data points in the regression estimation in such
a way that the first-order term in the asymptotic bias remains unchanged. To do
*this, at each x we construct a linear combination of the local linear estimators at*
*some equally spaced points near x, with the linear coefficients satisfying certain*
moment conditions derived from the asymptotic bias expansions. Then variance
reduction is achieved since the three preliminary estimators are correlated and the
correlation coefficients are less than 1. In this context, we fix the number of nearby
points at three since that is the minimal requirement specified by the moment con-
ditions and using more than three yields complex solutions. In addition, the grid
points are taken so that the final estimator is the simplest and most efficient in the
mean squared error sense.

*Formally, for any x, define three equally spaced points α**x,j* *= x − (r + 1 −*
*j )ω*_{n}*, j* *= 0, 1, 2, where r ∈ (−1, 1) and ω**n**= δh for some constant δ > 0. The*
*shift parameter r determines the location of the leftmost point α**x,0* *relative to x*

*and ω**n*is the spacing of the grid. Construct a quadratic interpolation of the local
linear estimators *m(α* _{x,0}*),m(α* _{x,1}*)* and *m(α* _{x,2}*)* *and then estimate m(x) by the*
*value of the interpolated curve at x:*

*m*_{q}*(x)*= ^{}

*j**=0,1,2*

*A*_{j}*(r)m*^{}*x− (r + 1 − j)ω**n*

*,*

*where the coefficients depend only on r and are given by*

*A*0*(r)= r(r − 1)/2,* *A*1*(r)= (1 − r*^{2}*),* *A*2*(r)= r(r + 1)/2.*

Then the moment conditions (7.4) are satisfied so that*m*_{q}*(x)*has the same asymp-
totic bias as *m(x). Theorem* 1 and Proposition1 show that *m**q**(x)* has a smaller
asymptotic variance than*m(x)* and the asymptotic variances differ from each other
*by a constant factor depending on the bin-width parameter δ, the shift parameter*
*r* *and the kernel K. Given K and δ, the optimal values of r that minimize this*
*constant factor are r*= ±√

*1/2, which give*

*m*_{±}*(x)*= ^{}

*j**=0,1,2*

*A*_{j}^{}*±1/*√

2^{}*m*^{}*x*−^{}*±1/*√

2*+ 1 − j*^{}*ω*_{n}^{}*.*

Since*m*_{+}*(x)* *uses more data information on the left-hand side of x than on the*
right-hand side, the curve estimate*m*_{+}*(·) tends to shift to the right of m(·). Sim-*
ilarly,*m*_{−}*(·) tends to shift to the left of* *m(·). This symmetry suggests taking the*
average

*m**a**(x)*= {*m*_{+}*(x)*+*m*_{−}*(x)}/2*

as our final estimator. Sections 3 and 5 demonstrate that, compared to *m*_{±}*(x),*

*m*_{a}*(x)*further improves the asymptotic and finite sample efficiencies.

*When Supp(m) is bounded, Supp(m)= [0, 1] say, since each of* *m*_{±}*(x)* and

*m**a**(x)*requires values of*m**at points around x, a δ value,*

*δ(x)*= min^{}*δ, x/*^{}*1/2*+ 1^{}*h, (1− x)/*^{}*1/2*+ 1^{}*h*^{ }*,*
(2.2)

*which depends on the distances from x to the boundary points 0 and 1, is used so*
that*m**a**(x)is defined for every x∈ [0, 1].*

Our estimators *m*_{±}*(x)* and*m**a**(x)* are simple linear combinations of local lin-
ear estimators evaluated at nearby points. No pilot estimation is involved in this
variance reduction. Therefore, in finite samples, the asymptotic efficiencies are
*achieved at relatively small n. For the same reasons, they share many advantages*
enjoyed by the local linear estimator, for example automatic boundary correction
and design adaptivity; see [11]. It is shown in Section 3that each of*m*_{±}*(x)* and

*m**a**(x)*reduces the asymptotic variance uniformly across different regression func-
tions, different designs and different error distributions. Then procedures such as
design planning employed in applications of*m*apply to our estimators in general.

Thus the proposed estimators are rather user friendly.

*2.3. Confidence intervals. Consider constructing confidence intervals for*
*m(x)* based on *m(x)* and *m**q**(x). Let ν**ij* =^{}*s*^{i}*K(s)*^{j}*ds* and*ν**0l* =^{}{^{}*i**=0,1,2*

*A*_{i}*(r)K(s+iδ)}*^{l}*ds. Define w*_{ij k}*(x)= n*^{−1}*h*^{j}^{−i−1}^{}^{n}_{l}_{=1}*(x−X**l**)*^{i}*{K**h**(x−X**l**)*}* ^{j}*×

*{Y*

*l*

*−m(x)}*

*and*

^{k}*σ*

^{2}

*(x)= n*

^{−1}

^{}

^{n}

_{l=1}*K*

*h*

*(x−X*

*l*

*){Y*

*l*−

*m(x)*}

^{2}

*/w*010

*(x).*Then both

*T*_{n}*(x)*=

√*nh*

√*ν*02

*m(x)− m(x)*

*σ*^{2}*(x)/w*010*(x)*

*,* *T*_{n}^{∗}*(x)*=

√*nh*

√*ν*02

*m**q**(x)− m(x)*

*σ*^{2}*(x)/w*010*(x)*

*are asymptotically N (0, 1) distributed when nh*^{5}→ 0, and we have the following
*one-sided confidence intervals for m(x):*

*I** _{β}*=

^{}

*m(x)*

*− z*

*β*{

*σ*

^{2}

*(x)/w*

_{010}

*(x)*}

^{1/2}*ν*

_{02}

^{1/2}*(nh)*

^{−1/2}*,*∞

^{}

*,*(2.3)

*I**β*=^{}*m**q**(x)− z**β*{*σ*^{2}*(x)/w*010*(x)*}^{1/2}*ν*_{02}^{1/2}*(nh)*^{−1/2}*,*∞^{}*,*
(2.4)

*where z**β* *satisfies P{N(0, 1) ≤ z**β**} = β.*

**3. Theoretical results.** The following conditions are needed for asymptotic
analysis of the estimators.

ASSUMPTION *(C**K**).* *K* is a symmetric density function with compact sup-
port.

ASSUMPTION*(C**m,f**).*

*1. m*^{}*(·), the second derivative of m(·), is bounded and continuous at x;*

*2. the density function f (·) of X satisfies f (x) > 0 and |f (x) − f (y)| ≤*
*c|x − y|*^{α}*for some 0 < α < 1;*

*3. σ*^{2}*(·) is bounded and continuous at x.*

*3.1. Asymptotic mean squared error. Fan [9] showed that, under the condi-*
tions*(C*_{m,f}*),(C*_{K}*), h→ 0 and nh → ∞ as n → ∞,*

*m(x)*= *S*_{n,2}*(x)T*_{n,0}*(x)− S**n,1**(x)T*_{n,1}*(x)*
*S**n,0**(x)S**n,2**(x)− S**n,1**(x)S**n,1**(x)+ n*^{−2}*,*
(3.1)

a modification of (2.1) that admits asymptotic unconditional variance, has
E{*m(x)*^{} *} − m(x) =*^{1}_{2}*m*^{}*(x)ν*20*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*},*
(3.2)

Var{*m(x)* } = *σ*^{2}*(x)*

*nhf (x)ν*02*+ o{h*^{4}*+ (nh)*^{−1}*}.*

(3.3)

THEOREM1. *Suppose that δ > 0 is a constant. Assume(C**m,f**),(C**K**), h*→ 0
*and nh→ ∞ as n → ∞. Then*

E{*m*^{}*q**(x)} − m(x) =*^{1}_{2}*m*^{}*(x)ν*20*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*},*
(3.4)

Var{*m**q**(x)*} = *σ*^{2}*(x)*

*nhf (x)˜ν*02*+ o{h*^{4}*+ (nh)*^{−1}*}.*

(3.5)

Note that*ν*02*= ν*02*− r*^{2}*(1− r*^{2}*)C(δ), where*

*C(δ)= 1.5C(0, δ) − 2C(0.5, δ) + 0.5C(1, δ)*
*with C(a, δ)*=^{}*K(t− aδ)K(t + aδ) dt.*

PROPOSITION1. *The quantity C(δ) has the following properties:*

*(a) For any symmetric kernel function K, C(δ)≥ 0 for any δ ≥ 0.*

*(b) If K has a unique maximum and is concave, then C(δ) is increasing in*
*δ >*0.

REMARK1. From (3.2) and (3.4),*m**q**(x)*and*m(x)* have the same asymptotic
bias. From (3.3), (3.5) and Proposition 1, the asymptotic variance of *m**q**(x)* is
smaller than that of *m(x)* by the amount *{nhf (x)}*^{−1}*σ*^{2}*(x)r*^{2}*(1− r*^{2}*)C(δ). Note*
*that 0 < r*^{2}*(1− r*^{2}*)≤ 1/4 for any r ∈ (−1, 1) \ {0} and attains its maximum 1/4*
*at r*= ±√

*1/2. Therefore, for any δ > 0, the optimal choices of r are r*= ±√
*1/2,*
which yield*m*_{±}*(x).*

REMARK 2. A generalization of *m**q**(x), based on local linear estimators at*
*z*0*= x − rδh, z*1*= x − (r − k)δh and z*2*= x − (r − k − 1)δh, for some 0 < k = 1,*
is^{}_{j}_{=0,1,2}*B*_{j}*(r)m(z* _{j}*), where B*_{0}*(r)= r(r − 1)/k(k + 1), B*1*(r)= −(r + k)(r −*
*1)/k and B*2*(r)= r(r + k)/(k + 1). It has the same asymptotic bias asm(x)* and
asymptotic variance*{nhf (x)}*^{−1}*σ*^{2}*(x)τ (δ, r, k), where*

*τ (δ, r, k)= ν*02

*j**=0,1,2*

*B**j**(r)*^{2}*+ 2B*0*(r)B*1*(r)C(k, δ/2)*

*+ 2B*0*(r)B*2*(r)C(k+ 1, δ/2) + 2B*1*(r)B*2*(r)C(1, δ/2).*

*In τ (δ, r, k), K interacts with r, k and δ and there is no explicit value of r mini-*
*mizing τ (δ, r, k) for given δ and k.*

COROLLARY1. *Under the conditions in Theorem*1, as n→ ∞,
E{*m*_{±}*(x)} − m(x) =*^{1}_{2}*m*^{}*(x)ν*20*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*},*
(3.6)

Var{*m*_{±}*(x)*} = *σ*^{2}*(x)*
*nhf (x)*

*ν*_{02}−*C(δ)*
4

*+ o{h*^{4}*+ (nh)*^{−1}*}.*

(3.7)

Theorem2can be proved using arguments similar to the proof of Theorem1.

THEOREM2. *Under the conditions in Theorem*1, as n→ ∞,
E{*m**a**(x)} − m(x) =*^{1}_{2}*m*^{}*(x)ν*20*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*},*
(3.8)

Var{*m*_{a}*(x)*} = *σ*^{2}*(x)*
*nhf (x)*

*ν*_{02}−*C(δ)*

4 −*D(δ)*
2

*+ o{h*^{4}*+ (nh)*^{−1}*},*
(3.9)

*where*

*D(δ)= ν*02−^{1}_{4}*C(δ)*

−_{16}^{1}^{}4^{}1+√
2^{}*C*^{}√

2*− 1, δ/2*^{}
+^{}3+ 2√

2^{}*C*^{}2−√
*2, δ/2*^{}
*+ 2C*^{}√

*2, δ/2*^{}+ 4^{}1−√
2^{}*C*^{}√

2*+ 1, δ/2*^{}
+^{}3− 2√

2^{}*C*^{}√

2*+ 2, δ/2*^{ }*.*
PROPOSITION2. *The quantity D(δ) in (3.9) is nonnegative for any δ*≥ 0.

REMARK3. *If m*^{(4)}*(x)*exists, then the second-order term in the bias of*m**a**(x)*
*is O(h*^{4}*), smaller than those ofm(x)* and*m*_{±}*(x).*

REMARK 4. *Suppose that supp(K)* *= [−1, 1]. Then, for δ ≥ 2, C(δ) =*
*(3/2)ν*02 and

Var{*m*_{q}*(x)*} ≈^{}1−^{3}_{2}*r*^{2}*(1− r*^{2}*)*^{ }Var{*m(x)* *},*
(3.10)

Var{*m*_{±}*(x)*} ≈ ^{5}_{8}Var{*m(x)* *}.*

*For δ≥ 2/(*√

2*− 1), D(δ) = (5/8)ν*02and

Var{*m**a**(x)*} ≈ _{16}^{5} Var{*m(x)* *}.*

(3.11)

*If K is infinitely supported, then (3.10) and (3.11) hold for sufficiently large δ.*

REMARK 5. *If δ= o(1), then* *m**q**(x),* *m*_{±}*(x)* and *m**a**(x)* all have the same
asymptotic bias and variance as *m(x). If δ* *→ ∞ with δ = o(h*^{−1/3}*)* *and m*^{}*(x)*
exists, then the biases of*m*_{q}*(x)*and*m*_{±}*(x)*remain the same as in (3.4) and (3.6)
and the variances are as in (3.10). If δ*→ ∞ with δ = o(h*^{−1/2}*)and m*^{(4)}*(x)*exists,
then the bias and variance of*m**a**(x)*are as in (3.8) and (3.11).

From Corollary1, the pointwise (global) asymptotic efficiency, in terms of as-
ymptotically optimal (integrated) MSE, of*m*_{±}*(x)*relative to*m(x)* is

*γ*_{q}*(δ)= {ν*02*− C(δ)/4}*^{−4/5}*ν*_{02}^{4/5}*.*
(3.12)

In addition, Theorem 2 implies that the pointwise or global asymptotic relative
efficiency of*m*_{a}*(x)*compared to*m(x)* is

*γ**a**(δ)= {ν*02*− C(δ)/4 − D(δ)/2}*^{−4/5}*ν*_{02}^{4/5}*.*
(3.13)

*Both γ*_{q}*(δ)and γ*_{a}*(δ)depend only on K and δ and do not involve the regression*
*function m, the design density f or the error distribution. Thus, asymptotically, the*

FIG. 1. *AMSE relative efficiency. The left and right panels respectively plot γ*_{q}*(δ) and γ**a**(δ) against*
*δ for the Uniform**(solid), Epanechnikov (dotted) and Normal (dashed) kernels.*

variance reduction methods do not interfere with these factors. Figure1*plots γ**q**(δ)*
*and γ*_{a}*(δ)against δ when K is the Uniform, K(u)= 0.5I (|u| < 1), Epanechnikov,*
*K(u)= 0.75(1 − u*^{2}*)I (|u| < 1) or Normal, K(u) = (2π)*^{−1/2}*exp(−u*^{2}*/2), kernel.*

*3.2. Coverage probability. Coverage probabilities of the confidence intervals*
*I**β* and*I*^{}*β* *for m(x), given in (2.3) and (2.4) and constructed based onm(x)* and

*m**q**(x), are analyzed and compared as follows.*

THEOREM 3. *Assume* *(C**K**),* *(C**m,f**), h= o(n*^{−1/5}*) and nh/log n→ ∞ as*
*n→ ∞. Then*

*P{m(x) ∈ I**β*}

*= β + (nh*^{5}*)** ^{1/2}*4

^{−1}

*ν*21

*ν*

_{02}

^{−1/2}*σ*

^{−1}

*(x)*

*× f*^{1/2}*(x)m*^{}*(x)(z*_{β}^{2}*− 3)φ(z**β**)*
(3.14)

*− (nh)** ^{−1/2}*6

^{−1}

*ν*

_{02}

^{−3/2}*σ*

^{−3}

*(x)f*

^{−1/2}*(x)V*

_{3}

*(x)*

*× {ν*03*(z*_{β}^{2}*− 1) − 3ν*_{02}^{2} *z*_{β}^{2}*}φ(z**β**)*
*+ O*^{}*1/*√

*nh+ h*^{}^{2}^{ }*,*
*P{m(x) ∈I*^{}*β*}

*= β + (nh*^{5}*)** ^{1/2}*4

^{−1}

*ν*21

*ν*

_{02}

^{−1/2}*σ*

^{−1}

*(x)*

*× f*^{1/2}*(x)m*^{}*(x)(z*_{β}^{2}*− 3)φ(z**β**)*
(3.15)

*− (nh)** ^{−1/2}*6

^{−1}

*ν*

_{02}

^{−3/2}*σ*

^{−3}

*(x)f*

^{−1/2}*(x)V*3

*(x)*

× {*ν*03*(z*_{β}^{2}*− 1) − 3**ν*_{02}^{2} *z*_{β}^{2}*}φ(z**β**)*
*+ O*^{}*1/*√

*nh+ h*^{}^{2}^{ }*,*

*where V*3*(x)= E[{Y − m(x)}*^{3}*|X = x].*

COROLLARY 2. *Assume the conditions of Theorem* 3. If *{m*^{}*(x)(z*^{2}* _{β}* −

*3)*}

^{−1}

*{ν*03

*(z*

^{2}

_{β}*− 1) − 3ν*

_{02}

^{2}

*z*

^{2}

_{β}*} < 0 and {m*

^{}

*(x)(z*

^{2}

_{β}*− 3)}*

^{−1}{

*ν*03

*(z*

^{2}

_{β}*− 1) −*3

*ν*02

*z*

^{2}

_{β}*} < 0, then*

_{β}*(δ, r)*≡ lim_{n}_{→∞}min*h**|P {m(x) ∈ I**β**} − β|*

min_{h}*|P {m(x) ∈I*^{}*β**} − β|*

(3.16)

=

*ν*03*(z*_{β}^{2}*− 1) − 3ν*_{02}^{2} *z*_{β}^{2}

*ν*03*(z*_{β}^{2}*− 1) − 3**ν*_{02}^{2} *z*_{β}^{2}

*5/6**ν*_{02}
*ν*02

*4/3*

*.*

Figure2*plots *_{0.95}*(δ, r)against r and δ for the Uniform, Normal and Epanech-*
*nikov kernels. Similarly to the mean squared error comparison, **0.95**(δ, r)* is al-
*ways greater than or equal to 1, nondecreasing in δ both for r* *= ±1/*√

2 and
*for the best r, and settles at an upper limit. Also, **0.95**(δ,±1/*√

*2) is very close*
to the optimal max_{r}_{0.95}*(δ, r). Results for other β values are similar. The limit*
lim*δ*→∞max*r**β**(δ, r)is roughly 1.15 if β≥ 0.9, 1.18 if β = 0.85, and 1.24 when*
*β* *= 0.75 or 0.7. If β = 0.75 or 0.7, **β**(δ, r)* can be less than 1, but such small
*confidence levels are rarely used in practice. For any fixed δ, taking r= ±1/*√

2,
as employed by*m*_{±}*(x), the coverage accuracy is always not far from optimal. Ta-*
ble1*gives **β**(δ,±1/*√

*2) for different β and δ values. There are significant gains*
*in terms of coverage accuracy for δ*≥ 1 when using the Epanechnikov kernel.

FIG. 2. *Coverage accuracy relative efficiency. Top row: perspective plots of *_{0.95}*(δ, r). Bot-*
*tom row: plotted against δ are arg max*_{r}_{0.95}*(δ, r) (dotted), 1/*√

*2 (solid), max*_{r}_{0.95}*(δ, r)*
*(dotted-dashed) and *_{0.95}*(δ,**±1/*√

*2) (dashed). From left, the columns correspond to the Uniform,*
*Epanechnikov and Normal kernels.*

TABLE1
*Coverage accuracy ratio*

**δ****0.6** **0.8** **1.0** **1.2** **1.6** **2.0**

Uniform 1.035 1.047 1.060 1.080 1.116 1.139

1.031 1.042 1.054 1.074 1.115 1.152

1.027 1.037 1.047 1.067 1.113 1.167

1.022 1.031 1.039 1.060 1.112 1.184

Epanechnikov 1.024 1.045 1.067 1.088 1.123 1.136

1.022 1.045 1.072 1.099 1.139 1.151

1.021 1.045 1.078 1.110 1.156 1.167

1.019 1.045 1.084 1.124 1.177 1.185

Normal 1.001 1.003 1.006 1.011 1.027 1.047

1.001 1.004 1.008 1.015 1.035 1.059

1.002 1.004 1.010 1.019 1.042 1.072

1.002 1.006 1.013 1.023 1.051 1.086

_{β}*(δ,±1/*√

*2) for the Uniform, Epanechnikov and Normal kernels. From the top, the rows respec-*
*tively represent β*= 0.95, 0.9, 0.85 and 0.8.

**4. Implementation.**

*4.1. Bandwidth selection. Bandwidth choice is most crucial in kernel smooth-*
ing and dominates the performance. Compared to*m(x),* *m*_{±}*(x)*and*m**a**(x)*do not
further complicate the bandwidth selection problem. This property does not nec-
essarily hold for other variance or bias reducing modifications.

First, consider local bandwidths which are needed when the underlying popula- tion has sharp features in the regression, design or error distribution. The optimal local bandwidth that minimizes the asymptotic mean squared error (AMSE) of

*m(x)*is

*h*0*(x)= {σ*^{2}*(x)ν*02}^{1/5}*{nf (x)m*^{}*(x)*^{2}*ν*_{20}^{2} }^{−1/5}*,*
which gives the optimal AMSE

*1.25{m*^{}*(x)*^{2}*ν*_{20}^{2} *σ*^{8}*(x)/f*^{4}*(x)*}^{1/5}*ν*_{02}^{4/5}*n*^{−4/5}*.*
(4.1)

For either of*m*_{±}*(x), the optimal local bandwidth is*

*h*1*(x)= {ν*02*− C(δ)/4}*^{1/5}*ν*_{02}^{−1/5}*h*0*(x),*
(4.2)

yielding the optimal AMSE

*1.25{m*^{}*(x)*^{2}*ν*_{20}^{2} *σ*^{8}*(x)/f*^{4}*(x)}*^{1/5}*{ν*02*− C(δ)/4}*^{4/5}*n*^{−4/5}*.*
(4.3)

The bandwidth that minimizes the AMSE of*m**a**(x)*is

*h**a**(x)= {ν*02*− C(δ)/4 − D(δ)/2}*^{1/5}*ν*_{02}^{−1/5}*h*0*(x),*
(4.4)

giving the optimal AMSE

*1.25{m*^{}*(x)*^{2}*ν*_{20}^{2} *σ*^{8}*(x)/f*^{4}*(x)*}^{1/5}*{ν*02*− C(δ)/4 − D(δ)/2}*^{4/5}*n*^{−4/5}*.*
(4.5)

An implication of (4.2) and (4.4) is that, for any given δ and K, after adjusting
by an appropriate constant multiplier, any local data-driven bandwidth designed
for*m(x)* is readily applicable to each of*m*_{±}*(x)*and*m**a**(x). In addition, regardless*
of what the regression, design or error distribution is, the multiplicative factors in
(4.2) and (4.4) all remain the same for different x values. Automatic local band-
width selectors for local linear regression include those of [2, 10, 20].

Alternatively, for a generic kernel estimator *¯m of m, consider the optimal*
global bandwidth that minimizes the asymptotic mean integrated squared er-
ror (AMISE), that is, the first-order term in the mean integrated squared error
E^{}*{ ¯m(x) − m(x)}*^{2}*f (x) dx. The optimal global bandwidths h*0*, h*1*and h**a* respec-
tively minimizing the AMISE’s of*m,* *m*_{±}and*m**a* admit the relations

*h*1*= {ν*02*− C(δ)/4}*^{1/5}*ν*_{02}^{−1/5}*h*0*,*
(4.6)

*h**a**= {ν*02*− C(δ)/4 − D(δ)/2}*^{1/5}*ν*_{02}^{−1/5}*h*0*.*
(4.7)

Again, any automatic global bandwidth for the local linear estimator (see, e.g., [13,
21]) can be accordingly adjusted for implementation of*m*_{±}and*m**a*.

Note that the constant factors in (4.2) and (4.6) are the same, and those in (4.4) and (4.7) are equal. Thus, bandwidth selection for our estimators is very simple.

*This advantage arises from the fact that r is kept fixed for all x, thereby enhancing*
*AMSE performance uniformly across different x, regressions, designs and error*
distributions.

*4.2. Bin width. In the construction ofm*_{±}*(x)* and*m**a**(x), the equally spaced*
*points α**x,0**, α**x,1* *and α**x,2* *have bin width δ > 0. If δ= o(1), then there is no*
variance reduction; see Remark 5. If δ diverges with δ*= o(h*^{−1/3}*), for* *m*_{±}*(x),*
*or δ= o(h*^{−1/2}*), for* *m**a**(x), although it is argued in Remark*5 that the new esti-
mators have good asymptotic properties, there can be adverse effects on the finite
*sample biases. In applications caution is needed when employing constant δ. Each*
of *m*_{±}*(x)* and *m**a**(x)* *uses observations further away from x when using larger*
*values of δ and that may substantially increase the finite sample bias. In general,*
*larger values of δ, for example, δ∈ [1.5, 2], are recommended only when n is large*
or when the estimation problem is not difficult, that is, the regression function is
*smooth and the noise level is low. Otherwise, a smaller δ is preferred. Furthermore,*
*δ*= 1 is a good default. Results of a numerical study, reported in Section5, support
these suggestions.

*4.3. Computation. A naive way to compute the proposed estimators is to cal-*
culate the required local linear estimators and then form the linear combinations.

Then the computational effort is increased by some constant factors. This extra

burden can be avoided by a careful implementation. Suppose the estimators are
computed over an equispaced grid. Letting the spacing of the grid be a multiple of
*ω**n*reduces the number of kernel evaluations to essentially the same as required by

*m. In addition, fast implementation of kernel estimators, for example, fast Fourier*
transform or binning methods of Fan and Marron [12], can be employed to allevi-
ate the computational effort.

**5. Numerical performance.** A simulation study was carried out to investi-
gate the finite sample performances. Three regression functions (see [24]),

*1. bimodal, m(x)= 0.3 exp{−16(x − 0.25)*^{2}*} + 0.7 exp{−64(x − 0.75)*^{2}},
*2. linear with peak, m(x)= 2 − 5x + 5 exp{−400(x − 0.5)*^{2}},

*3. sine, m(x)= sin(5πx),*

*were considered. The design was Uniform(0, 1). The random error ε was Nor-*
*mal(0, 1) distributed and σ (x)= kσ*0 *for all x∈ [0, 1], where k = 0.5, 1 or 2 and*
*σ*0*= 0.1,*√

*0.5 and 0.5 for the bimodal, linear with peak and sine regression func-*
*tions, respectively. The sample size was 25, 50, 100, 250 or 500. To avoid the data*
sparseness problem, let*m*HT*(x)*be the local linear estimator employing the inter-
polation method of Hall and Turlach [15] with the parameter r therein being taken
as 3. Let*m*CH*(x)* denote the bias reduction method of Choi and Hall [7] applied
to*m*HT*(x)*rather than the local linear estimator. Our variance reduction methods,
*using δ*= 0.6, 0.8, 1, 1.2 or 1.6, were applied to*m*HT*(x). The Epanechnikov ker-*
*nel was employed. The bandwidth h ranged over{0.008 · 1.1*^{k}*, k= 0, 1, . . . , 40}.*

In each setting 1000 samples were simulated. The mean integrated squared error (MISE) of an estimator is approximated by the average of the 1000 numerical integrals of the squared errors. The integrated squared bias (ISB) and integrated variance (IV) are likewise approximated.

Figure3depicts the MISE, ISB and IV curves under one of the configurations.

Our estimator*m**a*significantly improves on*m*HTin terms of IV. The second-order

FIG. 3. *MISE, ISB and IV curves. From left to right are MISE, ISB and IV plotted against h when*
*the regression is bimodal, the design is U(0, 1), k**= 1 and n = 100. The line types solid, long-dashed,*
*dotted, short-dashed, dashed, dotted-spaced and dotted-dashed respectively represent**m*_{}_{HT},*m*_{}_{CH}
*and**m**a**with δ**= 0.6, 0.8, 1, 1.2 and 1.6.*

FIG. 4. *MISE relative efficiency. Top (or bottom) row plots against n the MISE efficiencies relative*
*to* *m*_{H T}*when regression is the bimodal (or sine) function and design is U(0, 1). From left, the*
*columns correspond to the noise levels k**= 0.5, 1 and 2. Line types represent the estimators and are*
*as in Figure*3.

bias effect of*m**a* *is more apparent when h is large or δ is large. The ISB curve of*

*m*CHis much smaller than that of*m*HT*except when h is large.*

Finite sample efficiency of an estimator relative to*m*HTcan be measured by the
*ratio of the minimal, over h, MISE values. Figure*4plots some of the results. Our
estimator*m*_{a}*, using any of the δ values, outperformsm*_{HT}most of the time. The
bias reduction estimator *m*CH is better than *m**a* *with δ*≤ 1 when the regression
*function is smooth and when n≥ 100. Interestingly, even when n = 500,m*CHhas
no significant advantage over*m**a* *with δ= 1.2 or 1.6. Although we do not report*
the simulation results on the linear with peak regression function, we observe that,
in this case,*m*CH performs roughly the same as *m*HT, but our variance reduction
estimator*m**a* outperforms*m*HT. These observations coincide with previous stud-
ies on bias reduction methods that they usually depend on certain higher-order
*approximations which take effect only when the curve is smooth and n is large. In*
*the high-noise case (k*= 2),*m*HTbreaks down and there is not much improvement
employing any modification.

Examining Figure 4 more closely, we have the following conclusions. First,
*δ*= 1 is a reliable choice for general purposes. Except when the noise level is
*high (k= 2), the relative efficiency for δ = 1 is already above 1.1 from n = 100*
*and is not far from the asymptotic value γ**a**(1)≈ 1.22 (see Figure*1) for moderate
*sample sizes. In all of the cases, the curves for δ= 0.6 and 0.8 become almost flat*

*starting from n= 100. Besides δ = 1, larger δ values, for example, δ = 1.2, may*
have potential in practice. In general, for smooth curves like the sine function or
*low noise levels, gains resulted from employing δ > 1 over using δ*≤ 1 become
*noticeable for n*≥ 500.

Although the simulation results of*m*_{±} are not given here, the performances of

*m*_{±} are hard to differentiate from each other, the average version*m**a* copes with
the second-order bias effect better than*m*_{±}, and*m*_{±}*(x)using δ≤ 1.2 outperform*

*m**H T* most of the time. The simulation study has been summarized in terms of
the MISE performances. The pointwise mean squared errors, not reported here,
provide similar conclusions. In addition, the MSE relative efficiencies of the pro-
*posed estimators are roughly unchanged as x varies. Two truncated Normal de-*
*signs, N (0.5, 0.5*^{2}*)∩ (0, 1) and N(0, 1) ∩ (0, 1), were experimented with under*
all the considered configurations as well. In all cases, the relative MISE efficiency
of each of our variance reduction estimators behaves consistently across the three
different designs.

**6. Generalizations and applications.** Generalizations of the methodology
are multifold and comparatively easy. First, in kernel estimation of curves and
their derivatives using higher-order kernels or by fitting higher-degree local poly-
nomials, a linear combination of some preliminary estimators at more than three
neighboring points is required in order that moment conditions, in the same spirit
as those in (7.4), are satisfied. See also the asymptotic mean expressions (7.3)
and (7.5). Second, extension to estimation of multidimensional surfaces is possi-
ble. Cheng and Peng [6] made some progress in this regard: when using product
kernels, form a grid in every direction as in the one-dimensional case, and then
take the coefficient of one preliminary estimator as the product of all the corre-
sponding one-dimensional coefficients. Asymptotic relative efficiency, in terms of
*MSE or MISE, compared to the local linear smoother is γ*_{q}*(δ)*^{d}*and is γ*_{a}*(δ)** ^{d}* for
the average version.

The proposed variance reduction strategy is fairly simple, allowing a wide range of potential applications. Variance reduction is particularly useful when very few data points can be used in the estimation, for example, estimating quantities in conditional distributions. Other useful applications include various local modeling techniques such as local likelihood estimation, varying coefficient models and haz- ard regression. In situations where the covariance structure of the preliminary es- timators is completely analogous to (7.6), direct application is appropriate. Exam- ples include kernel density estimation and are found in local likelihood modeling [19, 26]. Notice that applying the techniques to density estimation can introduce negativity. When applying to more complicated settings, the covariance structure of the preliminary estimators needs to be investigated.

**7. Proofs.**

PROOF OFTHEOREM1. Following [9], consider the version of the local linear
estimator in (3.1). Denote Z*n**= O**l**(a**n**)* *if E|Z**n*|^{l}*= O(a**n*^{l}*)* *and Z**n**= o**l**(a**n**)* if
E*|Z**n*|^{l}*= o(a*_{n}^{l}*). For any fixed z∈ [α**x,0**, α*_{x,2}*], write ω**i,z**= hK**h**(z−X**i**){S**n,2**(z)*−
*(z−X**i**)S**n,1**(z)} with S**n,j**(z)= h*^{}^{n}*i*=1*(z−X**i**)*^{j}*K**h**(z−X**i**), j= 0, 1, 2. Then one*
can show that

*n*^{2}*h*^{4}
_{n}

*i*=1

*ω**i,z**+ n*^{−2}

_{−1}

*= {ν*20*f*^{2}*(z)*}^{−1}*+ o*4*(1),*
(7.1)

*n*

*i*=1

*{m(X**i**)− m(z)}ω**i,z**= n*^{2}*h*^{6}*f*^{2}*(z)ν*20*S**n,z**+ o*4*(n*^{2}*h*^{6}*),*
(7.2)

*where S*_{n,z}*= h*^{−2}E{m(X) − m(z) − m^{}*(z)(X* *− z)}K**h**(z* *− X). From (*7.1)
and (7.2),

E{*m(z)* *} = m(z) +*^{1}_{2}*m*^{}*(z)ν*_{20}*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*}.*

(7.3)

*Note that, for j* *= 0, 1, 2,*

*A*0*(r)(−1 − r)*^{j}*+ A*1*(r)(−r)*^{j}*+ A*2*(r)(1− r)*^{j}*= δ**0,j**,*
(7.4)

*where δ**0,j* *= 1 if j = 0 and 0 otherwise. Recall that x = α**x,1**+ rω**n**∈ [α**x,0**, α**x,2*],

*−1 ≤ r ≤ 1, and ω**n**= δh = α**x,2**− α**x,1**= α**x,1**− α**x,0*. Using (7.3), Taylor expan-
sion and (7.4) we have

E{*m*^{}*q**(x)*} = ^{}

*i**=0,1,2*

*A**i**(r) E{m(α*^{} *x,i**)*}

=^{}*m(x)*+^{1}_{2}*m*^{}*(x)ν*_{20}*h*^{2}^{ }

*i**=0,1,2*

*A*_{i}*(r)*
*+ m*^{}*(x)* ^{}

*i=0,1,2*

*A**i**(r)(α**x,i**− x)*
(7.5)

+^{1}_{2}*m*^{}*(x)* ^{}

*i**=0,1,2*

*A**i**(r)(α**x,i**− x)*^{2}*+ o{h*^{2}*+ (nh)** ^{−1/2}*}

*= m(x) +*^{1}_{2}*m*^{}*(x)ν*20*h*^{2}*+ o{h*^{2}*+ (nh)*^{−1/2}*}.*

Next we compute the variance of *m**q**(x). For any u, v* *∈ [α**x,0**, α**x,2*], from (7.1)
and (7.3),

Cov{*m(u),* *m(v)* } = Cov
*n*

*i*=1*ω**i,u**{Y**i**− m(u)}*

_{n}

*i*=1*ω**i,u**+ n*^{−2} *,*

_{n}

*i*=1*ω**i,v**{Y**i**− m(v)}*

_{n}

*i*=1*ω**i,v**+ n*^{−2}

*+ o*

1
*n*^{4}*h*^{4}

*.*

*Taking conditional expectations on X*1*, . . . , X**n*and using mean and variance de-
composition yield

Cov{*m(u),* *m(v)* }

= E^{}

*n*

*i*=1*ω**i,u**{m(X**i**)− m(u)}*^{}^{n}*j*=1*ω**j,v**{m(X**j**)− m(v)}*

*(*^{}^{n}_{i=1}*ω**i,u**+ n*^{−2}*)(*^{}^{n}_{j}_{=1}*ω**j,v**+ n*^{−2}*)*

− E

*n*

*i*=1*ω*_{i,u}*{m(X**i**)− m(u)}*

_{n}

*i*=1*ω**i,u**+ n*^{−2}

E

^{n}

*j*=1*ω*_{j,v}*{m(X**j**)− m(v)}*

_{n}

*j*=1*ω**j,v**+ n*^{−2}

(7.6)

+ E^{}

_{n}

*i*=1*ω**i,u**ω**i,v**σ*^{2}*(X**i**)*

*(*^{}^{n}_{i}_{=1}*ω**i,u**+ n*^{−2}*)(*^{}^{n}_{i}_{=1}*ω**i,v**+ n*^{−2}*)*

*+ o{(nh)*^{−4}}

= *σ*^{2}*(x)*
*nf (x)*

*K**h**(u− t)K**h**(v− t) dt + o{h*^{4}*+ (nh)*^{−1}*},*

where the last equality follows from (7.1), (7.2) and ^{}^{n}_{i}_{=1}*ω**i,u**ω**i,v**σ*^{2}*(X**i**)* =
*n*^{3}*h*^{8}*σ*^{2}*(x)ν*_{20}^{2} *f*^{2}*(x)*^{}*K*_{h}*(u* *− t)K**h**(v* *− t)f (t) dt{1 + O**l**(h*^{α}*+ (nh)*^{−1/2}*)*}.

Then (3.5) is valid since
Var{*m**q**(x)*} = ^{}

*i**=0,1,2*

*A**i**(r)*^{2}Var{*m(α* *x,i**)*}

+ ^{}

*i=0,1,2*

*j** =i*

*A**i**(r)A**j**(r)*Cov{*m(α* *x,i**),m(α* *x,j**)}.*

PROOF OF PROPOSITION 1. Property (a) follows from^{}*K(x− δ/2)K(x +*
*δ/2) dx*=^{}*K(x)K(x− δ) dx =*^{}*K(x)K(x+ δ) dx and writing*

*C(δ)*=^{ }^{3}_{2}*K(x)*^{2}*− K(x)K(x + δ) − K(x)K(x − δ)*
+^{1}_{2}*K(x− δ)K(x + δ)*^{ }*dx*

(7.7)

=^{ }*K(x)*−^{1}_{2}*K(x+ δ) −*^{1}_{2}*K(x− δ)*^{ }^{2}*dx.*

Property (b) can be shown by differentiating the right-hand side of (7.7). PROOF OFPROPOSITION2. From (3.7) and (3.9),

*D(δ)*= *2nhf (x)*
*σ*^{2}*(x)*

1

2Var{*m*_{+}*(x)*} +1

2Var{*m*_{−}*(x)*} − Var{*m**a**(x)*}

*× {1 + o(1)}*

= *nhf (x)*

*2σ*^{2}*(x)*Var{*m*_{+}*(x)*−*m*_{−}*(x)}{1 + o(1)}.*