REDUCING VARIANCE IN UNIVARIATE SMOOTHING BY

(1)

2007, Vol. 35, No. 2, 522–542 DOI:10.1214/009053606000001398

©Institute of Mathematical Statistics, 2007

REDUCING VARIANCE IN UNIVARIATE SMOOTHING BY MING-YENCHENG, LIANGPENG¹ANDJYH-SHYANGWU

National Taiwan University, Georgia Institute of Technology and Tamkang University

A variance reduction technique in nonparametric smoothing is proposed:

at each point of estimation, form a linear combination of a preliminary estimator evaluated at nearby points with the coefficients specified so that the asymptotic bias remains unchanged. The nearby points are chosen to maxi- mize the variance reduction. We study in detail the case of univariate local linear regression. While the new estimator retains many advantages of the local linear estimator, it has appealing asymptotic relative efficiencies. Band- width selection rules are available by a simple constant factor adjustment of those for local linear estimation. A simulation study indicates that the finite sample relative efficiency often matches the asymptotic relative efficiency for moderate sample sizes. This technique is very general and has a wide range of applications.

1. Introduction. Local linear modeling for nonparametric regression has many advantages and has become very popular. Hastie and Loader [16], Wand and Jones [27], Fan and Gijbels [11] and others have investigated extensively its theoretical and practical properties. To reduce the variance, for each point x, take a special linear combination of this local linear estimators at three equally spaced points around x. The linear combination satisfies certain moment conditions so that the asymptotic bias is unchanged. Below are a few specific features of this new estimator. First, both local and global automatic bandwidths can be easily obtained from those for the standard local linear estimator. Second, the asymptotic mean squared error is improved considerably and the amount of reduction is uniform across different locations, regression functions, designs and error distributions. Third, evidenced by a simulation study, the reduction in asymptotic variance is effectively projected to finite samples. Fourth, the estimators admit simple forms and only slightly increase the amount of computation time. Finally, many advantages of the local linear estimator, for example, design adaptivity and automatic boundary correction, are retained.

Received May 2004; revised May 2006.

1Supported by NSF Grant DMS-04-03443.

AMS 2000 subject classifications.Primary 62G08, 62G05; secondary 60G20.

Key words and phrases. Bandwidth, coverage probability, kernel, local linear regression, non- parametric smoothing, variance reduction.

522

(2)

In kernel density estimation, Kogure [18] studied a related topic on polynomial interpolation and obtained some preliminary results. The purpose is to find optimal allocation of the interpolation points by minimizing the asymptotic mean integrated squared error. Higher-order polynomial interpolation does not change the asymptotic bias but alters the asymptotic variance in a nonhomogeneous way unless the spacings are all large enough, which was assumed in seeking the optimal allocation.

Fan [9] showed that the local linear estimator is minimax optimal among all linear estimators. The proposed estimators are linear and have smaller asymptotic mean squared errors than the local linear estimator. There is no conflict between these two results since the assumptions are different. In [9], the maximum risk of a linear estimator is taken over a class of regression functions that can be approximated well linearly. The asymptotic results for our estimators require the slightly more stringent condition that the regression function has a bounded, continuous second derivative at the point of estimation.

There is an extensive literature on modifications and improvements of kernel and local linear estimators. Many of them are aimed at bias reduction, for example, Abramson [1], Samiuddin and El-Sayyad [22], Jones, Linton and Nielsen [17]

and Choi and Hall [7]. Choi and Hall [7] exploit the idea of taking linear combinations of some local linear estimators as well. The key difference is that we take linear combinations to maintain bias but reduce variance while Choi and Hall [7]

take linear combinations to reduce bias while essentially maintaining variance.

More specifically, (i) they use local linear estimates at points symmetrically lo- cated around x and we do not, (ii) they take a convex combination and we employ different constraints on the coefficients, (iii) bandwidth selection is straightforward in our case but is much more complicated for theirs, and (iv) our method applies to local constant fitting estimators but theirs does not. There are very few variance reduction techniques available. An error-dependent technique in local linear smoothing was suggested by Cheng and Hall [3]. The amount of variance reduction depends on the error distribution. Cheng and Hall [4] introduced an adaptive line integral of the bivariate kernel density estimate to reduce variance. It does not apply to the univariate case or regression estimation. These two methods require explicitly or implicitly a pilot estimation of the function or its derivatives. Our procedure is based on a completely different idea and effectively reduces variance without pilot estimation.

Section 2presents the methodology and Section3 provides results on the asymptotic mean squared error and coverage accuracy. Section 4 discusses some practical issues, including bandwidth selection and implementation. Section5con- tains a numerical study. Possible generalizations and applications of the variance reduction technique are discussed in Section6. Proofs are given in Section7.

(3)

2. Methodology.

2.1. Local linear regression. Suppose that independent bivariate observations (X1, Y1), . . . , (Xn, Yn)are sampled from the regression model

Y = m(X) + σ(X)ε,

where σ (X) is the conditional variance and the random error ε, independent of X, has zero mean and unit variance. Kernel estimators of m(x)= E(Y |X = x) include the Nadaraya–Watson, Gasser–Müller and local polynomial estimators; see [8, 25, 27]. Let K be a kernel function and h > 0 be a bandwidth. The local linear regressor is defined as

S_n,2(x)T_n,0(x)− Sn,1(x)T_n,1(x) Sn,0(x)Sn,2(x)− Sn,1(x)Sn,1(x), (2.1)

where Sn,l(x)= hⁿ_i₌₁(x− Xi)^lKh(x− Xi), l= 0, 1, 2, Tn,l(x)= hⁿ_i₌₁(x− Xi)^lKh(x − Xi)Yi, l= 0, 1, and Kh(t)= K(t/h)/h. This estimator is obtained by solving a local linear least squares problem. Its appealing theoretical and numerical properties were discussed by Fan [9] and Fan and Gijbels [11], among others. Further, it has become very popular, widely used in applications and imple- mented in statistical software. When the denominator is close to zero, it exhibits a rather unstable numerical behavior. This is particularly problematic for small samples or sparse designs. Remedies include modifications proposed by Seifert and Gasser [23], Cheng, Hall and Titterington [5], Hall and Turlach [15] and Seifert and Gasser [24]. Although we focus on the local linear estimator denoted bym(x), our variance reduction methods given below can be applied to any such modification.

2.2. Variance reduced estimators. A motivation of our variance reduction strategy is to incorporate more data points in the regression estimation in such a way that the first-order term in the asymptotic bias remains unchanged. To do this, at each x we construct a linear combination of the local linear estimators at some equally spaced points near x, with the linear coefficients satisfying certain moment conditions derived from the asymptotic bias expansions. Then variance reduction is achieved since the three preliminary estimators are correlated and the correlation coefficients are less than 1. In this context, we fix the number of nearby points at three since that is the minimal requirement specified by the moment conditions and using more than three yields complex solutions. In addition, the grid points are taken so that the final estimator is the simplest and most efficient in the mean squared error sense.

Formally, for any x, define three equally spaced points αx,j = x − (r + 1 − j )ω_n, j = 0, 1, 2, where r ∈ (−1, 1) and ωn= δh for some constant δ > 0. The shift parameter r determines the location of the leftmost point αx,0 relative to x

(4)

and ωnis the spacing of the grid. Construct a quadratic interpolation of the local linear estimators m(α _x,0),m(α _x,1) and m(α _x,2) and then estimate m(x) by the value of the interpolated curve at x:

m_q(x)=

j=0,1,2

A_j(r)mx− (r + 1 − j)ωn

,

where the coefficients depend only on r and are given by

A0(r)= r(r − 1)/2, A1(r)= (1 − r²), A2(r)= r(r + 1)/2.

Then the moment conditions (7.4) are satisfied so thatm_q(x)has the same asymptotic bias as m(x). Theorem 1 and Proposition1 show that mq(x) has a smaller asymptotic variance thanm(x) and the asymptotic variances differ from each other by a constant factor depending on the bin-width parameter δ, the shift parameter r and the kernel K. Given K and δ, the optimal values of r that minimize this constant factor are r= ±√

1/2, which give

m_±(x)=

j=0,1,2

A_j±1/√

2mx−±1/√

2+ 1 − jω_n.

Sincem₊(x) uses more data information on the left-hand side of x than on the right-hand side, the curve estimatem₊(·) tends to shift to the right of m(·). Sim- ilarly,m₋(·) tends to shift to the left of m(·). This symmetry suggests taking the average

ma(x)= {m₊(x)+m₋(x)}/2

as our final estimator. Sections 3 and 5 demonstrate that, compared to m_±(x),

m_a(x)further improves the asymptotic and finite sample efficiencies.

When Supp(m) is bounded, Supp(m)= [0, 1] say, since each of m_±(x) and

ma(x)requires values ofmat points around x, a δ value,

δ(x)= minδ, x/1/2+ 1h, (1− x)/1/2+ 1h, (2.2)

which depends on the distances from x to the boundary points 0 and 1, is used so thatma(x)is defined for every x∈ [0, 1].

Our estimators m_±(x) andma(x) are simple linear combinations of local linear estimators evaluated at nearby points. No pilot estimation is involved in this variance reduction. Therefore, in finite samples, the asymptotic efficiencies are achieved at relatively small n. For the same reasons, they share many advantages enjoyed by the local linear estimator, for example automatic boundary correction and design adaptivity; see [11]. It is shown in Section 3that each ofm_±(x) and

ma(x)reduces the asymptotic variance uniformly across different regression functions, different designs and different error distributions. Then procedures such as design planning employed in applications ofmapply to our estimators in general.

Thus the proposed estimators are rather user friendly.

(5)

2.3. Confidence intervals. Consider constructing confidence intervals for m(x) based on m(x) and mq(x). Let νij =sⁱK(s)^jds andν0l ={i=0,1,2

A_i(r)K(s+iδ)}^lds. Define w_{ij k}(x)= n⁻¹h^j⁻ⁱ⁻¹ⁿ_l₌₁(x−Xl)ⁱ{Kh(x−Xl)}^j× {Yl−m(x)}^kandσ²(x)= n⁻¹ⁿ_l=1Kh(x−Xl){Yl−m(x) }²/w010(x).Then both

T_n(x)=

√nh

√ν02

m(x)− m(x)

σ²(x)/w010(x)

, T_n^∗(x)=

√nh

√ν02

mq(x)− m(x)

σ²(x)/w010(x)

are asymptotically N (0, 1) distributed when nh⁵→ 0, and we have the following one-sided confidence intervals for m(x):

I_β=m(x) − zβ{σ²(x)/w₀₁₀(x)}^1/2ν₀₂^1/2(nh)^−1/2,∞, (2.3)

Iβ=mq(x)− zβ{σ²(x)/w010(x)}^1/2ν₀₂^1/2(nh)^−1/2,∞, (2.4)

where zβ satisfies P{N(0, 1) ≤ zβ} = β.

3. Theoretical results. The following conditions are needed for asymptotic analysis of the estimators.

ASSUMPTION (CK). K is a symmetric density function with compact support.

ASSUMPTION(Cm,f).

1. m(·), the second derivative of m(·), is bounded and continuous at x;

2. the density function f (·) of X satisfies f (x) > 0 and |f (x) − f (y)| ≤ c|x − y|^αfor some 0 < α < 1;

3. σ²(·) is bounded and continuous at x.

3.1. Asymptotic mean squared error. Fan [9] showed that, under the condi- tions(C_m,f),(C_K), h→ 0 and nh → ∞ as n → ∞,

m(x)= S_n,2(x)T_n,0(x)− Sn,1(x)T_n,1(x) Sn,0(x)Sn,2(x)− Sn,1(x)Sn,1(x)+ n⁻², (3.1)

a modification of (2.1) that admits asymptotic unconditional variance, has E{m(x) } − m(x) =¹₂m(x)ν20h²+ o{h²+ (nh)^−1/2}, (3.2)

Var{m(x) } = σ²(x)

nhf (x)ν02+ o{h⁴+ (nh)⁻¹}.

(3.3)

THEOREM1. Suppose that δ > 0 is a constant. Assume(Cm,f),(CK), h→ 0 and nh→ ∞ as n → ∞. Then

E{mq(x)} − m(x) =¹₂m(x)ν20h²+ o{h²+ (nh)^−1/2}, (3.4)

Var{mq(x)} = σ²(x)

nhf (x)˜ν02+ o{h⁴+ (nh)⁻¹}.

(3.5)

(6)

Note thatν02= ν02− r²(1− r²)C(δ), where

C(δ)= 1.5C(0, δ) − 2C(0.5, δ) + 0.5C(1, δ) with C(a, δ)=K(t− aδ)K(t + aδ) dt.

PROPOSITION1. The quantity C(δ) has the following properties:

(a) For any symmetric kernel function K, C(δ)≥ 0 for any δ ≥ 0.

(b) If K has a unique maximum and is concave, then C(δ) is increasing in δ >0.

REMARK1. From (3.2) and (3.4),mq(x)andm(x) have the same asymptotic bias. From (3.3), (3.5) and Proposition 1, the asymptotic variance of mq(x) is smaller than that of m(x) by the amount {nhf (x)}⁻¹σ²(x)r²(1− r²)C(δ). Note that 0 < r²(1− r²)≤ 1/4 for any r ∈ (−1, 1) \ {0} and attains its maximum 1/4 at r= ±√

1/2. Therefore, for any δ > 0, the optimal choices of r are r= ±√ 1/2, which yieldm_±(x).

REMARK 2. A generalization of mq(x), based on local linear estimators at z0= x − rδh, z1= x − (r − k)δh and z2= x − (r − k − 1)δh, for some 0 < k = 1, is_j_=0,1,2B_j(r)m(z _j), where B₀(r)= r(r − 1)/k(k + 1), B1(r)= −(r + k)(r − 1)/k and B2(r)= r(r + k)/(k + 1). It has the same asymptotic bias asm(x) and asymptotic variance{nhf (x)}⁻¹σ²(x)τ (δ, r, k), where

τ (δ, r, k)= ν02

j=0,1,2

Bj(r)²+ 2B0(r)B1(r)C(k, δ/2)

+ 2B0(r)B2(r)C(k+ 1, δ/2) + 2B1(r)B2(r)C(1, δ/2).

In τ (δ, r, k), K interacts with r, k and δ and there is no explicit value of r mini- mizing τ (δ, r, k) for given δ and k.

COROLLARY1. Under the conditions in Theorem1, as n→ ∞, E{m_±(x)} − m(x) =¹₂m(x)ν20h²+ o{h²+ (nh)^−1/2}, (3.6)

Var{m_±(x)} = σ²(x) nhf (x)

ν₀₂−C(δ) 4

+ o{h⁴+ (nh)⁻¹}.

(3.7)

Theorem2can be proved using arguments similar to the proof of Theorem1.

THEOREM2. Under the conditions in Theorem1, as n→ ∞, E{ma(x)} − m(x) =¹₂m(x)ν20h²+ o{h²+ (nh)^−1/2}, (3.8)

Var{m_a(x)} = σ²(x) nhf (x)

ν₀₂−C(δ)

4 −D(δ) 2

+ o{h⁴+ (nh)⁻¹}, (3.9)

(7)

where

D(δ)= ν02−¹₄C(δ)

−₁₆¹41+√ 2C√

2− 1, δ/2 +3+ 2√

2C2−√ 2, δ/2 + 2C√

2, δ/2+ 41−√ 2C√

2+ 1, δ/2 +3− 2√

2C√

2+ 2, δ/2. PROPOSITION2. The quantity D(δ) in (3.9) is nonnegative for any δ≥ 0.

REMARK3. If m⁽⁴⁾(x)exists, then the second-order term in the bias ofma(x) is O(h⁴), smaller than those ofm(x) andm_±(x).

REMARK 4. Suppose that supp(K) = [−1, 1]. Then, for δ ≥ 2, C(δ) = (3/2)ν02 and

Var{m_q(x)} ≈1−³₂r²(1− r²)Var{m(x) }, (3.10)

Var{m_±(x)} ≈ ⁵₈Var{m(x) }.

For δ≥ 2/(√

2− 1), D(δ) = (5/8)ν02and

Var{ma(x)} ≈ ₁₆⁵ Var{m(x) }.

(3.11)

If K is infinitely supported, then (3.10) and (3.11) hold for sufficiently large δ.

REMARK 5. If δ= o(1), then mq(x), m_±(x) and ma(x) all have the same asymptotic bias and variance as m(x). If δ → ∞ with δ = o(h^−1/3) and m(x) exists, then the biases ofm_q(x)andm_±(x)remain the same as in (3.4) and (3.6) and the variances are as in (3.10). If δ→ ∞ with δ = o(h^−1/2)and m⁽⁴⁾(x)exists, then the bias and variance ofma(x)are as in (3.8) and (3.11).

From Corollary1, the pointwise (global) asymptotic efficiency, in terms of asymptotically optimal (integrated) MSE, ofm_±(x)relative tom(x) is

γ_q(δ)= {ν02− C(δ)/4}^−4/5ν₀₂^4/5. (3.12)

In addition, Theorem 2 implies that the pointwise or global asymptotic relative efficiency ofm_a(x)compared tom(x) is

γa(δ)= {ν02− C(δ)/4 − D(δ)/2}^−4/5ν₀₂^4/5. (3.13)

Both γ_q(δ)and γ_a(δ)depend only on K and δ and do not involve the regression function m, the design density f or the error distribution. Thus, asymptotically, the

(8)

FIG. 1. AMSE relative efficiency. The left and right panels respectively plot γ_q(δ) and γa(δ) against δ for the Uniform(solid), Epanechnikov (dotted) and Normal (dashed) kernels.

variance reduction methods do not interfere with these factors. Figure1plots γq(δ) and γ_a(δ)against δ when K is the Uniform, K(u)= 0.5I (|u| < 1), Epanechnikov, K(u)= 0.75(1 − u²)I (|u| < 1) or Normal, K(u) = (2π)^−1/2exp(−u²/2), kernel.

3.2. Coverage probability. Coverage probabilities of the confidence intervals Iβ andIβ for m(x), given in (2.3) and (2.4) and constructed based onm(x) and

mq(x), are analyzed and compared as follows.

THEOREM 3. Assume (CK), (Cm,f), h= o(n^−1/5) and nh/log n→ ∞ as n→ ∞. Then

P{m(x) ∈ Iβ}

= β + (nh⁵)^1/24⁻¹ν21ν₀₂^−1/2σ⁻¹(x)

× f^1/2(x)m(x)(z_β²− 3)φ(zβ) (3.14)

− (nh)^−1/26⁻¹ν₀₂^−3/2σ⁻³(x)f^−1/2(x)V₃(x)

× {ν03(z_β²− 1) − 3ν₀₂² z_β²}φ(zβ) + O1/√

nh+ h², P{m(x) ∈Iβ}

= β + (nh⁵)^1/24⁻¹ν21ν₀₂^−1/2σ⁻¹(x)

× f^1/2(x)m(x)(z_β²− 3)φ(zβ) (3.15)

− (nh)^−1/26⁻¹ν₀₂^−3/2σ⁻³(x)f^−1/2(x)V3(x)

× {ν03(z_β²− 1) − 3ν₀₂² z_β²}φ(zβ) + O1/√

nh+ h²,

(9)

where V3(x)= E[{Y − m(x)}³|X = x].

COROLLARY 2. Assume the conditions of Theorem 3. If {m(x)(z²_β − 3)}⁻¹{ν03(z²_β − 1) − 3ν₀₂² z²_β} < 0 and {m(x)(z²_β − 3)}⁻¹{ν03(z²_β − 1) − 3ν02z²_β} < 0, then

_β(δ, r)≡ lim_n_→∞minh|P {m(x) ∈ Iβ} − β|

min_h|P {m(x) ∈Iβ} − β|

(3.16)

=

ν03(z_β²− 1) − 3ν₀₂² z_β²

ν03(z_β²− 1) − 3ν₀₂² z_β²

5/6ν₀₂ ν02

4/3

.

Figure2plots _0.95(δ, r)against r and δ for the Uniform, Normal and Epanech- nikov kernels. Similarly to the mean squared error comparison, 0.95(δ, r) is al- ways greater than or equal to 1, nondecreasing in δ both for r = ±1/√

2 and for the best r, and settles at an upper limit. Also, 0.95(δ,±1/√

2) is very close to the optimal max_r_0.95(δ, r). Results for other β values are similar. The limit limδ→∞maxrβ(δ, r)is roughly 1.15 if β≥ 0.9, 1.18 if β = 0.85, and 1.24 when β = 0.75 or 0.7. If β = 0.75 or 0.7, β(δ, r) can be less than 1, but such small confidence levels are rarely used in practice. For any fixed δ, taking r= ±1/√

2, as employed bym_±(x), the coverage accuracy is always not far from optimal. Ta- ble1gives β(δ,±1/√

2) for different β and δ values. There are significant gains in terms of coverage accuracy for δ≥ 1 when using the Epanechnikov kernel.

FIG. 2. Coverage accuracy relative efficiency. Top row: perspective plots of _0.95(δ, r). Bot- tom row: plotted against δ are arg max_r_0.95(δ, r) (dotted), 1/√

2 (solid), max_r_0.95(δ, r) (dotted-dashed) and _0.95(δ,±1/√

2) (dashed). From left, the columns correspond to the Uniform, Epanechnikov and Normal kernels.

(10)

TABLE1 Coverage accuracy ratio

δ 0.6 0.8 1.0 1.2 1.6 2.0

Uniform 1.035 1.047 1.060 1.080 1.116 1.139

1.031 1.042 1.054 1.074 1.115 1.152

1.027 1.037 1.047 1.067 1.113 1.167

1.022 1.031 1.039 1.060 1.112 1.184

Epanechnikov 1.024 1.045 1.067 1.088 1.123 1.136

1.022 1.045 1.072 1.099 1.139 1.151

1.021 1.045 1.078 1.110 1.156 1.167

1.019 1.045 1.084 1.124 1.177 1.185

Normal 1.001 1.003 1.006 1.011 1.027 1.047

1.001 1.004 1.008 1.015 1.035 1.059

1.002 1.004 1.010 1.019 1.042 1.072

1.002 1.006 1.013 1.023 1.051 1.086

_β(δ,±1/√

2) for the Uniform, Epanechnikov and Normal kernels. From the top, the rows respec- tively represent β= 0.95, 0.9, 0.85 and 0.8.

4. Implementation.

4.1. Bandwidth selection. Bandwidth choice is most crucial in kernel smooth- ing and dominates the performance. Compared tom(x), m_±(x)andma(x)do not further complicate the bandwidth selection problem. This property does not nec- essarily hold for other variance or bias reducing modifications.

First, consider local bandwidths which are needed when the underlying popula- tion has sharp features in the regression, design or error distribution. The optimal local bandwidth that minimizes the asymptotic mean squared error (AMSE) of

m(x)is

h0(x)= {σ²(x)ν02}^1/5{nf (x)m(x)²ν₂₀² }^−1/5, which gives the optimal AMSE

1.25{m(x)²ν₂₀² σ⁸(x)/f⁴(x)}^1/5ν₀₂^4/5n^−4/5. (4.1)

For either ofm_±(x), the optimal local bandwidth is

h1(x)= {ν02− C(δ)/4}^1/5ν₀₂^−1/5h0(x), (4.2)

yielding the optimal AMSE

1.25{m(x)²ν₂₀² σ⁸(x)/f⁴(x)}^1/5{ν02− C(δ)/4}^4/5n^−4/5. (4.3)

The bandwidth that minimizes the AMSE ofma(x)is

ha(x)= {ν02− C(δ)/4 − D(δ)/2}^1/5ν₀₂^−1/5h0(x), (4.4)

(11)

giving the optimal AMSE

1.25{m(x)²ν₂₀² σ⁸(x)/f⁴(x)}^1/5{ν02− C(δ)/4 − D(δ)/2}^4/5n^−4/5. (4.5)

An implication of (4.2) and (4.4) is that, for any given δ and K, after adjusting by an appropriate constant multiplier, any local data-driven bandwidth designed form(x) is readily applicable to each ofm_±(x)andma(x). In addition, regardless of what the regression, design or error distribution is, the multiplicative factors in (4.2) and (4.4) all remain the same for different x values. Automatic local bandwidth selectors for local linear regression include those of [2, 10, 20].

Alternatively, for a generic kernel estimator ¯m of m, consider the optimal global bandwidth that minimizes the asymptotic mean integrated squared error (AMISE), that is, the first-order term in the mean integrated squared error E{ ¯m(x) − m(x)}²f (x) dx. The optimal global bandwidths h0, h1and ha respectively minimizing the AMISE’s ofm, m_±andma admit the relations

h1= {ν02− C(δ)/4}^1/5ν₀₂^−1/5h0, (4.6)

ha= {ν02− C(δ)/4 − D(δ)/2}^1/5ν₀₂^−1/5h0. (4.7)

Again, any automatic global bandwidth for the local linear estimator (see, e.g., [13, 21]) can be accordingly adjusted for implementation ofm_±andma.

Note that the constant factors in (4.2) and (4.6) are the same, and those in (4.4) and (4.7) are equal. Thus, bandwidth selection for our estimators is very simple.

This advantage arises from the fact that r is kept fixed for all x, thereby enhancing AMSE performance uniformly across different x, regressions, designs and error distributions.

4.2. Bin width. In the construction ofm_±(x) andma(x), the equally spaced points αx,0, αx,1 and αx,2 have bin width δ > 0. If δ= o(1), then there is no variance reduction; see Remark 5. If δ diverges with δ= o(h^−1/3), for m_±(x), or δ= o(h^−1/2), for ma(x), although it is argued in Remark5 that the new estimators have good asymptotic properties, there can be adverse effects on the finite sample biases. In applications caution is needed when employing constant δ. Each of m_±(x) and ma(x) uses observations further away from x when using larger values of δ and that may substantially increase the finite sample bias. In general, larger values of δ, for example, δ∈ [1.5, 2], are recommended only when n is large or when the estimation problem is not difficult, that is, the regression function is smooth and the noise level is low. Otherwise, a smaller δ is preferred. Furthermore, δ= 1 is a good default. Results of a numerical study, reported in Section5, support these suggestions.

4.3. Computation. A naive way to compute the proposed estimators is to cal- culate the required local linear estimators and then form the linear combinations.

Then the computational effort is increased by some constant factors. This extra

(12)

burden can be avoided by a careful implementation. Suppose the estimators are computed over an equispaced grid. Letting the spacing of the grid be a multiple of ωnreduces the number of kernel evaluations to essentially the same as required by

m. In addition, fast implementation of kernel estimators, for example, fast Fourier transform or binning methods of Fan and Marron [12], can be employed to allevi- ate the computational effort.

5. Numerical performance. A simulation study was carried out to investi- gate the finite sample performances. Three regression functions (see [24]),

1. bimodal, m(x)= 0.3 exp{−16(x − 0.25)²} + 0.7 exp{−64(x − 0.75)²}, 2. linear with peak, m(x)= 2 − 5x + 5 exp{−400(x − 0.5)²},

3. sine, m(x)= sin(5πx),

were considered. The design was Uniform(0, 1). The random error ε was Nor- mal(0, 1) distributed and σ (x)= kσ0 for all x∈ [0, 1], where k = 0.5, 1 or 2 and σ0= 0.1,√

0.5 and 0.5 for the bimodal, linear with peak and sine regression func- tions, respectively. The sample size was 25, 50, 100, 250 or 500. To avoid the data sparseness problem, letmHT(x)be the local linear estimator employing the interpolation method of Hall and Turlach [15] with the parameter r therein being taken as 3. LetmCH(x) denote the bias reduction method of Choi and Hall [7] applied tomHT(x)rather than the local linear estimator. Our variance reduction methods, using δ= 0.6, 0.8, 1, 1.2 or 1.6, were applied tomHT(x). The Epanechnikov ker- nel was employed. The bandwidth h ranged over{0.008 · 1.1^k, k= 0, 1, . . . , 40}.

In each setting 1000 samples were simulated. The mean integrated squared error (MISE) of an estimator is approximated by the average of the 1000 numerical integrals of the squared errors. The integrated squared bias (ISB) and integrated variance (IV) are likewise approximated.

Figure3depicts the MISE, ISB and IV curves under one of the configurations.

Our estimatormasignificantly improves onmHTin terms of IV. The second-order

FIG. 3. MISE, ISB and IV curves. From left to right are MISE, ISB and IV plotted against h when the regression is bimodal, the design is U(0, 1), k= 1 and n = 100. The line types solid, long-dashed, dotted, short-dashed, dashed, dotted-spaced and dotted-dashed respectively representm_HT,m_CH andmawith δ= 0.6, 0.8, 1, 1.2 and 1.6.

(13)

FIG. 4. MISE relative efficiency. Top (or bottom) row plots against n the MISE efficiencies relative to m_{H T} when regression is the bimodal (or sine) function and design is U(0, 1). From left, the columns correspond to the noise levels k= 0.5, 1 and 2. Line types represent the estimators and are as in Figure3.

bias effect ofma is more apparent when h is large or δ is large. The ISB curve of

mCHis much smaller than that ofmHTexcept when h is large.

Finite sample efficiency of an estimator relative tomHTcan be measured by the ratio of the minimal, over h, MISE values. Figure4plots some of the results. Our estimatorm_a, using any of the δ values, outperformsm_HTmost of the time. The bias reduction estimator mCH is better than ma with δ≤ 1 when the regression function is smooth and when n≥ 100. Interestingly, even when n = 500,mCHhas no significant advantage overma with δ= 1.2 or 1.6. Although we do not report the simulation results on the linear with peak regression function, we observe that, in this case,mCH performs roughly the same as mHT, but our variance reduction estimatorma outperformsmHT. These observations coincide with previous stud- ies on bias reduction methods that they usually depend on certain higher-order approximations which take effect only when the curve is smooth and n is large. In the high-noise case (k= 2),mHTbreaks down and there is not much improvement employing any modification.

Examining Figure 4 more closely, we have the following conclusions. First, δ= 1 is a reliable choice for general purposes. Except when the noise level is high (k= 2), the relative efficiency for δ = 1 is already above 1.1 from n = 100 and is not far from the asymptotic value γa(1)≈ 1.22 (see Figure1) for moderate sample sizes. In all of the cases, the curves for δ= 0.6 and 0.8 become almost flat

(14)

starting from n= 100. Besides δ = 1, larger δ values, for example, δ = 1.2, may have potential in practice. In general, for smooth curves like the sine function or low noise levels, gains resulted from employing δ > 1 over using δ≤ 1 become noticeable for n≥ 500.

Although the simulation results ofm_± are not given here, the performances of

m_± are hard to differentiate from each other, the average versionma copes with the second-order bias effect better thanm_±, andm_±(x)using δ≤ 1.2 outperform

mH T most of the time. The simulation study has been summarized in terms of the MISE performances. The pointwise mean squared errors, not reported here, provide similar conclusions. In addition, the MSE relative efficiencies of the pro- posed estimators are roughly unchanged as x varies. Two truncated Normal de- signs, N (0.5, 0.5²)∩ (0, 1) and N(0, 1) ∩ (0, 1), were experimented with under all the considered configurations as well. In all cases, the relative MISE efficiency of each of our variance reduction estimators behaves consistently across the three different designs.

6. Generalizations and applications. Generalizations of the methodology are multifold and comparatively easy. First, in kernel estimation of curves and their derivatives using higher-order kernels or by fitting higher-degree local poly- nomials, a linear combination of some preliminary estimators at more than three neighboring points is required in order that moment conditions, in the same spirit as those in (7.4), are satisfied. See also the asymptotic mean expressions (7.3) and (7.5). Second, extension to estimation of multidimensional surfaces is possible. Cheng and Peng [6] made some progress in this regard: when using product kernels, form a grid in every direction as in the one-dimensional case, and then take the coefficient of one preliminary estimator as the product of all the corre- sponding one-dimensional coefficients. Asymptotic relative efficiency, in terms of MSE or MISE, compared to the local linear smoother is γ_q(δ)^d and is γ_a(δ)^d for the average version.

The proposed variance reduction strategy is fairly simple, allowing a wide range of potential applications. Variance reduction is particularly useful when very few data points can be used in the estimation, for example, estimating quantities in conditional distributions. Other useful applications include various local modeling techniques such as local likelihood estimation, varying coefficient models and haz- ard regression. In situations where the covariance structure of the preliminary estimators is completely analogous to (7.6), direct application is appropriate. Exam- ples include kernel density estimation and are found in local likelihood modeling [19, 26]. Notice that applying the techniques to density estimation can introduce negativity. When applying to more complicated settings, the covariance structure of the preliminary estimators needs to be investigated.

(15)

7. Proofs.

PROOF OFTHEOREM1. Following [9], consider the version of the local linear estimator in (3.1). Denote Zn= Ol(an) if E|Zn|^l = O(an^l) and Zn= ol(an) if E|Zn|^l= o(a_n^l). For any fixed z∈ [αx,0, α_x,2], write ωi,z= hKh(z−Xi){Sn,2(z)− (z−Xi)Sn,1(z)} with Sn,j(z)= hⁿi=1(z−Xi)^jKh(z−Xi), j= 0, 1, 2. Then one can show that

n²h⁴ _n

i=1

ωi,z+ n⁻²

₋₁

= {ν20f²(z)}⁻¹+ o4(1), (7.1)

n

i=1

{m(Xi)− m(z)}ωi,z= n²h⁶f²(z)ν20Sn,z+ o4(n²h⁶), (7.2)

where S_n,z = h⁻²E{m(X) − m(z) − m(z)(X − z)}Kh(z − X). From (7.1) and (7.2),

E{m(z) } = m(z) +¹₂m(z)ν₂₀h²+ o{h²+ (nh)^−1/2}.

(7.3)

Note that, for j = 0, 1, 2,

A0(r)(−1 − r)^j + A1(r)(−r)^j+ A2(r)(1− r)^j = δ0,j, (7.4)

where δ0,j = 1 if j = 0 and 0 otherwise. Recall that x = αx,1+ rωn∈ [αx,0, αx,2],

−1 ≤ r ≤ 1, and ωn= δh = αx,2− αx,1= αx,1− αx,0. Using (7.3), Taylor expan- sion and (7.4) we have

E{mq(x)} =

i=0,1,2

Ai(r) E{m(α x,i)}

=m(x)+¹₂m(x)ν₂₀h²

i=0,1,2

A_i(r) + m(x)

i=0,1,2

Ai(r)(αx,i− x) (7.5)

+¹₂m(x)

i=0,1,2

Ai(r)(αx,i− x)²+ o{h²+ (nh)^−1/2}

= m(x) +¹₂m(x)ν20h²+ o{h²+ (nh)^−1/2}.

Next we compute the variance of mq(x). For any u, v ∈ [αx,0, αx,2], from (7.1) and (7.3),

Cov{m(u), m(v) } = Cov n

i=1ωi,u{Yi− m(u)}

_n

i=1ωi,u+ n⁻² ,

_n

i=1ωi,v{Yi− m(v)}

_n

i=1ωi,v+ n⁻²

+ o

1 n⁴h⁴

.

(16)

Taking conditional expectations on X1, . . . , Xnand using mean and variance de- composition yield

Cov{m(u), m(v) }

= E

n

i=1ωi,u{m(Xi)− m(u)}ⁿj=1ωj,v{m(Xj)− m(v)}

(ⁿ_i=1ωi,u+ n⁻²)(ⁿ_j₌₁ωj,v+ n⁻²)

− E

n

i=1ω_i,u{m(Xi)− m(u)}

_n

i=1ωi,u+ n⁻²

E

ⁿ

j=1ω_j,v{m(Xj)− m(v)}

_n

j=1ωj,v+ n⁻²

(7.6)

+ E

_n

i=1ωi,uωi,vσ²(Xi)

(ⁿ_i₌₁ωi,u+ n⁻²)(ⁿ_i₌₁ωi,v+ n⁻²)

+ o{(nh)⁻⁴}

= σ²(x) nf (x)

Kh(u− t)Kh(v− t) dt + o{h⁴+ (nh)⁻¹},

where the last equality follows from (7.1), (7.2) and ⁿ_i₌₁ωi,uωi,vσ²(Xi) = n³h⁸σ²(x)ν₂₀² f²(x)K_h(u − t)Kh(v − t)f (t) dt{1 + Ol(h^α + (nh)^−1/2)}.

Then (3.5) is valid since Var{mq(x)} =

i=0,1,2

Ai(r)²Var{m(α x,i)}

+

i=0,1,2

j =i

Ai(r)Aj(r)Cov{m(α x,i),m(α x,j)}.

PROOF OF PROPOSITION 1. Property (a) follows fromK(x− δ/2)K(x + δ/2) dx=K(x)K(x− δ) dx =K(x)K(x+ δ) dx and writing

C(δ)=³₂K(x)²− K(x)K(x + δ) − K(x)K(x − δ) +¹₂K(x− δ)K(x + δ)dx

(7.7)

=K(x)−¹₂K(x+ δ) −¹₂K(x− δ)²dx.

Property (b) can be shown by differentiating the right-hand side of (7.7). PROOF OFPROPOSITION2. From (3.7) and (3.9),

D(δ)= 2nhf (x) σ²(x)

1

2Var{m₊(x)} +1

2Var{m₋(x)} − Var{ma(x)}

× {1 + o(1)}

= nhf (x)

2σ²(x)Var{m₊(x)−m₋(x)}{1 + o(1)}.