Journal of Multivariate Analysis

### Simple and Efficient Improvements of Multivariate Local Linear Regression

Ming-Yen Cheng^{1} and Liang Peng^{2}

Abstract

This paper studies improvements of multivariate local linear regression. Two intuitively appealing variance reduction techniques are proposed. They both yield estimators that retain the same asymptotic conditional bias as the multivariate local linear estimator and have smaller asymptotic conditional variances. The estimators are further examined in aspects of bandwidth selection, asymptotic relative efficiency and implementation. Their asymptotic relative efficiencies with respect to the multivariate local linear estimator are very attractive and increase exponentially as the number of covariates increases. Data- driven bandwidth selection procedures for the new estimators are straightforward given those for local linear regression. Since the proposed estimators each has a simple form, implementation is easy and requires much less or about the same amount of effort. In addition, boundary corrections are automatic as in the usual multivariate local linear regression.

AMS 1991 subject classification: 62G08; 62G05; 62G20

Keywords. Bandwidth selection, kernel smoothing, local linear regression, multiple re- gression, nonparametric regression, variance reduction.

1Department of Mathematics, National Taiwan University, Taipei 106, Taiwan. Email:

cheng@math.ntu.edu.tw

2School of Mathematics, Georgia Institute of Technology, Atlanta GA 30332-0160, USA. Email:

peng@math.gatech.edu

### 1 Introduction

Nonparametric regression methods are useful for exploratory data analysis and for rep- resenting underlying features that can not be well described by parametric regression models. In the recent two decades, many attentions have been paid to local polynomial modeling for nonparametric regression which was first suggested by Stone (1977) and Cleveland (1979). Fan (1993) and many others investigated the theoretical and numerical properties. Ruppert and Wand (1994) established theoretical results for local polynomial regression with multiple covariates. Wand and Jones (1995), Fan and Gijbels (1996) and Simonoff (1996) provided excellent reviews. We consider reducing variance in multivariate local linear regression. This is of fundamental interests since local linear techniques are very useful and efficient in a wide range of fields including survival analysis, longitudinal data analysis, time series modeling and so on.

The nonparametric regression model with multiple covariates is as follows

*Y*_{i}*= m(X*_{i}*) + ν*^{1/2}*(X*_{i}*) ε*_{i}*,* (1.1)
*where {X*^{i}*= (X**i1**, · · · , X** ^{id}*)

^{T}*}*

^{n}*i=1*

*are i.i.d. random vectors with density function f and*

*independent of ε*1

*, · · · , ε*

*, which are i.i.d. random variables with mean zero and variance*

^{n}*one. The local linear estimator of the conditional mean function m(·) at x = (x*

^{1}

*, · · · , x*

*)*

^{d}*is b*

^{T}*α, the solution for α to the following locally kernel weighted least squares problem*

min*α,β*

X*n*
*i=1*

*Y**i**− α − β*^{T}*(X**i**− x)*2

Π^{d}_{j=1}*KX**ij* *− x*^{j}*b**j**h*

*,* (1.2)

*where K(·) is a one-dimensional kernel function, h > 0, and b*^{i}*> 0, i = 1, · · · , d, are*
*constants. Here b*1*,· · · , b** ^{d}*are tuning parameters which allow us to choose different band-

*widths for each direction: in (1.2) the bandwidth for kernel smoothing along the i-th*

*covariate is b*

*i*

*h, i = 1, · · · , d. The kernel weight function in (1.2) is taken as a product*

*kernel and the bandwidth matrix H*

^{1/2}*= diag{hb*

^{1}

*, · · · , hb*

^{d}*} is diagonal. From standard*weighted least squares theory, the local linear estimator is given by

b

*m(x) = e*^{T}*(X*_{x}^{T}*W**x**X**x*)^{−1}*X*_{x}^{T}*W**x**Y,* (1.3)

*where e = (1, 0, · · · , 0)*^{T}*is a (d + 1)-vector, Y = (Y*1*, · · · , Y** ^{n}*)

*,*

^{T}*W**x* *= diag*n

Π^{d}_{j=1}*KX**1j**− x*^{j}*b**j**h*

*, · · · , Π*^{d}*j=1**KX**nj* *− x*^{j}*b**j**h*

o*,* *X**x* =

*1 (X*1*− x)** ^{T}*
... ...

*1 (X**n**− x)*^{T}

* .*

Define

*B = diag{b*^{2}1*, · · · , b*^{2}*d**}, µ*^{2}*(K) =*
Z

*s*^{2}*K(s) ds,* *R(K) =*
Z

*K*^{2}*(s) ds,*

*M*_{2}*(x) =*

*∂*^{2}

*∂x*1*∂x*1*m(x), · · · ,* _{∂x}^{∂}_{1}_{∂x}^{2} _{d}*m(x)*

... ... ...

*∂*^{2}

*∂x*_{d}*∂x*1*m(x), · · · ,* _{∂x}^{∂}_{d}_{∂x}^{2} _{d}*m(x)*

* .*

*If x is an interior point, Ruppert and Wand (1994) showed that, under regularity condi-*
tions,

E b

*m(x)− m(x)*

*X*1*, · · · , X** ^{n}*

= 1

2*h*^{2}*µ*2*(K) tr*

*BM*2*(x)*
*+ o**p*

*h*^{2}

*,* (1.4)

Var
b
*m(x)*

*X*1*, · · · , X** ^{n}*

= *ν(x)*

*nh*^{d}*f (x)Π*^{d}_{i=1}*b**i*

*R(K)** ^{d}*

*1 + o**p*(1)

*.* (1.5)

*Here, that x is an interior point means that the set S**x,K* =

*(z*1*, · · · , z** ^{d}*)

*: Π*

^{T}

_{j=1}

^{d}*K*

*(z*

*j*

*−*

*x*

_{j}*)/(b*

_{j}*h)*

*> 0*

, i.e. support of the local kernel weight function in the local least squares
*problem (1.2), is entirely contained in the support of the design density f . Expressions*
(1.4) and (1.5) reveal behaviors of b*m(x). The conditional variance has a slower rate of*
*convergence as the number of covariates d increases and the conditional bias is of the same*
*order h*^{2} *for any value of d. Performance of* *m(x) can be measured by the asymptotically*b
optimal conditional mean squared error, i.e. the asymptotic conditional mean squared
*error minimized over all bandwidths. It has the order n** ^{−4/(d+4)}* and deteriorates for larger

*values of d. This is known as the curse of dimensionality problem. It occurs naturally*

*because, with the same sample size, the design points X*1

*, · · · , X*

*are much less dense in higher dimensions so the variance inflates to a slower rate. Therefore, reducing variance of multivariate local linear regression becomes very important and it is investigated in the subsequent sections.*

^{n}*We propose two types of estimators of m(x) that improve the multivariate local linear*
regression estimator b*m(x) in terms of reducing the asymptotic conditional variance while*
keeping the same asymptotic conditional bias. The first variance reducing estimator is

introduced in Section 2.1. It has a very appealing property of achieving variance reduction
while requiring even much less computational effort, by a factor decreasing exponentially
*in d, than the original local linear estimator. Our second method, proposed in Section*
2.2, is even more effective in the sense that its pointwise relative efficiency with respect to

b

*m(x) is uniform and is the best that the first method can achieve at only certain points.*

The way it is constructed can be easily explained by the first method.

Section 2 introduces the variance reducing techniques and investigates the asymp- totic conditional biases and variances. Bandwidth selection, the most crucial problem in nonparametric smoothing, is discussed in Section 3. Section 4 studies the asymptotic relative efficiencies and issues such as implementation and boundary corrections. A sim- ulation study and a real application are presented in Section 5. All proofs are given in Section 6.

### 2 Methodology

### 2.1 Method I – Fixed Local Linear Constraints

*Let G = (G*1*, · · · , G** ^{d}*)

^{T}*be a vector of odd integers, and for each i = 1,· · · , d let {α*

*:*

^{i,j}*j = 1,· · · , G*

^{i}*} be an equally spaced grid of points with bin width*

*δ**i**b**i**h = α**i,j+1**− α*^{i,j}*for j = 1, · · · , G*^{i}*− 1,*

*where δ*_{i}*> 0, i = 1, · · · , d, are given tuning parameters. In practice, choosing δ**i* *∈*
*[0.5, 1.5], i = 1,· · · , d, for moderate sample sizes is preferred. Then Λ =*

*(α**1,u*1*, · · · , α*^{d,u}*d*)* ^{T}* :

*u*

*i*

*= 1,· · · , G*

^{i}*for each i = 1, · · · , d*

*is a collection of grid points in the range D =*
*[α**1,1**, α**1,G*1*] × · · · × [α*^{d,1}*, α**d,G**d**] ⊂ R** ^{d}*. Denote

*D**v* *= [α**1,2v*1*, α*_{1,2v}_{1}_{+2}]*× · · · × [α**d,2v**d**, α**d,2v** _{d}*+2

*],*

*where 2v**i* *∈ {1, 3, · · · , G*^{i}*− 2} for i = 1, · · · , d. Then the D*^{v}*’s form a partition of D. And*
*for any fixed point x = (x*1*,· · · , x** ^{d}*)

^{T}*∈ D, there exist two vectors v = (v*

^{1}

*,· · · , v*

*)*

^{d}*and*

^{T}*r = (r*

_{1}

*,· · · , r*

*d*)

^{T}*, where r*

_{i}*∈ [−1, 1] for i = 1, · · · , d, such that x is expressed as*

*x**i* *= α**i,2v**i*+1*+ r**i**δ**i**b**i**h for each i = 1, · · · , d.* (2.1)

*So the vector v indicates the subset D**v* *of D that x belongs to and the vector r marks*
*the location of x relative to the set of grid points that fall within D**v*, i.e.,

Λ*v* = Λ*∩ D*^{v}*= Λ ∩ [α** ^{1,2v}*1

*, α*

*1,2v*1+2

*] × · · · × [α*

^{d,2v}*d*

*, α*

*d,2v*

*d*+2]

=

*x*^{∗}*(k*1*, · · · , k*^{d}*) = (α**1,2v*1*+k*1*,· · · , α*^{d,2v}*d**+k**d*)^{T}*: (k*1*, · · · , k** ^{d}*)

^{T}*∈ {0, 1, 2}*

**

^{d}*. (2.2)*

The local linear estimator b*m(x) involves an inverse operation associated with the*
local design matrix in which only a few design points have positive weights, see (1.2)
and (1.3). That contributes much instability to b*m(x). Therefore, our idea of variance*
*reduction in local linear estimation of m(x) at any x ∈ D*^{v}*⊂ D is the following. Given*
*{ bm(α) : α ∈ Λ}, i.e. the local linear estimates evaluated over Λ, we form a linear*
combination of the values *m(α), α ∈ Λ*b ^{v}*⊂ Λ, to be a new estimate of m(x) instead of*
recomputing b*m(x) at x as in (1.3). In this way the resultant estimate is not allowed to*
differ too much from the values b*m(α), α ∈ Λ** ^{v}*, where Λ

*v*

*is a degenerate subset of D*

*v*, and its source of variability is restricted to their variances and covariances. In other words,

*our new estimator at any x ∈ D*

*is constrained by b*

^{v}*m(α), α ∈ Λ*

*, and in general will have a smaller variance than*

^{v}*m(x). Meanwhile, to ensure the asymptotic conditional*b bias unchanged, the new estimator has to be subject to certain moment conditions. This can be accomplished by forcing the coefficients in the linear combination to fulfill the corresponding requirements.

*Formally, put δ = (δ*1*, · · · , δ** ^{d}*)

*and let*

^{T}*A*0

*(s) =*

*s(s − 1)*

2 *,* *A*1*(s) = 1 − s*^{2}*,* *A*2*(s) =* *s(s + 1)*

2 *.* (2.3)

Our first variance reduced estimator is defined as e

*m(x; r, δ) =* X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)o
b
*m*

*x*^{∗}*(k*1*, · · · , k** ^{d}*)

= X

*x*^{∗}*(k*1*,··· ,k**d**)∈Λ**v*

nΠ^{d}_{i=1}*A**k**i**(r**i*)o
b
*m*

*x*^{∗}*(k*1*, · · · , k** ^{d}*)

*.* (2.4)

That is, e*m(x; r, δ) is a linear combination of bm(α), α ∈ Λ**v* *⊂ Λ, the original local linear*
estimates at the 3* ^{d}* grid points in Λ

*v*

*⊂ D*

^{v}*where x ∈ D*

*.*

^{v}*Since the functions A*0*(s), A*1*(s) and A*2*(s) satisfy A*0*(s) + A*1*(s) + A*2*(s) = 1 for all*
*s* *∈ [−1, 1], it is clear from (2.3) and (2.4) that em(x; r, δ) = bm(x) for all x ∈ Λ and of*

course e*m(x; r, δ) and bm(x) have the exactly same finite and large sample behaviors over*
*Λ. So the interesting part is D \ Λ where em(x; r, δ) and bm(x) are not equal. The fact*
that e*m(x; r, δ) has a smaller variance than* *m(x) for all x*b *∈ D \ Λ can be explained in*
two ways. First, compared to b*m(x), em(x; r, δ) is constructed using more data points as*
*the collection { bm(α) : α* *∈ Λ**v**} is based on observations with their X-values falling in a*
*larger neighborhood of x. Another reason is that em(x; r, δ) is constrained by the local*
linear estimates b*m(α), α* *∈ Λ** ^{v}*, instead of being built from a separate local linear fitting
or in any modified way, so its source of variation is restricted to the local linear fittings

*at α∈ Λ*

*v*.

Concerning the bias of e*m(x; r, δ) as an estimator of m(x), the expected values of*
b

*m(α), α ∈ Λ*^{v}*, differ from m(x) by more than the usual h*^{2}-order bias of b*m(x). For*
e

*m(x; r, δ) to have the same asymptotic bias as bm(x), noting that points in Λ**v*are all distant
*from x at the order h, the coefficients in the linear combination defining em(x; r, δ) have to*
*sum up to one and cancel the extra order-h and order-h*^{2} biases contributed by*m(α), α ∈*b
Λ*v**. Since A*0*(s), A*1*(s) and A*2*(s) defined in (2.3) satisfy conditions (6.2), the coefficients*
Π^{d}_{i=1}*A**k**i**(r**i*) in the linear combination defining e*m(x; r, δ) fulfill these requirements.*

Throughout this paper we assume the following regularity conditions:

*(A1) The kernel K is a compactly supported, bounded kernel such that µ*2*(K)∈ (0, ∞);*

*(A2) The point x is in the support of f . At x, ν is continuous, f is continuously*
*differentiable and all second-order derivatives of m are continuous. Also, f (x) > 0*
*and ν(x) > 0;*

*(A3) The bandwidth h satisfies h = h(n) → 0 and nh*^{d}*→ ∞ as n → ∞.*

Our main result is as follows.

*Theorem 1. Suppose that x is any point in D with the corresponding vectors v and r*
as in (2.1). Assume that every element of Λ*v* is an interior point. Then, under conditions
*(A1)–(A3), as n→ ∞,*

E e

*m(x; r, δ) − m(x)*

*X*_{1}*,· · · , X** ^{n}*

= 1

2*h*^{2}*µ*2*(K) tr*

*BM*2*(x)*
*+ o**p*

*h*^{2}

*,* (2.5)

Var e

*m(x; r, δ)**X*_{1}*, · · · , X** ^{n}*

= *ν(x)*

*nh*^{d}*f (x)Π*^{d}_{i=1}*b** _{i}* Π

^{d}**

_{i=1}*R(K) − r*^{2}*i**(1 − r*^{2}*i**)C(δ**i*)
*+o**p*

*(nh** ^{d}*)

**

^{−1}*,* (2.6)
where

*C(s) =* 3

2*C(0, s) − 2C(*1

2*, s) +*1

2*C(1, s) ,*
*C(s, t) =*

Z

*K(u − st)K(u + st) du.*

Hence, from (1.4), (1.5), (2.5) and (2.6), b*m(x) and* *m(x; r, δ) have the same asymp-*e
totic conditional bias and their asymptotic conditional variances differ only by the con-
stant factors Π^{d}_{i=1}*R(K) and Π*^{d}* _{i=1}*

*R(K)−r**i*^{2}(1*−r**i*^{2}*)C(δ**i*)

. Therefore, comparison between
*the asymptotic conditional variances lies on the vector δ of the bin widths for Λ and the*
*vector r indicating the location of x in D**v* *⊂ D, but not the vector v. The two quantities*
Π^{d}_{i=1}*R(K) and Π*^{d}* _{i=1}*

*R(K)* *− r*^{2}*i**(1 − r*^{2}*i**)C(δ** _{i}*)

*are equal when r*^{2}* _{i}*(1

*− r*

*i*

^{2}

*)C(δ*

*) = 0 for*

_{i}*i = 1, · · · , d. Concerning δ, C(δ*

^{1}

*) = · · · = C(δ*

^{d}*) = 0 if and only if δ*1

*= · · · = δ*

*= 0, which corresponds to e*

^{d}*m(x; r, δ) = bm(x) for all x ∈ D and is not meaningful at all. The*

*case that some δ*

*i*are zero is not of any interest either, since in that case Λ is degenerate

*in the sense that it does not span D. So we are only interested in the case where all the*bin widths are positive:

*δ**i* *> 0, i = 1, · · · , d.*

*This condition is assumed throughout this paper. As for r, note that r*_{i}^{2}*(1 − r**i*^{2}) = 0
*if and only if r**i* *∈ {−1, 0, 1}. Under the assumption that δ*^{i}*> 0, i = 1, · · · , d, the two*
*asymptotic conditional variances are equal if and only if r**i* *∈ {−1, 0, 1} for all i = 1, · · · , d*
*and that corresponds to x* *∈ Λ, which coincide with the fact that em(x; r, δ) = bm(x) for*
*x ∈ Λ. On the other hand, r**i*^{2}(1*−r*^{2}*i**) > 0 for all r**i* *∈ [−1, 1]\{−1, 0, 1} and, for commonly*
*used kernels, such as the Epanechnikov kernel K(u) = 0.75(1 − u*^{2}*)I(−1 < u < 1) and*
*the Normal kernel K(u) = exp(−u*^{2}*/2)/√*

*2π, C(s) > 0 for all s > 0; see Cheng, Peng and*
*Wu (2005). Hence we have, for all x ∈ D \ Λ,*

Var e

*m(x; r, δ)**X*_{1}*, · · · , X** ^{n}*

*< Var*
b

*m(x)**X*_{1}*, · · · , X** ^{n}*

*asymptotically.*

Ratio of the asymptotic conditional variance of e*m(x; r, δ) to that of* *m(x) is*b
Π^{d}* _{i=1}*n

*R(K)*

*R(K) − r**i*^{2}*(1 − r**i*^{2}*)C(δ** _{i}*)
o

*.*

*Given B and δ, this ratio, as well as the pointwise asymptotic relative efficiency of*
e

*m(x; r, δ) with respect to* *m(x), differs as x varies in D*b *v*. And, irrelevant to the vec-
*tor v,* *m(x; r, δ) attains the most relative variance reduction when x has its associated*e
*vector r = (r*1*, · · · , r** ^{d}*)

^{T}*taking the values r*

*i*

*= ±*p

*1/2 for all i = 1,· · · , d. That is, within*
*each D**v*, e*m(x; r, δ) is asymptotically most efficient relative to bm(x) at the 2** ^{d}* points

*x = (α*_{1,2v}_{1}*,· · · , α**d,2v**d*)^{T}*+ h*

*(1 + r*1*)δ*1*b*1*, · · · , (1+r*^{d}*)δ**d**b**d**T*

*, r∈ {−1/√*
*2, 1/√*

2*}*^{d}*. (2.7)*
And the maximum is uniform, hence unique, across all such points and over all subsets
*D**v* *of D. Asymptotic relative efficiency of em(x; r, δ) with respect to bm(x) is investigated*
in further details in Section 4.1.

### 2.2 Method II – Varying Local Linear Constraints

As observed in Section 2.1, our first variance reduced estimator*m(x; r, δ) improves b*e *m(x)*
*in a non-uniform manner as x varies in D. And the same best pointwise relative variance*
reduction occurs at the 2^{d}*points given in (2.7) in each subset D**v* *of D. Our second*
variance reducing estimator is then constructed to achieve this best relative efficiency
*everywhere. The approach is that, fixing at any vector r* *∈*

*− 1/√*
*2, 1/√*

2*d*

and for
*each x, evaluate the usual local linear estimates at 3*^{d}*points surrounding x and then*
linearly combine these estimates to form a new estimator in the same way as in Section
2.1. But now these 3^{d}*neighboring points are determined by x and r and hence differ as*
*x changes.*

*Consider that given both B and δ being positive vectors. Fix any r = (r*1*, · · · , r** ^{d}*)

^{T}*∈*

*{−1/√*

*2, 1/√*

*2}*^{d}*. Then, for every x where m(x) is to be estimated, let*
*α*_{x,r}*= x − h*

*(1 + r*_{1}*)δ*1*b*_{1}*,· · · , (1 + r*^{d}*)δ**d**b**d**T*

*,*
Λ*x,r* =

*x*^{∗}_{k}_{1}_{,··· ,k}_{d}*(x; r) = α*_{x,r}*+ h (k*_{1}*δ*_{1}*b*_{1}*,· · · , k**d**δ*_{d}*b** _{d}*)

^{T}*: (k*

_{1}

*,· · · , k*

*d*)

^{T}*∈ {0, 1, 2}*

**

^{d}*.*

*Define a variance reduced estimator of m(x) as*

e

*m**r**(x; δ) =* X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)o
b
*m*

*x*^{∗}_{k}_{1}_{,··· ,k}_{d}*(x; r)*

= X

*x*^{∗}

*k1,··· ,kd**(x;r)∈Λ**x,r*

nΠ^{d}_{i=1}*A**k**i**(r**i*)o
b
*m*

*x*^{∗}_{k}_{1}_{,··· ,k}_{d}*(x; r)*
*.*

Thus e*m**r**(x; δ) is a linear combination of the local linear estimates over Λ**x,r*and e*m(x; r, δ) is*
a linear combination of the local linear estimates at Λ*v*. The coefficients in the linear com-
binations are parallel. These facts explain clearly that e*m**r**(x; δ) enjoys the same variance*
reducing property as *m(x; r, δ). The main difference between e*e *m**r**(x; δ) and em(x; r, δ) is*
that the set Λ* _{x,r}* in the definition of e

*m*

_{r}*(x; δ) varies as x changes and r ∈ {−1/√*

*2, 1/√*
*2}** ^{d}*
is fixed, while the grid Λ

*v*for defining

*m(x; r, δ) is fixed for all x ∈ D*e

^{v}*= [α*

*1,2v*1

*, α*

*1,2v*1+2

*] ×*

*· · · × [α*^{d,2v}*d**, α**d,2v**d*+2*] and r depends on x. See also (2.7). Again, δ*1*, · · · , δ** ^{d}* are given

*tuning parameters and, for moderate sample sizes, choosing δ*

*i*

*∈ [0.5, 1.5], i = 1, · · · , d,*

*is preferred. If support of X is bounded, say Supp(X) = [0, 1]*

*, then to keep Λ*

^{d}*within*

_{x,r}*Supp(X), in practice we take δ*

*i*

*(x*

*i*) = min

*δ**i**, x**i**/[(1 +*p

*1/2)h], (1 − x*^{i}*)/[(1 +*p

*1/2)h]*
,
*i = 1, · · · , d, for given δ*^{1}*, · · · , δ** ^{d}*.

The following theorem follows immediately from Theorem 1.

*Theorem 2. Suppose that r is any given vector in {−1/√*
*2, 1/√*

2*}** ^{d}* and every element
of Λ

*x,r*

*is an interior point. Then, under conditions (A1)–(A3), as n → ∞,*

E e

*m*_{r}*(x; δ) − m(x)*

*X*_{1}*, · · · , X**n*

= 1

2*h*^{2}*µ*_{2}*(K) tr*

*BM*_{2}*(x)*
*+ o** _{p}*

*h*^{2}

*,* (2.8)
Var

e

*m**r**(x; δ)**X*_{1}*, · · · , X** ^{n}*

= *ν(x)*

*nh*^{d}*f (x)Π*^{d}_{i=1}*b** _{i}* Π

^{d}**

_{i=1}*R(K) − C(δ*^{i}*)/4*

*1 + o**p*(1)

*. (2.9)*

Therefore the asymptotic conditional biases of e*m*_{r}*(x; δ) and bm(x) are again the*
same. And the ratio of the asymptotic conditional variances is constant over all values of
*x satisfying the conditions in Theorem 2.*

There are 2^{d}*such estimators indexed by r ∈*

*− 1/√*
*2, 1/√*

2*d*

. For a particular
*value of r, since the set Λ**x,r* *is skewed around x, i.e. the 3** ^{d}* points in Λ

*x,r*are asymmetri-

*cally distributed about x, finite sample bias of em*

_{r}*(x; δ) may be more than the asymptotic*prediction. A way to avoid potential finite sample biases arising from this skewness is to take an average of all the 2

^{d}*estimates. That is, given B and δ, the averaged variance*reduced estimator is defined as

*m(x; δ) = 2*¯ * ^{−d}* X

*r∈{−1/**√*
*2,1/**√*

*2}*^{d}

e

*m*_{r}*(x; δ) ,*

*for every x. The following theorem is proved in Section 6.*

Theorem 3. Suppose that every element of Λ*x,r* *is an interior point for every r ∈*
*{−1/√*

*2, 1/√*

*2}*^{d}*. Then, under conditions (A1)–(A3), as n→ ∞,*
E

*m(x; δ) − m(x)*¯

*X*1*, · · · , X** ^{n}*

= 1

2*h*^{2}*µ*2*(K) tr*

*BM*2*(x)*
*+ o**p*

*h*^{2}

*,* (2.10)

Var

¯
*m(x; δ)*

*X*_{1}*, · · · , X**n*

= *ν(x)*

*nh*^{d}*f (x)Π*^{d}_{i=1}*b**i*

Π^{d}* _{i=1}*n

*R(K)−* *C(δ**i*)

4 *−* *D(δ**i*)
2

o

*+o**p*

*(nh** ^{d}*)

**

^{−1}*,* (2.11)
where

*D(s) = C(0, s) −* 1

4*C(s) −* 1 +*√*
2

4 *C√2 − 1*
2 *, s*

*−* 3 + 2*√*
2

16 *C2 −√*
2
2 *, s*

*−*1

8*C√2*
2 *, s*

*−* *1 −√*
2

4 *C√2 + 1*
2 *, s*

*−* *3 − 2√*
2

16 *C√2 + 2*
2 *, s*

*.*

*The quantity D(δ**i**) in (2.11) is always nonnegative for δ*_{i}*≥ 0, see Cheng, Peng and*
Wu (2005). Therefore, from (2.8)–(2.11), besides being equally biased asymptotically as

e

*m**r**(x; δ), ¯m(x; δ) has an even smaller asymptotic conditional variance.*

### 3 Bandwidth Selection

The asymptotically optimal local bandwidth that minimizes the asymptotic conditional
mean squared error of b*m(x) is*

*h*0*(x) =*h *ν(x)R(K)*^{d}

*nf (x)µ*2*(K)*^{2}*tr{BM*^{2}*(x)}*^{2}Π^{d}_{i=1}*b**i*

i*1/(d+4)*

*,* (3.1)

and those for e*m(x; r, δ), em**r**(x; δ) and ¯m(x; δ) are respectively*

*h*_{1}*(x) = B*_{1}*(x, r; δ) h*_{0}*(x) ,* *h*_{2}*(x) = B*_{2}*(δ) h*_{0}*(x) ,* *h*_{3}*(x) = B*_{3}*(δ) h*_{0}*(x) ,* (3.2)
where

*B*1*(x, r; δ) = Π*^{d}* _{i=1}*h

*R(K) − r**i*^{2}*(1 − r**i*^{2}*)C(δ**i*)

*/R(K)*i*1/(d+4)*

*,*
*B*2*(δ) = Π*^{d}* _{i=1}*h

*R(K) − C(δ*^{i}*)/4*

*/R(K)*i*1/(d+4)*

*,*
*B*3*(δ) = Π*^{d}* _{i=1}*h

*R(K)− C(δ*^{i}*)/4 − D(δ*^{i}*)/2*

*/R(K)*i*1/(d+4)*

*.*

Many popular and reliable data-driven bandwidth selection rules for kernel smooth-
ing are constructed based on the asymptotically optimal bandwidth expressions. Note
*that C(δ**i**)’s and D(δ**i*)’s in (3.2), relating the asymptotically optimal local bandwidths,
*are all constants determined by δ and the kernel K. Then an important implication*
of (3.2) is that data-based local bandwidth selection for any of the proposed estimators

e

*m(x; r, δ), em**r**(x; δ) and ¯m(x; δ) is simply a matter of adjusting any local bandwidth se-*
lector for *m(x) by multiplying the constant factors accordingly.*b

Asymptotically optimal global bandwidths for a kernel estimator ¯*m(x; h) of m(x)*
*based on bandwidth h are usually derived from global measures of discrepancy such as*

IAMSE

¯
*m; h*

= Z

*D*

AMSE

¯

*m(x; h)*

*f (x) w(x) dx ,* (3.3)
where AMSE

¯

*m(x; h)*

is the asymptotic conditional mean squared error of ¯*m(x; h) and*
*w(x) is a known weight function. Asymptotically optimal global bandwidths form(x; r, δ)*e
and b*m(x) that minimize this IAMSE measure with respect to h usually do not admit the*
simple relation the local counterparts have in (3.2). The reason is that the relative variance
reduction achieved by e*m(x; r, δ) is non-uniform over every D** _{v}*. However, suppose that the

*conditional variance function ν(x) has a low curvature within each D*

*v*. Then, since the bin

*widths δ*

*i*

*b*

*i*

*h, i = 1,· · · , d, of Λ are all of order h, a sensible data-driven global bandwidth*for

*m(x; r, δ) can be obtained from multiplying one for*e

*m(x) by the constant factor*b

Z

*D*

Π^{d}* _{i=1}*

*R(K) − r*^{2}*i*(1*− r*^{2}*i**)C(δ** _{i}*)

*w(x) dx /*
Z

*D*

*R(K)*^{d}*w(x) dx*

*1/(d+4)*

*,* (3.4)
see (1.4), (1.5), (2.5) and (2.6). Hence, the task of automatically choosing a global
bandwidth for the proposed estimator e*m(x; r, δ) can be done analogously, or at least with*
not much more difficulty, as that for b*m(x).*

*Denote as h*_{0}*, h*_{2} *and h*_{3} the asymptotically optimal global bandwidths of b*m(·),*
e

*m**r**(·; δ) and ¯m(·; δ). They are defined as the bandwidths minimizing the global measure*
*in (3.3) for the respective estimators, with D being the set of points where the estimators*
are all defined. Then, from (2.5), (2.6), (2.8)–(2.11),

*h*2 *= B*2*(δ) h*0*,* *h*3 *= B*3*(δ) h*0*,* (3.5)
*where B*2*(δ) and B*3*(δ) are exactly as given in (3.2). The constants B*2*(δ) and B*3*(δ)*
*depend only on the known kernel function K and the given bin width vector δ. Hence,*

automatic global bandwidth selection procedures for both e*m**r**(x; δ) and ¯m(x; δ) can be*
easily established from adjusting those for the usual multivariate local linear estimator

b

*m(x) by the appropriate constants. For example, Ruppert (1997) and Yang and Tschernig*
(1999) established automatic bandwidth procedures for multivariate local linear regres-
sion. Thus the new estimators e*m**r**(x; δ) and ¯m(x; δ) both enjoy the appealing advantage*
that they achieve a great extent of variance reduction and improvement of efficiency with-
out introducing any further unknown factors to bandwidth selection, the most important
issue in nonparametric smoothing.

All the above adjustments are rather easy compared to bandwidth selection problems
created by other modifications of local linear estimation, which usually change both the
asymptotic conditional bias and variance. The reason for this major advantage is that the
pointwise asymptotic conditional bias is unaffected everywhere no matter which of the
variance reduction methods is applied. An even more promising feature of both*m*e*r**(x; δ)*
and ¯*m(x; δ) is that they each has its pointwise asymptotic variance as the same constant*
multiple of that of b*m(x) at all x.*

### 4 Comparisons

This section is devoted to examine in details impacts of the new estimators on several essential issues in nonparametric smoothing, including relative efficiency, implementation and boundary effects.

### 4.1 Relative Efficiencies

The pointwise asymptotically optimal conditional mean squared error of b*m(x), achieved*
*by h*0*(x) in (3.1), is*

AMSE b

*m(x); h*0*(x)*

= 5 4

*n ν(x)R(K)*^{d}*nf (x) Π*^{d}_{i=1}*b**i*

o*4/(d+4)*h

*µ*2*(K)*^{2}*tr{BM*^{2}*(x)}*^{2}i*d/(d+4)*

*.* (4.1)

The asymptotically optimal local bandwidths in (3.2) respectively yield the pointwise
asymptotically optimal mean squared errors of e*m(x; r, , δ), em**r**(x; δ) and ¯m(x; δ) as*

AMSE e

*m(x; r, δ); h*1*(x)*

*= B*1*(x; r, δ)*^{4}AMSE
b

*m(x); h*0*(x)*
*,*
AMSE

e

*m*_{r}*(x; δ); h*_{2}*(x)*

*= B*_{2}*(δ)*^{4}AMSE
b

*m(x); h*_{0}*(x)*
*,*
AMSE

¯

*m(x; δ); h*3*(x)*

*= B*3*(δ)*^{4}AMSE
b

*m(x); h*0*(x)*
*.*

*Thus, given B and δ, the pointwise asymptotic relative efficiencies of the new estimators*
with respect to b*m(x) are*

Eff e

*m(x; r, δ), bm(x)*

*= B*1*(x; r, δ)*^{−4}*,* (4.2)
Eff

e

*m**r**(x; δ), bm(x)*

*= B*2*(δ)*^{−4}*,* Eff

*m(x; δ), b*¯ *m(x)*

*= B*3*(δ)*^{−4}*,* (4.3)
*for every x. Similarly, taking the asymptotically optimal global bandwidths in (3.5) yields*
that the global asymptotic relative efficiencies of*m*e*r**(x; δ) and ¯m(x; δ) with respect to bm(x)*
are

Eff e

*m**r**(·; δ), bm(·)*

*= B*2*(δ)*^{−4}*,* Eff

¯

*m(·; δ), bm(·)*

*= B*3*(δ)*^{−4}*.* (4.4)
Therefore, for each of e*m**r**(x; δ) and ¯m(x; δ), local and global asymptotic relative efficiencies*
compared to b*m(x) are the same and depend only on δ and K. Global asymptotic relative*
efficiency of e*m(x; r, δ) with respect to bm(x) has a more complex form since the pointwise*
relative variance reduction is non-uniform. However, it is well approximated by a simple
expression, not elaborated here, arising from the constant adjustment (3.4) to the global
bandwidth.

*Rewrite the pointwise and global asymptotic relative efficiencies B*2*(δ)*^{−4}*and B*3*(δ)** ^{−4}*,
in (4.3) and (4.4), as

*B*2*(δ)** ^{−4}* = Π

^{d}

_{i=1}*S(δ*

*i*)

^{−4/(d+4)}*,*

*B*3

*(δ)*

*= Π*

^{−4}

^{d}

_{i=1}*T (δ*

*i*)

^{−4/(d+4)}*,*

*where S(u) =*

*R(K)− C(u)/4*

*/R(K) and T (u) =*

*R(K) − C(u)/4 − D(u)/2*

*/R(K).*

*Consider the simplest case that δ*_{1} *= · · · = δ**d**= δ*_{0}. Then
Eff

e

*m**r**(x; δ),m(x)*b

= Eff e

*m**r*(*·; δ), bm(·)*

*= S(δ*0)^{−4d/(d+4)}*,*
Eff

*m(x; δ), b*¯ *m(x)*

= Eff

*m(·; δ), b*¯ *m(·)*

*= T (δ*_{0})^{−4d/(d+4)}*.* (4.5)

*For commonly used kernels, S(δ*0)^{−1}*is greater than one for all δ*0 *> 0, increases in δ*0,
*and has an upper limit 8/5. Also, T (δ*0)^{−1}*is greater than one for all δ*0 *> 0, increases*
*in δ*0*, and has an upper limit 16/5. If K is supported on [-1,1], the respective upper*
*limits occur when δ*0 *= 2 and δ*0 *= 2/(√*

*2 − 1) in which case variances of the proposed*
estimators consist of only variances and no covariance of the local linear estimates in the
linear combinations. Then, from (4.5), an important property of our variance reduction
*techniques is that the relative efficiency improvements increase as the dimensionality d of*
*the covariates X = (X*1*, · · · , X** ^{d}*)

^{T}*grows. Table 1 contains some values of S(δ*0)

^{−4d/(d+4)}*and T (δ*

_{0})

^{−4d/(d+4)}*when K is the Epanechnikov kernel.*

Table 1: Relative efficiencies Eff e

*m*_{r}*(·; δ), bm(·)*

(upper half) and Eff

*m(·; δ), b*¯ *m(·)*

(lower
*half) when δ*1 *= · · · = δ*^{d}*= δ*0 *and when K is the Epanechnikov kernel.*

*δ*_{0} d=1 d=2 d=3 d=4 d=5

0.6 1.064 1.110 1.143 1.169 1.189

0.8 1.088 1.151 1.198 1.235 1.264

1.0 1.113 1.195 1.257 1.306 1.345

1.2 1.166 1.292 1.391 1.469 1.533

1.6 1.293 1.535 1.735 1.902 2.043

2.0 1.456 1.871 2.238 2.560 2.842

0.6 1.089 1.153 1.201 1.238 1.268

0.8 1.121 1.210 1.278 1.332 1.375

1.0 1.168 1.295 1.394 1.473 1.538

1.2 1.237 1.425 1.576 1.700 1.804

1.6 1.393 1.737 2.033 2.288 2.509

2.0 1.580 2.142 2.663 3.136 3.560

*2/(√*

*2 − 1)* 2.536 4.716 7.345 10.240 13.260

### 4.2 Implementation

Since there is no parametric structural assumptions on the unknown regression function
*m(·) in nonparametric smoothing, nonparametric regression estimators are usually eval-*
*uated over a range of x-values in practice. Consider computing estimators of m(x) over*
a fine grid Λ0 *to provide sensible comprehension of the regression function m(·) over a*
range of interest. Then the local linear estimator b*m(x) requires to perform the local least*

*squares fitting (1.2) at every x ∈ Λ*^{0}.

To compute our first variance reduced estimator e*m(x; r, δ) for all x ∈ Λ*^{0} one can
*proceed as follows. Form D and Λ as in Section 2.1 such that D covers Λ*0 and Λ is
coarser than Λ0, and then compute the estimates e*m(x; r, δ), x ∈ Λ*^{0}, using b*m(α), α ∈ Λ,*
as described in Section 2.1. One appealing feature of e*m(x; r, δ) is that it is very easy to*
implement since the coefficients in the linear combination are just products of the simple
*one dimensional functions A*0*, A*1*and A*2. Therefore, implemented in the above mentioned
way, e*m(x; r, δ) amounts to a considerable saving of computational time compared to bm(x).*

This is a particularly important advantage in the multivariate case. For example, if the
number of grid points at each dimension in Λ is a fixed proportion of that in Λ_{0}, then
the number of evaluations of the local linear estimator is reduced exponentially as the
*dimension d increases. Hence, the better asymptotic conditional mean squared error*
performance of e*m(x; r, δ) compared to* *m(x) is true at virtually little cost but a great*b
saving of computational effort.

The estimators e*m*_{r}*(x, δ) and ¯m(x, δ) are more computationally involved. For ex-*
ample, in order to evaluate e*m**r**(x, δ) at each x, one needs to calculate the 3** ^{d}* local linear
estimates

*m(α), α ∈ λ*b

*. In a naive way, that requires 3*

^{x,r}*times the effort compared to what*

^{d}*m(x) needs. Fortunately, there are ways to avoid such an increase of computational*b effort. Suppose that again

*m*e

*r*

*(x, δ) is to be evaluated over a grid Λ*0. One approach is to take the bin widths of the grid Λ0

*to be in proportions to the bin widths δ*

*i*

*b*

*i*

*h,*

*i = 1, · · · , d, of Λ*

*so that*

^{x,r}*m(α), α ∈ Λ*b

^{x,r}*, can be reused for other values of x ∈ Λ*

^{0}.

*Also, the set of coefficients in the linear combination is the same for all x*

*∈ Λ*

^{0}so needs to be evaluated once only. Hence, in this manner, e

*m*

*r*

*(x, δ) is computed with about the*same amount of effort as the local linear estimator b

*m(x). The estimator ¯m(x, δ) can be*constructed in a similar way to alleviate computational effort.

### 4.3 Behaviors at Boundary Regions

One reason that the local linear technique is very popular in practice and in many contexts
*is that it does boundary correction automatically. For instance, when x is a boundary*

point, the conditional bias and variance of b*m(x) are both kept at the same orders as in the*
interior. Theorem 2.2 of Ruppert and Wand (1994) formally defines a boundary point in
multivariate local linear regression and provides asymptotic expressions of the conditional
*bias and variance. The theorem shows that only the constant factors involving K and*
*x are changed and the constant factors depend on how far, relative to the bandwidth*
*matrix, x is away from the boundary.*

Clearly from the form of e*m(x; r, δ), Theorem 1 can be extended to include the*
case where Λ*v* is not entirely contained in the interior, so that asymptotic results of the
conditional bias and variance of e*m(x; r, δ) are given for every x ∈ D. This is not elaborated*
here because it is straightforward but the notation becomes much more complicated. One
*can show that the conditional bias and variance is of the same orders at all x ∈ D as long*
as Λ*v* *is in the support of f , which is of course true in general. Therefore, em(x; r, δ) also*
achieves automatic boundary corrections.

*For a boundary point x, comparison between the conditional variances of bm(x)*
and *m(x; r, δ) becomes tedious as the constant coefficients are both very complex. How-*e
ever, we can argue that Var

e

*m(x; r, δ)*

*X*1*,· · · , X** ^{n}*

is again asymptotically smaller than Var

b
*m(x)*

*X*1*, · · · , X** ^{n}*

at boundary regions in the following way. Our estimator e*m(x; r, δ)*
is a linear combination of b*m(α), α ∈ Λ** ^{v}*. It is well known that Var

b

*m(α)**X*_{1}*, · · · , X** ^{n}*
is
much smaller than Var

b
*m(x)*

*X*_{1}*, · · · , X** ^{n}*

*for those α∈ Λ** ^{v}* that are more away from the

*boundary than x. The weight in em(x; r, δ) put on any bm(α) with α closer to the boundary*

*than x becomes close to one only when x is right nearby it. Otherwise, em(x; r, δ) spreads*its weights on

*m(α) for those α ∈ Λ*b

^{v}*more away from the boundary than x.*

The constant multiplier in the asymptotic expression of E^{2}
b

*m(x)−m(x)**X*_{1}*, · · · , X** ^{n}*

*changes and it generally becomes smaller as x moves from the interior to the boundary*region. Therefore,

*m(x; r, δ) no longer has the same asymptotic conditional bias as*e

*m(x)*b

*at a boundary point x. However, it is generally true that, as x moves more toward the*boundary, the decrease in E

^{2}

b

*m(x) − m(x)**X*_{1}*,· · · , X** ^{n}*

is much less than the increase in Var

b
*m(x)*

*X*_{1}*, · · · , X** ^{n}*

. Then the difference between E^{2}
e

*m(x; r, δ) − m(x)*

*X*_{1}*, · · · , X** ^{n}*
and E

^{2}

b

*m(x) −m(x)*

*X*_{1}*,· · · , X**n*

is small compared to the variance reduction yielded by e

*m(x; r, δ). This implies that the asymptotic conditional mean squared error of* *m(x; r, δ)*e

is again smaller than that of b*m(x) when x is a boundary point.*

Consider behaviors of e*m**r**(x; δ) and ¯m(x; δ) in boundary regions. Suppose that Λ**x,r*

*is entirely contained in the support of the design density f . Then the above arguments*
can be used to demonstrate that e*m**r**(x; δ) and ¯m(x; δ) also have the automatic boundary*
correction property, and that they have smaller asymptotic mean squared errors than

b

*m(x) in boundary regions.*

Asymptotic behaviors of the conditional bias and variance of the local linear estima-
tor b*m(x) when x is near the boundary are illustrated and discussed in details by Fan and*
Gijbels (1992), Ruppert and Wand (1994) and Fan and Gijbels (1996), among others.

### 5 Simulation study and real application

### 5.1 Simulation Study

*Consider model (1.1) with d = 2, ν(x) = 1, m(x*1*, x*2*) = sin(2πx*1*) + sin(2πx*2). Let
*X*_{i1}*and X*_{i2}*be independent with each being uniformly distributed on [0, 1], and ε** _{i}* has
a standard Normal distribution. We drew samples from the above model with sample

*size n = 400 and n = 1000. In computing ˆm(x*1

*, x*2) and ¯

*m(x*1

*, x*2

*; δ), we employed*

*the biweight kernel K(u) =*

^{15}

_{16}

*(1 − u*

^{2})

^{2}

*I(|u| ≤ 1), h = 0.15 and 0.2, b*

^{1}

*= b*2 = 1,

*δ*

_{i}*= min{1, x*

*i*

*/[(1 + 1√*

*2)h], (1 − x**i**)/[(1 + 1/√*

*2)h]} for i = 1, 2. In Figures 1-4, the*
natural logarithm of ratio of the mean squared error of ¯*m(x*1*, x*2*; δ) to that of ˆm(x*1*, x*2)
*is plotted against different x*1 *= 0, 0.05, · · · , 1 and x*^{2} *= 0, 0.05, · · · , 1, and also against*
*x*2 *= 0, 0.5, · · · , 1, but with fixed x*^{1} *= 0.2, 0.5, 0.8. These figures show that, in all cases*
considered, ¯*m(x*_{1}*, x*_{2}*; δ) has a significantly smaller mean squared error than ˆm(x*_{1}*, x*_{2})
*when (x*1*, x*2) is an interior point. Occasionally, the mean squared error of ¯*m(x*1*, x*2*; δ)*
is slightly larger than that of ˆ*m(x*1*, x*2) for some boundary points. Boundary behaviours
is not a main issue in nonparametric multivariate regression since the data contain very
little information about the regression surface there.

### 5.2 Real application

We applied the local linear estimator ˆ*m(·) and our variance reduced estimate ¯m(·; δ) to*
the Boston housing price data set. This data set consists of the median value of owner-
occupied homes in 506 U.S. census tracts in the Boston area in 1970, together with several
variables which might explain the variation of housing value, see Harrison and Rubinfeld
(1978). Here we fit model (1.1) to the median values of homes with two covariates,
*x*1 *= LST AT (lower status of the population) and x*2 *= P T RAT IO (pupil-teacher ratio*
by town). We computed estimators ˆ*m(x*_{1}*, x*_{2}) and ¯*m(x*_{1}*, x*_{2}*; δ) by taking the same set of*
*tuning parameters δ**i**, i = 1, 2, and the same kernel as in Section 5.1 and h = 1, and*
*b*1 *= 1.5, b*2 *= 0.5 were employed here. These estimators are plotted in Figure 5. Close*
*to the boundary P T RAT IO = 0, mainly P T RAT IO less than 1.8, the two estimators*
behave quite differently since the regression surface changes drastically in that boundary
region. The spikes in ¯*m(x*1*, x*2*; δ) will disappear if smaller values of δ**i**, i = 1, 2, are used.*

*For P T RAT IO > 3 or LST AT > 10, our estimator ¯m(x*1*, x*2*; δ) effectively smoothes out*
the spurious bumps produced by ˆ*m(x*1*, x*2).

### 6 Proofs

*Proof of Theorem 1. Put ∆ = ((k*1 *− 1 − r*^{1}*)δ*1*b*1*,· · · , (k*^{d}*− 1 − r*^{d}*)δ**d**b**d*)* ^{T}*. Then

*x*

^{∗}*(k*1

*,· · · , k*

^{d}*) − x = h∆. Further*

*m(x*^{∗}*(k*1*, · · · , k** ^{d}*))

*− m(x) = h∆*

^{T}*M*1

*(x) +*1

2*h*^{2}∆^{T}*M*2*(x)∆ + o(h*^{2}*),* (6.1)
*where M*1*(x) =* _{∂}

*∂x*1*m(x), · · · ,**∂x*^{∂}_{d}*m(x)**T*

*. Since, for any s ∈ [−1, 1],*
X

*j∈{0,1,2}*

*A**j**(s) = 1 ,* X

*j∈{0,1,2}*

*A**j**(s)(j− 1 − s)*^{l}*= 0 , l = 1, 2 ,* (6.2)

we have X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)o

= 1,

E e

*m(x; r, δ) − m(x)**X*_{1}*,· · · , X** ^{n}*

= X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)o
E

b
*m*

*x*^{∗}*(k*1*, · · · , k** ^{d}*)

*− m*

*x*^{∗}*(k*1*, · · · , k** ^{d}*)X1

*,· · · , X*

**

^{n}+ X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)on
*m*

*x*^{∗}*(k*1*, · · · , k** ^{d}*)

*− m(x)*o

= X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)oh1

2*µ*2*(K)h*^{2}*tr*

*BM*2*(x)*
*+ o**p*

*h*^{2}i

+ X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)on

*h∆*^{T}*M*1*(x) +* 1

2*h*^{2}∆^{T}*M*2*(x)∆ + o*
*h*^{2}o

= 1

2*µ*2*(K) h*^{2}*tr*

*BM*2*(x)*

*+ h E*1+ 1

2*h*^{2}*E*2*+ o**p*

*h*^{2}
*.*
where

*E*1 = X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)o

∆^{T}*M*1*(x) ,*

*E*_{2} = X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A*_{k}_{i}*(r** _{i}*)o

∆^{T}*M*_{2}*(x)∆ .*
*Then (2.5) follows if E*1 *= E*2 = 0 which can be validated by showing that

*E*1 = X

*(k*1*,··· ,k**d*)^{T}*∈{0,1,2}*^{d}

nΠ^{d}_{i=1}*A**k**i**(r**i*)oX^{d}

*j=1*

*(k**j**− 1 − r*^{j}*)δ**j**b**j*

*∂*

*∂x**j*

*m(x)*

= X

*(k*1*,**··· ,k**d−1*)^{T}*∈{0,1,2}*^{d−1}

nΠ^{d−1}_{i=1}*A*_{k}_{i}*(r** _{i}*)on X

*k**d**∈{0,1,2}*

*A*_{k}_{d}*(r** _{d}*)
X

*d*

*j=1*

*(k*_{j}*− 1 − r**j**)δ*_{j}*b*_{j}*∂*

*∂x**j*

*m(x)*o

= X

*(k*1*,··· ,k**d−1*)^{T}*∈{0,1,2}*^{d−1}

nΠ^{d−1}_{i=1}*A**k**i**(r**i*)onX^{d−1}

*j=1*

*(k**j* *− 1 − r*^{j}*)δ**j**b**j*

*∂*

*∂x**j*

*m(x)*o
*,*