Error-dependent smoothing rules in local linear regression

(1)

ERROR-DEPENDENT SMOOTHING RULES IN LOCAL LINEAR REGRESSION

Ming-Yen Cheng1,2 and Peter Hall1

1_{Australian National University and} 2_{National Taiwan University}

Abstract: We suggest an adaptive, error-dependent smoothing method for reducing

the variance of local-linear curve estimators. It involves weighting the bandwidth used at the ith datum in proportion to a power of the absolute value of the ith residual. We show that the optimal power is 2/3. Arguing in this way, we prove that asymptotic variance can be reduced by 24% in the case of Normal errors, and by 35% for double-exponential errors. These results might appear to violate Jianqing Fan’s bounds on performance of local-linear methods, but note that our approach to smoothing produces nonlinear estimators. In the case of Normal errors, our esti-mator has slightly better mean squared error performance than that suggested by Fan’s minimax bound, calculated by him over all estimators, not just linear ones. However, these improvements are available only for single functions, not uniformly over Fan’s function class. Even greater improvements in performance are achiev-able for error distributions with heavier tails. For symmetric error distributions the method has no first-order effect on bias, and existing bias-reduction techniques may be used in conjunction with error-dependent smoothing. In the case of asymmetric error distributions an overall reduction in mean squared error is achievable, involv-ing a trade-off between bias and variance contributions. However, in this settinvolv-ing, the technique is relatively complex and probably not practically feasible.

Key words and phrases: Bandwidth, kernel method, nonparametric regression, tail

weight, variance reduction.

1. Introduction

There is a great variety of bias reduction methods for nonparametric curve estimation, ranging from high-order kernel techniques (e.g., Wand and Jones (1995), Chapters 2 and 5) to local bandwidth adjustments (e.g., Abramson (1982, 1984), Jones (1990)), methods based on varying location and scale (e.g., Samiud-din and el-Sayyad (1990), Jones, McKay and Hu (1994)), empirical transforma-tions (e.g., Ruppert and Cline (1994)), weights (e.g., Jones, Linton and Nielsen (1995)), and skew computation (e.g., Choi and Hall (1998)). All have variants for both nonparametric density estimation and nonparametric regression, although often they are introduced ﬁrst in the former setting. However, very few methods have been suggested for reducing the impact of variance. Those that do exist involve principally deterministic adjustments to bandwidth, altering the local

(2)

trade-oﬀ between variance and squared bias in the context of mean integrated squared error.

In the present paper, and in the context of nonparametric regression, we suggest a new and entirely diﬀerent approach to variance reduction. It involves adjusting bandwidth in a stochastic, rather than deterministic, way, with the aim of providing improved performance by giving greater weight to data pairs that correspond to smaller absolute errors. In an extreme case, if an error were exactly zero then we would wish to use the corresponding data pair for interpolation, rather than simply for smoothing. Interpolation would correspond to using a bandwidth of zero, and so our “error dependent” method involves taking the bandwidth to be a function of the error — or more practically, of the residual. The function is non-degenerate, even in the asymptotic limit. For local-linear estimators we suggest that it be taken proportional to the two-thirds power of the absolute value of the error (or residual). Diﬀerent powers are appropriate for higher-order methods, with the power increasing to 1 as order increases.

The idea of allowing the bandwidth for the ith datum to depend nontrivially on the ith error is reminiscent of Abramson’s (1982) approach to bias reduction in density estimation. There, the bandwidth for smoothing X_i when estimat-ing the density f is ideally taken proportional to a negative power of f (X_i). While this approach has an analogue for nonparametric regression (Hall (1990)), it nevertheless reduces bias rather than variance, and is not closely related to the technique suggested here. In particular, error-dependent smoothing reduces variance by a constant factor, and has no ﬁrst-order impact on bias (in the case of symmetric error distributions); most bias reduction methods reduce bias by an order of magnitude and inﬂate variance by a constant factor.

In large samples, our method reduces the variance contribution to mean integrated squared error by 24% in the case of Normal errors, by 35% for double-exponentially distributed errors, and by even greater amounts for very heavy-tailed error distributions. These figures might appear to violate the bounds as-serted by Fan (1993) for performance of local-linear estimators. There is in fact no contradiction, however, principally for the following reason. Fan’s minimax theory applies to classes of regression functions that have, in effect, two bounded derivatives. By way of contrast, our results require the functions to have two continuous derivatives. Our claim about improved performance does need conti-nuity of the second derivative, and in particular does not hold uniformly across the function class addressed by Fan. Other dissimilarities too should be born in mind. For example, Fan’s (1993) result about local-linear estimators applies only within the class of linear techniques, and (after error-dependent smoothing) our estimators are nonlinear. Furthermore, Fan’s other minimax bounds, measuring performance against nonlinear techniques, do not specifically address the range of heavy-tailed error distributions that are considered in the present paper, since

(3)

his class C2 of models for the “max” part of “minimax” contains models with

Normal errors.

Our methods are relatively simple when the error distribution is symmetric or nearly symmetric, and that is the context in which they have greatest practical signiﬁcance. In the case of asymmetry, however, error-dependent smoothing can introduce an additional bias term to the estimator. This makes it relatively complex to minimise mean squared error. For the sake of completeness we brieﬂy explore the general case from a theoretical viewpoint but, since the practical attraction of the asymmetric setting is not high, we do not describe its numerical properties.

From a practical viewpoint, empirical bandwidth choice methods that pro-duce overly variable bandwidths can require relatively large samples in order to achieve theoretically optimal levels of performance. The rule proposed in this paper, where the bandwidth is proportional to a power of a residual, can be subject to excessive ﬂuctuation when residuals are either close to 0 or large in absolute value. This suggests that a truncation argument be used to dampen variability and improve performance. The need to choose truncation points adds to the complexity of the method, however, as does the necessity of selecting a pilot bandwidth in order to calculate residuals. The result is that, in small to moderate sized samples, mean squared error reductions oﬀered by a practical version of our method may be some distance from theoretically optimal levels. Nevertheless, our theory and simulation analysis show clearly the potential of the methods.

Provided errors are stochastically independent, conditional on design points, there is no diﬃculty in combining error-dependent smoothing with a bias-reduction method. The eﬀect is to reduce bias by an order of magnitude, and reduce variance by a constant factor relative to the value it would assume if only bias reduction were employed. Likewise, error-dependent smoothing may be applied in conjunction with a spatially-local bandwidth choice procedure, such as that suggested by Fan and Gijbels (1995): one simply takes the scale factor h, in formulae such as h_i = h H(ˆ_i) discussed in Section 2, to depend on spatial location x. The method is also applicable to the case of heteroscedasticity, no matter whether the variance function is modelled parametrically or estimated nonparametrically. See Section 2.7 for discussion.

2. Methodology and Main Results 2.1. Model and estimator

Assume that independent and identically distributed data pairs (X1, Y1), . . . ,

(X_n, Y_n) are generated by the model

(4)

where X_i is independent of _i, and E(_i) = 0. In the “ideal” case, where the errors _i are known, we define a bandwidth h_i by h_i = h H(_i). Here, h = h(n) denotes a sequence of positive constants, and H is a fixed positive function. More realistically, we might approximate _i by a residual ˆ_i, and put h_i = h H(ˆ_i). In either case, and for fixed x, let (â, ˆb) denote the pair that minimises

n

i=1

{Yi− a − b (Xi− x)}2h−1_i K{(X_i− x)/h_i} ,

where K is a kernel function, and put ˆg(x) = ˆa.

We expect H(x) to be an increasing function of |x|, in which case smaller absolute errors (or residuals) produce less smoothing. In the “realistic” case, where residuals are used rather than the errors themselves, particularly large absolute values of residuals may not necessarily reflect similar values of errors. Rather, they might result from inaccuracies in the pilot estimator of g that is used to compute residuals. This means that we should usually threshold the smoothing parameter, to guard against outlying values. Thresholding is also useful if we are to ward off problems with sparse design, which do not make themselves felt through the asymptotic distribution of ˆg. These difficulties do not arise in the “ideal” case, which therefore offers more insight into the operation of level-dependent thresholding. We address the “ideal” and “realistic” cases in Sections 2.4 and 2.5, respectively.

2.2. Overview of theory

First we deal with the “ideal” setting. In the case of classical local-linear smoothing, where H ≡ 1, it is well-known that the local-linear estimator has bias of size h2 and error about the mean of size (nh)−1/2. Speciﬁcally, under mild regularity conditions (see Fan (1993)),

ˆ

g = g + 1₂h2gκ₂+ (nh)−1/2f−1/2κ1/2σN_n+ o_p(h2) , (2.2) where κ₂ = u2K(u) du, κ = K2, σ2 = var(), and N_n denotes a random variable whose distribution is asymptotically Normal N(0, 1). The second term on the right-hand side of (2.2) represents systematic error, and the third is stochastic error.

More generally, suppose H is a non-degenerate function. If E{H()2} < ∞ then we may standardise H so that E{H()2_{} = 1. We claim that in these}

circumstances, provided the error distribution is symmetric and H is an even function, the expansion at (2.2) continues to hold, except that the third term is multiplied by the factor ρ, where ρ2= E{2H()−1}/σ2:

ˆ

(5)

As in (2.2), N_ndenotes a random variable that is asymptotically Normal N(0, 1), although it will assume diﬀerent numerical values in appearances at (2.2) and (2.3).

Minimising ρ subject to E{H()2_{} = 1 is an elementary variational problem.}

The minimum is achieved when H() equals a constant multiple of||2/3, in which case ρ2 = ρ2₀ where

ρ2₀≡ {E(||4/3)}3/2/σ2 < 1 . (2.4) Of course, taking H() proportional to ||2/3 is all that is needed for optimal reduction of asymptotic mean squared error. The particular constant is absorbed into the non-random multiplier, h, and so is immaterial.

For Normal errors, and for our estimator rather than a conventional local-linear estimator, our results are suggestive of a version of Fan’s (1993) Theorem 4 in which his constant 0.8962 is replaced by 1.00, this being the value (to two decimal places) of

(1.243 ρ8/5₀ )−1 = (1.243× [2 {Γ(7/6)/π1/2}3/2]4/5)−1≈ 1.00242. (2.5) The formula here represents the efficiency, defined as the ratio of mean squared errors, of our estimator relative to the “optimal” nonlinear regression function estimator, when estimating a fixed, twice continuously differentiable target g. The “optimal” estimator here is the one that gives minimum mean squared error in a minimax sense, uniformly over functions that have two bounded derivatives; it does not make use of continuity of the second derivative. Details of the origin of the formula are given by Donoho and Liu (1991), and in fact the figure 1.243 is taken from Donoho and Liu’s Table 1. It was apparently derived by combining, in a conservative way, numerical results obtained by Donoho, Liu and MacGib-bon (1990). The exact value of ρ2₀ in the Normal case may be shown to equal 2{Γ(7/6)/π1/2}3/2≈ 0.757.

Note that our results rely on continuity of the second derivative of g, whereas Fan’s results apply uniformly over a class of g’s that have only a uniformly bounded derivative. The requirement for continuity is the key to reconciling Fan’s results with our own. The fact that the “ideal” estimator is not a true estimator in the usual sense is not an issue in comparing Fan’s results with our own. Indeed, we show in Section 2.5 that, for any particular functions f and g with two continuous derivatives, the level of asymptotic performance evinced by the ideal estimator is achievable by a realistic version. In this case a pilot estima-tor ˜g is constructed, and used to calculate ˆ_i= Y_i− ˜g(X_i). Then, possibly after centering or thresholding these residuals, we compute the empirical bandwidth ˆ

h_i = h|ˆ_i|2/3_{. We use ˆ}_h

i in place of hi to construct ˆg. Modulo regularity

con-ditions, and truncation to alleviate diﬃculties caused by too large or too small values of|ˆ_i|, formula (2.3) continues to hold.

(6)

Since Fan’s minimax function classC2contains models where the error

distri-bution is Normal, results such as those discussed above do not relate to improve-ments in performance that error-dependent smoothing can achieve in the case of heavy-tailed error distributions. Indeed, the value of ρ2₀ tends to be smaller for distributions with heavier tails. Values in the cases of double Exponential, Nor-mal and Uniform errors are respectively 0.650, 0.757 and 0.842. For errors with Student’s t distribution on 5, 10 or 20 degrees of freedom, the values are 0.676, 0.726 and 0.743, respectively. Figure 2.1 plots ρ2 as a function of the number of degrees of freedom (interpreted in the continuum) for Student’s t errors. For non-Normal errors one can also construct estimators that are more eﬃcient than ˆ

g by replacing local least-squares by a robust method such as local M -estimation.

5 10 15 20 0. 0 0. 2 0. 4 0. 6

Figure 2.1. Values of ρ2 versus number of degrees of freedom for Student’s

t error distributions.

In the case of asymmetric errors, or when H is not an even function, a formula similar to (2.3) is valid, except that an extra bias term of size h2 is introduced. This quantity is proportional to E{ H()2}, and so is typically small if the error distribution is close to being symmetric. More generally, it is possible to either choose both h and H (in the formula h_i = h H(_i)) so that asymptotic mean squared error is minimised; or to choose H to minimise the asymptotic variance term subject to both E{H()2} = 1 and E{ H()2} = 0. The ﬁrst rule produces a particularly complex function H, and is quite unattractive for practical implementation. The second rule gives

H(u) =    c₁u2/3 if u > 0, c2|u|2/3 if u < 0, (2.6)

(7)

where c₁=

E{4/3I( > 0)} + E{||

4/3_{I( < 0)}_{} E{}7/3_{I( > 0)}_}

E{||7/3_{I( < 0)}_}

_−1/2

, c2/c1=

E{7/3I( > 0)}/E{||7/3I( < 0)}

_1/2

.

Thus, using error-dependent smoothing in the case of asymmetric errors requires relatively detailed inference about the error distribution, and is not as attractive as in the case of symmetry.

Note, however, that there is always a strict reduction in the variance contri-bution to asymptotic mean squared error. For example, if H is given by (2.6), with the constants c1, c2 deﬁned above, then (without requiring the assumption

of symmetry) (2.3) holds with ρ < 1. 2.3. Generalisations and extensions

The main idea behind these results — that the variance of a nonparametric regression estimator may be reduced by using an error-dependent smoothing parameter — is applicable to a wide range of settings. For example, it is valid for linear, second-order methods such as the Nadaraya-Watson estimator (see e.g., Härdle (1990), p.25; Wand and Jones (1995), p.119), the Gasser-Müller estimator (Wand and Jones (1995), p.131) and the Priestley-Chao estimator (Wand and Jones (1995), p.130). In all these cases the asymptotically optimal form of H is the same as that described above in the local-linear context. Moreover, if the error distribution is symmetric then bias is unaffected, to first order, by error-dependent smoothing.

Error-dependent smoothing is also applicable to general local-polynomial estimators of regression means and their derivatives (see e.g., Ruppert and Wand (1994)). There, if the method is of order r (meaning that, in the case of a non-random bandwidth h, bias is of size hr), the natural analogue of the condition E{H()2} = 1 is E{H()r} = 1. This ensures that, if the error-dependent bandwidth is h_i = h H(_i), if H is an even function, and if the error distribution is symmetric, the bias term is the same (to ﬁrst order) as it would be if the bandwidth were simply h. If we are estimating the sth derivative of g, for s≥ 0, then the variance contribution to asymptotic mean squared error is also the same to ﬁrst order, except that it is reduced by the multiplicative factor ρ = E{2H()−(2s+1)}/σ2.

The minimum of this quantity, subject to the constraint E{H()r} = 1, is achieved when H() is proportional to||2/(r+2s+1). Thus, the general form of ρ2₀ (given at (2.4) in the case (r, s) = (2, 0)) is

(8)

In the case of symmetric errors, this represents the greatest amount by which variance may be reduced using error-dependent smoothing, when estimating an sth derivative by an rth order method.

2.4. Theory in “ideal” case

Assume that h = h(n) → 0 and nh → ∞ as n → ∞, and that H is a continuous, positive function satisfying

E{H()3} < ∞ , E{H()−1} < ∞ , E[2{H() + H()−1}] < ∞ . (2.7) Let K be a bounded, compactly supported, symmetric probability density, and let f denote the marginal density of X_i, assumed to exist in a neighbourhood of x. Assume that f and g have two continuous derivatives in a neighbourhood of x, and that f (x) > 0.

Theorem 2.1. Under the above conditions, ˆ

g(x) = g(x) + 1₂h2κ2[g(x) E{H()2} + f(x) f (x)−1E{ H()2}]

+[(nh)−1κ f (x)−1E{2H()−1}]1/2N_n(x) + o_p(h2) , (2.8) where N_n(x) denotes a random variable whose distribution is asymptotically Nor-mal N(0, 1).

Corollary. Assume the conditions of Theorem 2.1. If E{ H()2} = 0 — in particular, if the distribution of is symmetric and H is an even function — and if E{H()2} = 1, then (2.3) holds at x, with ρ2 = E{2H()−1}/σ2.

Conditions (2.7) hold if, for example, (a) C1||1−δ ≤ H() ≤ C2(1 +||)

for constants C₁, C₂, δ > 0, (b) E(||3) <∞, and (c) the probability density of exists in a neighbourhood of the origin and is bounded away from 0 there. If H() has the optimal form C||2/3 then (2.7) is valid if (c) holds and E(||8/3) < ∞. The theorem may be extended to a variety of other settings, for example to the case where x is a boundary point.

2.5. Theory in “realistic” case

Suppose f and g have two continuous derivatives in a neighbourhood of x, and that f (x) > 0. Assume too that K is a bounded, compactly supported, symmetric probability density, and K exists and satisfies a Hölder condition; that H exists and satisfies a H¨older condition, and C1 ≤ H ≤ C2 for constants

0 < C1 ≤ C2 <∞; that h = h(n) → 0 and nh → ∞ as n → ∞; that E(2) <

∞. Call these assumptions (A1). (We say that a function ψ satisﬁes a H¨older

condition, or is H¨older continuous with exponent η, if there exists a constant C > 0 such that |ψ(x) − ψ(y)| ≤ C |x − y|η for all x and y.)

(9)

In view of the continuity of f and g, the quantity

ξ(δ)≡ sup

u : |x−u|≤δ {|f

_(x)_{− f}_{(u)| + |g}_(x)_{− g}_(u)|} _(2.9)

converges to 0 as δ→ 0. Let h0= h0(n) denote a bandwidth, chosen to converge

to 0 suﬃciently slowly for n1/5h₀ → ∞, and suﬃciently quickly for n1/5h₀ = O(nη) for all η > 0 and n1/5h₀ξ(h₀)1/2 → 0. Call this assumption (A₂). For example, if both f and g are H¨older continuous then ξ(δ) = O(δη) for some η > 0, and so h₀(n) n−1/5log n is an adequate choice.

Let ˜g denote a standard local-linear estimator of g, computed using band-width h₀. Put ˆ_i = Y_i− ˜g(X_i) and ˆh_i = h H(ˆ_i), and redeﬁne ˆg(x) as the value ˆ

a in the pair (a, b) = (ˆa, ˆb) that minimises

n

i=1

{Yi− a − b (Xi− x)}2hˆ−1_i K{(X_i− x)/ˆh_i} .

Theorem 2.2. Under assumptions (A1) and (A2), (2.8) holds for the new

esti-mator ˆg, with N_n again denoting a random variable that is asymptotically Nor-mal N(0, 1).

Neither Theorem 2.1 nor Theorem 2.2 is available uniformly in a function class such as Fan’sC₂. One reason, complementary to that given in Section 2.2, is that the rate at which the “o_p(h2)” terms converge to 0 depends explicitly on the modulus of continuity of both fand g, and can be arbitrarily slow. In the case of Theorem 2.2 this in turn influences choice of the pilot-estimator bandwidth h₀. Taking that quantity to be equal to a fixed constant multiple of n−1/5, rather than of larger order than n−1/5 (as required by assumption (A2)), results in inflation

of variance relative to that achieved by the “ideal” estimator. For suﬃciently heavy-tailed error distributions, such an inﬂation still produces a reduction in variance, relative to that for a classical local-linear estimator. This is readily apparent in both theoretical analysis and numerical simulations. For Normal data, however, slight oversmoothing of the pilot estimator and relatively large sample sizes are necessary in order to achieve obvious improvements.

The conditions imposed on H in Theorem 2.2 preclude the optimal form H(u) = γ|u|2/3. However, the level of performance in that case may be achieved, up to a constant factor that converges to 1 as n→ ∞, by considering successive approximations to the optimal H by functions satisfying the conditions of the theorem. Likewise, conditions on the kernel K prevent it from being the Epanech-nikov function that attains asymptotically minimal mean squared error, but we may circumvent this problem by considering a sequence of approximations.

For example, the following is an empirical, thresholded version of the asymp-totically optimal “ideal” procedure suggested in Section 2.2. Let ˜g be deﬁned

(10)

as before, and assume the conditions imposed in Theorem 2.2 on K and on the bandwidths h and h0. Additionally, suppose K is a compactly supported

prob-ability density with two bounded derivatives, that f and g have three bounded derivatives in a neighbourhood of x, that f (x) > 0, that E(4) <∞, and that H is deﬁned by H(u) =        γ|u|2/3 _{if n}−α_{≤ |u| ≤ n}β_, γ n−2α/3 if|u| < n−α, γ n2β/3 if|u| > nβ, (2.10)

where α, β, γ denote positive constants. Then it is possible to choose α, β > 0 such that, for all γ > 0, (2.7) holds for the new version of ˆg (constructed using the bandwidths ˆh_i = H(ˆ_i)). The proof is particularly complex, and so will not be given here. The function H at (2.10) achieves the asymptotic performance represented by (2.8) with H(u)≡ γ |u|2/3.

2.6. Bandwidth choice

Note that in view of property (2.3) the optimal bandwidth, hed_opt say, for an error-dependent smoothing rule is hed

opt= hllinoptρ2/5, where hllinoptis the optimal

band-width for the conventional local-linear estimator, and ρ2= E{2H()−1}/σ2. (It is assumed here that H has been standardised so that E{H()2} = 1.) Therefore, to construct an empirical approach to bandwidth choice for an error-dependent smoothing rule we can use our “favourite” technique (such as a plug-in method or cross-validation) to compute an empirical approximation ˆhllin_{to h}llin

opt, and take

ˆ

hed= ˆhllinρˆ2/5 to be our empirical approximation to hed_opt, where

ˆ ρ2 = n−1 n i=1 ˆ 2_i H(ˆ_i)−1 n−1 n i=1 H(ˆ_i)2 _1/2 n−1 n i=1 ˆ 2_i ₋₁ .

(The second factor here gives an empirical version of the standardisation E{H()2} = 1.) It may be proved that if the error distribution is symmetric, and if (in the construction of ˆg(x)) we take ˆh_i = ˆhllinρˆ2/5H(ˆ_i), and replace h on the right-hand side of (2.8) by hed_opt, then (2.3) continues to hold.

There is also a more complex cross-validation approach which adjusts im-plicitly for error-dependent smoothing, and does not require the multiplicative adjustment by ˆρ2/5. It is very computer intensive, however.

2.7. Heteroscedastic errors

For simplicity and brevity we have presented results only for models with independent and identically distributed errors. However, our methods apply vir-tually without change in heteroscedastic cases, where the error variance changes

(11)

with x. The simplicity with which this setting can be treated derives from the fact that, provided error variance is a smooth function of x, the model is “locally homoscedastic”.

Indeed, suppose Y_i= g(X_i) + _i where _i= ζ_iσ(X_i), the variables X1, ζ1, . . . ,

X_n, ζ_n are independent, the X_i’s are identically distributed, the ζ_i’s are identi-cally distributed with zero mean and unit variance, and the function σ is contin-uous. Denote these conditions by assumption (A3) (it replaces the overarching

assumption made in Section 2.1 that the pairs (X_i, Y_i) are independent and iden-tically distributed, which implies that the _i’s are independent and identically distributed). If we continue to deﬁne ˆ_i = Y_i − ˜g(X_i), if we continue to take ˆ

h_i = h H(ˆ_i), if on the right-hand side of (2.8) we replace in all the expecta-tions by ζ σ(x), and if we assume (A₁)–(A₃), then Theorem 2.2 continues to hold. The method of proof is virtually identical.

The only essential diﬀerence between homoscedastic and heteroscedastic cases lies in the way bandwidth is computed. We discuss this issue in the setting of local bandwidth choice, which seems more appropriate when error variance changes with location. Assume the distribution of ζ is symmetric, and for sim-plicity consider the case where H(u) =|u|2/3. (Truncated versions of H, such as those considered at the end of Section 2.5, are approximations to this function. There is no need to include a constant multiplier, since it may be incorporated into the bandwidth.) Put µ = E(|ζ|4/3). Then, noting the result reported in the previous paragraph, we show that

ˆ

g(x) = g(x) +1₂h2κ₂g(x) µ σ(x)4/3

+[(nh)−1κ f (x)−1µ σ(x)4/3]1/2N_n(x) + o_p(h2) .

It may be proved from this formula that the bandwith hed_opt(x) that min-imises asymptotic mean squared error at x may be expressed as hed_opt(x) = hllin_opt(x){µσ(x)4/3}−1/5, where hllin_opt(x) ={κσ(x)2/nκ₂f (x)g(x)2}1/5 is the band-width that minimises mean squared error of the standard local linear estimator. Therefore, given an empirical version ˆhllin_opt(x) of hllin_opt(x) (see e.g., Fan and Gijbels (1995)), and estimators µ and ˆ σ(x) of µ and σ(x) respectively, we may calculate an empirical version ˆhed

opt(x) = ˆhllinopt(x){µ ˆσ(x)4/3}−1/5of hedopt(x). There

are several ways of computing µ and ˆ σ(x); we give only two here. In the ﬁrst, σ(·) is modelled parametrically by a smooth function, for example a linear or a quadratic function. Parameters of the model, and hence σ(·), can be consis-tently estimated by treating centered residuals as though they were true values of the _i’s. In this way, an estimator ˆσ(·) of σ(·) can be calculated. Residual values of the ζ_i’s can now be computed as ˆζ_i = ˆ_i/ˆσ(X_i), and µ estimated by

(12)

by considering the nonparametric regression problem in which squares of centered residuals are regressed on their expected values. Once ˆσ(·) has been calculated in this way,µ can be computed as before.

3. Numerical Properties

In this section we report a simulation study conducted to examine nu-merical properties of error-dependent smoothing rules. In this work we took g(x) = 4 sin(2πx) and used equally-distributed design points on (0, 1). The er-ror distribution was either Normal N(0, 2.25) or Student’s t with 5 degrees of freedom. Sample size n was 50, 100, 200 or 500. The biweight kernel K(u) = (1− u2₎2_I

(−1<u<1) was employed.

When plotting integrated squared biases, variances and mean squared errors against bandwidth, h ranged over 51 logarithmically equispaced values. We do not explore empirical bandwidth choice, since the additional variation that it introduces may confound diﬀerences between conventional local-linear techniques and our method. In theory the second bandwidth, h0, used for the for the pilot

estimator has only a second-order eﬀect on the results, although it should be taken larger than the theoretically optimal bandwidth. In all the work reported here we chose h0 to be 25% larger than the standard optimal value.

Three estimators were considered: (i) ˆg_L, the standard local-linear estimator; (ii) ˆg_I, the “ideal” error-dependent local-linear estimator with h_i= h|_i|2/3; and (iii) ˆg_R, the “realistic” error-dependent local-linear estimator using the function H at (2.10) with (n−α, nβ) = (0.22/3, 82/3) and residuals obtained from a pilot estimation. For every setting, 1000 random samples were generated. Each esti-mator was evaluated over a equispaced grid of 400 points with the interpolation method of Hall and Turlach (1997) used to guard against sparse design problems caused by too-small choices of bandwidth. The mean integrated squared biases and variances were approximated by averaging over the 1000 realizations.

Figure 3.1 summarises results when n = 100 and the error distribution was Student’s t with 5 degrees of freedom. First, we note that error-dependent smoothing rules lead to signiﬁcant decrease in the mean integrated variance (see panel (c)) while maintaining almost the same level of mean integrated squared bias (panel (b)). This is as predicted by our asymptotic theory. Furthermore, the minimum mean integrated squared error (MISE) values for ˆg_L, ˆg_I and ˆg_Rare 0.137, 0.096 and 0.119. Equivalently, the “ideal” and “realistic” error-dependent smoothing rules reduced MISE by 30% and 13% respectively. Greater reductions occurred for larger values of n until they asymptoted to the large-sample limit predicted in Section 2.

(13)

Figure 3.1. Simulation results for the sine regression function, depicted in panel (a), and for sample size n = 100 and t5 errors. In panels (b), (c) and

(d), respectively, the integrated squared biases, variances and mean squared errors of ˆg_L (solid lines), ˆg_I (dotted lines) and ˆg_R (dashed lines) are plotted against bandwidth on a log-log scale. The vertical lines, with consistent line types, locate the optimal bandwidths that produced minimum mean integrated squared errors.

Figure 3.2 is the analogue of Figure 3.1 except that the error distribution is now Normal N (0, 2.25). The “ideal” error-dependent smoothing rule again produces significant MISE reduction. However, MISE reductions in the “real-istic” case only become significant for n = 500. In the case of Normal errors, error-dependent smoothing rules tend to slightly inflate the bias while reducing the variance. The increase in bias is of course a second-order effect, since it is not evident in the first-order theoretical analysis in Section 2. More generally, simulations with Student’s t errors with a range of degrees of freedom show that the extent of MISE improvement offered by ˆg_R declines, and the extent of bias inflation increases, as the error tail-weight decreases.

(14)

Figure 3.2. Simulation results for the sine regression function, n = 100 and Normal (0, 2.25) errors. Functions and line types are the same as in Figure 3.1.

In summary, tail weight of the error distribution, and second-order contribu-tions from bias, can have signiﬁcant aﬀect on the performance of error-dependent smoothing rules. In relation to the bias issue we found that, for a given error distribution, performance of ˆg_R relative to ˆg_L could be made better or worse by using target functions that produced lesser or greater amounts of bias, respec-tively. For example, by taking the sinusoid g to have only half a wavelength over the interval (0, 1) we could enhance performance of ˆg_R relative to ˆg_L; and by giving it more than one wavelength we could reduce relative performance.

A reviewer expressed concern that our use of grids for computation implied the method might be excessively computationally expensive. We use grids only to numerically determine mean squared integrated errors. In particular, no grid search methods are required. All our techniques involve only explicit calculation; nothing is deﬁned implicitly, and no equations have to be solved in order to

(15)

con-struct our estimators. In practice one could use binning procedures to accelerate calculation. See for example page 238 of Scott (1992).

Acknowledgements

We are grateful to Iain Johnstone for very helpful discussion, and to three reviewers for helpful comments on the ﬁrst version of the paper.

Appendix

A.1 Proof of Theorem 2.1.

We may write ˆg = (S₂T₀ − S₁T₁)/(S₂S₀ − S2₁), where, deﬁning K_i(x) = K{(X_i − x)/h_i} and, for a vector a = a(x) = (a1, . . . , an), setting U (a) =

n−1_i h−1_i a_iK_i, we put S_j = U (a) for a_i(x) = (X_i − x)j, and T_j = U (a) for a_i(x) = Y_i(X_i − x)j. Let T_j1 = U (a) with a_i(x) = g(X_i) (X_i− x)j, T_j2 = U (a) with a_i(x) = _i(X_i− x)j, and R_j = U (a) with a_i = hj_i. Then, T_j = T_j1+ T_j2. (For the sake of simplicity we shall often suppress the argument x.)

It may be proved by Taylor expansion that

S₂T₀₁− S₁T₁₁= g (S₂S₀− S₁2) +₂1g[κ₂h2f E{H()2}]2+ o_p(h4) (A.1) and S₂S₀− S₁2 = κ₂h2f2E{H()2} + o_p(h2). Therefore,

ˆ

g₁ ≡ (S₂T₀₁− S₁T₁₁)/(S₂S₀− S₁2) = g +1₂gκ₂h2E{H()2} + o_p(h2) . (A.2) Noting that (2.7) implies E{|| H()2} < ∞, it may be proved that for j = 0, 1, E(T_j2) = hjκ_jf (x) E{ H()j} + hj+1κ_j+1f(x) E{ H()j+1} +1₂hj+2κ_j+2f(x) E{ H()j+2} + o(hj+2) , (A.3) var(T_j2)∼ n−1h2j−1E{2H()2j−1} f y2jK(y)2dy , (A.4) where in (A.3), when j = 1, it is understood that we take the expansion only up to a remainder of o(h2). Lindeberg’s condition for the series T02, normalised by

its standard deviation, holds provided that, for any C1, C2 > 0,

(nh)−1 n i=1 E[2_i H(_i)−2I(|X_i− x| ≤ C₁h_i) I{|_i| H(_i)−1> C₂(nh)1/2}] → 0 . (A.5) (See e.g., Chung (1974), p.205, for Lindeberg’s condition.) Result (A.5) follows from the fact that E{2_H()−1_{} < ∞, and so by Lindeberg’s central limit}

(16)

result with (A.3) and (A.4) we deduce that, for ﬁxed values of the argument x, ˆ

g₂≡ (S₂T₀₂− S₁T₁₂)/(S₂S₀− S₁2)

=1₂ff−1κ₂h2E{ H()2} + [(nh)−1f−1κ E{2H()−1}]1/2N_n+ o_p(h2) , where N_n is asymptotically Normal N (0, 1). The theorem follows from this ex-pansion and (A.2), on noting that ˆg = ˆg₁+ ˆg₂.

A.2. Proof of Theorem 2.2.

Put K_i = K{(X_i − x)/ˆh_i} and U (a) = n −1 _i hˆ−1_i a_iK_i, and let S_j, T_j1,

T_j2, T_j and R_j denote the versions of U (a) that arise with a _i(x) = (X_i − x)j, g(X_i) (X_i−x)j, _i(X_i−x)j, Y_i(X_i−x)jand ˆhj_i, respectively. Then,T_j =T_j1+T_j2 and ˆg = (S₂T₀ −S₁T₁)/(S₂S₀−S₁2). Since H is assumed bounded away from zero and inﬁnity thenR_j = O_p(hj), for each j. Therefore, the following analogue of (A.1) holds, derived in a similar manner:

S₂T₀₁−S₁T₁₁= g (S₂S₀−S₁2) +1₂g(S₂2−S₁S₃) + o_p(h4) . (A.6) Let C1, C2, . . . denote generic ﬁnite, strictly positive constants, let η > 0 be a

H¨older exponent appropriate for both H and K, and put ∆ = ˜g−g, H1 = H/H

and h_i = h H(_i). Since C1 ≤ H ≤ C2 then we may choose C3, C4, C5 > 0 such

that, by Taylor expansion and uniformly in i,

H(ˆi)− H(_i) H(_i) + ∆(Xi)H1(i) ≤ C3|∆(Xi)|1+η, (A.7) K _X_i_{− x} ˆ h_i − KXi− x h_i + ˆ_h_i_{− h}_i h_i _X_i_{− x} h_i K _X_i_{− x} h_i ≤ C4 ˆhi− hi h_i 1+ηI(|X_i− x| ≤ C5h) . (A.8)

Combining (A.7) and (A.8) we see that, with L(u) = u K(u),

K _X_i_{− x} ˆ h_i − KXi− x h_i − ∆(Xi) H1(i) L _X_i_{− x} h_i ≤ C6|∆(Xi)|1+ηI(|Xi− x| ≤ C5h) . (A.9)

Similarly but more simply, with H2= H/H2we have|h(ˆh−1_i −h−1_i )−∆(Xi) H2(i)|

≤ C7|∆(Xi)|1+η. Combining this result with (A.9), and deﬁning M = K + L,

we deduce that for any sequence A₁, . . . , A_n,

n i=1 ˆ h−1_i (X_i−x)jA_iK{(X_i−x)/ˆh_i}− n i=1 h−1_i (X_i−x)jA_iK{(X_i−x)/h_i}

(17)

−h−1 n i=1 ∆(X_i) (X_i− x)jA_iH₂(_i) M{(X_i− x)/h_i} ≤ C8h−1 n i=1 |∆(Xi)|1+η|X_i− x|j|A_i| I(|X_i− x| ≤ C5h) . (A.10)

Put λ_n = n1/5h₀. It is straightforward to prove that ∆(X_i) = O_p(n−2/5λ2_n) uniformly in values of i such that|X_i− x| ≤ C5h. Hence, when Ai ≡ 1 we obtain

from (A.10) the result

n i=1 ˆ h−1_i (X_i− x)jK{(X_i− x)/ˆh_i} − n i=1 h−1_i (X_i− x)jK{(X_i− x)/h_i} = O_p(n3/5hjλ2_n) .

Equivalently, S_j − S_j = O_p(n−2/5hjλ2_n), where S_j is as in the proof of Theorem 2.1. From this result and the property n−2/5λ2_n = o(1) it may be shown that

S_j = κ_jhjf E{H()j} + o_p(hj). The latter relation, and (A.6), imply that

S₂T₀₁−S₁T₁₁= g(S₂S₀−S₁2) +1₂g[κ₂h2f E{H()2}]2+ o_p(h4) , (A.11)

S₂S₀−S₁2= κ₂h2f2E{H()2} + o_p(h2) . (A.12) The only other sequence A1, . . . , An in which we are interested is Ai ≡ i,

and there we may deduce from (A.10) the bound

n i=1 ˆ h−1_i (X_i−x)j_iK{(X_i−x)/ˆh_i}− n i=1 h−1_i (X_i− x)j_iK{(X_i−x)/h_i}−V_j = O_p(n(3−2η)/5hjλ2(1+η)_n ) , (A.13) where V_j ≡ h−1_i ∆(X_i)(X_i−x)j_iH2(i)M{(Xi−x)/hi}. By Taylor-expanding

the ratio formula for a local-linear estimator (see e.g., the expression for ˆa given by Fan (1993, p.197) it may be proved that ∆(u) = h2

0ψn(u) + (nh0)−1f (u)−1

i i

×K{(Xi−u)/h0}+Op(h40), uniformly in values u in a neighbourhood of x, where

ψ_n denotes a deterministic function that satisﬁes ψ_n(u) = 1₂g(u) κ₂ + o(ξ_n) uniformly in a neighbourhood of x, and ξ_n= ξ(h0), ξ(·) being as deﬁned at (2.9).

Therefore, V_j = V_j1+ V_j2+ o_p(n3/5hj), where V_j1= h−1h2₀ n i=1 (X_i− x)jψ_n(X_i) _iH₂(_i) M{(X_i− x)/h_i} , V_j2= (nhh0)−1f (x)−1 n i1=1 n i2=1 _i₁_i₂H2(i2)(Xi2−x)jK{ X_i₁−X_i₂ h0 }M{ X_i₂−x h_i₂ } .

(18)

Noting that ujM (u) du = 0 for j = 0, 1 we may prove that for j = 0, 1 and k = 1, 2, E(V_j1) = o(n3/5hj), E(V_j2) = o(n3/5hj), V_jk− E(V_jk) = o_p(n3/5hj). When treating V_j2 we consider the diagonal and off-diagonal terms separately. The absolute value of the sum of the diagonal terms is adequately bounded by the sum of the absolute values of the summands. Since, conditional on the X_i’s, the _i’s are independent, then the variance of the sum of the off-diagonal terms in V_j2 is readily computed under the assumption that E(2) <∞; in particular, we do not need finite fourth moments. However, to prove that var(V_j2) = o(n6/5h2j) we do require the assumption that n1/5h0 → ∞.

Therefore, V_j = o_p(n3/5hj) for j = 0, 1, and so by (A.13),

T_j2= T_j2+n−1(V_j1+V_j2)+O_p(n−2(1+η)/5hjλ2(1+η)_n ) = T_j2+o_p(n−2/5hj) (A.14) for j = 0, 1, where T_j2 is as in the proof of Theorem 2.1. We know from that proof that|T₀₂| + |h−1T₁₂| = O_p{h2+ (nh)−1/2}. Hence, by (A.12) and (A.14),

S₂T₀₂−S₁T₁₂= f−1(S₂S₀−S₁2) T₀₂+ o_p[h2{h2+ (nh)−1/2} + n−2/5h2] . (A.15) Combining (A.11), (A.12) and (A.15) we deduce that

ˆ

g ={(S₂T₀₁−S₁T₁₁) + (S₂T₀₂−S₁T₁₂)} (S₂S₀−S₁2)−1

= g +1₂gκ2h2E{H()2} + f−1T02+ op{h2+ (nh)−1/2} . (A.16)

We showed during the proof of Theorem 2.1 that T₀₂ = 1₂fκ₂h2E{ H()2} + [(nh)−1f κ E{2H()−1}]1/2N_n + o_p(h2), where N_n converges to Normal N(0, 1). Theorem 2.2 follows from this result and (A.16).

References

Abramson, I. S. (1982). On bandwidth variation in kernel estimates — a square root law. Ann.

Statist. 9, 168-176.

Abramson, I. S. (1984). Adaptive density ﬂattening — a metric distortion principle for com-bating bias in nearest neighbour methods. Ann. Statist. 12, 880-886.

Choi, E. and Hall, P. (1998). On bias reduction in local linear smoothing. Biometrika 85, 333-346.

Chung, K. L. (1974). A Course in Probability Theory. Academic Press, New York.

Donoho, D. L. and Liu, R. C. (1991). Geometrizing rates of convergence, III. Ann. Statist. 19, 668–701.

Donoho, D. L., Liu, R. C. and MacGibbon, B. (1990). Minimax risk over hyperrectangles, and implications. Ann. Statist. 18, 1416-1437.

Fan, J. (1993). Local linear regression smoothers and their minimax eﬃciencies. Ann. Statist. 21, 196-216.

Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial ﬁtting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B57, 371-394. Hall, P. (1990). On the bias of variable bandwidth curve estimators. Biometrika 77, 529-536.

(19)

Hall, P. and Turlach, B. A. (1997). Interpolation methods for adapting to sparse design in nonparametric regression. (With discussion and rejoinder.) J. Amer. Statist. Assoc. 92, 466-472.

H¨ardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cam-bridge, U.K.

Jones, M. C. (1990). Variable kernel density estimates and variable kernel density estimates.

Austral. J. Statist. 32, 361-371.

Jones, M. C., Linton, O. and Nielsen, J. P. (1995). A simple bias reduction method for density estimation. Biometrika 82, 327-338.

Jones, M. C., McKay, I. J. and Hu, T.-C. (1994). Variable location and scale kernel density estimation. Ann. Statist. Math. 46, 521-535.

Ruppert, D. and Cline, B. H. (1994). Bias reduction in kernel density estimation by smoothed empirical transformations. Ann. Statist. 22, 185-210.

Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression.

Ann. Statist. 22, 1346-1370.

Samiuddin, M. and el-Sayyad, G. M. (1990). On nonparametric kernel density estimates.

Bio-metrika 77, 865-874.

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York.

Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.

Department of Mathematics, National Taiwan University, Taipei 106, Taiwan. E-mail: cheng@math.ntu.edu.tw

Centre for Mathematics and Its Applications, Australian National University, Canberra, ACT 0200, Australia.

E-mail: halpstat@pretty.anu.edu.au