• 沒有找到結果。

Error-dependent smoothing rules in local linear regression

N/A
N/A
Protected

Academic year: 2021

Share "Error-dependent smoothing rules in local linear regression"

Copied!
19
0
0

加載中.... (立即查看全文)

全文

(1)

ERROR-DEPENDENT SMOOTHING RULES IN LOCAL LINEAR REGRESSION

Ming-Yen Cheng1,2 and Peter Hall1

1Australian National University and 2National Taiwan University

Abstract: We suggest an adaptive, error-dependent smoothing method for reducing

the variance of local-linear curve estimators. It involves weighting the bandwidth used at the ith datum in proportion to a power of the absolute value of the ith residual. We show that the optimal power is 2/3. Arguing in this way, we prove that asymptotic variance can be reduced by 24% in the case of Normal errors, and by 35% for double-exponential errors. These results might appear to violate Jianqing Fan’s bounds on performance of local-linear methods, but note that our approach to smoothing produces nonlinear estimators. In the case of Normal errors, our esti-mator has slightly better mean squared error performance than that suggested by Fan’s minimax bound, calculated by him over all estimators, not just linear ones. However, these improvements are available only for single functions, not uniformly over Fan’s function class. Even greater improvements in performance are achiev-able for error distributions with heavier tails. For symmetric error distributions the method has no first-order effect on bias, and existing bias-reduction techniques may be used in conjunction with error-dependent smoothing. In the case of asymmetric error distributions an overall reduction in mean squared error is achievable, involv-ing a trade-off between bias and variance contributions. However, in this settinvolv-ing, the technique is relatively complex and probably not practically feasible.

Key words and phrases: Bandwidth, kernel method, nonparametric regression, tail

weight, variance reduction.

1. Introduction

There is a great variety of bias reduction methods for nonparametric curve estimation, ranging from high-order kernel techniques (e.g., Wand and Jones (1995), Chapters 2 and 5) to local bandwidth adjustments (e.g., Abramson (1982, 1984), Jones (1990)), methods based on varying location and scale (e.g., Samiud-din and el-Sayyad (1990), Jones, McKay and Hu (1994)), empirical transforma-tions (e.g., Ruppert and Cline (1994)), weights (e.g., Jones, Linton and Nielsen (1995)), and skew computation (e.g., Choi and Hall (1998)). All have variants for both nonparametric density estimation and nonparametric regression, although often they are introduced first in the former setting. However, very few methods have been suggested for reducing the impact of variance. Those that do exist involve principally deterministic adjustments to bandwidth, altering the local

(2)

trade-off between variance and squared bias in the context of mean integrated squared error.

In the present paper, and in the context of nonparametric regression, we suggest a new and entirely different approach to variance reduction. It involves adjusting bandwidth in a stochastic, rather than deterministic, way, with the aim of providing improved performance by giving greater weight to data pairs that correspond to smaller absolute errors. In an extreme case, if an error were exactly zero then we would wish to use the corresponding data pair for interpolation, rather than simply for smoothing. Interpolation would correspond to using a bandwidth of zero, and so our “error dependent” method involves taking the bandwidth to be a function of the error — or more practically, of the residual. The function is non-degenerate, even in the asymptotic limit. For local-linear estimators we suggest that it be taken proportional to the two-thirds power of the absolute value of the error (or residual). Different powers are appropriate for higher-order methods, with the power increasing to 1 as order increases.

The idea of allowing the bandwidth for the ith datum to depend nontrivially on the ith error is reminiscent of Abramson’s (1982) approach to bias reduction in density estimation. There, the bandwidth for smoothing Xi when estimat-ing the density f is ideally taken proportional to a negative power of f (Xi). While this approach has an analogue for nonparametric regression (Hall (1990)), it nevertheless reduces bias rather than variance, and is not closely related to the technique suggested here. In particular, error-dependent smoothing reduces variance by a constant factor, and has no first-order impact on bias (in the case of symmetric error distributions); most bias reduction methods reduce bias by an order of magnitude and inflate variance by a constant factor.

In large samples, our method reduces the variance contribution to mean integrated squared error by 24% in the case of Normal errors, by 35% for double-exponentially distributed errors, and by even greater amounts for very heavy-tailed error distributions. These figures might appear to violate the bounds as-serted by Fan (1993) for performance of local-linear estimators. There is in fact no contradiction, however, principally for the following reason. Fan’s minimax theory applies to classes of regression functions that have, in effect, two bounded derivatives. By way of contrast, our results require the functions to have two continuous derivatives. Our claim about improved performance does need conti-nuity of the second derivative, and in particular does not hold uniformly across the function class addressed by Fan. Other dissimilarities too should be born in mind. For example, Fan’s (1993) result about local-linear estimators applies only within the class of linear techniques, and (after error-dependent smoothing) our estimators are nonlinear. Furthermore, Fan’s other minimax bounds, measuring performance against nonlinear techniques, do not specifically address the range of heavy-tailed error distributions that are considered in the present paper, since

(3)

his class C2 of models for the “max” part of “minimax” contains models with

Normal errors.

Our methods are relatively simple when the error distribution is symmetric or nearly symmetric, and that is the context in which they have greatest practical significance. In the case of asymmetry, however, error-dependent smoothing can introduce an additional bias term to the estimator. This makes it relatively complex to minimise mean squared error. For the sake of completeness we briefly explore the general case from a theoretical viewpoint but, since the practical attraction of the asymmetric setting is not high, we do not describe its numerical properties.

From a practical viewpoint, empirical bandwidth choice methods that pro-duce overly variable bandwidths can require relatively large samples in order to achieve theoretically optimal levels of performance. The rule proposed in this paper, where the bandwidth is proportional to a power of a residual, can be subject to excessive fluctuation when residuals are either close to 0 or large in absolute value. This suggests that a truncation argument be used to dampen variability and improve performance. The need to choose truncation points adds to the complexity of the method, however, as does the necessity of selecting a pilot bandwidth in order to calculate residuals. The result is that, in small to moderate sized samples, mean squared error reductions offered by a practical version of our method may be some distance from theoretically optimal levels. Nevertheless, our theory and simulation analysis show clearly the potential of the methods.

Provided errors are stochastically independent, conditional on design points, there is no difficulty in combining error-dependent smoothing with a bias-reduction method. The effect is to reduce bias by an order of magnitude, and reduce variance by a constant factor relative to the value it would assume if only bias reduction were employed. Likewise, error-dependent smoothing may be applied in conjunction with a spatially-local bandwidth choice procedure, such as that suggested by Fan and Gijbels (1995): one simply takes the scale factor h, in formulae such as hi = h H(ˆi) discussed in Section 2, to depend on spatial location x. The method is also applicable to the case of heteroscedasticity, no matter whether the variance function is modelled parametrically or estimated nonparametrically. See Section 2.7 for discussion.

2. Methodology and Main Results 2.1. Model and estimator

Assume that independent and identically distributed data pairs (X1, Y1), . . . ,

(Xn, Yn) are generated by the model

(4)

where Xi is independent of i, and E(i) = 0. In the “ideal” case, where the errors i are known, we define a bandwidth hi by hi = h H(i). Here, h = h(n) denotes a sequence of positive constants, and H is a fixed positive function. More realistically, we might approximate i by a residual ˆi, and put hi = h H(ˆi). In either case, and for fixed x, let (ˆa, ˆb) denote the pair that minimises

n



i=1

{Yi− a − b (Xi− x)}2h−1i K{(Xi− x)/hi} ,

where K is a kernel function, and put ˆg(x) = ˆa.

We expect H(x) to be an increasing function of |x|, in which case smaller absolute errors (or residuals) produce less smoothing. In the “realistic” case, where residuals are used rather than the errors themselves, particularly large absolute values of residuals may not necessarily reflect similar values of errors. Rather, they might result from inaccuracies in the pilot estimator of g that is used to compute residuals. This means that we should usually threshold the smoothing parameter, to guard against outlying values. Thresholding is also useful if we are to ward off problems with sparse design, which do not make themselves felt through the asymptotic distribution of ˆg. These difficulties do not arise in the “ideal” case, which therefore offers more insight into the operation of level-dependent thresholding. We address the “ideal” and “realistic” cases in Sections 2.4 and 2.5, respectively.

2.2. Overview of theory

First we deal with the “ideal” setting. In the case of classical local-linear smoothing, where H ≡ 1, it is well-known that the local-linear estimator has bias of size h2 and error about the mean of size (nh)−1/2. Specifically, under mild regularity conditions (see Fan (1993)),

ˆ

g = g + 12h2gκ2+ (nh)−1/2f−1/2κ1/2σNn+ op(h2) , (2.2) where κ2 =  u2K(u) du, κ = K2, σ2 = var(), and Nn denotes a random variable whose distribution is asymptotically Normal N(0, 1). The second term on the right-hand side of (2.2) represents systematic error, and the third is stochastic error.

More generally, suppose H is a non-degenerate function. If E{H()2} < ∞ then we may standardise H so that E{H()2} = 1. We claim that in these

circumstances, provided the error distribution is symmetric and H is an even function, the expansion at (2.2) continues to hold, except that the third term is multiplied by the factor ρ, where ρ2= E{2H()−1}/σ2:

ˆ

(5)

As in (2.2), Nndenotes a random variable that is asymptotically Normal N(0, 1), although it will assume different numerical values in appearances at (2.2) and (2.3).

Minimising ρ subject to E{H()2} = 1 is an elementary variational problem.

The minimum is achieved when H() equals a constant multiple of||2/3, in which case ρ2 = ρ20 where

ρ20≡ {E(||4/3)}3/2/σ2 < 1 . (2.4) Of course, taking H() proportional to ||2/3 is all that is needed for optimal reduction of asymptotic mean squared error. The particular constant is absorbed into the non-random multiplier, h, and so is immaterial.

For Normal errors, and for our estimator rather than a conventional local-linear estimator, our results are suggestive of a version of Fan’s (1993) Theorem 4 in which his constant 0.8962 is replaced by 1.00, this being the value (to two decimal places) of

(1.243 ρ8/50 )−1 = (1.243× [2 {Γ(7/6)/π1/2}3/2]4/5)−1≈ 1.00242. (2.5) The formula here represents the efficiency, defined as the ratio of mean squared errors, of our estimator relative to the “optimal” nonlinear regression function estimator, when estimating a fixed, twice continuously differentiable target g. The “optimal” estimator here is the one that gives minimum mean squared error in a minimax sense, uniformly over functions that have two bounded derivatives; it does not make use of continuity of the second derivative. Details of the origin of the formula are given by Donoho and Liu (1991), and in fact the figure 1.243 is taken from Donoho and Liu’s Table 1. It was apparently derived by combining, in a conservative way, numerical results obtained by Donoho, Liu and MacGib-bon (1990). The exact value of ρ20 in the Normal case may be shown to equal 2{Γ(7/6)/π1/2}3/2≈ 0.757.

Note that our results rely on continuity of the second derivative of g, whereas Fan’s results apply uniformly over a class of g’s that have only a uniformly bounded derivative. The requirement for continuity is the key to reconciling Fan’s results with our own. The fact that the “ideal” estimator is not a true estimator in the usual sense is not an issue in comparing Fan’s results with our own. Indeed, we show in Section 2.5 that, for any particular functions f and g with two continuous derivatives, the level of asymptotic performance evinced by the ideal estimator is achievable by a realistic version. In this case a pilot estima-tor ˜g is constructed, and used to calculate ˆi= Yi− ˜g(Xi). Then, possibly after centering or thresholding these residuals, we compute the empirical bandwidth ˆ

hi = h|ˆi|2/3. We use ˆh

i in place of hi to construct ˆg. Modulo regularity

con-ditions, and truncation to alleviate difficulties caused by too large or too small values of|ˆi|, formula (2.3) continues to hold.

(6)

Since Fan’s minimax function classC2contains models where the error

distri-bution is Normal, results such as those discussed above do not relate to improve-ments in performance that error-dependent smoothing can achieve in the case of heavy-tailed error distributions. Indeed, the value of ρ20 tends to be smaller for distributions with heavier tails. Values in the cases of double Exponential, Nor-mal and Uniform errors are respectively 0.650, 0.757 and 0.842. For errors with Student’s t distribution on 5, 10 or 20 degrees of freedom, the values are 0.676, 0.726 and 0.743, respectively. Figure 2.1 plots ρ2 as a function of the number of degrees of freedom (interpreted in the continuum) for Student’s t errors. For non-Normal errors one can also construct estimators that are more efficient than ˆ

g by replacing local least-squares by a robust method such as local M -estimation.

5 10 15 20 0. 0 0. 2 0. 4 0. 6

Figure 2.1. Values of ρ2 versus number of degrees of freedom for Student’s

t error distributions.

In the case of asymmetric errors, or when H is not an even function, a formula similar to (2.3) is valid, except that an extra bias term of size h2 is introduced. This quantity is proportional to E{ H()2}, and so is typically small if the error distribution is close to being symmetric. More generally, it is possible to either choose both h and H (in the formula hi = h H(i)) so that asymptotic mean squared error is minimised; or to choose H to minimise the asymptotic variance term subject to both E{H()2} = 1 and E{ H()2} = 0. The first rule produces a particularly complex function H, and is quite unattractive for practical implementation. The second rule gives

H(u) =    c1u2/3 if u > 0, c2|u|2/3 if u < 0, (2.6)

(7)

where c1=



E{4/3I( > 0)} + E{||

4/3I( < 0)} E{7/3I( > 0)}

E{||7/3I( < 0)}

−1/2

, c2/c1=



E{7/3I( > 0)}/E{||7/3I( < 0)}

1/2

.

Thus, using error-dependent smoothing in the case of asymmetric errors requires relatively detailed inference about the error distribution, and is not as attractive as in the case of symmetry.

Note, however, that there is always a strict reduction in the variance contri-bution to asymptotic mean squared error. For example, if H is given by (2.6), with the constants c1, c2 defined above, then (without requiring the assumption

of symmetry) (2.3) holds with ρ < 1. 2.3. Generalisations and extensions

The main idea behind these results — that the variance of a nonparametric regression estimator may be reduced by using an error-dependent smoothing parameter — is applicable to a wide range of settings. For example, it is valid for linear, second-order methods such as the Nadaraya-Watson estimator (see e.g., H¨ardle (1990), p.25; Wand and Jones (1995), p.119), the Gasser-M¨uller estimator (Wand and Jones (1995), p.131) and the Priestley-Chao estimator (Wand and Jones (1995), p.130). In all these cases the asymptotically optimal form of H is the same as that described above in the local-linear context. Moreover, if the error distribution is symmetric then bias is unaffected, to first order, by error-dependent smoothing.

Error-dependent smoothing is also applicable to general local-polynomial estimators of regression means and their derivatives (see e.g., Ruppert and Wand (1994)). There, if the method is of order r (meaning that, in the case of a non-random bandwidth h, bias is of size hr), the natural analogue of the condition E{H()2} = 1 is E{H()r} = 1. This ensures that, if the error-dependent bandwidth is hi = h H(i), if H is an even function, and if the error distribution is symmetric, the bias term is the same (to first order) as it would be if the bandwidth were simply h. If we are estimating the sth derivative of g, for s≥ 0, then the variance contribution to asymptotic mean squared error is also the same to first order, except that it is reduced by the multiplicative factor ρ = E{2H()−(2s+1)}/σ2.

The minimum of this quantity, subject to the constraint E{H()r} = 1, is achieved when H() is proportional to||2/(r+2s+1). Thus, the general form of ρ20 (given at (2.4) in the case (r, s) = (2, 0)) is

(8)

In the case of symmetric errors, this represents the greatest amount by which variance may be reduced using error-dependent smoothing, when estimating an sth derivative by an rth order method.

2.4. Theory in “ideal” case

Assume that h = h(n) → 0 and nh → ∞ as n → ∞, and that H is a continuous, positive function satisfying

E{H()3} < ∞ , E{H()−1} < ∞ , E[2{H() + H()−1}] < ∞ . (2.7) Let K be a bounded, compactly supported, symmetric probability density, and let f denote the marginal density of Xi, assumed to exist in a neighbourhood of x. Assume that f and g have two continuous derivatives in a neighbourhood of x, and that f (x) > 0.

Theorem 2.1. Under the above conditions, ˆ

g(x) = g(x) + 12h2κ2[g(x) E{H()2} + f(x) f (x)−1E{ H()2}]

+[(nh)−1κ f (x)−1E{2H()−1}]1/2Nn(x) + op(h2) , (2.8) where Nn(x) denotes a random variable whose distribution is asymptotically Nor-mal N(0, 1).

Corollary. Assume the conditions of Theorem 2.1. If E{ H()2} = 0 — in particular, if the distribution of  is symmetric and H is an even function — and if E{H()2} = 1, then (2.3) holds at x, with ρ2 = E{2H()−1}/σ2.

Conditions (2.7) hold if, for example, (a) C1||1−δ ≤ H() ≤ C2(1 +||)

for constants C1, C2, δ > 0, (b) E(||3) <∞, and (c) the probability density of  exists in a neighbourhood of the origin and is bounded away from 0 there. If H() has the optimal form C||2/3 then (2.7) is valid if (c) holds and E(||8/3) < ∞. The theorem may be extended to a variety of other settings, for example to the case where x is a boundary point.

2.5. Theory in “realistic” case

Suppose f and g have two continuous derivatives in a neighbourhood of x, and that f (x) > 0. Assume too that K is a bounded, compactly supported, symmetric probability density, and K exists and satisfies a H¨older condition; that H exists and satisfies a H¨older condition, and C1 ≤ H ≤ C2 for constants

0 < C1 ≤ C2 <∞; that h = h(n) → 0 and nh → ∞ as n → ∞; that E(2) <

∞. Call these assumptions (A1). (We say that a function ψ satisfies a H¨older

condition, or is H¨older continuous with exponent η, if there exists a constant C > 0 such that |ψ(x) − ψ(y)| ≤ C |x − y|η for all x and y.)

(9)

In view of the continuity of f and g, the quantity

ξ(δ)≡ sup

u : |x−u|≤δ {|f

(x)− f(u)| + |g(x)− g(u)|} (2.9)

converges to 0 as δ→ 0. Let h0= h0(n) denote a bandwidth, chosen to converge

to 0 sufficiently slowly for n1/5h0 → ∞, and sufficiently quickly for n1/5h0 = O(nη) for all η > 0 and n1/5h0ξ(h0)1/2 → 0. Call this assumption (A2). For example, if both f and g are H¨older continuous then ξ(δ) = O(δη) for some η > 0, and so h0(n) n−1/5log n is an adequate choice.

Let ˜g denote a standard local-linear estimator of g, computed using band-width h0. Put ˆi = Yi− ˜g(Xi) and ˆhi = h H(ˆi), and redefine ˆg(x) as the value ˆ

a in the pair (a, b) = (ˆa, ˆb) that minimises

n



i=1

{Yi− a − b (Xi− x)}2hˆ−1i K{(Xi− x)/ˆhi} .

Theorem 2.2. Under assumptions (A1) and (A2), (2.8) holds for the new

esti-mator ˆg, with Nn again denoting a random variable that is asymptotically Nor-mal N(0, 1).

Neither Theorem 2.1 nor Theorem 2.2 is available uniformly in a function class such as Fan’sC2. One reason, complementary to that given in Section 2.2, is that the rate at which the “op(h2)” terms converge to 0 depends explicitly on the modulus of continuity of both fand g, and can be arbitrarily slow. In the case of Theorem 2.2 this in turn influences choice of the pilot-estimator bandwidth h0. Taking that quantity to be equal to a fixed constant multiple of n−1/5, rather than of larger order than n−1/5 (as required by assumption (A2)), results in inflation

of variance relative to that achieved by the “ideal” estimator. For sufficiently heavy-tailed error distributions, such an inflation still produces a reduction in variance, relative to that for a classical local-linear estimator. This is readily apparent in both theoretical analysis and numerical simulations. For Normal data, however, slight oversmoothing of the pilot estimator and relatively large sample sizes are necessary in order to achieve obvious improvements.

The conditions imposed on H in Theorem 2.2 preclude the optimal form H(u) = γ|u|2/3. However, the level of performance in that case may be achieved, up to a constant factor that converges to 1 as n→ ∞, by considering successive approximations to the optimal H by functions satisfying the conditions of the theorem. Likewise, conditions on the kernel K prevent it from being the Epanech-nikov function that attains asymptotically minimal mean squared error, but we may circumvent this problem by considering a sequence of approximations.

For example, the following is an empirical, thresholded version of the asymp-totically optimal “ideal” procedure suggested in Section 2.2. Let ˜g be defined

(10)

as before, and assume the conditions imposed in Theorem 2.2 on K and on the bandwidths h and h0. Additionally, suppose K is a compactly supported

prob-ability density with two bounded derivatives, that f and g have three bounded derivatives in a neighbourhood of x, that f (x) > 0, that E(4) <∞, and that H is defined by H(u) =        γ|u|2/3 if n−α≤ |u| ≤ nβ, γ n−2α/3 if|u| < n−α, γ n2β/3 if|u| > nβ, (2.10)

where α, β, γ denote positive constants. Then it is possible to choose α, β > 0 such that, for all γ > 0, (2.7) holds for the new version of ˆg (constructed using the bandwidths ˆhi = H(ˆi)). The proof is particularly complex, and so will not be given here. The function H at (2.10) achieves the asymptotic performance represented by (2.8) with H(u)≡ γ |u|2/3.

2.6. Bandwidth choice

Note that in view of property (2.3) the optimal bandwidth, hedopt say, for an error-dependent smoothing rule is hed

opt= hllinoptρ2/5, where hllinoptis the optimal

band-width for the conventional local-linear estimator, and ρ2= E{2H()−1}/σ2. (It is assumed here that H has been standardised so that E{H()2} = 1.) Therefore, to construct an empirical approach to bandwidth choice for an error-dependent smoothing rule we can use our “favourite” technique (such as a plug-in method or cross-validation) to compute an empirical approximation ˆhllinto hllin

opt, and take

ˆ

hed= ˆhllinρˆ2/5 to be our empirical approximation to hedopt, where

ˆ ρ2 = n−1 n  i=1 ˆ 2i H(ˆi)−1 n−1 n  i=1 H(ˆi)2 1/2 n−1 n  i=1 ˆ 2i −1 .

(The second factor here gives an empirical version of the standardisation E{H()2} = 1.) It may be proved that if the error distribution is symmetric, and if (in the construction of ˆg(x)) we take ˆhi = ˆhllinρˆ2/5H(ˆi), and replace h on the right-hand side of (2.8) by hedopt, then (2.3) continues to hold.

There is also a more complex cross-validation approach which adjusts im-plicitly for error-dependent smoothing, and does not require the multiplicative adjustment by ˆρ2/5. It is very computer intensive, however.

2.7. Heteroscedastic errors

For simplicity and brevity we have presented results only for models with independent and identically distributed errors. However, our methods apply vir-tually without change in heteroscedastic cases, where the error variance changes

(11)

with x. The simplicity with which this setting can be treated derives from the fact that, provided error variance is a smooth function of x, the model is “locally homoscedastic”.

Indeed, suppose Yi= g(Xi) + i where i= ζiσ(Xi), the variables X1, ζ1, . . . ,

Xn, ζn are independent, the Xi’s are identically distributed, the ζi’s are identi-cally distributed with zero mean and unit variance, and the function σ is contin-uous. Denote these conditions by assumption (A3) (it replaces the overarching

assumption made in Section 2.1 that the pairs (Xi, Yi) are independent and iden-tically distributed, which implies that the i’s are independent and identically distributed). If we continue to define ˆi = Yi − ˜g(Xi), if we continue to take ˆ

hi = h H(ˆi), if on the right-hand side of (2.8) we replace  in all the expecta-tions by ζ σ(x), and if we assume (A1)–(A3), then Theorem 2.2 continues to hold. The method of proof is virtually identical.

The only essential difference between homoscedastic and heteroscedastic cases lies in the way bandwidth is computed. We discuss this issue in the setting of local bandwidth choice, which seems more appropriate when error variance changes with location. Assume the distribution of ζ is symmetric, and for sim-plicity consider the case where H(u) =|u|2/3. (Truncated versions of H, such as those considered at the end of Section 2.5, are approximations to this function. There is no need to include a constant multiplier, since it may be incorporated into the bandwidth.) Put µ = E(|ζ|4/3). Then, noting the result reported in the previous paragraph, we show that

ˆ

g(x) = g(x) +12h2κ2g(x) µ σ(x)4/3

+[(nh)−1κ f (x)−1µ σ(x)4/3]1/2Nn(x) + op(h2) .

It may be proved from this formula that the bandwith hedopt(x) that min-imises asymptotic mean squared error at x may be expressed as hedopt(x) = hllinopt(x){µσ(x)4/3}−1/5, where hllinopt(x) ={κσ(x)2/nκ2f (x)g(x)2}1/5 is the band-width that minimises mean squared error of the standard local linear estimator. Therefore, given an empirical version ˆhllinopt(x) of hllinopt(x) (see e.g., Fan and Gijbels (1995)), and estimators µ and ˆ σ(x) of µ and σ(x) respectively, we may calculate an empirical version ˆhed

opt(x) = ˆhllinopt(x){µ ˆσ(x)4/3}−1/5of hedopt(x). There

are several ways of computing µ and ˆ σ(x); we give only two here. In the first, σ(·) is modelled parametrically by a smooth function, for example a linear or a quadratic function. Parameters of the model, and hence σ(·), can be consis-tently estimated by treating centered residuals as though they were true values of the i’s. In this way, an estimator ˆσ(·) of σ(·) can be calculated. Residual values of the ζi’s can now be computed as ˆζi = ˆi/ˆσ(Xi), and µ estimated by



(12)

by considering the nonparametric regression problem in which squares of centered residuals are regressed on their expected values. Once ˆσ(·) has been calculated in this way,µ can be computed as before.

3. Numerical Properties

In this section we report a simulation study conducted to examine nu-merical properties of error-dependent smoothing rules. In this work we took g(x) = 4 sin(2πx) and used equally-distributed design points on (0, 1). The er-ror distribution was either Normal N(0, 2.25) or Student’s t with 5 degrees of freedom. Sample size n was 50, 100, 200 or 500. The biweight kernel K(u) = (1− u2)2I

(−1<u<1) was employed.

When plotting integrated squared biases, variances and mean squared errors against bandwidth, h ranged over 51 logarithmically equispaced values. We do not explore empirical bandwidth choice, since the additional variation that it introduces may confound differences between conventional local-linear techniques and our method. In theory the second bandwidth, h0, used for the for the pilot

estimator has only a second-order effect on the results, although it should be taken larger than the theoretically optimal bandwidth. In all the work reported here we chose h0 to be 25% larger than the standard optimal value.

Three estimators were considered: (i) ˆgL, the standard local-linear estimator; (ii) ˆgI, the “ideal” error-dependent local-linear estimator with hi= h|i|2/3; and (iii) ˆgR, the “realistic” error-dependent local-linear estimator using the function H at (2.10) with (n−α, nβ) = (0.22/3, 82/3) and residuals obtained from a pilot estimation. For every setting, 1000 random samples were generated. Each esti-mator was evaluated over a equispaced grid of 400 points with the interpolation method of Hall and Turlach (1997) used to guard against sparse design problems caused by too-small choices of bandwidth. The mean integrated squared biases and variances were approximated by averaging over the 1000 realizations.

Figure 3.1 summarises results when n = 100 and the error distribution was Student’s t with 5 degrees of freedom. First, we note that error-dependent smoothing rules lead to significant decrease in the mean integrated variance (see panel (c)) while maintaining almost the same level of mean integrated squared bias (panel (b)). This is as predicted by our asymptotic theory. Furthermore, the minimum mean integrated squared error (MISE) values for ˆgL, ˆgI and ˆgRare 0.137, 0.096 and 0.119. Equivalently, the “ideal” and “realistic” error-dependent smoothing rules reduced MISE by 30% and 13% respectively. Greater reductions occurred for larger values of n until they asymptoted to the large-sample limit predicted in Section 2.

(13)

Figure 3.1. Simulation results for the sine regression function, depicted in panel (a), and for sample size n = 100 and t5 errors. In panels (b), (c) and

(d), respectively, the integrated squared biases, variances and mean squared errors of ˆgL (solid lines), ˆgI (dotted lines) and ˆgR (dashed lines) are plotted against bandwidth on a log-log scale. The vertical lines, with consistent line types, locate the optimal bandwidths that produced minimum mean integrated squared errors.

Figure 3.2 is the analogue of Figure 3.1 except that the error distribution is now Normal N (0, 2.25). The “ideal” error-dependent smoothing rule again produces significant MISE reduction. However, MISE reductions in the “real-istic” case only become significant for n = 500. In the case of Normal errors, error-dependent smoothing rules tend to slightly inflate the bias while reducing the variance. The increase in bias is of course a second-order effect, since it is not evident in the first-order theoretical analysis in Section 2. More generally, simulations with Student’s t errors with a range of degrees of freedom show that the extent of MISE improvement offered by ˆgR declines, and the extent of bias inflation increases, as the error tail-weight decreases.

(14)

Figure 3.2. Simulation results for the sine regression function, n = 100 and Normal (0, 2.25) errors. Functions and line types are the same as in Figure 3.1.

In summary, tail weight of the error distribution, and second-order contribu-tions from bias, can have significant affect on the performance of error-dependent smoothing rules. In relation to the bias issue we found that, for a given error distribution, performance of ˆgR relative to ˆgL could be made better or worse by using target functions that produced lesser or greater amounts of bias, respec-tively. For example, by taking the sinusoid g to have only half a wavelength over the interval (0, 1) we could enhance performance of ˆgR relative to ˆgL; and by giving it more than one wavelength we could reduce relative performance.

A reviewer expressed concern that our use of grids for computation implied the method might be excessively computationally expensive. We use grids only to numerically determine mean squared integrated errors. In particular, no grid search methods are required. All our techniques involve only explicit calculation; nothing is defined implicitly, and no equations have to be solved in order to

(15)

con-struct our estimators. In practice one could use binning procedures to accelerate calculation. See for example page 238 of Scott (1992).

Acknowledgements

We are grateful to Iain Johnstone for very helpful discussion, and to three reviewers for helpful comments on the first version of the paper.

Appendix

A.1 Proof of Theorem 2.1.

We may write ˆg = (S2T0 − S1T1)/(S2S0 − S21), where, defining Ki(x) = K{(Xi − x)/hi} and, for a vector a = a(x) = (a1, . . . , an), setting U (a) =

n−1i h−1i aiKi, we put Sj = U (a) for ai(x) = (Xi − x)j, and Tj = U (a) for ai(x) = Yi(Xi − x)j. Let Tj1 = U (a) with ai(x) = g(Xi) (Xi− x)j, Tj2 = U (a) with ai(x) = i(Xi− x)j, and Rj = U (a) with ai = hji. Then, Tj = Tj1+ Tj2. (For the sake of simplicity we shall often suppress the argument x.)

It may be proved by Taylor expansion that

S2T01− S1T11= g (S2S0− S12) +21g[κ2h2f E{H()2}]2+ op(h4) (A.1) and S2S0− S12 = κ2h2f2E{H()2} + op(h2). Therefore,

ˆ

g1 ≡ (S2T01− S1T11)/(S2S0− S12) = g +12gκ2h2E{H()2} + op(h2) . (A.2) Noting that (2.7) implies E{|| H()2} < ∞, it may be proved that for j = 0, 1, E(Tj2) = hjκjf (x) E{ H()j} + hj+1κj+1f(x) E{ H()j+1} +12hj+2κj+2f(x) E{ H()j+2} + o(hj+2) , (A.3) var(Tj2)∼ n−1h2j−1E{2H()2j−1} f  y2jK(y)2dy , (A.4) where in (A.3), when j = 1, it is understood that we take the expansion only up to a remainder of o(h2). Lindeberg’s condition for the series T02, normalised by

its standard deviation, holds provided that, for any C1, C2 > 0,

(nh)−1 n  i=1 E[2i H(i)−2I(|Xi− x| ≤ C1hi) I{|i| H(i)−1> C2(nh)1/2}] → 0 . (A.5) (See e.g., Chung (1974), p.205, for Lindeberg’s condition.) Result (A.5) follows from the fact that E{2H()−1} < ∞, and so by Lindeberg’s central limit

(16)

result with (A.3) and (A.4) we deduce that, for fixed values of the argument x, ˆ

g2≡ (S2T02− S1T12)/(S2S0− S12)

=12ff−1κ2h2E{ H()2} + [(nh)−1f−1κ E{2H()−1}]1/2Nn+ op(h2) , where Nn is asymptotically Normal N (0, 1). The theorem follows from this ex-pansion and (A.2), on noting that ˆg = ˆg1+ ˆg2.

A.2. Proof of Theorem 2.2.

Put Ki = K{(Xi − x)/ˆhi} and U (a) = n −1 i hˆ−1i aiKi, and let Sj, Tj1,



Tj2, Tj and Rj denote the versions of U (a) that arise with a i(x) = (Xi − x)j, g(Xi) (Xi−x)j, i(Xi−x)j, Yi(Xi−x)jand ˆhji, respectively. Then,Tj =Tj1+Tj2 and ˆg = (S2T0 −S1T1)/(S2S0−S12). Since H is assumed bounded away from zero and infinity thenRj = Op(hj), for each j. Therefore, the following analogue of (A.1) holds, derived in a similar manner:



S2T01−S1T11= g (S2S0−S12) +12g(S22−S1S3) + op(h4) . (A.6) Let C1, C2, . . . denote generic finite, strictly positive constants, let η > 0 be a

older exponent appropriate for both H and K, and put ∆ = ˜g−g, H1 = H/H

and hi = h H(i). Since C1 ≤ H ≤ C2 then we may choose C3, C4, C5 > 0 such

that, by Taylor expansion and uniformly in i,

 H(ˆi)− H(i) H(i) + ∆(Xi)H1(i)  ≤ C3|∆(Xi)|1+η, (A.7)  K Xi− x ˆ hi  − KXi− x hi  + ˆhi− hi hi  Xi− x hi  K Xi− x hi   ≤ C4  ˆhi− hi hi  1+ηI(|Xi− x| ≤ C5h) . (A.8)

Combining (A.7) and (A.8) we see that, with L(u) = u K(u),

 K Xi− x ˆ hi  − KXi− x hi  − ∆(Xi) H1(i) L Xi− x hi   ≤ C6|∆(Xi)|1+ηI(|Xi− x| ≤ C5h) . (A.9)

Similarly but more simply, with H2= H/H2we have|h(ˆh−1i −h−1i )−∆(Xi) H2(i)|

≤ C7|∆(Xi)|1+η. Combining this result with (A.9), and defining M = K + L,

we deduce that for any sequence A1, . . . , An,

 n i=1 ˆ h−1i (Xi−x)jAiK{(Xi−x)/ˆhi}− n  i=1 h−1i (Xi−x)jAiK{(Xi−x)/hi}

(17)

−h−1 n i=1 ∆(Xi) (Xi− x)jAiH2(i) M{(Xi− x)/hi} ≤ C8h−1 n  i=1 |∆(Xi)|1+η|Xi− x|j|Ai| I(|Xi− x| ≤ C5h) . (A.10)

Put λn = n1/5h0. It is straightforward to prove that ∆(Xi) = Op(n−2/5λ2n) uniformly in values of i such that|Xi− x| ≤ C5h. Hence, when Ai ≡ 1 we obtain

from (A.10) the result

 n i=1 ˆ h−1i (Xi− x)jK{(Xi− x)/ˆhi} − n  i=1 h−1i (Xi− x)jK{(Xi− x)/hi} = Op(n3/5hjλ2n) .

Equivalently, Sj − Sj = Op(n−2/5hjλ2n), where Sj is as in the proof of Theorem 2.1. From this result and the property n−2/5λ2n = o(1) it may be shown that



Sj = κjhjf E{H()j} + op(hj). The latter relation, and (A.6), imply that



S2T01−S1T11= g(S2S0−S12) +12g[κ2h2f E{H()2}]2+ op(h4) , (A.11)



S2S0−S12= κ2h2f2E{H()2} + op(h2) . (A.12) The only other sequence A1, . . . , An in which we are interested is Ai ≡ i,

and there we may deduce from (A.10) the bound

 n i=1 ˆ h−1i (Xi−x)jiK{(Xi−x)/ˆhi}− n  i=1 h−1i (Xi− x)jiK{(Xi−x)/hi}−Vj = Op(n(3−2η)/5hjλ2(1+η)n ) , (A.13) where Vj ≡ h−1i ∆(Xi)(Xi−x)jiH2(i)M{(Xi−x)/hi}. By Taylor-expanding

the ratio formula for a local-linear estimator (see e.g., the expression for ˆa given by Fan (1993, p.197) it may be proved that ∆(u) = h2

0ψn(u) + (nh0)−1f (u)−1 

i i

×K{(Xi−u)/h0}+Op(h40), uniformly in values u in a neighbourhood of x, where

ψn denotes a deterministic function that satisfies ψn(u) = 12g(u) κ2 + o(ξn) uniformly in a neighbourhood of x, and ξn= ξ(h0), ξ(·) being as defined at (2.9).

Therefore, Vj = Vj1+ Vj2+ op(n3/5hj), where Vj1= h−1h20 n  i=1 (Xi− x)jψn(Xi) iH2(i) M{(Xi− x)/hi} , Vj2= (nhh0)−1f (x)−1 n  i1=1 n  i2=1 i1i2H2(i2)(Xi2−x)jK{ Xi1−Xi2 h0 }M{ Xi2−x hi2 } .

(18)

Noting that ujM (u) du = 0 for j = 0, 1 we may prove that for j = 0, 1 and k = 1, 2, E(Vj1) = o(n3/5hj), E(Vj2) = o(n3/5hj), Vjk− E(Vjk) = op(n3/5hj). When treating Vj2 we consider the diagonal and off-diagonal terms separately. The absolute value of the sum of the diagonal terms is adequately bounded by the sum of the absolute values of the summands. Since, conditional on the Xi’s, the i’s are independent, then the variance of the sum of the off-diagonal terms in Vj2 is readily computed under the assumption that E(2) <∞; in particular, we do not need finite fourth moments. However, to prove that var(Vj2) = o(n6/5h2j) we do require the assumption that n1/5h0 → ∞.

Therefore, Vj = op(n3/5hj) for j = 0, 1, and so by (A.13),



Tj2= Tj2+n−1(Vj1+Vj2)+Op(n−2(1+η)/5hjλ2(1+η)n ) = Tj2+op(n−2/5hj) (A.14) for j = 0, 1, where Tj2 is as in the proof of Theorem 2.1. We know from that proof that|T02| + |h−1T12| = Op{h2+ (nh)−1/2}. Hence, by (A.12) and (A.14),



S2T02−S1T12= f−1(S2S0−S12) T02+ op[h2{h2+ (nh)−1/2} + n−2/5h2] . (A.15) Combining (A.11), (A.12) and (A.15) we deduce that

ˆ

g ={(S2T01−S1T11) + (S2T02−S1T12)} (S2S0−S12)−1

= g +12gκ2h2E{H()2} + f−1T02+ op{h2+ (nh)−1/2} . (A.16)

We showed during the proof of Theorem 2.1 that T02 = 12fκ2h2E{ H()2} + [(nh)−1f κ E{2H()−1}]1/2Nn + op(h2), where Nn converges to Normal N(0, 1). Theorem 2.2 follows from this result and (A.16).

References

Abramson, I. S. (1982). On bandwidth variation in kernel estimates — a square root law. Ann.

Statist. 9, 168-176.

Abramson, I. S. (1984). Adaptive density flattening — a metric distortion principle for com-bating bias in nearest neighbour methods. Ann. Statist. 12, 880-886.

Choi, E. and Hall, P. (1998). On bias reduction in local linear smoothing. Biometrika 85, 333-346.

Chung, K. L. (1974). A Course in Probability Theory. Academic Press, New York.

Donoho, D. L. and Liu, R. C. (1991). Geometrizing rates of convergence, III. Ann. Statist. 19, 668–701.

Donoho, D. L., Liu, R. C. and MacGibbon, B. (1990). Minimax risk over hyperrectangles, and implications. Ann. Statist. 18, 1416-1437.

Fan, J. (1993). Local linear regression smoothers and their minimax efficiencies. Ann. Statist. 21, 196-216.

Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B57, 371-394. Hall, P. (1990). On the bias of variable bandwidth curve estimators. Biometrika 77, 529-536.

(19)

Hall, P. and Turlach, B. A. (1997). Interpolation methods for adapting to sparse design in nonparametric regression. (With discussion and rejoinder.) J. Amer. Statist. Assoc. 92, 466-472.

ardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cam-bridge, U.K.

Jones, M. C. (1990). Variable kernel density estimates and variable kernel density estimates.

Austral. J. Statist. 32, 361-371.

Jones, M. C., Linton, O. and Nielsen, J. P. (1995). A simple bias reduction method for density estimation. Biometrika 82, 327-338.

Jones, M. C., McKay, I. J. and Hu, T.-C. (1994). Variable location and scale kernel density estimation. Ann. Statist. Math. 46, 521-535.

Ruppert, D. and Cline, B. H. (1994). Bias reduction in kernel density estimation by smoothed empirical transformations. Ann. Statist. 22, 185-210.

Ruppert, D. and Wand, M. P. (1994). Multivariate locally weighted least squares regression.

Ann. Statist. 22, 1346-1370.

Samiuddin, M. and el-Sayyad, G. M. (1990). On nonparametric kernel density estimates.

Bio-metrika 77, 865-874.

Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice and Visualization. Wiley, New York.

Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman and Hall, London.

Department of Mathematics, National Taiwan University, Taipei 106, Taiwan. E-mail: cheng@math.ntu.edu.tw

Centre for Mathematics and Its Applications, Australian National University, Canberra, ACT 0200, Australia.

E-mail: halpstat@pretty.anu.edu.au

數據

Figure 2.1. Values of ρ 2 versus number of degrees of freedom for Student’s t error distributions.
Figure 3.1. Simulation results for the sine regression function, depicted in panel (a), and for sample size n = 100 and t 5 errors
Figure 3.2. Simulation results for the sine regression function, n = 100 and Normal (0, 2.25) errors

參考文獻

相關文件

In this report, formats were specified for single, double, and extended precisions, and these standards are generally followed by microcomputer manufactures using

In this paper, we build a new class of neural networks based on the smoothing method for NCP introduced by Haddou and Maheux [18] using some family F of smoothing functions.

Abstract Based on a class of smoothing approximations to projection function onto second-order cone, an approximate lower order penalty approach for solving second-order cone

Based on a class of smoothing approximations to projection function onto second-order cone, an approximate lower order penalty approach for solving second-order cone

If the bootstrap distribution of a statistic shows a normal shape and small bias, we can get a confidence interval for the parameter by using the boot- strap standard error and

Gu, Smoothing Newton algorithm based on a regularized one-parametric class of smoothing functions for generalized complementarity problems over symmetric cones, Journal of

1 After computing if D is linear separable, we shall know w ∗ and then there is no need to use PLA.. Noise and Error Algorithmic Error Measure. Choice of

The continuity of learning that is produced by the second type of transfer, transfer of principles, is dependent upon mastery of the structure of the subject matter …in order for a