OPTIMAL DESIGN FOR CURVE ESTIMATION BY LOCAL LINEAR SMOOTHING Ming-Yen Cheng

(1)

BY LOCAL LINEAR SMOOTHING

Ming-Yen Cheng^1,2, Peter Hall¹, D.M. Titterington^1,3

21 January 1997

ABSTRACT. The integral of the mean squared error of an estimator of a regression function is used as a criterion for defining an optimal design measure in the context of local linear regression, when the bandwidth is chosen in a locally optimal manner. An algorithm is proposed that constructs a sequence of piecewise- uniform designs with the help of current estimates of the integral of mean squared error. These estimates do not require direct estimation of the second derivative of the regression function. Asymptotic properties of the algorithm are established and numerical results illustrate the gains that can be made, relative to a uniform design, by using the optimal design or sub-optimal, piecewise-uniform designs. The behaviour of the algorithm in practice is also illustrated.

KEYWORDS. Bandwidth choice, local linear regression, mean squared error, nonlinear regression, optimal design, sequential design.

SHORT TITLE. Optimal design for local linear smoothing.

AMS SUBJECT CLASSIFICATION. Primary 62G07, Secondary 62G20.

1 Centre for Mathematics and its Applications, Australian National University, Canberra, ACT

0200, Australia.

2 Institute of Mathematical Statistics, National Chung Cheng University, Minghsiung, Chiayi,

Taiwan ROC.

3 Department of Statistics, University of Glasgow, Glasgow G12 8QW, UK.

(2)

gan with seminal work of Kiefer (1959), leading to research summarized in books by Fedorov (1972), Silvey (1980) and Pukelsheim (1993). In contrast to the linear case, in nonlinear problems the true values of parameters can strongly influence optimal designs; see, for instance, Ford et al (1989) and Chaloner and Verdinelli (1995) for reviews of, respectively, non-Bayesian and Bayesian approaches. Recently, optimal design ideas have been applied to particular nonlinear models in the neural computation literature, where the concept of optimally-slanted sequential design is described as “active learning”. See for example Fedorov (1972), MacKay (1992), Cohn (1994) and the statistical introduction by Cheng and Titterington (1994).

Cohn (1994) mentions the possibility of extending these ideas to environments such as locally weighted regression. It is this direction that the present paper takes, by considering the application of optimal design ideas to so-called “nonparametric”

regression. We shall modify the usual criterion on which optimality is based, since in nonparametric regression it is central that the “model” represented by the fitted curve is incorrect. Indeed, optimality is achieved by trading off model inaccuracy, expressed through bias, against model suitability, represented by purely stochastic error. As our optimality criterion we shall use the integral of mean squared error over a compact design space, although in principle, versions that reflect differential weighting across the design space could be employed instead.

Using this viewpoint of optimality, Section 2 will propose an empirical, asymptotically optimal sequential rule for selecting both the bandwidth and the design density when the estimator is based on local linear regression. For purposes of illustration we shall adopt as our goal the modification of uniform design obtained by putting less weight in regions of low curvature. Thus, the task becomes one of determining how to reasonably estimate the curve in places of high curvature, subject to a constant weight in the definition of mean squared error. Other options will be discussed in Section 2.3. Numerical properties of our procedure will be reported in Section 3, in terms of a simulation study. Section 4 will outline technical arguments

(3)

behind the results in Section 2.

It is assumed that observations are generated by the model

Y = g(x) + , (1.1)

where g is the function to be estimated, has zero mean and variance σ² (and a distribution not depending on x), and, conditional on design points x = Xi, the ordinates Y = Y_i are independent with a distribution determined by the model at (1.1). The algorithm for selecting design points is at the disposal of the experi- menter, and should be chosen to optimize performance. We shall suppose that the design is restricted to the interval I = [0, 1], although clearly other possibilities may be treated in a similar way.

2. An algorithm and its properties.

2.1. Algorithm for computing design and bandwidth. The sequential rule that we propose is based on updating in a geometric sequence of steps. This represents a compromise between the fully-sequential, Anscombe-type algorithm, which involves adjusting the algorithm for datum-by-datum increments in sample size, n, and is not really appropriate in nonparametric regression; and the double sampling, Stein- type approach, which “guesses” the final order of magnitude of the desired sample size, and uses a single but sizeable subsample to refine the initial guess. A wide variety of approaches, involving for example polynomial increases in n instead of fully sequential methods, can be effective, depending on the “cost” of each update.

In the context to which we shall apply our techniques, cost would depend largely on computational complexity.

Given r > 1, let nk denote the integer part of r^k. Estimation of g is conducted iteratively, with step k employing information derived from the previous nk−1 data pairs. Step k may be conveniently broken into two parts: (a) determining a design density ˆf_k from which to draw the design points for the next N_k = n_k− n_k−1 pairs;

and (b) drawing these new data, adjoining them to the earlier n_k−1 pairs to produce

(4)

a new set X_k = {(X_i, Y_i), 1 ≤ i ≤ n_k}, and using X_k to construct estimators ˆg_k of g and ˆσ²_kof σ². We compute ˆf_kas a histogram, define ˆg_kusing local linear smoothing, and construct ˆσ_k using first-order differences.

Algorithm for completing part (a). (i) Definition of histogram. Given an integer m_k n_k, and positive constants a₁, . . . , a_m_k satisfying P a_i = m_k, let f = f (·|a1, . . . , am_k) denote the density on I that equals ai on ((i − 1)/mk, i/mk] for 1 ≤ i ≤ mk. (ii) Estimation of mean squared error. Write ˆgk−1 and ˆσ_k−1² for the estimators of g and σ² computed from the set Xk−1 of the first nk−1 data pairs. Let νk nk be an integer, let K be the kernel that we shall employ to construct ˆgk, and define κ₁ =R K² and h = h(·|a₁, . . . , a_m_k, b) = bν_k^−1/5f⁻², where b > 0. (Thus, in contradistinction to near neighbour methods which effectively take h inversely proportional to f , we ask that h be inversely proportional to the square of f .) Our estimator of mean integrated squared error is

∆(a₁, . . . , a_m_k, b) = κ₁ˆσ²_k−1ν_k^−4/5b⁻¹ + ∆₁(a₁, . . . , a_m_k, b) ,

where

∆1(a1, . . . , am_k, b) = Z

J

Z

ˆgk−1{x − h(x)y} − ˆgk−1(x)K(y) dy

2

dx ,

J = (Cν_k^−1/5, 1 − Cν_k^−1/5), and C > 0 is a constant such that the restriction 0 < h ≤ Cν_k^−1/5 is imposed by constraints on the choice of (a₁, . . . , a_m_k, b). (iii) Definition of ˆf_k. Estimate a₁, . . . , a_m_k, b as the values of those quantities that minimize ∆, subject to a restriction which implies that 0 < h ≤ Cν_k^−1/5. (There are many examples of such restrictions. Simple ones will be considered in Section 2.2.) Note that minimizing ∆ over a1, . . . , am_k, for fixed b, is equivalent to minimizing ∆1

over a1, . . . , am_k. Let ˆfk be the version of f obtained on substituting the estimators for the values of a1, . . . , amk, b. Write ˆbk for the estimator of b.

A key feature of our approach is that the term, ∆1, that measures the contribution due to bias, avoids the troublesome, direct estimation of the second derivative of the curve.

(5)

Algorithm for completing part (b). Conditional on X_k−1, draw N_kindependent and identically distributed random pairs Y_k = {(X_ki, Y_ki), 1 ≤ i ≤ N_k}, generated by the model, with the X_ki’s having density ˆf_k. Compute the local linear estimator ˆ

g_k, based on the data X_k = X_k−1 ∪ Y_k and using locally adaptive bandwidth h = ˆbkn^−1/5_k fˆ_k⁻² and kernel K. Define

˜

σ_k² = {2(N_k− 1)}⁻¹

N_k−1

X

i=1

Y_k,i+1⁰ − Y_ki⁰ ² ,

where {(X_ki⁰ , Y_ki⁰ ), 1 ≤ i ≤ Nk} denotes an ordering of the pairs in Yk such that X_k1⁰ < . . . < X_kN⁰

k. Define ˆσ_k² by either ˆσ_k² = ˜σ_k² or ˆ

σ²_k= n_k−1(n_k−1+ N_k)⁻¹ˆσ²_k−1+ N_k(n_k−1+ N_k)⁻¹σ˜²_k.

2.2. Motivation for the algorithm. We shall motivate the algorithm in terms of its ability to minimize integrated squared error. This is often appropriate when the curve is to be used for calibration or prediction, for example. In other cases, where applications will be more interpretative, perhaps through analysis of unusual features or turning points, a different criterion would be employed. The difference in the criterion may be as simple as employing a weight function when defining mean squared error, perhaps with the weight chosen adaptively so as to select features of interest. Alternatively, a measure of risk allied to a geometric description of distance, such as the Haussdorff metric, might be employed instead of squared- error loss. The broad approach to constructing the algorithm would be similar in such cases, with the aim still being to optimise an empirical measure of loss. But details will of course differ.

We shall describe motivation in an heuristic fashion, but it may be rigorously justified under the following conditions: (a) the target function g has two continuous derivatives on I, and g⁰⁰ vanishes only at a finite number of points, (b) the error distribution has all moments finite, zero mean and variance σ², and (c) the symmetric, nonnegative kernel K is H¨older continuous and supported on a bounded interval,

(6)

say (−c, c). With these assumptions, Section 2.3 will state a theorem addressing the performance of the algorithm.

Suppose n independent observations are made of a pair (X, Y ) generated as Y = g(X) + , in which the design variables X are distributed with a continuous density f , and the distribution of the error, , has zero mean and variance σ². An estimator of g based on local linear smoothing, using kernel K and bandwidth h, has its asymptotic mean squared error at x ∈ I given by

Hn(x, h|f ) = (nh)⁻¹κ1σ²f (x)⁻¹+ ¹₄h⁴κ2g⁰⁰(x)², (2.1) where κ₁ is as in Section 2.1 and κ₂ = {R

y²K(y) dy}². See for example Fan (1993) and Hastie and Loader (1993). For fixed x, which we now suppress, the quantity at (2.1) is minimized by taking h = h₀ = (n κ₃f g⁰⁰²)^−1/5, where κ₃ = κ2/(κ1σ²). Substituting back into (2.1) we deduce that with an optimal local choice of bandwidth, mean squared error is given asymptotically by n^−4/5κ4(g⁰⁰²/f⁴)^1/5, where the constant κ4 depends only on K and σ². The minimum mean integrated squared error is obtained by integrating this quantity over I, producing a functional proportional to A(f ) =R

I(g⁰⁰²/f⁴)^1/5.

The optimal design density f is that function which minimizes A(f ) subject to R

I f = 1 and f ≥ 0. A simple calculus of variations argument shows that this is given by f₀ = c₀|g⁰⁰|^2/9, where the constant c₀ is chosen to ensure that R

I f₀ = 1. Note particularly that for this choice, the optimal bandwidth is inversely proportional to the square of f : h0 = c1f⁻², where c1 is a constant. This explains why, in the algorithm suggested in Section 2.1, we took the bandwidth for computing ˆ

gk to vary in proportion to ˆf_k⁻².

In (2.1), let the bandwidth h be h1 = bn^−1/5f⁻², where b is an arbitrary positive constant and f is an arbitrary continuous density, bounded away from zero on I. On integrating H_n(·, h₁|f ) over I the contribution from the first term may be seen to equal

Z

I

(nh₁)⁻¹κ₁σ²f⁻¹ = κ₁σ²n^−4/5b⁻¹.

(7)

Note particularly that the effect of f has disappeared. This explains the origin of the first term in the formula for ∆, and why it depends only on b, not on a₁, . . . , a_m_k. The second term is an estimate of the integral of the second term of (2.1), again with h₁ substituted for h.

If the function g is at all “interesting” then it will enjoy one or more points of inflection, where g⁰⁰ = 0. There, the optimal design density determined by the argument two paragraphs above equals zero, and in such cases the approximate formula at (2.1) is not valid. We overcome this problem by introducing a threshold, η, below which the value of ˆfk is not allowed to fall. When discussing theoretical properties of our algorithm it is convenient to also impose a ceiling on f . For economy of notation we take this to be η⁻¹, although we could have used any large positive number. Given η ∈ (0, 1), define

fη =





 cη

g⁰⁰

2/9 if cη

g⁰⁰

2/9∈ (η, η⁻¹)

η if c_η

g⁰⁰

2/9≤ η η⁻¹ if cη

g⁰⁰

2/9≥ η⁻¹,

(2.2)

where the positive constant cη is uniquely defined by the requirement thatR

fη = 1, and where f_η → f₀ and c_η → c₀ as η → 0. Let b_η denote the value of the constant b which minimizes

b⁻¹κ₁σ²+ ¹₄b⁴κ²₂ Z

I

g⁰⁰²f_η⁻⁸, and let η₁ ∈ (0, 1) be so small that b_η ∈ (η₁, η⁻¹₁ ).

If in part (a) of the algorithm we restrict attention to (a₁, . . . , a_m_k, b) such that η ≤ ai ≤ η⁻¹ for 1 ≤ i ≤ mk, and η1 ≤ b ≤ η⁻¹₁ , (2.3) where η ∈ (0, 1) and η1 is as defined in the previous paragraph, then the histogram fˆkis an estimator of fη, ˆbkis an estimator of bη, and the constant C in the definition of J may be taken equal to cη⁻²η₁⁻¹. The consistency of ˆfk and ˆbk for fη and bη, respectively, will be shown during the proof of Theorem 2.1.

In problems involving high orders of estimation the optimal design density is broadly similar to that at (2.2). For example, if we are locally fitting a polynomial

(8)

of degree 2ν − 3, where ν ≥ 2 is an integer, then the asymptotically optimal design density is proportional to |g^(ν)|^2/(4ν+1). This generalizes the case ν = 2 treated just above. (The case of fitting polynomials of even order is a little different; note for example the results of Ruppert and Wand (1994).)

2.3. Properties of the algorithm. Let (C) denote conditions (a) – (c) introduced in the first paragraph of Section 2.1, as well as condition (2.3) for sufficiently small η1, and the assumption that for some δ ∈ (0, 1) we have νk = O(n^1−δ_k ), mk= O(ν_k^1−δ) and m_k → ∞. Let f_η have the definition given in Section 2.2, with η as in (2.3).

The minimum mean squared error derived by optimizing over all design densities subject to the constraint at (2.3), is H_n(x, h|f_η). Our main theorem states that ˆg_k achieves this optimum.

Theorem 2.1. Assume conditions (C), with η1 in (2.3) chosen as suggested in Section 2.2. Then

Z

I

(ˆgk− g)²

Z

I

infh Hn(·, h|fη)

→ 1 (2.4)

with probability 1 as k → ∞.

Remark 2.1. Use of ridge methods. The local linear estimator ˆgk is defined without recourse to adjustments, such as ridging, which are designed to induce more stable numerical properties. For a discussion of stabilization methods in non- sequential contexts, see for example Seifert and Gasser (1995). There is no difficulty in developing an analogue of Theorem 2.1 for the case where ˆgk is defined by ridged or interpolated local linear smoothing.

Remark 2.2. Efficiency. The efficiency of employing an optimal design density may be expressed as the ratio of two mean integrated squared errors, the first using the optimal density and the second using another density, f , say. In view of Theorem 2.1, efficiency may also be expressed in purely empirical terms. Indeed, the ratio of R

I(ˆg_k−g)² toR

I inf_f,hE_f(ˆg −g)², where ˆg is a standard local linear estimator using

(9)

kernel K (with ridging employed to ensure that mean squared error is well-defined), and where the infimum is taken over all choices of f and of a nonstochastic, locally adaptive bandwidth h, converges (as first k → ∞ and then η → 0) to 1. In this sense, the sequential estimator ˆg_k is fully efficient.

Remark 2.3. Alternative regression estimators. It is of interest to consider alternative approaches to regression, not least because they can produce results of a different character from those described earlier. We shall treat two, the classi- cal Nadaraya-Watson method (see e.g. H¨ardle (1990), Section 3.1, and Wand and Jones (1995), p. 119) and a “transformation approach” (Hart (1991)). By judicious choice of the design density, depending on the target function g, both these techniques permit the asymptotic bias of a second-order kernel estimator ˆg to vanish, and hence the mean squared error to be an order of magnitude smaller than in the case of local linear smoothing. This result may seem to contradict the known minimax optimality of local linear smoothing (Fan 1993), but it should be noted that such optimality results pertain only to the case of a fixed design density that is bounded away from zero and infinity, and hence do not apply to sequentially chosen design. While reduced mean squared error is an attractive feature, it is achieved only at the expense of significant practical difficulties constructing the sequential version of the estimator. Therefore, we do not develop the methods here beyond outlining theoretical properties and discussing their implications.

For a Nadaraya-Watson kernel estimator the asymptotic mean squared error formula, the analogue of (2.1), may be written as

(nh)⁻¹κ1σ²f (x)⁻¹ + κ h⁴g⁰⁰(x) + 2 g⁰(x) f⁰(x) f (x)⁻¹ 2

, (2.5)

where the constant κ depends only on the kernel. The second term here represents the squared bias contribution to mean squared error, and may be rendered equal to zero (except at zeros of g⁰) by defining f = f₀ = c₀|g⁰|^−1/2, where c⁻¹₀ =R

|g⁰|^−1/2. Hart’s transformation estimator has a mean squared error which enjoys a similar

(10)

expansion, this time with the second term in (2.5) replaced by

κ h⁴g⁰⁰(x) f (x)⁻²− g⁰(x) f⁰(x) f (x)⁻³ ² .

That quantity is rendered equal to zero by using the design density f₀ = c₀|g⁰|, with c⁻¹₀ =R

|g⁰|.

For both the Nadaraya-Watson and transformation definitions of ˆg we may estimate |g⁰| either explicitly or implicitly, and employ the estimator in the obvious way to construct an estimator of f0. In principle, if g has four bounded derivatives then this technique can produce an order of squared bias equal to h⁸, and hence an order of mean squared error equal to n^−8/9 (rather than the n^−4/5 typically associated with second-order methods), away from points where g⁰ = 0. In practice, however, there are significant obstacles to achieving such performance. The first problem is that these methods demand an estimator of g⁰ whose high-order derivatives accurately estimate those of g⁰. While this is feasible for large samples, it is not really practicable with smaller data sets. Since a sequential procedure would usually start with relatively small samples, this is a drawback. The inherent numerical instability of estimators of ratios of functions, such as appear in formulae for bias terms of Nadaraya-Watson or transformation estimators, makes it even more difficult to ensure good performance in small samples.

Secondly, attaining optimal performance at the level n^−8/9 demands a particularly complex bandwidth formula, depending on the fourth derivative of g. The theoretical difficulties are easily overcome, but an attractive numerical algorithm seems to be out of reach. In particular, there does not appear to exist a high-order analogue of the simple algorithm suggested in Section 2.1, which simultaneously produces an estimator of the optimal bandwidth and optimal design density. There- fore, selection of the appropriate bandwidth in a high-order sequential procedure is a particularly awkward problem. Thirdly, this high-order performance seems to be only achievable away from points where g⁰ vanishes; at those points the optimal design density is either infinite (in the case of the Nadaraya-Watson estimator) or

(11)

zero (for the transformation estimator). Different bandwidth selection procedures seem to be necessary at those points, producing different convergence rates. This makes for a cumbersome approach to inference.

3. Numerical results

3.1. Efficiencies of some suboptimal designs. Recall, from Section 2.2, that the mean integrated squared error associated with a design f and an optimally chosen bandwidth, is proportional to A(f ) =R

I(g⁰⁰²/f⁴)^1/5, and that the optimal design is given by f0 = c0|g⁰⁰|^2/9. Thus, A(f0) = (R

I |g⁰⁰|^2/9)^9/5 is the minimum achievable value of A(f ), and it is natural to define the efficiency of design f , relative to the optimal design, by A(f₀)/A(f ). We evaluated this in the context of fifteen true curves given by the class of Gaussian mixtures proposed by Marron and Wand (1992). These vary from the standard Gaussian density function through a range of Gaussian mixtures of varying complexity. The design space is the interval (−3, 3).

We describe our results here only in words. A more detailed account, including tabular results, is available in a technical report (Cheng, Hall and Titterington 1995).

We computed Monte Carlo approximations to efficiencies of designs that are themselves mixtures of the optimal design, f0, and the uniform design on (−3, 3), with mixing weights θ and 1 − θ, respectively. The range of θ was 0.0 (0.1) 1.0. For many of the fifteen curves, there was not much to be gained from using the optimal design rather than the uniform, but in some cases the gains were considerable.

Particular instances were curves 3–5 and 10–14, in the nomenclature of Marron and Wand (1992).

Bearing the above robustness in mind, as well as the fact that our algorithm for sequential design involves a sequence of piecewise-uniform designs, we computed the efficiencies of such designs. We addressed cases corresponding to m = 2^r, for r = 0, 1, . . . , 4, and with the a_i’s chosen at their optimal values. If such a design is denoted by f_m0 then it is straightforward to show that A(f_m0) =

(12)

(L/m)^4/5(P A^5/9_i )^9/5, in which L denotes the length of the design space (an interval), and A_i = R |g⁰⁰|^2/5, where the range of integration is the interval ((i − 1)L/mk, iL/mk] for 1 ≤ i ≤ mk. We obtained good gains in efficiency by m = 8.

For example, in the case of curves 3–5 in (the order of Marron and Wand (1992)), efficiencies were respectively 69%, 74% and 52% for m = 1, but had increased to 93%, 90% and 82% by m = 8.

Since zeros of the optimal design density correspond to points of inflexion of g (see (2.2)), then departure of optimal design from uniformity will tend to be in proportion to the “wiggliness” of g. The numerical analyses described above confirm this informal theoretical conclusion. For example, functions 10–15 of Marron and Wand (1992) exhibit the greatest number of points of inflexion, and include a class of functions for which we observed substantial gains in performance when attempting to optimise design.

3.2. An illustration of the algorithm. In the small study reported here we con- centrated on the mutual similarity of the formula for the asymptotic, integrated mean-squared error and the corresponding estimator thereof, defined by ∆, and on how closely the design created after one iteration of the algorithm approximates the optimal design. Although our theory is asymptotic in k, in practice only a small number of iterations of the algorithm would be carried out, bearing in mind the re- lationships between successive sample sizes: the design procedure is best described as batch-sequential, with large batch sizes, and anything approaching genuinely sequential design, in which the design is updated one observation at a time, does not make sense in the context of nonparametric regression. To help make the conclu- sions clear, the exemplar chosen curve was the continuous, but not continuously differentiable, piecewise quadratic curve given by

g(x) =

x(1 − 2x)/4 if 0 ≤ x ≤ 1/2 2(1 − 2x)(1 − x) if 1/2 ≤ x ≤ 1.

For this curve, therefore, the optimal design is piecewise uniform, corresponding to m = 2, and with a₁ = 2/(1 + 64^1/9) = 0.773 and a₂ = 2 − a₁. The Epanechnikov

(13)

kernel was used in the local linear fitting. Thus κ₁ = 3/5 and κ₂ = 1/25, so that the optimal value of b was (2⁹· 15 · σ²)^1/5/(1 + 64^1/9)^9/5. Therefore, the experiment involved choosing an initial sample of size n₀, taking m₁ = 2 and investigating the fruits of the algorithm in terms of the resulting estimates of a₁ and a₂. A variety of values were chosen for n0 and ν1, and care was taken to choose the range of integration J suitably when calculating ∆1. In fact ∆1 included a small gap near the discontinuity in g⁰(x) at x = ¹₂, where the calculations were unstable.

The bandwidth used for calculating the local linear curve estimates ˆg0 was the asymptotically optimal value, constant in x, based on the estimate of σ calculated from first-order differences and on an assumption of constant curvature, for g, of magnitude 4. (Note that the true curvature has magnitude 1 on the first half of the interval, and 8 thereafter.)

We report here on 10 replicates of each of the cases σ = 0.01, 0.05 and n0 = 400, 1000. Throughout, ν1 = 200. For each replicate, ∆ was evaluated on a grid of values of (b, a1), and the optimal combination of values was found, correct to two decimal places in each of the two variables. This would seem to be adequate from a practical point of view. The optimal values for b were 0.171 and 0.326, for σ = 0.01 and 0.05, respectively, and the algorithm achieved these values closely, especially for σ = 0.01. In that case, the optimal value of a₁ (0.773) was slightly over-estimated, because of the smoothing involved in calculating the bias term ∆₁, but the errors were not great. The ∆ surface always had a nice single minimum in (b, a1), at least in the region investigated in the experiment. The values of ∆ were typically within a few percent of those from the asymptotic formula. Occasionally, the difference went into double figures, in percentage terms, but was fairly constant as (b, a₁) varied, so that, as reported above, the positions of the minima on the two surfaces were very similar.

4. Outline proof of Theorem 2.1. The proof is given only in barest outline.

Details are given in the technical report by Cheng, Hall and Titterington (1995).

(14)

The first step is to establish a relatively crude upper bound to |ˆg_k− g|:

sup

I

|ˆgk− g| = O n^−(2/5)+η_k

(4.1)

with probability 1, as k → ∞, for all η > 0. As a prelude to establishing (4.1), define N (l) = {1, . . . , Nl}, let {Xli, i ∈ N (l)} denote the set of design points constructed in step l of the algorithm, write Yki = g(Xki) + ki for the associated measurements of Y , and let h = ˆbkn^−1/5_k fˆ_k⁻². Let skj(x) denote the sum of (x − Xli)^jK{(x − Xli)/h}

over i ∈ N (l) and 1 ≤ l ≤ k, put wli(x) = {sk2(x)−sk1(x)(x−Xli)} K{(x−Xli)/h}

for i ∈ N (l), and define W_k to equal the sum of w_li over i ∈ N (l) and 1 ≤ l ≤ k.

In this notation,

ˆ

gk = g + (Ak+ Bk)W_k⁻¹, (4.2) where

Ak=

k

X

l=1

A(l) , A(l) = X

i∈N (l)

wlili,

Bk=

k

X

l=1

B(l) , B(l) = X

i∈N (l)

wli{g(Xki) − g} .

Defining K1 = K, K2(y) = yK(y) and

V_lj(x) = X

i∈N (l)

K_j{(x − X_li)/h} _li,

we have

|A_k| ≤

k

X

l=1

|A(l)| ≤

k

X

l=1 2

X

j=1

|s_k,3−j| h^j−1|V_lj|

≤ (sk2+ |sk1| h)

k

X

l=1

(|Vl1| + |Vl2|) .

Also, |s_kj| ≤ (ch)^jU_k, where c > 0 is chosen so that (−c, c) contains the support of K, and

U_k =

k

X

l=1

X

i∈N (l)

K{(x − X_li)/h} .

(15)

Therefore,

|Ak| ≤ C1h²Uk k

X

l=1

(|Vl1| + |Vl2|) ,

where, here and below, C₁and C₂are positive constants. Similarly, an upper bound may be established for B_k. In consequence it may be shown from (4.2) that

|ˆgk− g| ≤ C2W_k⁻¹

h²Uk

k

X

l=1

(|Vl1| + |Vl2|) + h⁴U_k²

. (4.3)

Computations based on large-deviation bounds for the variables Vl1 and Vl2

show that sup_I|Vlj| = Op(n^(2/5)+η_k ) for all η > 0. Hence, by (4.3), sup

I

|ˆgk− g| = On

W_k0⁻¹ h²₀n^(2/5)+η_k Uk0+ h⁴₀U_k0² o

, (4.4)

with probability 1, where Uk0 = sup_IUk, Wk0 = infIWk and h0 = sup_Ih. It may be shown that U_k0 = O(n^4/5_k ) and W_k0⁻¹ = O(n^−6/5_k ); and, by definition of h, that h₀ = O(n^−1/5_k ), all results holding with probability 1. The desired result (4.1) follows from these bounds and (4.4).

The next step is to show that sup

I

| ˆf_k− f_η| → 0 , ˆb_k→ b_η (4.5) with probability 1. First, using (4.1) we may prove that

∆₁(a₁, . . . , a_m_k, b) = ¹₄ν_k^−4/5b⁴κ²₂ψ(f ) + o ν_k^−4/5 ,

∆(a₁, . . . , a_m_k, b) = ν_k^−4/5φ(f, b) + o ν_k^−4/5 ,

(4.6)

uniformly in a₁, . . . , a_m_k, b satisfying (2.3), where ψ(f ) =R

I(g⁰⁰/f⁴)² and φ(f, b) = κ₁σ²b⁻¹+ ¹₄b⁴κ²₂ψ(f ). Results (4.5) may be derived from (4.6).

Using (4.5) it may be shown that

s_kj = n^(4−j)/5_k t^(j+1)/5ρ_jf_η+ o n^(4−j)/5_k

(4.7) uniformly on I, with probability 1, where ρ_j = R

y^jK(y) dy and t is defined by h = n^−1/5_k t. Now, ρ0 = 1, ρ1 = 0, ρ2 = κ2 and Wk(x) = sk2sk0− s²_k1, and so by (4.7),

sup

I

Wk n²_kh⁴_ηκ2f_η²−1

− 1

→ 0 , (4.8)

(16)

where h_η = n^−1/5_k f_η⁻²b_η. Similarly it may be shown that

sup

I

Bk 1

2 n²_kh⁶_ηκ²₂f_η²−1

− g⁰⁰

→ 0 . (4.9)

Combining (4.8) and (4.9) we conclude that

sup

I

BkW_k⁻¹ − ¹₂h²_ηκ2g⁰⁰

= o n^−2/5_k . (4.10) Combining (4.2), (4.8) and (4.10) we deduce that

Z

I

(ˆgk− g)² = (1 + ξ1) Z

I

n

Ak n²_kh⁴_ηκ2f_η²−1

+ (1 + δ) ¹₂h²_ηκ2g⁰⁰ o2

= (1 + ξ2) Z

I

n

A²_k n²_kh⁴_ηκ2f_η²−2

+ ¹₄ h²_ηκ2g⁰⁰2o + 2

Z

I

Ak n²_kh⁴_ηκ2f_η²−1 1

2h²_ηκ2g⁰⁰ , (4.11) where the function δ satisfies sup_I|δ| → 0 with probability 1, and the random variables ξj converge to 0 with probability 1. After some simplification of the right- hand side of (4.11) we obtain (2.4).

Acknowledgement The authors are grateful to David Cohn for access to unpub- lished work which stimulated the work in this paper. The helpful comments of a referee and editor led to this shortened version of the original manuscript.

REFERENCES

CHALONER, K. AND VERDINELLI, I. (1995) Bayesian experimental design: a review. Preprint.

CHENG, B. AND TITTERINGTON, D.M. (1994). Neural networks: a review from a statistical perspective (with discussion). Statistical Science 9, 2–54.

CHENG, M.-Y., HALL, P. AND TITTERINGTON, D.M. (1995). Optimal design for curve estimation by local linear smoothing. Research Report No.

SRR046–95, Centre for Mathematics and its Applications, Australian Na- tional University.

CHU, C.-K. AND MARRON, J.S. (1991). Choosing a kernel regression estimator (with discussion). Statistical Science 6, 404–436.

COHN, D.A. (1994) Neural network exploration using optimal experimental design.

MIT AI Lab Memo No. 1491.

(17)

FAN, J. (1993). Local linear regression smoothers and their minimax efficiencies.

Ann. Statist. 21, 196–216.

FEDOROV, V.V. (1972). Theory of Optimal Experiments. Academic Press, New York.

FORD, I., KITSOS, C.P. AND TITTERINGTON, D.M. (1989). Recent advances in nonlinear experimental designs. Technometrics 31, 49–60.

H ¨ARDLE, W. Applied Nonparametric Regression. Cambridge University Press, Cambridge, U.K.

HART, J.D. Contribution to discussion of Chu and Marron (1991). Statistical Sci- ence 6, 425–427.

HASTIE, T. AND LOADER, C. (1993). Local regression: automatic kernel carpen- try. Statist. Science 8, 120–143.

KIEFER, J. (1959). Optimal experimental designs (with discussion). J.R. Statist.

Soc. B 21, 272–319.

MACKAY, D.J.C. (1992). Information-based objective functions for active data selection. neural Computation 4, 590-0604.

MARRON, J.S. AND WAND, M. (1992). Exact mean integrated squared error.

Ann. Statist. 20, 712–736.

PUKELSHEIM, F. (1993). Optimal Design of Experiments. Wiley, New York.

RUPPERT, D. AND WAND, M.P. (1994). Multivariate locally weighted least squar- es regression. Ann. Statist. 22, 1346–1370.

SEIFERT, B. AND GASSER, T. (1995). Finite sample analysis of local polynomials:

analysis and solutions. J. Amer. Statist. Assoc., to appear.

SILVEY, S.D. (1980). Optimal Design. Chapman and Hall, London.

WAND, M.P. AND JONES, M.C. (1995). Kernel Smoothing. Chapman & Hall, London.