• 沒有找到結果。

OPTIMAL DESIGN FOR CURVE ESTIMATION BY LOCAL LINEAR SMOOTHING Ming-Yen Cheng

N/A
N/A
Protected

Academic year: 2022

Share "OPTIMAL DESIGN FOR CURVE ESTIMATION BY LOCAL LINEAR SMOOTHING Ming-Yen Cheng"

Copied!
17
0
0

加載中.... (立即查看全文)

全文

(1)

BY LOCAL LINEAR SMOOTHING

Ming-Yen Cheng1,2, Peter Hall1, D.M. Titterington1,3

21 January 1997

ABSTRACT. The integral of the mean squared error of an estimator of a re- gression function is used as a criterion for defining an optimal design measure in the context of local linear regression, when the bandwidth is chosen in a locally optimal manner. An algorithm is proposed that constructs a sequence of piecewise- uniform designs with the help of current estimates of the integral of mean squared error. These estimates do not require direct estimation of the second derivative of the regression function. Asymptotic properties of the algorithm are established and numerical results illustrate the gains that can be made, relative to a uniform design, by using the optimal design or sub-optimal, piecewise-uniform designs. The behaviour of the algorithm in practice is also illustrated.

KEYWORDS. Bandwidth choice, local linear regression, mean squared error, nonlinear regression, optimal design, sequential design.

SHORT TITLE. Optimal design for local linear smoothing.

AMS SUBJECT CLASSIFICATION. Primary 62G07, Secondary 62G20.

1 Centre for Mathematics and its Applications, Australian National University, Canberra, ACT

0200, Australia.

2 Institute of Mathematical Statistics, National Chung Cheng University, Minghsiung, Chiayi,

Taiwan ROC.

3 Department of Statistics, University of Glasgow, Glasgow G12 8QW, UK.

(2)

gan with seminal work of Kiefer (1959), leading to research summarized in books by Fedorov (1972), Silvey (1980) and Pukelsheim (1993). In contrast to the linear case, in nonlinear problems the true values of parameters can strongly influence optimal designs; see, for instance, Ford et al (1989) and Chaloner and Verdinelli (1995) for reviews of, respectively, non-Bayesian and Bayesian approaches. Recently, op- timal design ideas have been applied to particular nonlinear models in the neural computation literature, where the concept of optimally-slanted sequential design is described as “active learning”. See for example Fedorov (1972), MacKay (1992), Cohn (1994) and the statistical introduction by Cheng and Titterington (1994).

Cohn (1994) mentions the possibility of extending these ideas to environments such as locally weighted regression. It is this direction that the present paper takes, by considering the application of optimal design ideas to so-called “nonparametric”

regression. We shall modify the usual criterion on which optimality is based, since in nonparametric regression it is central that the “model” represented by the fitted curve is incorrect. Indeed, optimality is achieved by trading off model inaccuracy, expressed through bias, against model suitability, represented by purely stochastic error. As our optimality criterion we shall use the integral of mean squared error over a compact design space, although in principle, versions that reflect differential weighting across the design space could be employed instead.

Using this viewpoint of optimality, Section 2 will propose an empirical, asymp- totically optimal sequential rule for selecting both the bandwidth and the design density when the estimator is based on local linear regression. For purposes of illus- tration we shall adopt as our goal the modification of uniform design obtained by putting less weight in regions of low curvature. Thus, the task becomes one of de- termining how to reasonably estimate the curve in places of high curvature, subject to a constant weight in the definition of mean squared error. Other options will be discussed in Section 2.3. Numerical properties of our procedure will be reported in Section 3, in terms of a simulation study. Section 4 will outline technical arguments

(3)

behind the results in Section 2.

It is assumed that observations are generated by the model

Y = g(x) +  , (1.1)

where g is the function to be estimated,  has zero mean and variance σ2 (and a distribution not depending on x), and, conditional on design points x = Xi, the ordinates Y = Yi are independent with a distribution determined by the model at (1.1). The algorithm for selecting design points is at the disposal of the experi- menter, and should be chosen to optimize performance. We shall suppose that the design is restricted to the interval I = [0, 1], although clearly other possibilities may be treated in a similar way.

2. An algorithm and its properties.

2.1. Algorithm for computing design and bandwidth. The sequential rule that we propose is based on updating in a geometric sequence of steps. This represents a compromise between the fully-sequential, Anscombe-type algorithm, which involves adjusting the algorithm for datum-by-datum increments in sample size, n, and is not really appropriate in nonparametric regression; and the double sampling, Stein- type approach, which “guesses” the final order of magnitude of the desired sample size, and uses a single but sizeable subsample to refine the initial guess. A wide variety of approaches, involving for example polynomial increases in n instead of fully sequential methods, can be effective, depending on the “cost” of each update.

In the context to which we shall apply our techniques, cost would depend largely on computational complexity.

Given r > 1, let nk denote the integer part of rk. Estimation of g is conducted iteratively, with step k employing information derived from the previous nk−1 data pairs. Step k may be conveniently broken into two parts: (a) determining a design density ˆfk from which to draw the design points for the next Nk = nk− nk−1 pairs;

and (b) drawing these new data, adjoining them to the earlier nk−1 pairs to produce

(4)

a new set Xk = {(Xi, Yi), 1 ≤ i ≤ nk}, and using Xk to construct estimators ˆgk of g and ˆσ2kof σ2. We compute ˆfkas a histogram, define ˆgkusing local linear smoothing, and construct ˆσk using first-order differences.

Algorithm for completing part (a). (i) Definition of histogram. Given an in- teger mk  nk, and positive constants a1, . . . , amk satisfying P ai = mk, let f = f (·|a1, . . . , amk) denote the density on I that equals ai on ((i − 1)/mk, i/mk] for 1 ≤ i ≤ mk. (ii) Estimation of mean squared error. Write ˆgk−1 and ˆσk−12 for the estimators of g and σ2 computed from the set Xk−1 of the first nk−1 data pairs. Let νk  nk be an integer, let K be the kernel that we shall employ to construct ˆgk, and define κ1 =R K2 and h = h(·|a1, . . . , amk, b) = bνk−1/5f−2, where b > 0. (Thus, in contradistinction to near neighbour methods which effectively take h inversely proportional to f , we ask that h be inversely proportional to the square of f .) Our estimator of mean integrated squared error is

∆(a1, . . . , amk, b) = κ1ˆσ2k−1νk−4/5b−1 + ∆1(a1, . . . , amk, b) ,

where

1(a1, . . . , amk, b) = Z

J

 Z

ˆgk−1{x − h(x)y} − ˆgk−1(x)K(y) dy

2

dx ,

J = (Cνk−1/5, 1 − Cνk−1/5), and C > 0 is a constant such that the restriction 0 < h ≤ Cνk−1/5 is imposed by constraints on the choice of (a1, . . . , amk, b). (iii) Definition of ˆfk. Estimate a1, . . . , amk, b as the values of those quantities that minimize ∆, subject to a restriction which implies that 0 < h ≤ Cνk−1/5. (There are many examples of such restrictions. Simple ones will be considered in Section 2.2.) Note that minimizing ∆ over a1, . . . , amk, for fixed b, is equivalent to minimizing ∆1

over a1, . . . , amk. Let ˆfk be the version of f obtained on substituting the estimators for the values of a1, . . . , amk, b. Write ˆbk for the estimator of b.

A key feature of our approach is that the term, ∆1, that measures the contribu- tion due to bias, avoids the troublesome, direct estimation of the second derivative of the curve.

(5)

Algorithm for completing part (b). Conditional on Xk−1, draw Nkindependent and identically distributed random pairs Yk = {(Xki, Yki), 1 ≤ i ≤ Nk}, generated by the model, with the Xki’s having density ˆfk. Compute the local linear estimator ˆ

gk, based on the data Xk = Xk−1 ∪ Yk and using locally adaptive bandwidth h = ˆbkn−1/5kk−2 and kernel K. Define

˜

σk2 = {2(Nk− 1)}−1

Nk−1

X

i=1

Yk,i+10 − Yki0 2 ,

where {(Xki0 , Yki0 ), 1 ≤ i ≤ Nk} denotes an ordering of the pairs in Yk such that Xk10 < . . . < XkN0

k. Define ˆσk2 by either ˆσk2 = ˜σk2 or ˆ

σ2k= nk−1(nk−1+ Nk)−1ˆσ2k−1+ Nk(nk−1+ Nk)−1σ˜2k.

2.2. Motivation for the algorithm. We shall motivate the algorithm in terms of its ability to minimize integrated squared error. This is often appropriate when the curve is to be used for calibration or prediction, for example. In other cases, where applications will be more interpretative, perhaps through analysis of unusual features or turning points, a different criterion would be employed. The difference in the criterion may be as simple as employing a weight function when defining mean squared error, perhaps with the weight chosen adaptively so as to select features of interest. Alternatively, a measure of risk allied to a geometric description of distance, such as the Haussdorff metric, might be employed instead of squared- error loss. The broad approach to constructing the algorithm would be similar in such cases, with the aim still being to optimise an empirical measure of loss. But details will of course differ.

We shall describe motivation in an heuristic fashion, but it may be rigorously justified under the following conditions: (a) the target function g has two continuous derivatives on I, and g00 vanishes only at a finite number of points, (b) the error dis- tribution has all moments finite, zero mean and variance σ2, and (c) the symmetric, nonnegative kernel K is H¨older continuous and supported on a bounded interval,

(6)

say (−c, c). With these assumptions, Section 2.3 will state a theorem addressing the performance of the algorithm.

Suppose n independent observations are made of a pair (X, Y ) generated as Y = g(X) + , in which the design variables X are distributed with a continuous density f , and the distribution of the error, , has zero mean and variance σ2. An estimator of g based on local linear smoothing, using kernel K and bandwidth h, has its asymptotic mean squared error at x ∈ I given by

Hn(x, h|f ) = (nh)−1κ1σ2f (x)−1+ 14h4κ2g00(x)2, (2.1) where κ1 is as in Section 2.1 and κ2 = {R

y2K(y) dy}2. See for example Fan (1993) and Hastie and Loader (1993). For fixed x, which we now suppress, the quantity at (2.1) is minimized by taking h = h0 = (n κ3f g002)−1/5, where κ3 = κ2/(κ1σ2). Substituting back into (2.1) we deduce that with an optimal local choice of bandwidth, mean squared error is given asymptotically by n−4/5κ4(g002/f4)1/5, where the constant κ4 depends only on K and σ2. The minimum mean integrated squared error is obtained by integrating this quantity over I, producing a functional proportional to A(f ) =R

I(g002/f4)1/5.

The optimal design density f is that function which minimizes A(f ) subject to R

I f = 1 and f ≥ 0. A simple calculus of variations argument shows that this is given by f0 = c0|g00|2/9, where the constant c0 is chosen to ensure that R

I f0 = 1. Note particularly that for this choice, the optimal bandwidth is inversely proportional to the square of f : h0 = c1f−2, where c1 is a constant. This explains why, in the algorithm suggested in Section 2.1, we took the bandwidth for computing ˆ

gk to vary in proportion to ˆfk−2.

In (2.1), let the bandwidth h be h1 = bn−1/5f−2, where b is an arbitrary positive constant and f is an arbitrary continuous density, bounded away from zero on I. On integrating Hn(·, h1|f ) over I the contribution from the first term may be seen to equal

Z

I

(nh1)−1κ1σ2f−1 = κ1σ2n−4/5b−1.

(7)

Note particularly that the effect of f has disappeared. This explains the origin of the first term in the formula for ∆, and why it depends only on b, not on a1, . . . , amk. The second term is an estimate of the integral of the second term of (2.1), again with h1 substituted for h.

If the function g is at all “interesting” then it will enjoy one or more points of inflection, where g00 = 0. There, the optimal design density determined by the argument two paragraphs above equals zero, and in such cases the approximate formula at (2.1) is not valid. We overcome this problem by introducing a threshold, η, below which the value of ˆfk is not allowed to fall. When discussing theoretical properties of our algorithm it is convenient to also impose a ceiling on f . For economy of notation we take this to be η−1, although we could have used any large positive number. Given η ∈ (0, 1), define

fη =



 cη

g00

2/9 if cη

g00

2/9∈ (η, η−1)

η if cη

g00

2/9≤ η η−1 if cη

g00

2/9≥ η−1,

(2.2)

where the positive constant cη is uniquely defined by the requirement thatR

fη = 1, and where fη → f0 and cη → c0 as η → 0. Let bη denote the value of the constant b which minimizes

b−1κ1σ2+ 14b4κ22 Z

I

g002fη−8, and let η1 ∈ (0, 1) be so small that bη ∈ (η1, η−11 ).

If in part (a) of the algorithm we restrict attention to (a1, . . . , amk, b) such that η ≤ ai ≤ η−1 for 1 ≤ i ≤ mk, and η1 ≤ b ≤ η−11 , (2.3) where η ∈ (0, 1) and η1 is as defined in the previous paragraph, then the histogram fˆkis an estimator of fη, ˆbkis an estimator of bη, and the constant C in the definition of J may be taken equal to cη−2η1−1. The consistency of ˆfk and ˆbk for fη and bη, respectively, will be shown during the proof of Theorem 2.1.

In problems involving high orders of estimation the optimal design density is broadly similar to that at (2.2). For example, if we are locally fitting a polynomial

(8)

of degree 2ν − 3, where ν ≥ 2 is an integer, then the asymptotically optimal design density is proportional to |g(ν)|2/(4ν+1). This generalizes the case ν = 2 treated just above. (The case of fitting polynomials of even order is a little different; note for example the results of Ruppert and Wand (1994).)

2.3. Properties of the algorithm. Let (C) denote conditions (a) – (c) introduced in the first paragraph of Section 2.1, as well as condition (2.3) for sufficiently small η1, and the assumption that for some δ ∈ (0, 1) we have νk = O(n1−δk ), mk= O(νk1−δ) and mk → ∞. Let fη have the definition given in Section 2.2, with η as in (2.3).

The minimum mean squared error derived by optimizing over all design densities subject to the constraint at (2.3), is Hn(x, h|fη). Our main theorem states that ˆgk achieves this optimum.

Theorem 2.1. Assume conditions (C), with η1 in (2.3) chosen as suggested in Section 2.2. Then

 Z

I

(ˆgk− g)2

 Z

I

infh Hn(·, h|fη)



→ 1 (2.4)

with probability 1 as k → ∞.

Remark 2.1. Use of ridge methods. The local linear estimator ˆgk is defined without recourse to adjustments, such as ridging, which are designed to induce more stable numerical properties. For a discussion of stabilization methods in non- sequential contexts, see for example Seifert and Gasser (1995). There is no difficulty in developing an analogue of Theorem 2.1 for the case where ˆgk is defined by ridged or interpolated local linear smoothing.

Remark 2.2. Efficiency. The efficiency of employing an optimal design density may be expressed as the ratio of two mean integrated squared errors, the first using the optimal density and the second using another density, f , say. In view of Theorem 2.1, efficiency may also be expressed in purely empirical terms. Indeed, the ratio of R

I(ˆgk−g)2 toR

I inff,hEf(ˆg −g)2, where ˆg is a standard local linear estimator using

(9)

kernel K (with ridging employed to ensure that mean squared error is well-defined), and where the infimum is taken over all choices of f and of a nonstochastic, locally adaptive bandwidth h, converges (as first k → ∞ and then η → 0) to 1. In this sense, the sequential estimator ˆgk is fully efficient.

Remark 2.3. Alternative regression estimators. It is of interest to consider al- ternative approaches to regression, not least because they can produce results of a different character from those described earlier. We shall treat two, the classi- cal Nadaraya-Watson method (see e.g. H¨ardle (1990), Section 3.1, and Wand and Jones (1995), p. 119) and a “transformation approach” (Hart (1991)). By judicious choice of the design density, depending on the target function g, both these tech- niques permit the asymptotic bias of a second-order kernel estimator ˆg to vanish, and hence the mean squared error to be an order of magnitude smaller than in the case of local linear smoothing. This result may seem to contradict the known minimax optimality of local linear smoothing (Fan 1993), but it should be noted that such optimality results pertain only to the case of a fixed design density that is bounded away from zero and infinity, and hence do not apply to sequentially chosen design. While reduced mean squared error is an attractive feature, it is achieved only at the expense of significant practical difficulties constructing the sequential version of the estimator. Therefore, we do not develop the methods here beyond outlining theoretical properties and discussing their implications.

For a Nadaraya-Watson kernel estimator the asymptotic mean squared error formula, the analogue of (2.1), may be written as

(nh)−1κ1σ2f (x)−1 + κ h4g00(x) + 2 g0(x) f0(x) f (x)−1 2

, (2.5)

where the constant κ depends only on the kernel. The second term here represents the squared bias contribution to mean squared error, and may be rendered equal to zero (except at zeros of g0) by defining f = f0 = c0|g0|−1/2, where c−10 =R

|g0|−1/2. Hart’s transformation estimator has a mean squared error which enjoys a similar

(10)

expansion, this time with the second term in (2.5) replaced by

κ h4g00(x) f (x)−2− g0(x) f0(x) f (x)−3 2 .

That quantity is rendered equal to zero by using the design density f0 = c0|g0|, with c−10 =R

|g0|.

For both the Nadaraya-Watson and transformation definitions of ˆg we may es- timate |g0| either explicitly or implicitly, and employ the estimator in the obvious way to construct an estimator of f0. In principle, if g has four bounded derivatives then this technique can produce an order of squared bias equal to h8, and hence an order of mean squared error equal to n−8/9 (rather than the n−4/5 typically as- sociated with second-order methods), away from points where g0 = 0. In practice, however, there are significant obstacles to achieving such performance. The first problem is that these methods demand an estimator of g0 whose high-order deriva- tives accurately estimate those of g0. While this is feasible for large samples, it is not really practicable with smaller data sets. Since a sequential procedure would usually start with relatively small samples, this is a drawback. The inherent numer- ical instability of estimators of ratios of functions, such as appear in formulae for bias terms of Nadaraya-Watson or transformation estimators, makes it even more difficult to ensure good performance in small samples.

Secondly, attaining optimal performance at the level n−8/9 demands a partic- ularly complex bandwidth formula, depending on the fourth derivative of g. The theoretical difficulties are easily overcome, but an attractive numerical algorithm seems to be out of reach. In particular, there does not appear to exist a high-order analogue of the simple algorithm suggested in Section 2.1, which simultaneously produces an estimator of the optimal bandwidth and optimal design density. There- fore, selection of the appropriate bandwidth in a high-order sequential procedure is a particularly awkward problem. Thirdly, this high-order performance seems to be only achievable away from points where g0 vanishes; at those points the optimal design density is either infinite (in the case of the Nadaraya-Watson estimator) or

(11)

zero (for the transformation estimator). Different bandwidth selection procedures seem to be necessary at those points, producing different convergence rates. This makes for a cumbersome approach to inference.

3. Numerical results

3.1. Efficiencies of some suboptimal designs. Recall, from Section 2.2, that the mean integrated squared error associated with a design f and an optimally chosen bandwidth, is proportional to A(f ) =R

I(g002/f4)1/5, and that the optimal design is given by f0 = c0|g00|2/9. Thus, A(f0) = (R

I |g00|2/9)9/5 is the minimum achievable value of A(f ), and it is natural to define the efficiency of design f , relative to the optimal design, by A(f0)/A(f ). We evaluated this in the context of fifteen true curves given by the class of Gaussian mixtures proposed by Marron and Wand (1992). These vary from the standard Gaussian density function through a range of Gaussian mixtures of varying complexity. The design space is the interval (−3, 3).

We describe our results here only in words. A more detailed account, including tabular results, is available in a technical report (Cheng, Hall and Titterington 1995).

We computed Monte Carlo approximations to efficiencies of designs that are themselves mixtures of the optimal design, f0, and the uniform design on (−3, 3), with mixing weights θ and 1 − θ, respectively. The range of θ was 0.0 (0.1) 1.0. For many of the fifteen curves, there was not much to be gained from using the optimal design rather than the uniform, but in some cases the gains were considerable.

Particular instances were curves 3–5 and 10–14, in the nomenclature of Marron and Wand (1992).

Bearing the above robustness in mind, as well as the fact that our algo- rithm for sequential design involves a sequence of piecewise-uniform designs, we computed the efficiencies of such designs. We addressed cases corresponding to m = 2r, for r = 0, 1, . . . , 4, and with the ai’s chosen at their optimal values. If such a design is denoted by fm0 then it is straightforward to show that A(fm0) =

(12)

(L/m)4/5(P A5/9i )9/5, in which L denotes the length of the design space (an in- terval), and Ai = R |g00|2/5, where the range of integration is the interval ((i − 1)L/mk, iL/mk] for 1 ≤ i ≤ mk. We obtained good gains in efficiency by m = 8.

For example, in the case of curves 3–5 in (the order of Marron and Wand (1992)), efficiencies were respectively 69%, 74% and 52% for m = 1, but had increased to 93%, 90% and 82% by m = 8.

Since zeros of the optimal design density correspond to points of inflexion of g (see (2.2)), then departure of optimal design from uniformity will tend to be in proportion to the “wiggliness” of g. The numerical analyses described above confirm this informal theoretical conclusion. For example, functions 10–15 of Marron and Wand (1992) exhibit the greatest number of points of inflexion, and include a class of functions for which we observed substantial gains in performance when attempting to optimise design.

3.2. An illustration of the algorithm. In the small study reported here we con- centrated on the mutual similarity of the formula for the asymptotic, integrated mean-squared error and the corresponding estimator thereof, defined by ∆, and on how closely the design created after one iteration of the algorithm approximates the optimal design. Although our theory is asymptotic in k, in practice only a small number of iterations of the algorithm would be carried out, bearing in mind the re- lationships between successive sample sizes: the design procedure is best described as batch-sequential, with large batch sizes, and anything approaching genuinely se- quential design, in which the design is updated one observation at a time, does not make sense in the context of nonparametric regression. To help make the conclu- sions clear, the exemplar chosen curve was the continuous, but not continuously differentiable, piecewise quadratic curve given by

g(x) =

x(1 − 2x)/4 if 0 ≤ x ≤ 1/2 2(1 − 2x)(1 − x) if 1/2 ≤ x ≤ 1.

For this curve, therefore, the optimal design is piecewise uniform, corresponding to m = 2, and with a1 = 2/(1 + 641/9) = 0.773 and a2 = 2 − a1. The Epanechnikov

(13)

kernel was used in the local linear fitting. Thus κ1 = 3/5 and κ2 = 1/25, so that the optimal value of b was (29· 15 · σ2)1/5/(1 + 641/9)9/5. Therefore, the experiment involved choosing an initial sample of size n0, taking m1 = 2 and investigating the fruits of the algorithm in terms of the resulting estimates of a1 and a2. A variety of values were chosen for n0 and ν1, and care was taken to choose the range of integration J suitably when calculating ∆1. In fact ∆1 included a small gap near the discontinuity in g0(x) at x = 12, where the calculations were unstable.

The bandwidth used for calculating the local linear curve estimates ˆg0 was the asymptotically optimal value, constant in x, based on the estimate of σ calculated from first-order differences and on an assumption of constant curvature, for g, of magnitude 4. (Note that the true curvature has magnitude 1 on the first half of the interval, and 8 thereafter.)

We report here on 10 replicates of each of the cases σ = 0.01, 0.05 and n0 = 400, 1000. Throughout, ν1 = 200. For each replicate, ∆ was evaluated on a grid of values of (b, a1), and the optimal combination of values was found, correct to two decimal places in each of the two variables. This would seem to be adequate from a practical point of view. The optimal values for b were 0.171 and 0.326, for σ = 0.01 and 0.05, respectively, and the algorithm achieved these values closely, especially for σ = 0.01. In that case, the optimal value of a1 (0.773) was slightly over-estimated, because of the smoothing involved in calculating the bias term ∆1, but the errors were not great. The ∆ surface always had a nice single minimum in (b, a1), at least in the region investigated in the experiment. The values of ∆ were typically within a few percent of those from the asymptotic formula. Occasionally, the difference went into double figures, in percentage terms, but was fairly constant as (b, a1) varied, so that, as reported above, the positions of the minima on the two surfaces were very similar.

4. Outline proof of Theorem 2.1. The proof is given only in barest outline.

Details are given in the technical report by Cheng, Hall and Titterington (1995).

(14)

The first step is to establish a relatively crude upper bound to |ˆgk− g|:

sup

I

|ˆgk− g| = O n−(2/5)+ηk 

(4.1)

with probability 1, as k → ∞, for all η > 0. As a prelude to establishing (4.1), define N (l) = {1, . . . , Nl}, let {Xli, i ∈ N (l)} denote the set of design points constructed in step l of the algorithm, write Yki = g(Xki) + ki for the associated measurements of Y , and let h = ˆbkn−1/5kk−2. Let skj(x) denote the sum of (x − Xli)jK{(x − Xli)/h}

over i ∈ N (l) and 1 ≤ l ≤ k, put wli(x) = {sk2(x)−sk1(x)(x−Xli)} K{(x−Xli)/h}

for i ∈ N (l), and define Wk to equal the sum of wli over i ∈ N (l) and 1 ≤ l ≤ k.

In this notation,

ˆ

gk = g + (Ak+ Bk)Wk−1, (4.2) where

Ak=

k

X

l=1

A(l) , A(l) = X

i∈N (l)

wlili,

Bk=

k

X

l=1

B(l) , B(l) = X

i∈N (l)

wli{g(Xki) − g} .

Defining K1 = K, K2(y) = yK(y) and

Vlj(x) = X

i∈N (l)

Kj{(x − Xli)/h} li,

we have

|Ak| ≤

k

X

l=1

|A(l)| ≤

k

X

l=1 2

X

j=1

|sk,3−j| hj−1|Vlj|

≤ (sk2+ |sk1| h)

k

X

l=1

(|Vl1| + |Vl2|) .

Also, |skj| ≤ (ch)jUk, where c > 0 is chosen so that (−c, c) contains the support of K, and

Uk =

k

X

l=1

X

i∈N (l)

K{(x − Xli)/h} .

(15)

Therefore,

|Ak| ≤ C1h2Uk k

X

l=1

(|Vl1| + |Vl2|) ,

where, here and below, C1and C2are positive constants. Similarly, an upper bound may be established for Bk. In consequence it may be shown from (4.2) that

|ˆgk− g| ≤ C2Wk−1

 h2Uk

k

X

l=1

(|Vl1| + |Vl2|) + h4Uk2



. (4.3)

Computations based on large-deviation bounds for the variables Vl1 and Vl2

show that supI|Vlj| = Op(n(2/5)+ηk ) for all η > 0. Hence, by (4.3), sup

I

|ˆgk− g| = On

Wk0−1 h20n(2/5)+ηk Uk0+ h40Uk02 o

, (4.4)

with probability 1, where Uk0 = supIUk, Wk0 = infIWk and h0 = supIh. It may be shown that Uk0 = O(n4/5k ) and Wk0−1 = O(n−6/5k ); and, by definition of h, that h0 = O(n−1/5k ), all results holding with probability 1. The desired result (4.1) follows from these bounds and (4.4).

The next step is to show that sup

I

| ˆfk− fη| → 0 , ˆbk→ bη (4.5) with probability 1. First, using (4.1) we may prove that

1(a1, . . . , amk, b) = 14νk−4/5b4κ22ψ(f ) + o νk−4/5 ,

∆(a1, . . . , amk, b) = νk−4/5φ(f, b) + o νk−4/5 ,

(4.6)

uniformly in a1, . . . , amk, b satisfying (2.3), where ψ(f ) =R

I(g00/f4)2 and φ(f, b) = κ1σ2b−1+ 14b4κ22ψ(f ). Results (4.5) may be derived from (4.6).

Using (4.5) it may be shown that

skj = n(4−j)/5k t(j+1)/5ρjfη+ o n(4−j)/5k 

(4.7) uniformly on I, with probability 1, where ρj = R

yjK(y) dy and t is defined by h = n−1/5k t. Now, ρ0 = 1, ρ1 = 0, ρ2 = κ2 and Wk(x) = sk2sk0− s2k1, and so by (4.7),

sup

I

Wk n2kh4ηκ2fη2−1

− 1

→ 0 , (4.8)

(16)

where hη = n−1/5k fη−2bη. Similarly it may be shown that

sup

I

Bk 1

2 n2kh6ηκ22fη2−1

− g00

→ 0 . (4.9)

Combining (4.8) and (4.9) we conclude that

sup

I

BkWk−112h2ηκ2g00

= o n−2/5k  . (4.10) Combining (4.2), (4.8) and (4.10) we deduce that

Z

I

(ˆgk− g)2 = (1 + ξ1) Z

I

n

Ak n2kh4ηκ2fη2−1

+ (1 + δ) 12h2ηκ2g00 o2

= (1 + ξ2) Z

I

n

A2k n2kh4ηκ2fη2−2

+ 14 h2ηκ2g002o + 2

Z

I

Ak n2kh4ηκ2fη2−1 1

2h2ηκ2g00 , (4.11) where the function δ satisfies supI|δ| → 0 with probability 1, and the random variables ξj converge to 0 with probability 1. After some simplification of the right- hand side of (4.11) we obtain (2.4).

Acknowledgement The authors are grateful to David Cohn for access to unpub- lished work which stimulated the work in this paper. The helpful comments of a referee and editor led to this shortened version of the original manuscript.

REFERENCES

CHALONER, K. AND VERDINELLI, I. (1995) Bayesian experimental design: a review. Preprint.

CHENG, B. AND TITTERINGTON, D.M. (1994). Neural networks: a review from a statistical perspective (with discussion). Statistical Science 9, 2–54.

CHENG, M.-Y., HALL, P. AND TITTERINGTON, D.M. (1995). Optimal de- sign for curve estimation by local linear smoothing. Research Report No.

SRR046–95, Centre for Mathematics and its Applications, Australian Na- tional University.

CHU, C.-K. AND MARRON, J.S. (1991). Choosing a kernel regression estimator (with discussion). Statistical Science 6, 404–436.

COHN, D.A. (1994) Neural network exploration using optimal experimental design.

MIT AI Lab Memo No. 1491.

(17)

FAN, J. (1993). Local linear regression smoothers and their minimax efficiencies.

Ann. Statist. 21, 196–216.

FEDOROV, V.V. (1972). Theory of Optimal Experiments. Academic Press, New York.

FORD, I., KITSOS, C.P. AND TITTERINGTON, D.M. (1989). Recent advances in nonlinear experimental designs. Technometrics 31, 49–60.

H ¨ARDLE, W. Applied Nonparametric Regression. Cambridge University Press, Cambridge, U.K.

HART, J.D. Contribution to discussion of Chu and Marron (1991). Statistical Sci- ence 6, 425–427.

HASTIE, T. AND LOADER, C. (1993). Local regression: automatic kernel carpen- try. Statist. Science 8, 120–143.

KIEFER, J. (1959). Optimal experimental designs (with discussion). J.R. Statist.

Soc. B 21, 272–319.

MACKAY, D.J.C. (1992). Information-based objective functions for active data selection. neural Computation 4, 590-0604.

MARRON, J.S. AND WAND, M. (1992). Exact mean integrated squared error.

Ann. Statist. 20, 712–736.

PUKELSHEIM, F. (1993). Optimal Design of Experiments. Wiley, New York.

RUPPERT, D. AND WAND, M.P. (1994). Multivariate locally weighted least squar- es regression. Ann. Statist. 22, 1346–1370.

SEIFERT, B. AND GASSER, T. (1995). Finite sample analysis of local polynomials:

analysis and solutions. J. Amer. Statist. Assoc., to appear.

SILVEY, S.D. (1980). Optimal Design. Chapman and Hall, London.

WAND, M.P. AND JONES, M.C. (1995). Kernel Smoothing. Chapman & Hall, London.

參考文獻

相關文件

Both MAR and SMAR use a fixed prepredictor to reduce the correlation of the pixels before the actual prediction of the MAR model. Then, the residual values are encoded by the

Since everyone needs to write the final solutions alone, there is absolutely no need to lend your homework solutions and/or source codes to your classmates at any time.. In order

An open-top box with a square base and two perpendicular dividers, as shown in the diagram, is to have a volume of 288 cubic inches7. Use Lagrange multipliers to find the

For comparison purposes we depicted in Figure 1 and Figure 2 a kernel smoothing of the data after kink detection and estimation (plain curve) as well as a kernel smoothing of the

Full credit if they got (a) wrong but found correct q and integrated correctly using their answer.. Algebra mistakes -1% each, integral mistakes

But by definition the true param- eter value θθθ ◦ uniquely solves the population moment condition (11.1), so the probability limit θθθ ∗ must be equal to the true parameter

(Why do we usually condemn the person who produces a sexually explicit material to make money but not a person who does the same thing in the name of art?). • Do pornographic

If the best number of degrees of freedom for pure error can be specified, we might use some standard optimality criterion to obtain an optimal design for the given model, and