Reducing variance in nonparametric surface estimation

(1)

Journal of Multivariate Analysis 86 (2003) 375–397

Reducing variance in nonparametric surface

estimation

Ming-Yen Cheng

a,b

and Peter Hall

a,

*

a

Centre for Mathematics and its Applications, Australian National University, Canberra, ACT 0200, Australia

b

Department of Mathematics, National Taiwan University, Taipei 106, Taiwan Received 12 February 2001

Abstract

We suggest a method for reducing variance in nonparametric surface estimation. The technique is applicable to a wide range of inferential problems, including both density estimation and regression, and to a wide variety of estimator types. It is based on estimating the contours of a surface by minimising deviations of elementary surface estimates along a quadratic curve. Once a contour estimate has been obtained, the ﬁnal surface estimate is computed by averaging conventional surface estimates along a portion of the contour. Theoretical and numerical properties of the technique are discussed.

AMS 1991 subject classifications: primary 62G07; secondary 62H12

Keywords: Bandwidth; Boundary effect; Kernel method; Nonparametric density estimation; Nonpara-metric regression; Variance reduction

1. Introduction

We suggest a variance reduction method for nonparametric surface estimators, based on approximating the projection of a contour into the design plane at the point x where we wish to construct the estimate. The contour estimator is then used as an axis along which a continuum of conventional surface estimates is averaged in order to achieve a ﬁnal estimate at x: Since our technique does not alter asymptotic

*Corresponding author.

E-mail address:halpstat@pretty.anu.edu.au (P. Hall).

(2)

bias then the reduction in variance that it offers leads directly to a reduction in asymptotic mean squared error.

This method has several novel features. Firstly, it exploits the extra degree of freedom that is available in the problem of surface estimation. Secondly, it provides a new technique for estimating gradients and curvatures of contour lines, without passing explicitly to derivatives of surface estimates. Thirdly, when applied to a surface estimate that is always positive, in either density estimation or regression, our method produces a boundary-corrected estimate that is always positive. Our approach to estimating contours involves choosing either a line segment or a quadratic along which a conventional surface estimator is least variable, in the neighbourhood of the point x at which we wish to estimate the surface.

The technique is applicable to nonparametric methods in both density estimation and regression. Indeed, it is not tied to a particular estimator type in either of these settings; for example, in nonparametric regression it can be used in conjunction with spline, local linear or Nadaraya–Watson methods. In the case of density estimation, when a conventional kernel estimator is used as its basis, the technique can be viewed as a device for re-computing kernel shape.

As implied two paragraphs above, the technique also has potential application for overcoming edge effects. Modiﬁed boundary kernel methods have been proposed for addressing this problem (see e.g. [14,19,20]), but like their univariate counterparts they can produce negative estimates at boundaries. Local polynomial and local parametric methods are more successful in this regard, although the increase in variance of such techniques near the boundary means that good asymptotic performance is often not visible unless sample size is particularly large. Scott ([18], pp. 82–85) gives a particularly illuminating discussion of issues such as these.

Multivariate generalisations are of course possible. However, since the multi-variate analogue of a contour is not so familiar, not as readily depicted, and not as easy to compute as in the bivariate case, then high-dimensional generalisations do not offer as convenient a vehicle for illustrating the potential of the method. If the distribution is d-variate then the contour corresponding to ‘‘height’’ H is the set of points y such that gðyÞ ¼ H; and is a region with d 1 degrees of freedom.

Our variance reduction method is related to the so-called balloon kernel techniques for density estimation. See[9,18, p. 149ff]. There is an extensive literature on approaches for remedying boundary effects in density estimation and regression, mainly in univariate cases. It includes methods based on special ‘‘boundary kernels’’, for example those considered by Gasser and Mu¨ller[6], Gasser et al.[7], Granovsky and Mu¨ller[8]and Mu¨ller[13]. Rice[15]suggested a dual-bandwidth approach. So-called ‘‘reﬂection methods’’ include those of [1a,10,17]. The projection method of Djojosugito and Speckman [2] is in the same spirit. Eubank and Speckman [3] proposed a method that involves combining a conventional curve estimator with a substantially undersmoothed estimator. Cheng, Fan and Marron [1] suggested methods that have optimal asymptotic performance at boundaries. The natural boundary-respecting properties of local polynomial methods have been discussed by

(3)

Fan[4], Hastie and Loader[11], Ruppert and Wand[16]and Fan and Gijbels[5], for example. See also[12].

Section 2 will introduce our method and discuss, in an heuristic and nontechnical way, its variance-reduction properties. Theoretical results, underpinning the informal arguments in Section 2, will be given in Section 3, and rigorous technical details will be outlined in Section 5. Section 4 will summarise a simulation study that complements the theory.

2. Methodology 2.1. The method

Let g denote a univariate function of a 2-vector; for example, g might be the density of a bivariate distribution, or the mean in a regression problem where the explanatory variable is bivariate and the response variable is scalar. We wish to estimate g nonparametrically, making only smoothness assumptions and exploiting the extra degree of freedom that is available in the context of surface estimation, relative to the conventional case where the argument of g is univariate.

To this end we ﬁrst construct an elementary nonparametric estimator ˆg of g: For example, when g is a probability density we might take

ˆ gðxÞ ¼ ðnh2_Þ1 Xn i¼1 K x Xi h ; ð2:1Þ

where K is a radially symmetric bivariate kernel, h is a bandwidth, and X1; y; Xnare

independent and identically distributed random variables with density g:

Next we describe construction of a local quadratic estimator of the level set, or contour, of g in the neighbourhood of x; local linear estimators will be treated in Section 2.2. Let Cðxjy; cÞ denote the parabola passing through x ¼ ðxð1Þ; xð2ÞÞ; with its vertex at x and its tangent there in the direction of the unit vectorðcos y; sin yÞ; and with curvature 2c at x: Thus, as a curve in theðzð1Þ_{; z}ð2Þ_{Þ-plane, Cðxjy; cÞ has}

equation

ðzð2Þ xð2ÞÞ cos y ðzð1Þ xð1ÞÞ sin y

¼ cfðzð2Þ xð2ÞÞ sin y þ ðzð1Þ xð1ÞÞ cos yg2:

We shall constrain y and c byp=2oypp=2 and NocoN; which ensures that each nondegenerate parabola in the plane is representable by Cðxjy; cÞ for a unique tripleðx; y; cÞ:

Given l40; let Cðxjy; c; lÞ denote the set of points zACðxjy; cÞ that satisfy jjz xjjplh; where jj jj denotes standard Euclidean distance. Let jCj denote the

(4)

length of a ﬁnite segment C of a rectiﬁable curve, and put xðc; lÞ ¼ jCðxjy; c; lÞj; ˇ gðxjy; c; lÞ ¼ xðc; lÞ1 Z Cðxjy;c;lÞ ˆ gðzÞ ds; Sðxjy; c; lÞ ¼ xðc; lÞ1 Z Cðxjy;c;lÞ f ˆgðzÞ ˇgðxjy; c; lÞg2ds; ð2:2Þ

ð#yx; ˆcxÞ ¼ arg minðy;cÞSðxjy; c; lÞ; ð2:3Þ

where ds is an inﬁnitesimal element of arc length along C ¼ Cðxjy; c; lÞ at the point on C with coordinates z: Panel (a) of Fig. 1 depicts an example of the contour estimator Cðxj#yx; ˆcx;lÞ: Our ﬁnal estimator of gðxÞ is

˜

gðxjlÞ ¼ ˇgðxj#yx; ˆcx;lÞ: ð2:4Þ

In practice, one would not necessarily use the same value of l when computing ð#yx; ˆcxÞ and when calculating ˜g: That is, the l’s at (2.2) and (2.4) would not

necessarily be identical. We shall argue in Section 3 that a relatively large value of l (asymptotically, l-N) should be used to give accurate estimation of the ‘‘true’’ quadratic approximation Cðxjyx; cxÞ to the contour line at x: On the other hand, a

relatively small value of l may be adequate for reducing variance and removing edge effects in the estimator ˜g:

To give an intuitive explanation of this point, note that estimation of y and c is closely related to estimation of second derivatives of g; for which a larger bandwidth is needed than when simply estimating g itself. This explains why lh; which is effectively a bandwidth for computation of #yx and ˆcx; should be relatively large.

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 x -2 -1 0 1 2 y 0 0.1 0.2 0 .3 0.4 0. 5 (a) (b)

Fig. 1. Sausage-shaped kernel. In the context of density estimation, panel (a) depicts a portion of a point cloud, and the true contour line (solid line) that passes x¼ ð0:85; 1:04Þ (cross sign), when data are from the bivariate normal Nð0; IÞ distribution and n ¼ 500: Dotted line is the contour line estimate Cðxj#yx; ˆcx;lÞ;

calculated at that point based on the spherical biweight kernel, h¼ 0:8; and l ¼ 1:25: Panel (b) shows a perspective plot of the corresponding ‘‘sausage-shaped’’ kernel Kx;deﬁned at (2.5).

(5)

However, there is not the same pressing need for choosing lh large when estimating g itself.

2.2. Choice of contour estimator

To appreciate why minimising Sðxjy; c; lÞ produces a parabola that approximates the contour DðxÞ; note that we are in effect ﬁnding that choice of ðy; cÞ which renders

ˆ

gðzÞ least variable as we move z along the curve Cðxjy; cÞ: Indeed, if we were to replace ˆgðzÞ by its true value, gðzÞ; when deﬁning ˇgðxjy; c; lÞ and Sðxjy; c; lÞ; then the curve Cðxj#yx; ˆcxÞ produced by minimising S would, if not constrained to have a

quadratic equation, be exactly DðxÞ: The curve Cðxj#yx; ˆcxÞ represents an empirical,

quadratic approximation to this contour.

An alternative technique is to take C to be a line segment, rather than a piece of a quadratic. The mechanics of implementing the approximation are virtually identical in this setting: we replace Cðxjy; c; lÞ by Clinðxjy; lÞ; denoting the line segment of

length 2l centred at x and inclined at angle y; we replace xðc; lÞ at (2.2) by 2l; and call the resulting integral Sðxjy; lÞ instead of Sðxjy; c; lÞ; and we choose y ¼ #yx to

minimise Sðxjy; lÞ: This approach is adequate for the results described in Sections 3.1–3.3, but for the higher-order analogues described in Section 3.4 a local quadratic method, or something similar such as ﬁtting local ellipses, is required.

A very different approach in estimating contour lines is to construct an appropriately oversmoothed estimator of the function g; and compute its contours. Oversmoothing is necessary in order to obtain sufﬁciently accurate estimates of derivatives of the surface; these are used explicitly or implicitly in constructing an estimate of the contour. We argue, however, that such a method is in general not as attractive as that proposed here, owing to the relative difﬁculty of drawing contours from differential-geometric properties of a surface.

Nevertheless, oversmoothing ˆg is beneﬁcial when it is necessary to construct ˜g at a place where the tangent plane to the surface is virtually horizontal. Minimising the function Sðxjy; c; lÞ with respect to ðy; cÞ relies on detecting off-contour differences in g through variation of g; if the gradient of g is low then so too will be the variation. In such cases we rely on higher-order derivatives to provide ‘‘leverage’’ for detecting the contour—hence the need for more dramatic smoothing.

2.3. Removing edge effects

Let R denote the support of the distribution of the points Xi on which the

estimator ˆg is based. In the context of density estimation R would be the support of g; and in regression R would be the support of the density of Xi in the regression

problem Yi¼ gðXiÞ þ error: The basic estimator ˆgðxÞ potentially suffers from edge

effects whenever the support of the function kxðzÞ Kfðx zÞ=hg protrudes outside

R: However, assuming K is radially symmetric and vanishes outside a disc of unit radius, this problem is solved by the following trivial modiﬁcation of the estimator suggested in Section 2.1: Re-deﬁne the parabola segment Cðxjy; c; lÞ to be the largest

(6)

connected subset of Cðxjy; cÞ inside the disc fz: jjz xjjplhg; subject to the set Sfy; c; Cðxjy; c; lÞg being wholly contained within R; where

Sðy; c; TÞ fz1: jjz1 z2jjph for some z2ATg:

Fig. 2illustrates the removal of edge effects in this context. Theoretical results, and their derivations, in the presence of edge effects are entirely analogous to their counterparts in the absence of those effects.

2.4. Why the estimator ˜g has advantages

The advantages stem from the property, established in Section 3, that ˜g is a good approximation to the average of ˆg over a portion of the true contour of the surface represented by y¼ gðxÞ: Speciﬁcally, let DðxÞ denote the contour line that passes through x; and let DðxjlÞ equal the largest connected subset of DðxÞ inside the disc fz: jjz xjjplhg; subject to Sfy; c; DðxjlÞg being wholly contained within R: Write

ˇ

gcontðxjlÞ for the integral average of ˆgðzÞ over zADðxjlÞ: Then, as we shall show in

Section 3.2, the difference between ˜gðxjlÞ and ˇgcontðxjlÞ is of smaller order than the

difference between the latter function and the true value of gðxÞ:

It is easy to see why ˇgcontðxjlÞ is likely to perform better than the conventional

estimator ˆgðxÞ: Indeed, the averaging that is explicit in the deﬁnition of ˇgcont will

clearly tend to reduce variance, by an order of magnitude if l is allowed to diverge with n: And the bias of ˇgcontðxjlÞ will equal the average value of the bias of ˆgðzÞ over

values of z for which gðzÞ ¼ gðxÞ: Replacing bias by an average value is generally not deleterious, and in fact the asymptotic bias of ˇgcontðxjlÞ is identical to that of ˆgðxÞ:

0.0 0.5 1.0 1.5 2.0 -1.0 -0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 -1.0 -0.5 0.0 0.5 1.0 (a) (b)

Fig. 2. Removing edge effects. In the presence of edge effects the subset of Cðxjy; cÞ (dotted curve) that comprises Cðxjy; c; lÞ (solid curve) is reduced, to ensure that the resulting region Sfy; c; Cðxjy; c; lÞg; from which the estimator ˜gð jlÞ is computed, lies wholly within the support R (right-hand side of the vertical line) of the design distribution. The point x is marked by a cross. Panels (a) and (b) illustrate cases where the contour is convex and concave, respectively, with respect to the boundary.

(7)

2.5. Particular cases of ˜g

In the case of density estimation the estimator ˜g may be thought of as having been computed using a kernel whose shape is symmetric about the parabolic axis represented by Cðxj#yx; ˆcxÞ: If ˆg is given by (2.1) then this kernel is Kx;say, deﬁned by

KxðvÞ jCð0j#yx; ˆcxh; l=hÞj1

Z

Cð0j#yx; ˆcxh;l=hÞ

Kðz þ vÞ ds: ð2:5Þ

In this notation the estimator ˜g has the standard form at (2.1): ˜ gðxjlÞ ¼ ðnh2Þ1 X n i¼1 Kx x Xi h ;

where the support of Kxis sausage-shaped with its axis represented by the quadratic

Cð0j#yx; ˆcxhÞ:

Fig. 1 illustrates a typical local quadratic contour estimate, and the associated sausage-shaped kernel, in the case of nonparametric density estimation. There is an obvious analogue of the ﬁgure in the case of a local linear approximation to the contour.

In the context of kernel-based regression the estimator ˜g cannot be expressed simply as the result of replacing K in the deﬁnition of ˆgðxÞ by Kx:An approach like

this is still feasible, but it would generally involve at least two kernels like Kx; one

(Kx;1 say) designed for estimating contours of fg; where f is the design density, and

the otherðKx;2Þ designed to estimate contours of f : For example, in the case of

local-constant or Nadaraya–Watson estimation of g one would use Kx;1 and Kx;2 in the

numerator and denominator, respectively, of the estimator. The computational complexity of such an approach makes it unattractive, however.

3. Theoretical properties 3.1. Contour approximation

Our aim in this section is to describe the accuracy with which the empirical contour line Cðxj#yx; ˆcxÞ estimates a nonrandom, quadratic approximation

Cðxjyx; cxÞ to DðxÞ: For brevity we conﬁne our detailed treatment to the case of

nonparametric density estimation, noting in Section 3.6 the similarities to nonparametric regression. We deal initially only with situations where edge effects do not arise; Section 3.5 discusses how our results change in the presence of edge effects.

Let S denote a bounded, open set in the plane. We assume of the kernel that K is a compactly supported; radially symmetric; probability

(8)

of h and l that

h^n1=6_; _l2_{h=ðlog nÞ}5=4_{-N; and lh ¼ Oðn}e_Þ

for some e40; as n-N; ðCh;lÞ

and of the density g that it is differentiable on S and satisﬁes

the gradient of the steepest vector in the tangent plane at x to

the surface represented by y¼ gðxÞ does not vanish for xAS ðC1gÞ

and

g has two H ¨older-continuous derivatives; of all types; in S: ðC2gÞ

In respect of ðCh;l), note that h^n1=6 is the optimal size of bandwidth for

estimating a density g with two derivatives.

ConditionsðC1gÞ and ðC2gÞ imply that, for each xAS; the contour line DðxÞ that

passes through x may be represented locally as a quadratic, in the sense that there exist a real number cx;and yxAðp=2; p=2; both uniquely determined, such that the distance from any given point z on DðxÞ to the nearest point on Cðxjyx; cxÞ

converges to 0 at rate oðr2_{Þ; uniformly in z satisfying jjz xjjpr; as r-0:}

From a sample X1; y; Xn of independent and identically distributed random

variables drawn from the distribution with density g; compute first the density estimator ˆg given at (2.1), and thenð#yx; ˆcxÞ defined at (2.3). Our first result describes

rates of convergence of the estimators #yx and ˆcx to yx and cx; respectively.

Immediately below the theorem we discuss its analogue when contours are estimated using local linear methods.

Given e40 let SeDS equal the set of all points xAS such that the closed disc of radius e; centred at x; is contained in S: Let /y1 y2S denote the distance between

arbitrary real numbers y1 and y2;modulo p:

Theorem 3.1. Assume conditionsðCKÞ; ðCh;lÞ; ðC1gÞ and ðC2gÞ: Constrain c to satisfy

jcjpC=ðlhÞ; where C40 is fixed, when choosing ðy; cÞ to minimise Sðxjy; c; lÞ; defined at (2.2). Then for each e40; and with probability 1,

ðlog nÞ1=2sup

xASe

ð/#yx yxS þ lh jˆcx cxjÞ-0: ð3:1Þ

The theorem holds with only minor modiﬁcations if we use local linear, rather than local quadratic, approximations to contour lines. Indeed, consistent estimation of cx is not required for our method to produce asymptotic improvements on the

conventional estimator ˆg: If we take Sðxjy; lÞ to be the ‘‘linear’’ analogue of Sðxjy; c; lÞ deﬁned in Section 2.2, and #yxto be its minimiser; and if we assumeðCKÞ;

ðCh;lÞ; ðC1gÞ and ðC2gÞ; then (3.1) continues to hold in the sense that with

probability 1, ðlog nÞ1=2sup

xASe

(9)

In practical terms, the assumption ‘‘l2h=ðlog nÞ5=4-N’’ in ðCh;lÞ asks that the

square of the radius, lh; of the parabola fragment Cðxjy; c; lÞ be of larger order than the bandwidth, h:

3.2. Density estimation

In this section we show that any sufﬁciently accurate empirical, quadratic approximation Cðxj*yx; ˜cxÞ to Cðxjyx; cxÞ leads to an estimator ˇgðxj*yx; ˜cx;lÞ that is a

uniformly good approximation to ˇgcontðxjlÞ:

Let *yx; ˜cx denote general estimators of yx; cx respectively. Write l0 for a new

version of l; which for the sake of simplicity we shall keep ﬁxed. Our next result describes properties of the estimator ˇgðxj*yx; ˜cx;l0Þ: The version of (3.1) for *yxand ˜cx;

and ﬁxed l; is: with probability 1, ðlog nÞ1=2sup

xASe

ð/*yx yxS þ hj˜cx cxjÞ-0: ð3:3Þ

Recall the deﬁnition of ˇ gcontðxjlÞ ¼ jDðxjlÞj1 Z DðxjlÞ ˆ gðzÞ ds:

Theorem 3.2. Assume conditionsðCKÞ; ðC2gÞ and (3.3). Suppose too that h^n1=6and

l040 is fixed. Then with probability 1,

ˇ

gðxj*yx; ˜cx;l0Þ ¼ ˇgcontðxjl0Þ þ opðh2Þ ð3:4Þ

uniformly in xASe; for eache40:

The estimators #yx; ˆcx described in Theorem 3.1 are examples of *yx; ˜cx; and then

(3.1) immediately implies (3.3). However, taking #yx to be a local linear estimator is

also adequate; there we should take ˆcx¼ 0; and (3.3) follows from (3.2).

We should stress that in Theorem 3.2 the value l0 of l is taken ﬁxed, while in

Theorem 3.1 it diverges slowly with n: The latter requirement is symptomatic of the degree of oversmoothing that is necessary when estimating quantities that are linked to density derivatives, such as the tangent angle yxor the curvature cx;rather than

the density itself.

3.3. Performance advantages

To appreciate the variance reduction properties of the estimator ˇgcont (and hence

of ˜g), relative to its standard kernel counterpart ˆg; let LðvÞ denote the integral average of Kðv þ zÞ over zAL where L is any line segment of length 2l0;and put

(10)

ˇ

gcontðxjl0Þ are asymptotic to ðnh2Þ1gðxÞ kM as n-N; where M ¼ K and L in the

respective cases. Moreover, kLokK;and so our method reduces variance; and also,

kL=kKBCl10 ;for a constant C40; as l0-N: The latter result shows that as the

ﬁxed value of l0 becomes larger, the extent of variance reduction increases in

proportion to l0:(Note that kLdoes not depend on the particular choice of L:)

The asymptotic bias of ˇgcontðxjl0Þ is readily seen to be identical to that of ˆgðxÞ; and

in fact the expected value of either estimator equals gðxÞ þ1 2k2r

2_{gðxÞ þ oðh}2_{Þ; where}

k2¼

R ðvð1Þ_Þ2

KðvÞ dv; vð1Þ_{denotes the ﬁrst component of the vector v; and}_r2_{g equals}

the Laplacian. Combining this result with that in the previous paragraph, and with (3.4), we see that the empirical contour estimator ˇgðxj*yx; ˜cx;l0Þ has the same

asymptotic bias as the conventional kernel estimator ˆgðxÞ; but has reduced asymptotic variance.

Therefore the estimator ˇgðxj*yx; ˜cx;l0Þ has less minimum asymptotic mean squared

error (AMSE) than ˆgðxÞ: In particular, if h ¼ Hn1=6 _{then the AMSE equals}

n2=3ALðHÞ; where

ALðHÞ ¼1₄H4k22fr 2_gðxÞg2

þ H2gðxÞkL:

Through the fact that kLokK this is always (unless gðxÞ ¼ 0) strictly less than the

AMSE of ˆgðxÞ; in the obvious notation the AMSE of ˆgðxÞ equals n2=3_A KðHÞ:

Likewise the asymptotic mean integrated squared error of ˇgðxj*yx; ˜cx;l0Þ; computed

for example over xASe;is less than that for ˆgðxÞ:

The estimator ˇgðxj*yx; ˜cx;l0Þ is also asymptotically normally distributed, in the

sense that ˇ

gðxj*yx; ˜cx;l0Þ ¼ gðxÞ þ1₂h4k2r2gðxÞ þ ðnhÞ1=2fgðxÞkLg1=2Nn

þ opðn1=3Þ; ð3:5Þ

where Nn is asymptotically distributed as normal Nð0; 1Þ:

To rigorously establish the claims made above, note that ˇgcontðxjl0Þ may be

written in a form similar to that at (2.1): ˇ gcontðxjl0Þ ¼ ðnh2Þ1 Xn i¼1 Kcont;x x Xi h ; ð3:6Þ where Kcont;xðvÞ jD0j1 Z D0 Kðv þ zÞ ds; ð3:7Þ

with D0denoting the image of the contour line segment DðxÞ after rescaling by h1

in each coordinate and translating x to the origin. As h-0 the kernel Kcont;x

converges to L; if the line segment L is chosen to have its centre at the origin and to be parallel to the contour tangent at x: (This does not affect the value of kL;

(11)

asymptotic toðnh2_Þ1_gðxÞk

M follows from standard arguments for the variance of a

kernel density estimator; see for example[22, pp. 19–23]. The result kLokKfollows

from the Cauchy–Schwarz inequality; equality cannot arise unless l0¼ 0: It may be

shown too that l0kL=kK-C Z N N ds Z KðvÞKfv þ ðs; 0ÞTg dv;

whence it follows that kLBCkK=l0: Result (3.5) follows from (3.4), (3.6) and

the bias properties of ˇgcontðxjl0Þ noted in the paragraph containing (3.5). Asymptotic

normality of the variable Nn in (3.5) is an immediate consequence of the fact

that ˇgcontðxjl0Þ is a sum of n independent and identically distributed random

variables.

3.4. High-order generalisations, and optimality

In Section 3.2 we simpliﬁed our theory by considering only the case where h is of the size appropriate for optimal construction of ˆg; and l0 is ﬁxed. In the present

section we discuss improvements in the overall convergence rate that are available using other choices of bandwidth, and taking l0to diverge with n: Our ﬁrst result is a

version of Theorems 3.1 and 3.2 in this setting.

Theorem 3.3. AssumeðCKÞ; h ¼ c1n2=11 and l0 ¼ c2n1=11 where c1; c240 are fixed,

and that

g has two H ¨older continuous derivatives; where the H ¨older coefficient exceeds2

3: ð3:8Þ

Then estimators *yxand ˜cxof yxand cx; respectively, can be constructed such that with

probability 1,

h1ðlog nÞ1=2 sup

xASe

ð/*yx yxS þ l0hj˜cx cxjÞ-0: ð3:9Þ

Furthermore, ifð*yx; ˜cxÞ satisfies (3.9), then (3.4) continues to hold with probability 1,

uniformly in xASefor each e40:

If the Ho¨lder coefﬁcient mentioned in (3.8) equals 1 x1Að0;13Þ; if, when

constructing ˆg for use with the local quadratic contour estimation method outlined in Section 2.1, we take h¼ nð3x2Þ=22 _{where 0}ox

2o9x1=ð2 þ 3x1Þ; and if

we take ð*yx; ˜cxÞ to be ð#yx; ˆcxÞ; deﬁned in Section 2.1; then (3.9) holds. (An

outline proof will be given in Section 5.3.) Thus, as noted in the last paragraph of Section 3.2, it is necessary to use a larger order of bandwidth when estimating quantities such as yxand cx;which depend on derivatives of g; than when estimating

g itself.

The estimator ˇgcont in (3.4) again admits representation (3.6), with kernel Kcont;x

(12)

deviation Ofðnh2_l

0Þ1=2g and bias Oðh2Þ: Both are of order n4=11; and so

ˇ

gðxj*yx; ˜cx;l0Þ ¼ gðxÞ þ Opðn4=11Þ: This represents an improvement by an order of

magnitude on the rate of convergence, Opðn1=3Þ; of the estimator discussed in

Theorem 3.2. Faster rates of convergence, upto Opðnð1=2ÞþxÞ for any given x40; can

be obtained for sufﬁciently smooth densities by using more accurate contour estimators.

As is well known (see e.g.[21]), the optimal rate of convergence of estimators of bivariate densities with two bounded derivatives equals Oðn1=3Þ: The results discussed in Section 3.3 might seem to contradict this result, since they signal the possibility of achieving a convergence rate of oðn1=3Þ by choosing n1=6_{h to decrease}

to zero sufficiently slowly, and l0to diverge to infinity sufficiently slowly, as n-N:

However, there is in fact no violation, since we need a little more than just two bounded derivatives, speciﬁcally the Ho¨lder continuity assumption in condition ðC2gÞ; in order to achieve the rate. Likewise, the assumption in (3.8) that the Ho¨lder

coefﬁcient exceeds (rather than equals)2₃is slightly more than necessary for optimal performance under minimal conditions. In each case, however, a biproduct of the additional assumption is the uniform strong approximation of the empirical contour-based estimator ˇgðxj*yx; ˜cx;l0Þ by its generalised kernel form ˇgcontðxjl0Þ; as

evidenced by (3.4).

3.5. The case of edge effects

In Section 2.3 we showed that, in the context of density estimation, there is a natural version of ˜g that addresses edge effects. As a prelude to stating the results of Section 3.2 for estimators of this type, redeﬁne the contour segment D0 at (3.7) by

taking it to be that subset of the original D0 which is as large as possible subject to

the support of R_D

0Kfð þ zÞ=hg ds not protruding outside the support set R

(introduced in Section 2.3). With this modiﬁcation, continue to deﬁne Kcont;xby (3.7)

and ˇgcont by (3.6).

Take R to be a compact set whose boundary has two Ho¨lder-continuous derivatives and is such that at no point on the boundary is the tangent to the boundary equal to the corresponding contour line; and assume the conditions of Theorems 3.1 and 3.2 on R rather than S: Then (3.1) and (3.4) hold uniformly in xAR (rather than xASe). Moreover, an argument identical to that in Section 3.2

shows that the asymptotic variance of ˇgcontðxjl0Þ; and hence of ˜gðxjl0Þ; decreases to

0 at rate l1₀ as l0-N:

3.6. Nonparametric regression

A model for nonparametric regression is that where data pairs ðXi; YiÞ are

generated by the formula Yi¼ gðXiÞ þ ei;the errors eihaving zero mean. Theory in

this case is similar to that for density estimation, although regularity conditions are required on the design variables Xi and the error distribution. For the latter it is

(13)

sufﬁcient to suppose that the eis are independent and identically distributed with all

moments ﬁnite and zero mean. In this case, terms in log n in (3.1) and (3.3) should be replaced by terms in nd_; _{for d40 ﬁxed but arbitrarily small. In this vein, the}

assumption ‘‘l2h=ðlog nÞ5=4-N’’ in ðCh;lÞ should be replaced by ‘‘l2h=nd-N for

some d40’’. LetðC0_h;lÞ denote the corresponding version of ðCh;lÞ:

Of the design variables it is adequate to suppose that they are independent and identically distributed with density f ; which is bounded away from 0 on S and has two Ho¨lder-continuous derivatives there. With this assumption,ðCKÞ; ðC0h;lÞ; ðC1gÞ

and ðC2gÞ; using l when estimating ð#yx; ˆcxÞ; and employing a ﬁxed l0 when

constructing ˜g; and taking the basic estimator ˆg to be of either Nadaraya–Watson or local-linear type; results described in Sections 3.2 and 3.3 hold in the case of nonparametric regression.

4. Numerical examples

Three estimation methods, local quadratic approximation to contour lines (giving ˜

gBðxjlÞ), local linear approximation (giving ˜gLðxjlÞ), and the standard kernel

estimator ˆgðxÞ; were used to estimate the probability density functions of two distributions. We generated 200 random samples of size n¼ 500 from each. For each sample, integrated squared error (ISE) values for the three estimators were approximated by numerical integration. Values of MISE were approximated by averaging 200 of the respective ISE values. The spherical biweight kernel, KðzÞ ¼

15

8pð1 jjzjj 2_Þ2

þ;was employed throughout.

Our ﬁrst example is the unimodal bivariate normal Nð0; IÞ distribution. We took the bandwidth to equal 0.8. To construct ˜gBðxjlÞ and ˜gLðxjlÞ; l in (2.2) was taken as

minð0:1 þ2

3dðxÞ; 1:1Þ; where dðxÞ was the distance from x to the location of the

mode of ˆg: Three-quarters of this value was used for l in (2.4). See the second-last paragraph of Section 2.1 and the last paragraph of Section 3.2. Notice that ‘‘radii’’ of contour lines of the density surface degenerate near the mode, and that linearly increasing the value of l ensures appropriate approximation of the contour lines.

Among the 200 random samples, the three samples that give median ISE values for the three estimators are plotted inFig. 3, which also shows the corresponding values of ˆgðxÞ; ˜gLðxjlÞ and ˜gBðxjlÞ: In multivariate cases, often a density surface

estimate fluctuates significantly due to data sparseness difficulties. The averaging step of our contour approximation methods remedies this problem. This effect is clearly demonstrated by the middle and bottom rows ofFig. 3. There, for each of the three samples, the surfaces corresponding to ˜gLðxjlÞ and ˜gBðxjlÞ have less wiggly

contour lines than ˆg at places away from the mode.

Table 1 gives ISE values for the nine estimates. Table 2 provides average ISE values, these being approximation to MISEs, for the three estimators. These results demonstrate clear gains of ˜gLðxjlÞ over ˆg:

(14)

-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.15_0.130.110.09 0.07 0.07 0.07 0.05

Fig. 3. Unimodal density estimates. The top row depicts three samples of size 500 drawn from the bivariate normal Nð0; IÞ distribution. Middle and bottom rows show contour lines of the true density surface (dashed lines), ˆg (dotted lines) and contour approximation estimators, for the three respective samples. The middle row compares the local linear contour approximation method (solid lines) with ˆg: The bottom row compares the local quadratic contour approximation method (solid lines) with ˆg:

Table 1

ISE values of the density estimates shown inFigs. 3 and 4

Unimodal normal ˆ gðxÞ 0.001550 0.001660 0.001602 ˜ gLðxjlÞ 0.001398 0.001332 0.001450 ˜ gBðxjlÞ 0.001437 0.001635 0.001520

Bimodal normal mixture ˆ gðxÞ 0.003616 0.003717 0.003610 ˜ gLðxjlÞ 0.003311 0.003261 0.003310 ˜ gBðxjlÞ 0.003548 0.003661 0.003553

(15)

Our next example illustrates performance of our estimators in a more complex, bimodal setting, a mixture of two bivariate normal distributions:

0:7N 0 0 ! ; 1 0 0 1 !! þ 0:3N 1:5 1 ! ; 0:26 0:1 0:1 0:26 !! : ð4:1Þ

Bandwidth was h¼ 0:6: To construct ˜gðxjlÞ we took l in (2.2) to be minð0:1 þ

2

3dðxÞ; 1:1Þ; and three-quarters of its value to be l in (2.4), where dðxÞ was the

distance from x to the location of the mode of ˆg nearest to x: This prevents our using too-large values of l at places between the modes, where the contour lines are curved, and hence helps preserve the bimodal feature of the density surface estimate. (In practice, there may not be prior information about the number of modes of the true distribution. In this case one can make a judgment from plots of preliminary estimates, such as ˆg:) For this example our approach again reduces ﬂuctuations in the density surface estimates caused by stochastic variability, particularly in regions away from either of the modes; see the panels in the middle and bottom rows ofFig. 4. The ISE and average ISE values are given inTables 1 and 2.

In summary, our simulation results demonstrate advantages of the contour approximation methods: the density surface estimates are more regularly shaped and the MISE values are reduced, compared to the usual kernel density estimate. Notably, the local linear contour approximation estimator enjoys good numerical performance. The local quadratic approximation method performs less well; it involves ﬁtting two, rather than one, parameter, and thus will outperform ˆg in MISE terms only when sample size is relatively large.

5. Proofs

5.1. Proof of Theorem 3.1

Put g¼ Eð ˆgÞ;%g ¼ EðˇgÞ; D ¼ ˆg g and %D ¼ ˇg %g: Deﬁne A1ðxjy; c; lÞ; A2ðxjy; c; lÞ

and A3ðxjy; c; lÞ to equal the integrals of fgðzÞ %gðxjy; cÞg2;fDðzÞ %Dðxjy; cÞg2 and

fgðzÞ %gðxjy; cÞg fDðzÞ %Dðxjy; cÞg; respectively, over zACðxjy; c; lÞ: Then, A2ðxjy; c; lÞ ¼

Z

Cðxjy;c;lÞ

DðzÞ2ds xðc; lÞ %Dðxjy; cÞ2; ð5:1Þ Table 2

Average ISE values of ˆgðxÞ; ˜gLðxjlÞ; and ˜gBðxjlÞ when applied to 200 random samples of size 500, drawn

from the unimodal Nð0; IÞ or the bimodal normal mixture distribution at (4.1)

Unimodal normal Bimodal normal mixture ˆ gðxÞ 0.001650 0.003724 ˜ gLðxjlÞ 0.001413 0.003377 ˜ gBðxjlÞ 0.001619 0.003678

(16)

A3ðxjy; c; lÞ ¼

Z

Cðxjy;c;lÞ

gðzÞDðzÞ ds xðc; lÞ%gðxjy; cÞ %Dðxjy; cÞ; ð5:2Þ

xS¼ A1þ A2þ 2A3: ð5:3Þ

Without loss of generality, lX1 and lhp1: Let c denote a differentiable function deﬁned in the plane, write jDcjðzÞ for the supremum of the absolute value of the directional derivative of c (at z) over all directions, let C40; and put Z¼ /y yxS

and z¼ jc cxj: There exists C140 with the property that

Z Cðxjy;c;lÞ Z Cðxjyx;cx;lÞ ! cðzÞ ds pC1ðZ þ lhzÞðlhÞ2 sup z: jjzxjjplhfjDcjðzÞ þ jcðzÞjg ð5:4Þ -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025

Fig. 4. Bimodal density estimates. Same asFig. 3, except that the samples are from the bivariate normal mixture distribution at (4.1).

(17)

uniformly inðy; cÞ such that jyj; jyxjpp; jcj; jcxjpC=ðlhÞ and Z; lhzpC: (Below we

shall refer to this uniform sense as ‘‘uniformn

’’. At (5.4) and below the constants C1; y; C4depend only on C:) To derive (5.4), note that the distance between a given

point on Cðxjy; cÞ and its counterpart on Cðxjyx; cx;lÞ; to which the former may be

rotated about x; is dominated by a constant multiple ofðZ þ lhzÞ lh: Therefore, the difference of function values at the two points is dominated by a constant multiple of ðZ þ lhzÞ lh times supjDcjðzÞ: To obtain the bound at (5.4) this should be multiplied by a constant times the lengths of the curves, i.e. by a constant times lh: There is an additional contribution to the right-hand side, coming from the difference between 1 and the Jacobian of the transformation, based on a rotation, which takes Cðxjy; cÞ to Cðxjyx; cx;lÞ; but it too is dominated by a constant multiple of the right-hand side

of (5.4).

The quantity xðc; lÞ; being the length of Cðxjy; c; lÞ; is asymptotic to 2lh uniformly injcjpC=ðlhÞ; and jxðc; lÞ xðcx;lÞjpC2ðlhÞ3z uniformly injcjpC such

that zpC=ðlhÞ: Therefore,

jxðc; lÞ1 xðcx;lÞ1jpC3lhz; ð5:5Þ

in the same uniform sense. Combining (5.1) with the results in this paragraph, and deﬁning Bjðxjy; c; lÞ ¼ xðc; lÞ1Ajðxjy; c; lÞ; we conclude that in the uniform

n sense, jB2ðxjy; c; lÞ B2ðxjyx; cx;lÞj pC4ðZ þ lhzÞlh sup z:jjzxjjplh fjDðzÞj jDDjðzÞ þ DðzÞ 2_g: _ð5:6Þ

Given j¼ 0; 1; let A4jðxjy; c; lÞ denote the integral of gðzÞjDðzÞ over zACðxjy; c; lÞ:

An argument similar to that leading to (5.6) implies that in the uniformn sense,

j%gðy; cÞ %gðyx; cxÞjpC5ðZ þ lhzÞlh; ð5:7Þ

where the constants C5; C6; C7 here and below depend only on C; g and K: From

(5.2), (5.7) and the properties of xðc; lÞ discussed in the previous paragraph, we deduce that in the uniformn

sense, jB3ðxjy; c; lÞ B3ðxjyx; cx;lÞj

pC6 ðlhÞ1max

j¼1;2fjA4jðxjy; c; lÞ A4jðxjyx; cx;lÞjg

þ ðZ þ lhzÞlh maxfjA40ðxjy; c; lÞj; jA40ðxjyx; cx;lÞjg

(18)

Combining (5.3), (5.6) and (5.8) we conclude that in the uniformn sense, jSðxjy; c; lÞ Sðxjyx; cx;lÞ fB1ðxjy; c; lÞ B1ðxjyx; cx;lÞgj

pC7 ðlhÞ1max

j¼1;2fjA4jðxjy; c; lÞ A4jðxjyx; cx;lÞjg

þ ðZ þ lhzÞlh sup

z:jjzxjjplh

fjDðzÞj jDDjðzÞ þ DðzÞ2g "

þ maxfjA40ðxjy; c; lÞj; jA40ðxjyx; cx;lÞjg

#!

: ð5:9Þ

The quantities T1 DðzÞ; T2 A40ðxjy; c; lÞ and T3 A4jðxjy; c; lÞ

A4jðxjyx; cx;lÞ all have zero mean, and have variances equal to Oðs21Þ; Oðs22Þ and

Oðs2 3Þ; respectively, where s21¼ h4; s22¼ l 2_h6 _{and s}2 3¼ ðZ þ lhzÞðlh2Þ 2 : Also, T4

jDDjðzÞ has mean square equal to Oðs2

4Þ; where s24¼ h2:For example, to obtain the

order of the variance of T3;note that the area between the curves Cðxjy; c; lÞ and

Cðxjyx; cx;lÞ equals OðaÞ; where a ¼ ðZ þ lhzÞðlhÞ2: The variance of nh2T3 is

essentially the variance of a Poisson variable with mean OðnaÞ; and so the variance of T3 equals Ofðnh2Þ2nag; which, since h^n1=6;equals Oðah2Þ ¼ Oðs23Þ:

Using Bennett’s inequality we may prove that, provided

n1eh2si-N; for some e40 and i ¼ 1; y; 4; ð5:10Þ

the probability that U1 jT1T4j; U2 jT2j or U3 jT3j exceeds u1 C8s1s4log n;

u2 C8s2ðlog nÞ1=2 or u3 C8s3ðlog nÞ1=2;respectively, equals OðnC9Þ in each case,

where C9 may be made arbitrarily large by choosing C8 sufﬁciently large; and these

probabilities are of the stated orders uniformly in x; zASe; and in c; cx;y; yx

complying with the ‘‘uniformn

’’ sense. From this result, using standard methods of approximation (see below), we may deduce that with probability 1 the right-hand side of (5.9), denoted below by RHS, satisﬁes

RHS¼ OðdnÞ

where dn¼ ðZ þ lhzÞ1=2hðlog nÞ1=2þ ðZ þ lhzÞl2h4ðlog nÞ1=2; ð5:11Þ

the former identity holding uniformly in xASeand in c; cx;y; yxcomplying with the

‘‘uniformn

’’ sense. (Below we shall refer to this alternative uniform sense as ‘‘uniformw’’.)

The ‘‘standard methods of approximation’’ alluded to above may be summarised as follows. Since S is bounded then, for any c40; a square lattice with edge width nc _{has only Oðn}2c_{Þ of its vertices in S: Since the derivatives of K are Ho¨lder}

continuous then we may choose c so large that the difference between the value of Uj

at a general point u (say) within Se;and the value of Uj at the point of the lattice

(within S) that is nearest to u; equals Oðn1Þ uniformly in u and in j ¼ 1; 2; 3; with probability 1. Call this result ðR1Þ: By choosing C8 (introduced in the previous

(19)

paragraph) so large that we may take C9X2cþ 2; and applying the Borel–Cantelli lemma, we may show that the supremum of UjðuÞ; over all u in the lattice, equals

OðtjÞ for each j; with probability 1. Call this result ðR2). Since n1¼ OðtjÞ then,

combining ðR1) and ðR2), we have shown that the supremum of UjðuÞ; over all

uASe;equals OðtjÞ for each j: This implies (5.11).

Define %gðxjy; c; lÞ to equal the integral of xðc; lÞ1gðzÞ over zACðxjy; c; lÞ: Given two bounded functions a and b defined in the plane, and a smooth, rectifiable, planar curve C of finite length jCj; put

jja bjj_C¼ jCj1 Z C faðzÞ bðzÞg2ds 1=2 :

The conditions assumed of g imply that g¼ g þ Oðh2_{Þ; whence it follows that}

B1ðxjy; c; lÞ1=2¼ jjg %gðxjy; c; lÞjj_Cðxjy;c;lÞþ Oðh2Þ; ð5:12Þ

in the uniformw sense. Moreover, writing b_n¼ bnðxÞ for a sequence of positive

functions satisfying b_nðxÞ^1 uniformly in xASe;we claim that

jjg %gðxjy; c; lÞjjCðxjy;c;lÞ¼ b 1=2

n ðZ þ lhzÞlh ð5:13Þ

in the uniformw sense.

To derive (5.13), note that each point on the curve segment Cðxjy; c; lÞ (the length of which is asymptotic to 2lh) is distant OfðZ þ lhzÞ lhg from the nearest point on the true contour line that passes through x: Moreover, along a portion of the curve segment, the portion having length equal to at least constant multiple of lh for all sufﬁciently large n; the nearest distance is at least a constant multiple ofðZ þ lhzÞlh: Let %gcontðxjlÞ denote the average of gðzÞ for z in the contour segment DðxjlÞ: In view

ofðC1gÞ and the results just noted,

jjg %gðxjy; c; lÞjjCðxjy;c;lÞ jjg %gcontðxjlÞjjDðxjlÞ¼ b 1=2

n ðZ þ lhzÞlh; ð5:14Þ

where bn has the properties claimed of the quantity at (5.13). A similar argument

shows that, since g has two Ho¨lder-continuous derivatives in S; jjg %gðxjyx; cx;lÞjj_Cðxjy_x;cx;lÞ jjg %gcontðxjlÞjjDðxjlÞ¼ OfðlhÞ

2þt_g; _ð5:15Þ

where t40 depends on the Ho¨lder exponent. But by deﬁnition of the contour line DðxÞ; %gcontðxjlÞ ¼ gðxÞ and gðzÞ ¼ gðxÞ for all zADðxÞ; and so (5.14) and (5.15) are

respectively identical to (5.13) and

jjg %gðxjyx; cx;lÞjjCðxjyx;cx;lÞ¼ OfðlhÞ

2þt

g: ð5:16Þ

Combining (5.12) and (5.13) we deduce that

B1ðxjy; c; lÞ ¼ bnfðZ þ lhzÞlhg2þ OfðZ þ lhzÞlh3þ h4g: ð5:17Þ

Likewise, (5.16) and the version of (5.12) forðy; cÞ ¼ ðyx; cxÞ gives

(20)

Combining (5.9), (5.11), (5.17) and (5.18), and noting that the quantityðZ þ lhzÞlh3

at (5.17) is of smaller order than the termðZ þ lhzÞlh3_{ðlog nÞ}1=2

at (5.11), we see that with probability 1, uniformly in xASe for each e40;

Sðxjy; c; lÞ Sðxjyx; cx;lÞ ¼ bnfðZ þ lhzÞlhg2þ Ofdnþ ðlhÞ4þ2tþ h4g: ð5:19Þ

Therefore, in the operation of minimising Sðxjy; c; lÞ over y and c; en Z þ lhz

can be made as small as a sufﬁciently large constant multiple of e0 n ðlhÞ

1

fd1=2n þ

ðlhÞ2þtþ h2_{g: Now, the relation e}

n^e0n is equivalent to

en^ðl2hÞ2=3ðlog nÞ1=3þ ðlhÞ1þt: ð5:20Þ

Note too that the property ðl2_hÞ2=3_{ðlog nÞ}1=3_{þ ðlhÞ}1þt_{¼ Oðe}

nÞ implies (5.10).

Therefore, with probability 1,

/#y_x_y_x_{S þ lhjˆc}_x_c_x_{j ¼ Ofðl}2_hÞ2=3

ðlog nÞ1=3þ ðlhÞ1þtg: The theorem follows directly from this result.

5.2. Proof of Theorem 3.2

In a slight abuse of notation, write ˇg;%g; C and x for ˇgðxj*yx; ˜cx;l0Þ;%gðxj*yx; ˜cx;l0Þ;

Cðxj*yx; ˜cx;l0Þ and xð˜cx;l0Þ; and let ˇg0;%g0; C0and x0denote the respective versions of

those quantities whenð*yx; ˜cxÞ is replaced by ðyx; cxÞ: In a slight change of notation

from the previous proof, put Z¼ ZðxÞ ¼ /*yx yxS and z ¼ zðxÞ ¼ j˜cx cxj:

Standard methods of strong approximation, similar to those used to derive (5.11), may be used to show that under the conditions of the theorem, j ˆgðzÞ gðzÞj ¼ Ofh2_{ðlog nÞ}1=2

g and jDð ˆg gÞjðzÞ ¼ Ofhðlog nÞ1=2g uniformly in zASe; for each

e40; with probability 1. Using this result, (5.4), (5.5) and the representations ˇ g%g ¼ x1Z C ð ˆg gÞ; gˇ0%g0¼ x10 Z C0 ð ˆg0 gÞ;

we may prove that with probability 1,

j ˇgðxj*yx; ˜cx;l0Þ %gðxj*yx; ˜cx;l0Þ f ˇgðxjyx; cx;l0Þ %gðxjyx; cx;l0Þgj

¼ OfðZ þ hzÞh2ðlog nÞ1=2g: ð5:21Þ

Similarly, using the fact that jDðg gÞjðzÞ ¼ OðhÞ uniformly in zASe; and

applying (5.4), (5.5) and the relation %gðxj*yx; ˜cx;l0Þ %gðxjyx; cx;l0Þ ¼ x1 Z C ðg gÞ x1₀ Z C0 ðg gÞ; we may prove that

j%gðxj*yx; ˜cx;l0Þ %gðxjyx; cx;l0Þj ¼ OfðZ þ hzÞh2g: ð5:22Þ

(21)

Likewise, recalling that ˇgcontðxjl0Þ is the average value of ˆg along the contour

segment Dðxjl0Þ; and noting that, in view of the Ho¨lder continuity of second

derivatives of g; Dðxjl0Þ and the parabola segment Cðxjyx; cx;l0Þ are uniformly

distant h2_ne _{apart, for some e40; we may show that with probability 1,}

j ˇgðxjyx; cx;l0Þ ˇgcontðxjl0Þj ¼ oðh2Þ: ð5:23Þ

Combining (5.21)–(5.23) we deduce that with probability 1, and uniformly in xASe;

ˇ

gðxj*yx; ˜cx;l0Þ ˇgcontðxjl0Þ ¼ OfðZ þ hzÞh2ðlog nÞ1=2g þ oðh2Þ: ð5:24Þ

The theorem follows from this property and (3.4). 5.3. Proof of Theorem 3.3

Here we show thatðCKÞ; (3.8) and (3.9) are sufﬁcient for (3.4) when h ¼ c1n2=11

and l0¼ c2n1=11;and that estimatorsð*yx; ˜cxÞ satisfying (3.9) are readily constructed

when (3.8) holds.

The arguments leading to (5.21) and (5.22) apply as before, although the terms zh and h2 _{on the right-hand sides of those formulae should be replaced by zl}

0h and

ðl0hÞ2;respectively. Therefore, in view of (3.9), for the present choices of h and l0;

the right-hand sides of (5.21) and (5.22) equal oðh2_{Þ with probability 1, uniformly in}

xASe:

For some x40;

jE ˆgðy1Þ gðy1Þ fE ˆgðy2Þ gðy2Þgj ¼ Oðh2jy1 y2jxÞ;

j ˆgðy1Þ E ˆgðy1Þ f ˆgðy2Þ E ˆgðy2Þgj

¼ Ofðnh2l0Þ1=2ðjjy1 y2jj=hÞxðlog nÞ1=2g

uniformly in points y1ADðxjl₀Þ and y₂ACðxjy_x; c_x;l₀Þ that are both distant s from x and are on the same side of x; and are in xASe:(In the case of the second identity

the result holds with probability 1.) For some Z40; jjy2 y1jj ¼ Ofðl0hÞ2þ2Zg ¼

Oðh1þZÞ; uniformly in pairs ðy1; y2Þ: Therefore, with probability 1 the difference

between the integral averages of ˆg g over Dðxjl0Þ and Cðxjyx; cx;l0Þ equals oðh2Þ;

uniformly in xASe:

Given y1 and y2 as before,

gðy2Þ ¼ gðy1Þ þ

X2 i¼1

ðy2 y1ÞðiÞgiðy1Þ þ Oðjjy2 y1jj2Þ;

where g1and g2represent ﬁrst partial derivatives, and bracketed superscripts denote

vector components. Recall thatjjy2 y1jj ¼ oðhÞ; uniformly in ðy1; y2Þ; and observe

that the integral average ofðy2 y1ÞðiÞgiðy1Þ over y2ACðxjy_x; c_x;l₀Þ is bounded by a constant multiple of the integral average ofjjy2 y1jj2;and hence equals oðh2Þ:

(22)

From this property and the fact that gðy1Þ ¼ gðxÞ for each y1 we deduce that the

integral average of gðy2Þ over Cðxjyx; cx;l0Þ equals the integral average of gðy1Þ over

Dðxjl0Þ; plus a term equal to oðh2Þ:

Combining these results we see that the difference between the integral averages of ˆ

g over Dðxjl0Þ and Cðxjyx; cx;l0Þ equals oðh2Þ; uniformly in xASe: This is the

analogue of (5.23) in the present setting. Combining this property and the versions of (5.21) and (5.22) we obtain the following version of (5.24): ˇgðxj*yx; ˜cx;l0Þ

ˇ

gcontðxjl0Þ ¼ oðh2Þ uniformly in xASe;with probability 1. This is equivalent to (3.4).

Next we show that, if (3.8) holds, estimators *yx and ˜cx can be constructed such

that (3.9) is true. Note that, in view of the present choice of h and l0; (3.9) is

equivalent to n2=11ðlog nÞ1=2 sup xASe /*y_x_y_x_{S-0; n}1=11_{ðlog nÞ}1=2 sup xASe j˜cx cxj-0 ð5:25Þ

with probability 1. Now, (3.8) implies that, simply by forming the respective derivatives of ˆg; one may estimate ﬁrst and second derivatives of g with respective rates nð5=22ÞZ and nð1=11ÞZ; for uniform convergence in Se with probability 1.

Therefore we may estimate contour tangent angle and contour curvature with the same respective rates. (In fact we may achieve this end by ﬁtting a local quadratic to contours, as suggested in Section 2.1.) Result (5.25) follows from this property.

If, when using the local quadratic contour estimation method outlined in Section 2.1, we choose the bandwidth for ˆg to be h¼ nð3x2Þ=22_{where 0}ox

2o9x1=ð2 þ 3x1Þ

and x1Að0;13Þ is the Ho¨lder coefﬁcient mentioned in (3.8), then for some Z40 the

convergence rates nð5=22ÞZ _{and n}ð1=11ÞZ _{(for, respectively, ﬁrst and second}

derivatives of g) mentioned in the previous paragraph are obtained. It follows that the local quadratic contour estimators also enjoy these rates.

Acknowledgments

Helpful comments of an editor and two reviewers have helped improve the paper.

References

[1] M.-Y. Cheng, J. Fan, J.S. Marron, On automatic boundary corrections, Ann. Statist. 25 (1997) 1691–1708.

[1a] A. Cowling, P. Hall, On pseudodata methods for removing boundary effects in kernel density estimation, J. Roy. Statist. Soc. Ser. B 58 (1996) 551–563.

[2] R.A. Djojosugito, P.L. Speckman, Boundary bias correction in nonparametric density estimation, Comm. Statist. Theory Methods 21 (1992) 69–88.

[3] R.L. Eubank, P. Speckman, A bias reduction theorem with applications in nonparametric regression, Scand. J. Statist. 18 (1991) 211–222.

[4] J. Fan, Local linear regression smoothers and their minimax efﬁciencies, Ann. Statist. 21 (1993) 196–216.

(23)

[6] T. Gasser, H.G. Mu¨ller, Kernel estimation of regression functions, in: T. Gasser, M. Rosenblatt (Eds.), Smoothing Techniques for Curve Estimation, Lecture Notes in Mathematics, Vol. 757, Springer, New York, 1979, pp. 23–68.

[7] T. Gasser, H.G. Mu¨ller, V. Mammitzsch, Kernels for nonparametric curve estimation, J. Roy. Statist. Soc. Ser. B 47 (1985) 238–252.

[8] B.L. Granovsky, H.G. Mu¨ller, Optimizing kernel methods: a unifying variational principle, Internat. Statist. Rev. 59 (1991) 373–388.

[9] P. Hall, C. Huber, A. Owen, A. Coventry, Asymptotically optimal balloon density estimates, J. Multivariate Anal. 51 (1994) 352–371.

[10] P. Hall, T.E. Wehrly, A geometrical method for removing edge effects from kernel-type nonparametric regression estimators, J. Amer. Statist. Assoc. 86 (1991) 665–672.

[11] T. Hastie, C. Loader, Local regression: automatic kernel carpentry, Statist. Sci. 8 (1993) 120–143. [12] M.C. Jones, Simple boundary correction for kernel density estimation, Statist. Comput. 3 (1993)

135–146.

[13] H.G. Mu¨ller, Smooth optimal kernel estimators near endpoints, Biometrika 78 (1991) 521–530. [14] H.G. Mu¨ller, U. Stadtmu¨ller, Multivariate boundary kernels and a continuous least squares principle,

J. Roy. Statist. Soc. Ser. B 61 (1999) 439–458.

[15] J. Rice, Boundary modiﬁcation for kernel estimators, Commun. Statist. Theory Methods 13 (1984) 893–900.

[16] D. Ruppert, M.P. Wand, Multivariate locally weighted least squares regression, Ann. Statist. 22 (1994) 1346–1370.

[17] E.F. Schuster, Incorporating support constraints into nonparametric estimators of densities, Commun. Statist. Theory Methods 14 (1985) 1123–1136.

[18] D.W. Scott, Multivariate Density Estimation—Theory, Practice and Visualization, Wiley, New York, 1992.

[19] J.G. Staniswalis, K. Messer, D.R. Finston, Kernel estimators for multivariate regression, J. Nonparamet. Statist. 3 (1993) 103–121.

[20] J.G. Staniswalis, K. Messer, Addendum to Staniswalis, Messer and Finston (1993), J. Nonparamet. Statist. 7 (1996) 67–68.

[21] C.J. Stone, Optimal rates of convergence for nonparametric estimators, Ann. Statist. 8 (1980) 1348–1360.