Journal of Multivariate Analysis 86 (2003) 375–397
Reducing variance in nonparametric surface
estimation
Ming-Yen Cheng
a,band Peter Hall
a,*
a
Centre for Mathematics and its Applications, Australian National University, Canberra, ACT 0200, Australia
b
Department of Mathematics, National Taiwan University, Taipei 106, Taiwan Received 12 February 2001
Abstract
We suggest a method for reducing variance in nonparametric surface estimation. The technique is applicable to a wide range of inferential problems, including both density estimation and regression, and to a wide variety of estimator types. It is based on estimating the contours of a surface by minimising deviations of elementary surface estimates along a quadratic curve. Once a contour estimate has been obtained, the final surface estimate is computed by averaging conventional surface estimates along a portion of the contour. Theoretical and numerical properties of the technique are discussed.
r2003 Elsevier Science (USA). All rights reserved.
AMS 1991 subject classifications: primary 62G07; secondary 62H12
Keywords: Bandwidth; Boundary effect; Kernel method; Nonparametric density estimation; Nonpara-metric regression; Variance reduction
1. Introduction
We suggest a variance reduction method for nonparametric surface estimators, based on approximating the projection of a contour into the design plane at the point x where we wish to construct the estimate. The contour estimator is then used as an axis along which a continuum of conventional surface estimates is averaged in order to achieve a final estimate at x: Since our technique does not alter asymptotic
*Corresponding author.
E-mail address:halpstat@pretty.anu.edu.au (P. Hall).
0047-259X/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved. doi:10.1016/S0047-259X(02)00032-5
bias then the reduction in variance that it offers leads directly to a reduction in asymptotic mean squared error.
This method has several novel features. Firstly, it exploits the extra degree of freedom that is available in the problem of surface estimation. Secondly, it provides a new technique for estimating gradients and curvatures of contour lines, without passing explicitly to derivatives of surface estimates. Thirdly, when applied to a surface estimate that is always positive, in either density estimation or regression, our method produces a boundary-corrected estimate that is always positive. Our approach to estimating contours involves choosing either a line segment or a quadratic along which a conventional surface estimator is least variable, in the neighbourhood of the point x at which we wish to estimate the surface.
The technique is applicable to nonparametric methods in both density estimation and regression. Indeed, it is not tied to a particular estimator type in either of these settings; for example, in nonparametric regression it can be used in conjunction with spline, local linear or Nadaraya–Watson methods. In the case of density estimation, when a conventional kernel estimator is used as its basis, the technique can be viewed as a device for re-computing kernel shape.
As implied two paragraphs above, the technique also has potential application for overcoming edge effects. Modified boundary kernel methods have been proposed for addressing this problem (see e.g. [14,19,20]), but like their univariate counterparts they can produce negative estimates at boundaries. Local polynomial and local parametric methods are more successful in this regard, although the increase in variance of such techniques near the boundary means that good asymptotic performance is often not visible unless sample size is particularly large. Scott ([18], pp. 82–85) gives a particularly illuminating discussion of issues such as these.
Multivariate generalisations are of course possible. However, since the multi-variate analogue of a contour is not so familiar, not as readily depicted, and not as easy to compute as in the bivariate case, then high-dimensional generalisations do not offer as convenient a vehicle for illustrating the potential of the method. If the distribution is d-variate then the contour corresponding to ‘‘height’’ H is the set of points y such that gðyÞ ¼ H; and is a region with d 1 degrees of freedom.
Our variance reduction method is related to the so-called balloon kernel techniques for density estimation. See[9,18, p. 149ff]. There is an extensive literature on approaches for remedying boundary effects in density estimation and regression, mainly in univariate cases. It includes methods based on special ‘‘boundary kernels’’, for example those considered by Gasser and Mu¨ller[6], Gasser et al.[7], Granovsky and Mu¨ller[8]and Mu¨ller[13]. Rice[15]suggested a dual-bandwidth approach. So-called ‘‘reflection methods’’ include those of [1a,10,17]. The projection method of Djojosugito and Speckman [2] is in the same spirit. Eubank and Speckman [3] proposed a method that involves combining a conventional curve estimator with a substantially undersmoothed estimator. Cheng, Fan and Marron [1] suggested methods that have optimal asymptotic performance at boundaries. The natural boundary-respecting properties of local polynomial methods have been discussed by
Fan[4], Hastie and Loader[11], Ruppert and Wand[16]and Fan and Gijbels[5], for example. See also[12].
Section 2 will introduce our method and discuss, in an heuristic and nontechnical way, its variance-reduction properties. Theoretical results, underpinning the informal arguments in Section 2, will be given in Section 3, and rigorous technical details will be outlined in Section 5. Section 4 will summarise a simulation study that complements the theory.
2. Methodology 2.1. The method
Let g denote a univariate function of a 2-vector; for example, g might be the density of a bivariate distribution, or the mean in a regression problem where the explanatory variable is bivariate and the response variable is scalar. We wish to estimate g nonparametrically, making only smoothness assumptions and exploiting the extra degree of freedom that is available in the context of surface estimation, relative to the conventional case where the argument of g is univariate.
To this end we first construct an elementary nonparametric estimator ˆg of g: For example, when g is a probability density we might take
ˆ gðxÞ ¼ ðnh2Þ1 Xn i¼1 K x Xi h ; ð2:1Þ
where K is a radially symmetric bivariate kernel, h is a bandwidth, and X1; y; Xnare
independent and identically distributed random variables with density g:
Next we describe construction of a local quadratic estimator of the level set, or contour, of g in the neighbourhood of x; local linear estimators will be treated in Section 2.2. Let Cðxjy; cÞ denote the parabola passing through x ¼ ðxð1Þ; xð2ÞÞ; with its vertex at x and its tangent there in the direction of the unit vectorðcos y; sin yÞ; and with curvature 2c at x: Thus, as a curve in theðzð1Þ; zð2ÞÞ-plane, Cðxjy; cÞ has
equation
ðzð2Þ xð2ÞÞ cos y ðzð1Þ xð1ÞÞ sin y
¼ cfðzð2Þ xð2ÞÞ sin y þ ðzð1Þ xð1ÞÞ cos yg2:
We shall constrain y and c byp=2oypp=2 and NocoN; which ensures that each nondegenerate parabola in the plane is representable by Cðxjy; cÞ for a unique tripleðx; y; cÞ:
Given l40; let Cðxjy; c; lÞ denote the set of points zACðxjy; cÞ that satisfy jjz xjjplh; where jj jj denotes standard Euclidean distance. Let jCj denote the
length of a finite segment C of a rectifiable curve, and put xðc; lÞ ¼ jCðxjy; c; lÞj; ˇ gðxjy; c; lÞ ¼ xðc; lÞ1 Z Cðxjy;c;lÞ ˆ gðzÞ ds; Sðxjy; c; lÞ ¼ xðc; lÞ1 Z Cðxjy;c;lÞ f ˆgðzÞ ˇgðxjy; c; lÞg2ds; ð2:2Þ
ð#yx; ˆcxÞ ¼ arg minðy;cÞSðxjy; c; lÞ; ð2:3Þ
where ds is an infinitesimal element of arc length along C ¼ Cðxjy; c; lÞ at the point on C with coordinates z: Panel (a) of Fig. 1 depicts an example of the contour estimator Cðxj#yx; ˆcx;lÞ: Our final estimator of gðxÞ is
˜
gðxjlÞ ¼ ˇgðxj#yx; ˆcx;lÞ: ð2:4Þ
In practice, one would not necessarily use the same value of l when computing ð#yx; ˆcxÞ and when calculating ˜g: That is, the l’s at (2.2) and (2.4) would not
necessarily be identical. We shall argue in Section 3 that a relatively large value of l (asymptotically, l-N) should be used to give accurate estimation of the ‘‘true’’ quadratic approximation Cðxjyx; cxÞ to the contour line at x: On the other hand, a
relatively small value of l may be adequate for reducing variance and removing edge effects in the estimator ˜g:
To give an intuitive explanation of this point, note that estimation of y and c is closely related to estimation of second derivatives of g; for which a larger bandwidth is needed than when simply estimating g itself. This explains why lh; which is effectively a bandwidth for computation of #yx and ˆcx; should be relatively large.
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 x -2 -1 0 1 2 y 0 0.1 0.2 0 .3 0.4 0. 5 (a) (b)
Fig. 1. Sausage-shaped kernel. In the context of density estimation, panel (a) depicts a portion of a point cloud, and the true contour line (solid line) that passes x¼ ð0:85; 1:04Þ (cross sign), when data are from the bivariate normal Nð0; IÞ distribution and n ¼ 500: Dotted line is the contour line estimate Cðxj#yx; ˆcx;lÞ;
calculated at that point based on the spherical biweight kernel, h¼ 0:8; and l ¼ 1:25: Panel (b) shows a perspective plot of the corresponding ‘‘sausage-shaped’’ kernel Kx;defined at (2.5).
However, there is not the same pressing need for choosing lh large when estimating g itself.
2.2. Choice of contour estimator
To appreciate why minimising Sðxjy; c; lÞ produces a parabola that approximates the contour DðxÞ; note that we are in effect finding that choice of ðy; cÞ which renders
ˆ
gðzÞ least variable as we move z along the curve Cðxjy; cÞ: Indeed, if we were to replace ˆgðzÞ by its true value, gðzÞ; when defining ˇgðxjy; c; lÞ and Sðxjy; c; lÞ; then the curve Cðxj#yx; ˆcxÞ produced by minimising S would, if not constrained to have a
quadratic equation, be exactly DðxÞ: The curve Cðxj#yx; ˆcxÞ represents an empirical,
quadratic approximation to this contour.
An alternative technique is to take C to be a line segment, rather than a piece of a quadratic. The mechanics of implementing the approximation are virtually identical in this setting: we replace Cðxjy; c; lÞ by Clinðxjy; lÞ; denoting the line segment of
length 2l centred at x and inclined at angle y; we replace xðc; lÞ at (2.2) by 2l; and call the resulting integral Sðxjy; lÞ instead of Sðxjy; c; lÞ; and we choose y ¼ #yx to
minimise Sðxjy; lÞ: This approach is adequate for the results described in Sections 3.1–3.3, but for the higher-order analogues described in Section 3.4 a local quadratic method, or something similar such as fitting local ellipses, is required.
A very different approach in estimating contour lines is to construct an appropriately oversmoothed estimator of the function g; and compute its contours. Oversmoothing is necessary in order to obtain sufficiently accurate estimates of derivatives of the surface; these are used explicitly or implicitly in constructing an estimate of the contour. We argue, however, that such a method is in general not as attractive as that proposed here, owing to the relative difficulty of drawing contours from differential-geometric properties of a surface.
Nevertheless, oversmoothing ˆg is beneficial when it is necessary to construct ˜g at a place where the tangent plane to the surface is virtually horizontal. Minimising the function Sðxjy; c; lÞ with respect to ðy; cÞ relies on detecting off-contour differences in g through variation of g; if the gradient of g is low then so too will be the variation. In such cases we rely on higher-order derivatives to provide ‘‘leverage’’ for detecting the contour—hence the need for more dramatic smoothing.
2.3. Removing edge effects
Let R denote the support of the distribution of the points Xi on which the
estimator ˆg is based. In the context of density estimation R would be the support of g; and in regression R would be the support of the density of Xi in the regression
problem Yi¼ gðXiÞ þ error: The basic estimator ˆgðxÞ potentially suffers from edge
effects whenever the support of the function kxðzÞ Kfðx zÞ=hg protrudes outside
R: However, assuming K is radially symmetric and vanishes outside a disc of unit radius, this problem is solved by the following trivial modification of the estimator suggested in Section 2.1: Re-define the parabola segment Cðxjy; c; lÞ to be the largest
connected subset of Cðxjy; cÞ inside the disc fz: jjz xjjplhg; subject to the set Sfy; c; Cðxjy; c; lÞg being wholly contained within R; where
Sðy; c; TÞ fz1: jjz1 z2jjph for some z2ATg:
Fig. 2illustrates the removal of edge effects in this context. Theoretical results, and their derivations, in the presence of edge effects are entirely analogous to their counterparts in the absence of those effects.
2.4. Why the estimator ˜g has advantages
The advantages stem from the property, established in Section 3, that ˜g is a good approximation to the average of ˆg over a portion of the true contour of the surface represented by y¼ gðxÞ: Specifically, let DðxÞ denote the contour line that passes through x; and let DðxjlÞ equal the largest connected subset of DðxÞ inside the disc fz: jjz xjjplhg; subject to Sfy; c; DðxjlÞg being wholly contained within R: Write
ˇ
gcontðxjlÞ for the integral average of ˆgðzÞ over zADðxjlÞ: Then, as we shall show in
Section 3.2, the difference between ˜gðxjlÞ and ˇgcontðxjlÞ is of smaller order than the
difference between the latter function and the true value of gðxÞ:
It is easy to see why ˇgcontðxjlÞ is likely to perform better than the conventional
estimator ˆgðxÞ: Indeed, the averaging that is explicit in the definition of ˇgcont will
clearly tend to reduce variance, by an order of magnitude if l is allowed to diverge with n: And the bias of ˇgcontðxjlÞ will equal the average value of the bias of ˆgðzÞ over
values of z for which gðzÞ ¼ gðxÞ: Replacing bias by an average value is generally not deleterious, and in fact the asymptotic bias of ˇgcontðxjlÞ is identical to that of ˆgðxÞ:
0.0 0.5 1.0 1.5 2.0 -1.0 -0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 -1.0 -0.5 0.0 0.5 1.0 (a) (b)
Fig. 2. Removing edge effects. In the presence of edge effects the subset of Cðxjy; cÞ (dotted curve) that comprises Cðxjy; c; lÞ (solid curve) is reduced, to ensure that the resulting region Sfy; c; Cðxjy; c; lÞg; from which the estimator ˜gð jlÞ is computed, lies wholly within the support R (right-hand side of the vertical line) of the design distribution. The point x is marked by a cross. Panels (a) and (b) illustrate cases where the contour is convex and concave, respectively, with respect to the boundary.
2.5. Particular cases of ˜g
In the case of density estimation the estimator ˜g may be thought of as having been computed using a kernel whose shape is symmetric about the parabolic axis represented by Cðxj#yx; ˆcxÞ: If ˆg is given by (2.1) then this kernel is Kx;say, defined by
KxðvÞ jCð0j#yx; ˆcxh; l=hÞj1
Z
Cð0j#yx; ˆcxh;l=hÞ
Kðz þ vÞ ds: ð2:5Þ
In this notation the estimator ˜g has the standard form at (2.1): ˜ gðxjlÞ ¼ ðnh2Þ1 X n i¼1 Kx x Xi h ;
where the support of Kxis sausage-shaped with its axis represented by the quadratic
Cð0j#yx; ˆcxhÞ:
Fig. 1 illustrates a typical local quadratic contour estimate, and the associated sausage-shaped kernel, in the case of nonparametric density estimation. There is an obvious analogue of the figure in the case of a local linear approximation to the contour.
In the context of kernel-based regression the estimator ˜g cannot be expressed simply as the result of replacing K in the definition of ˆgðxÞ by Kx:An approach like
this is still feasible, but it would generally involve at least two kernels like Kx; one
(Kx;1 say) designed for estimating contours of fg; where f is the design density, and
the otherðKx;2Þ designed to estimate contours of f : For example, in the case of
local-constant or Nadaraya–Watson estimation of g one would use Kx;1 and Kx;2 in the
numerator and denominator, respectively, of the estimator. The computational complexity of such an approach makes it unattractive, however.
3. Theoretical properties 3.1. Contour approximation
Our aim in this section is to describe the accuracy with which the empirical contour line Cðxj#yx; ˆcxÞ estimates a nonrandom, quadratic approximation
Cðxjyx; cxÞ to DðxÞ: For brevity we confine our detailed treatment to the case of
nonparametric density estimation, noting in Section 3.6 the similarities to nonparametric regression. We deal initially only with situations where edge effects do not arise; Section 3.5 discusses how our results change in the presence of edge effects.
Let S denote a bounded, open set in the plane. We assume of the kernel that K is a compactly supported; radially symmetric; probability
of h and l that
h^n1=6; l2h=ðlog nÞ5=4-N; and lh ¼ OðneÞ
for some e40; as n-N; ðCh;lÞ
and of the density g that it is differentiable on S and satisfies
the gradient of the steepest vector in the tangent plane at x to
the surface represented by y¼ gðxÞ does not vanish for xAS ðC1gÞ
and
g has two H ¨older-continuous derivatives; of all types; in S: ðC2gÞ
In respect of ðCh;l), note that h^n1=6 is the optimal size of bandwidth for
estimating a density g with two derivatives.
ConditionsðC1gÞ and ðC2gÞ imply that, for each xAS; the contour line DðxÞ that
passes through x may be represented locally as a quadratic, in the sense that there exist a real number cx;and yxAðp=2; p=2; both uniquely determined, such that the distance from any given point z on DðxÞ to the nearest point on Cðxjyx; cxÞ
converges to 0 at rate oðr2Þ; uniformly in z satisfying jjz xjjpr; as r-0:
From a sample X1; y; Xn of independent and identically distributed random
variables drawn from the distribution with density g; compute first the density estimator ˆg given at (2.1), and thenð#yx; ˆcxÞ defined at (2.3). Our first result describes
rates of convergence of the estimators #yx and ˆcx to yx and cx; respectively.
Immediately below the theorem we discuss its analogue when contours are estimated using local linear methods.
Given e40 let SeDS equal the set of all points xAS such that the closed disc of radius e; centred at x; is contained in S: Let /y1 y2S denote the distance between
arbitrary real numbers y1 and y2;modulo p:
Theorem 3.1. Assume conditionsðCKÞ; ðCh;lÞ; ðC1gÞ and ðC2gÞ: Constrain c to satisfy
jcjpC=ðlhÞ; where C40 is fixed, when choosing ðy; cÞ to minimise Sðxjy; c; lÞ; defined at (2.2). Then for each e40; and with probability 1,
ðlog nÞ1=2sup
xASe
ð/#yx yxS þ lh jˆcx cxjÞ-0: ð3:1Þ
The theorem holds with only minor modifications if we use local linear, rather than local quadratic, approximations to contour lines. Indeed, consistent estimation of cx is not required for our method to produce asymptotic improvements on the
conventional estimator ˆg: If we take Sðxjy; lÞ to be the ‘‘linear’’ analogue of Sðxjy; c; lÞ defined in Section 2.2, and #yxto be its minimiser; and if we assumeðCKÞ;
ðCh;lÞ; ðC1gÞ and ðC2gÞ; then (3.1) continues to hold in the sense that with
probability 1, ðlog nÞ1=2sup
xASe
In practical terms, the assumption ‘‘l2h=ðlog nÞ5=4-N’’ in ðCh;lÞ asks that the
square of the radius, lh; of the parabola fragment Cðxjy; c; lÞ be of larger order than the bandwidth, h:
3.2. Density estimation
In this section we show that any sufficiently accurate empirical, quadratic approximation Cðxj*yx; ˜cxÞ to Cðxjyx; cxÞ leads to an estimator ˇgðxj*yx; ˜cx;lÞ that is a
uniformly good approximation to ˇgcontðxjlÞ:
Let *yx; ˜cx denote general estimators of yx; cx respectively. Write l0 for a new
version of l; which for the sake of simplicity we shall keep fixed. Our next result describes properties of the estimator ˇgðxj*yx; ˜cx;l0Þ: The version of (3.1) for *yxand ˜cx;
and fixed l; is: with probability 1, ðlog nÞ1=2sup
xASe
ð/*yx yxS þ hj˜cx cxjÞ-0: ð3:3Þ
Recall the definition of ˇ gcontðxjlÞ ¼ jDðxjlÞj1 Z DðxjlÞ ˆ gðzÞ ds:
Theorem 3.2. Assume conditionsðCKÞ; ðC2gÞ and (3.3). Suppose too that h^n1=6and
l040 is fixed. Then with probability 1,
ˇ
gðxj*yx; ˜cx;l0Þ ¼ ˇgcontðxjl0Þ þ opðh2Þ ð3:4Þ
uniformly in xASe; for eache40:
The estimators #yx; ˆcx described in Theorem 3.1 are examples of *yx; ˜cx; and then
(3.1) immediately implies (3.3). However, taking #yx to be a local linear estimator is
also adequate; there we should take ˆcx¼ 0; and (3.3) follows from (3.2).
We should stress that in Theorem 3.2 the value l0 of l is taken fixed, while in
Theorem 3.1 it diverges slowly with n: The latter requirement is symptomatic of the degree of oversmoothing that is necessary when estimating quantities that are linked to density derivatives, such as the tangent angle yxor the curvature cx;rather than
the density itself.
3.3. Performance advantages
To appreciate the variance reduction properties of the estimator ˇgcont (and hence
of ˜g), relative to its standard kernel counterpart ˆg; let LðvÞ denote the integral average of Kðv þ zÞ over zAL where L is any line segment of length 2l0;and put
ˇ
gcontðxjl0Þ are asymptotic to ðnh2Þ1gðxÞ kM as n-N; where M ¼ K and L in the
respective cases. Moreover, kLokK;and so our method reduces variance; and also,
kL=kKBCl10 ;for a constant C40; as l0-N: The latter result shows that as the
fixed value of l0 becomes larger, the extent of variance reduction increases in
proportion to l0:(Note that kLdoes not depend on the particular choice of L:)
The asymptotic bias of ˇgcontðxjl0Þ is readily seen to be identical to that of ˆgðxÞ; and
in fact the expected value of either estimator equals gðxÞ þ1 2k2r
2gðxÞ þ oðh2Þ; where
k2¼
R ðvð1ÞÞ2
KðvÞ dv; vð1Þdenotes the first component of the vector v; andr2g equals
the Laplacian. Combining this result with that in the previous paragraph, and with (3.4), we see that the empirical contour estimator ˇgðxj*yx; ˜cx;l0Þ has the same
asymptotic bias as the conventional kernel estimator ˆgðxÞ; but has reduced asymptotic variance.
Therefore the estimator ˇgðxj*yx; ˜cx;l0Þ has less minimum asymptotic mean squared
error (AMSE) than ˆgðxÞ: In particular, if h ¼ Hn1=6 then the AMSE equals
n2=3ALðHÞ; where
ALðHÞ ¼14H4k22fr 2gðxÞg2
þ H2gðxÞkL:
Through the fact that kLokK this is always (unless gðxÞ ¼ 0) strictly less than the
AMSE of ˆgðxÞ; in the obvious notation the AMSE of ˆgðxÞ equals n2=3A KðHÞ:
Likewise the asymptotic mean integrated squared error of ˇgðxj*yx; ˜cx;l0Þ; computed
for example over xASe;is less than that for ˆgðxÞ:
The estimator ˇgðxj*yx; ˜cx;l0Þ is also asymptotically normally distributed, in the
sense that ˇ
gðxj*yx; ˜cx;l0Þ ¼ gðxÞ þ12h4k2r2gðxÞ þ ðnhÞ1=2fgðxÞkLg1=2Nn
þ opðn1=3Þ; ð3:5Þ
where Nn is asymptotically distributed as normal Nð0; 1Þ:
To rigorously establish the claims made above, note that ˇgcontðxjl0Þ may be
written in a form similar to that at (2.1): ˇ gcontðxjl0Þ ¼ ðnh2Þ1 Xn i¼1 Kcont;x x Xi h ; ð3:6Þ where Kcont;xðvÞ jD0j1 Z D0 Kðv þ zÞ ds; ð3:7Þ
with D0denoting the image of the contour line segment DðxÞ after rescaling by h1
in each coordinate and translating x to the origin. As h-0 the kernel Kcont;x
converges to L; if the line segment L is chosen to have its centre at the origin and to be parallel to the contour tangent at x: (This does not affect the value of kL;
asymptotic toðnh2Þ1gðxÞk
M follows from standard arguments for the variance of a
kernel density estimator; see for example[22, pp. 19–23]. The result kLokKfollows
from the Cauchy–Schwarz inequality; equality cannot arise unless l0¼ 0: It may be
shown too that l0kL=kK-C Z N N ds Z KðvÞKfv þ ðs; 0ÞTg dv;
whence it follows that kLBCkK=l0: Result (3.5) follows from (3.4), (3.6) and
the bias properties of ˇgcontðxjl0Þ noted in the paragraph containing (3.5). Asymptotic
normality of the variable Nn in (3.5) is an immediate consequence of the fact
that ˇgcontðxjl0Þ is a sum of n independent and identically distributed random
variables.
3.4. High-order generalisations, and optimality
In Section 3.2 we simplified our theory by considering only the case where h is of the size appropriate for optimal construction of ˆg; and l0 is fixed. In the present
section we discuss improvements in the overall convergence rate that are available using other choices of bandwidth, and taking l0to diverge with n: Our first result is a
version of Theorems 3.1 and 3.2 in this setting.
Theorem 3.3. AssumeðCKÞ; h ¼ c1n2=11 and l0 ¼ c2n1=11 where c1; c240 are fixed,
and that
g has two H ¨older continuous derivatives; where the H ¨older coefficient exceeds2
3: ð3:8Þ
Then estimators *yxand ˜cxof yxand cx; respectively, can be constructed such that with
probability 1,
h1ðlog nÞ1=2 sup
xASe
ð/*yx yxS þ l0hj˜cx cxjÞ-0: ð3:9Þ
Furthermore, ifð*yx; ˜cxÞ satisfies (3.9), then (3.4) continues to hold with probability 1,
uniformly in xASefor each e40:
If the Ho¨lder coefficient mentioned in (3.8) equals 1 x1Að0;13Þ; if, when
constructing ˆg for use with the local quadratic contour estimation method outlined in Section 2.1, we take h¼ nð3x2Þ=22 where 0ox
2o9x1=ð2 þ 3x1Þ; and if
we take ð*yx; ˜cxÞ to be ð#yx; ˆcxÞ; defined in Section 2.1; then (3.9) holds. (An
outline proof will be given in Section 5.3.) Thus, as noted in the last paragraph of Section 3.2, it is necessary to use a larger order of bandwidth when estimating quantities such as yxand cx;which depend on derivatives of g; than when estimating
g itself.
The estimator ˇgcont in (3.4) again admits representation (3.6), with kernel Kcont;x
deviation Ofðnh2l
0Þ1=2g and bias Oðh2Þ: Both are of order n4=11; and so
ˇ
gðxj*yx; ˜cx;l0Þ ¼ gðxÞ þ Opðn4=11Þ: This represents an improvement by an order of
magnitude on the rate of convergence, Opðn1=3Þ; of the estimator discussed in
Theorem 3.2. Faster rates of convergence, upto Opðnð1=2ÞþxÞ for any given x40; can
be obtained for sufficiently smooth densities by using more accurate contour estimators.
As is well known (see e.g.[21]), the optimal rate of convergence of estimators of bivariate densities with two bounded derivatives equals Oðn1=3Þ: The results discussed in Section 3.3 might seem to contradict this result, since they signal the possibility of achieving a convergence rate of oðn1=3Þ by choosing n1=6h to decrease
to zero sufficiently slowly, and l0to diverge to infinity sufficiently slowly, as n-N:
However, there is in fact no violation, since we need a little more than just two bounded derivatives, specifically the Ho¨lder continuity assumption in condition ðC2gÞ; in order to achieve the rate. Likewise, the assumption in (3.8) that the Ho¨lder
coefficient exceeds (rather than equals)23is slightly more than necessary for optimal performance under minimal conditions. In each case, however, a biproduct of the additional assumption is the uniform strong approximation of the empirical contour-based estimator ˇgðxj*yx; ˜cx;l0Þ by its generalised kernel form ˇgcontðxjl0Þ; as
evidenced by (3.4).
3.5. The case of edge effects
In Section 2.3 we showed that, in the context of density estimation, there is a natural version of ˜g that addresses edge effects. As a prelude to stating the results of Section 3.2 for estimators of this type, redefine the contour segment D0 at (3.7) by
taking it to be that subset of the original D0 which is as large as possible subject to
the support of RD
0Kfð þ zÞ=hg ds not protruding outside the support set R
(introduced in Section 2.3). With this modification, continue to define Kcont;xby (3.7)
and ˇgcont by (3.6).
Take R to be a compact set whose boundary has two Ho¨lder-continuous derivatives and is such that at no point on the boundary is the tangent to the boundary equal to the corresponding contour line; and assume the conditions of Theorems 3.1 and 3.2 on R rather than S: Then (3.1) and (3.4) hold uniformly in xAR (rather than xASe). Moreover, an argument identical to that in Section 3.2
shows that the asymptotic variance of ˇgcontðxjl0Þ; and hence of ˜gðxjl0Þ; decreases to
0 at rate l10 as l0-N:
3.6. Nonparametric regression
A model for nonparametric regression is that where data pairs ðXi; YiÞ are
generated by the formula Yi¼ gðXiÞ þ ei;the errors eihaving zero mean. Theory in
this case is similar to that for density estimation, although regularity conditions are required on the design variables Xi and the error distribution. For the latter it is
sufficient to suppose that the eis are independent and identically distributed with all
moments finite and zero mean. In this case, terms in log n in (3.1) and (3.3) should be replaced by terms in nd; for d40 fixed but arbitrarily small. In this vein, the
assumption ‘‘l2h=ðlog nÞ5=4-N’’ in ðCh;lÞ should be replaced by ‘‘l2h=nd-N for
some d40’’. LetðC0h;lÞ denote the corresponding version of ðCh;lÞ:
Of the design variables it is adequate to suppose that they are independent and identically distributed with density f ; which is bounded away from 0 on S and has two Ho¨lder-continuous derivatives there. With this assumption,ðCKÞ; ðC0h;lÞ; ðC1gÞ
and ðC2gÞ; using l when estimating ð#yx; ˆcxÞ; and employing a fixed l0 when
constructing ˜g; and taking the basic estimator ˆg to be of either Nadaraya–Watson or local-linear type; results described in Sections 3.2 and 3.3 hold in the case of nonparametric regression.
4. Numerical examples
Three estimation methods, local quadratic approximation to contour lines (giving ˜
gBðxjlÞ), local linear approximation (giving ˜gLðxjlÞ), and the standard kernel
estimator ˆgðxÞ; were used to estimate the probability density functions of two distributions. We generated 200 random samples of size n¼ 500 from each. For each sample, integrated squared error (ISE) values for the three estimators were approximated by numerical integration. Values of MISE were approximated by averaging 200 of the respective ISE values. The spherical biweight kernel, KðzÞ ¼
15
8pð1 jjzjj 2Þ2
þ;was employed throughout.
Our first example is the unimodal bivariate normal Nð0; IÞ distribution. We took the bandwidth to equal 0.8. To construct ˜gBðxjlÞ and ˜gLðxjlÞ; l in (2.2) was taken as
minð0:1 þ2
3dðxÞ; 1:1Þ; where dðxÞ was the distance from x to the location of the
mode of ˆg: Three-quarters of this value was used for l in (2.4). See the second-last paragraph of Section 2.1 and the last paragraph of Section 3.2. Notice that ‘‘radii’’ of contour lines of the density surface degenerate near the mode, and that linearly increasing the value of l ensures appropriate approximation of the contour lines.
Among the 200 random samples, the three samples that give median ISE values for the three estimators are plotted inFig. 3, which also shows the corresponding values of ˆgðxÞ; ˜gLðxjlÞ and ˜gBðxjlÞ: In multivariate cases, often a density surface
estimate fluctuates significantly due to data sparseness difficulties. The averaging step of our contour approximation methods remedies this problem. This effect is clearly demonstrated by the middle and bottom rows ofFig. 3. There, for each of the three samples, the surfaces corresponding to ˜gLðxjlÞ and ˜gBðxjlÞ have less wiggly
contour lines than ˆg at places away from the mode.
Table 1 gives ISE values for the nine estimates. Table 2 provides average ISE values, these being approximation to MISEs, for the three estimators. These results demonstrate clear gains of ˜gLðxjlÞ over ˆg:
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05 -2 -1 0 1 2 -2 -1 0 1 2 0.150.130.110.09 0.07 0.07 0.07 0.05
Fig. 3. Unimodal density estimates. The top row depicts three samples of size 500 drawn from the bivariate normal Nð0; IÞ distribution. Middle and bottom rows show contour lines of the true density surface (dashed lines), ˆg (dotted lines) and contour approximation estimators, for the three respective samples. The middle row compares the local linear contour approximation method (solid lines) with ˆg: The bottom row compares the local quadratic contour approximation method (solid lines) with ˆg:
Table 1
ISE values of the density estimates shown inFigs. 3 and 4
Unimodal normal ˆ gðxÞ 0.001550 0.001660 0.001602 ˜ gLðxjlÞ 0.001398 0.001332 0.001450 ˜ gBðxjlÞ 0.001437 0.001635 0.001520
Bimodal normal mixture ˆ gðxÞ 0.003616 0.003717 0.003610 ˜ gLðxjlÞ 0.003311 0.003261 0.003310 ˜ gBðxjlÞ 0.003548 0.003661 0.003553
Our next example illustrates performance of our estimators in a more complex, bimodal setting, a mixture of two bivariate normal distributions:
0:7N 0 0 ! ; 1 0 0 1 !! þ 0:3N 1:5 1 ! ; 0:26 0:1 0:1 0:26 !! : ð4:1Þ
Bandwidth was h¼ 0:6: To construct ˜gðxjlÞ we took l in (2.2) to be minð0:1 þ
2
3dðxÞ; 1:1Þ; and three-quarters of its value to be l in (2.4), where dðxÞ was the
distance from x to the location of the mode of ˆg nearest to x: This prevents our using too-large values of l at places between the modes, where the contour lines are curved, and hence helps preserve the bimodal feature of the density surface estimate. (In practice, there may not be prior information about the number of modes of the true distribution. In this case one can make a judgment from plots of preliminary estimates, such as ˆg:) For this example our approach again reduces fluctuations in the density surface estimates caused by stochastic variability, particularly in regions away from either of the modes; see the panels in the middle and bottom rows ofFig. 4. The ISE and average ISE values are given inTables 1 and 2.
In summary, our simulation results demonstrate advantages of the contour approximation methods: the density surface estimates are more regularly shaped and the MISE values are reduced, compared to the usual kernel density estimate. Notably, the local linear contour approximation estimator enjoys good numerical performance. The local quadratic approximation method performs less well; it involves fitting two, rather than one, parameter, and thus will outperform ˆg in MISE terms only when sample size is relatively large.
5. Proofs
5.1. Proof of Theorem 3.1
Put g¼ Eð ˆgÞ;%g ¼ EðˇgÞ; D ¼ ˆg g and %D ¼ ˇg %g: Define A1ðxjy; c; lÞ; A2ðxjy; c; lÞ
and A3ðxjy; c; lÞ to equal the integrals of fgðzÞ %gðxjy; cÞg2;fDðzÞ %Dðxjy; cÞg2 and
fgðzÞ %gðxjy; cÞg fDðzÞ %Dðxjy; cÞg; respectively, over zACðxjy; c; lÞ: Then, A2ðxjy; c; lÞ ¼
Z
Cðxjy;c;lÞ
DðzÞ2ds xðc; lÞ %Dðxjy; cÞ2; ð5:1Þ Table 2
Average ISE values of ˆgðxÞ; ˜gLðxjlÞ; and ˜gBðxjlÞ when applied to 200 random samples of size 500, drawn
from the unimodal Nð0; IÞ or the bimodal normal mixture distribution at (4.1)
Unimodal normal Bimodal normal mixture ˆ gðxÞ 0.001650 0.003724 ˜ gLðxjlÞ 0.001413 0.003377 ˜ gBðxjlÞ 0.001619 0.003678
A3ðxjy; c; lÞ ¼
Z
Cðxjy;c;lÞ
gðzÞDðzÞ ds xðc; lÞ%gðxjy; cÞ %Dðxjy; cÞ; ð5:2Þ
xS¼ A1þ A2þ 2A3: ð5:3Þ
Without loss of generality, lX1 and lhp1: Let c denote a differentiable function defined in the plane, write jDcjðzÞ for the supremum of the absolute value of the directional derivative of c (at z) over all directions, let C40; and put Z¼ /y yxS
and z¼ jc cxj: There exists C140 with the property that
Z Cðxjy;c;lÞ Z Cðxjyx;cx;lÞ ! cðzÞ ds pC1ðZ þ lhzÞðlhÞ2 sup z: jjzxjjplhfjDcjðzÞ þ jcðzÞjg ð5:4Þ -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025 -1 0 1 2 -2 -1 0 1 0.18 0.18 0.15 0.15 0.12 0.12 0.090.060.04 0.04 0.025
Fig. 4. Bimodal density estimates. Same asFig. 3, except that the samples are from the bivariate normal mixture distribution at (4.1).
uniformly inðy; cÞ such that jyj; jyxjpp; jcj; jcxjpC=ðlhÞ and Z; lhzpC: (Below we
shall refer to this uniform sense as ‘‘uniformn
’’. At (5.4) and below the constants C1; y; C4depend only on C:) To derive (5.4), note that the distance between a given
point on Cðxjy; cÞ and its counterpart on Cðxjyx; cx;lÞ; to which the former may be
rotated about x; is dominated by a constant multiple ofðZ þ lhzÞ lh: Therefore, the difference of function values at the two points is dominated by a constant multiple of ðZ þ lhzÞ lh times supjDcjðzÞ: To obtain the bound at (5.4) this should be multiplied by a constant times the lengths of the curves, i.e. by a constant times lh: There is an additional contribution to the right-hand side, coming from the difference between 1 and the Jacobian of the transformation, based on a rotation, which takes Cðxjy; cÞ to Cðxjyx; cx;lÞ; but it too is dominated by a constant multiple of the right-hand side
of (5.4).
The quantity xðc; lÞ; being the length of Cðxjy; c; lÞ; is asymptotic to 2lh uniformly injcjpC=ðlhÞ; and jxðc; lÞ xðcx;lÞjpC2ðlhÞ3z uniformly injcjpC such
that zpC=ðlhÞ: Therefore,
jxðc; lÞ1 xðcx;lÞ1jpC3lhz; ð5:5Þ
in the same uniform sense. Combining (5.1) with the results in this paragraph, and defining Bjðxjy; c; lÞ ¼ xðc; lÞ1Ajðxjy; c; lÞ; we conclude that in the uniform
n sense, jB2ðxjy; c; lÞ B2ðxjyx; cx;lÞj pC4ðZ þ lhzÞlh sup z:jjzxjjplh fjDðzÞj jDDjðzÞ þ DðzÞ 2g: ð5:6Þ
Given j¼ 0; 1; let A4jðxjy; c; lÞ denote the integral of gðzÞjDðzÞ over zACðxjy; c; lÞ:
An argument similar to that leading to (5.6) implies that in the uniformn sense,
j%gðy; cÞ %gðyx; cxÞjpC5ðZ þ lhzÞlh; ð5:7Þ
where the constants C5; C6; C7 here and below depend only on C; g and K: From
(5.2), (5.7) and the properties of xðc; lÞ discussed in the previous paragraph, we deduce that in the uniformn
sense, jB3ðxjy; c; lÞ B3ðxjyx; cx;lÞj
pC6 ðlhÞ1max
j¼1;2fjA4jðxjy; c; lÞ A4jðxjyx; cx;lÞjg
þ ðZ þ lhzÞlh maxfjA40ðxjy; c; lÞj; jA40ðxjyx; cx;lÞjg
Combining (5.3), (5.6) and (5.8) we conclude that in the uniformn sense, jSðxjy; c; lÞ Sðxjyx; cx;lÞ fB1ðxjy; c; lÞ B1ðxjyx; cx;lÞgj
pC7 ðlhÞ1max
j¼1;2fjA4jðxjy; c; lÞ A4jðxjyx; cx;lÞjg
þ ðZ þ lhzÞlh sup
z:jjzxjjplh
fjDðzÞj jDDjðzÞ þ DðzÞ2g "
þ maxfjA40ðxjy; c; lÞj; jA40ðxjyx; cx;lÞjg
#!
: ð5:9Þ
The quantities T1 DðzÞ; T2 A40ðxjy; c; lÞ and T3 A4jðxjy; c; lÞ
A4jðxjyx; cx;lÞ all have zero mean, and have variances equal to Oðs21Þ; Oðs22Þ and
Oðs2 3Þ; respectively, where s21¼ h4; s22¼ l 2h6 and s2 3¼ ðZ þ lhzÞðlh2Þ 2 : Also, T4
jDDjðzÞ has mean square equal to Oðs2
4Þ; where s24¼ h2:For example, to obtain the
order of the variance of T3;note that the area between the curves Cðxjy; c; lÞ and
Cðxjyx; cx;lÞ equals OðaÞ; where a ¼ ðZ þ lhzÞðlhÞ2: The variance of nh2T3 is
essentially the variance of a Poisson variable with mean OðnaÞ; and so the variance of T3 equals Ofðnh2Þ2nag; which, since h^n1=6;equals Oðah2Þ ¼ Oðs23Þ:
Using Bennett’s inequality we may prove that, provided
n1eh2si-N; for some e40 and i ¼ 1; y; 4; ð5:10Þ
the probability that U1 jT1T4j; U2 jT2j or U3 jT3j exceeds u1 C8s1s4log n;
u2 C8s2ðlog nÞ1=2 or u3 C8s3ðlog nÞ1=2;respectively, equals OðnC9Þ in each case,
where C9 may be made arbitrarily large by choosing C8 sufficiently large; and these
probabilities are of the stated orders uniformly in x; zASe; and in c; cx;y; yx
complying with the ‘‘uniformn
’’ sense. From this result, using standard methods of approximation (see below), we may deduce that with probability 1 the right-hand side of (5.9), denoted below by RHS, satisfies
RHS¼ OðdnÞ
where dn¼ ðZ þ lhzÞ1=2hðlog nÞ1=2þ ðZ þ lhzÞl2h4ðlog nÞ1=2; ð5:11Þ
the former identity holding uniformly in xASeand in c; cx;y; yxcomplying with the
‘‘uniformn
’’ sense. (Below we shall refer to this alternative uniform sense as ‘‘uniformw’’.)
The ‘‘standard methods of approximation’’ alluded to above may be summarised as follows. Since S is bounded then, for any c40; a square lattice with edge width nc has only Oðn2cÞ of its vertices in S: Since the derivatives of K are Ho¨lder
continuous then we may choose c so large that the difference between the value of Uj
at a general point u (say) within Se;and the value of Uj at the point of the lattice
(within S) that is nearest to u; equals Oðn1Þ uniformly in u and in j ¼ 1; 2; 3; with probability 1. Call this result ðR1Þ: By choosing C8 (introduced in the previous
paragraph) so large that we may take C9X2cþ 2; and applying the Borel–Cantelli lemma, we may show that the supremum of UjðuÞ; over all u in the lattice, equals
OðtjÞ for each j; with probability 1. Call this result ðR2). Since n1¼ OðtjÞ then,
combining ðR1) and ðR2), we have shown that the supremum of UjðuÞ; over all
uASe;equals OðtjÞ for each j: This implies (5.11).
Define %gðxjy; c; lÞ to equal the integral of xðc; lÞ1gðzÞ over zACðxjy; c; lÞ: Given two bounded functions a and b defined in the plane, and a smooth, rectifiable, planar curve C of finite length jCj; put
jja bjjC¼ jCj1 Z C faðzÞ bðzÞg2ds 1=2 :
The conditions assumed of g imply that g¼ g þ Oðh2Þ; whence it follows that
B1ðxjy; c; lÞ1=2¼ jjg %gðxjy; c; lÞjjCðxjy;c;lÞþ Oðh2Þ; ð5:12Þ
in the uniformw sense. Moreover, writing bn¼ bnðxÞ for a sequence of positive
functions satisfying bnðxÞ^1 uniformly in xASe;we claim that
jjg %gðxjy; c; lÞjjCðxjy;c;lÞ¼ b 1=2
n ðZ þ lhzÞlh ð5:13Þ
in the uniformw sense.
To derive (5.13), note that each point on the curve segment Cðxjy; c; lÞ (the length of which is asymptotic to 2lh) is distant OfðZ þ lhzÞ lhg from the nearest point on the true contour line that passes through x: Moreover, along a portion of the curve segment, the portion having length equal to at least constant multiple of lh for all sufficiently large n; the nearest distance is at least a constant multiple ofðZ þ lhzÞlh: Let %gcontðxjlÞ denote the average of gðzÞ for z in the contour segment DðxjlÞ: In view
ofðC1gÞ and the results just noted,
jjg %gðxjy; c; lÞjjCðxjy;c;lÞ jjg %gcontðxjlÞjjDðxjlÞ¼ b 1=2
n ðZ þ lhzÞlh; ð5:14Þ
where bn has the properties claimed of the quantity at (5.13). A similar argument
shows that, since g has two Ho¨lder-continuous derivatives in S; jjg %gðxjyx; cx;lÞjjCðxjyx;cx;lÞ jjg %gcontðxjlÞjjDðxjlÞ¼ OfðlhÞ
2þtg; ð5:15Þ
where t40 depends on the Ho¨lder exponent. But by definition of the contour line DðxÞ; %gcontðxjlÞ ¼ gðxÞ and gðzÞ ¼ gðxÞ for all zADðxÞ; and so (5.14) and (5.15) are
respectively identical to (5.13) and
jjg %gðxjyx; cx;lÞjjCðxjyx;cx;lÞ¼ OfðlhÞ
2þt
g: ð5:16Þ
Combining (5.12) and (5.13) we deduce that
B1ðxjy; c; lÞ ¼ bnfðZ þ lhzÞlhg2þ OfðZ þ lhzÞlh3þ h4g: ð5:17Þ
Likewise, (5.16) and the version of (5.12) forðy; cÞ ¼ ðyx; cxÞ gives
Combining (5.9), (5.11), (5.17) and (5.18), and noting that the quantityðZ þ lhzÞlh3
at (5.17) is of smaller order than the termðZ þ lhzÞlh3ðlog nÞ1=2
at (5.11), we see that with probability 1, uniformly in xASe for each e40;
Sðxjy; c; lÞ Sðxjyx; cx;lÞ ¼ bnfðZ þ lhzÞlhg2þ Ofdnþ ðlhÞ4þ2tþ h4g: ð5:19Þ
Therefore, in the operation of minimising Sðxjy; c; lÞ over y and c; en Z þ lhz
can be made as small as a sufficiently large constant multiple of e0 n ðlhÞ
1
fd1=2n þ
ðlhÞ2þtþ h2g: Now, the relation e
n^e0n is equivalent to
en^ðl2hÞ2=3ðlog nÞ1=3þ ðlhÞ1þt: ð5:20Þ
Note too that the property ðl2hÞ2=3ðlog nÞ1=3þ ðlhÞ1þt¼ Oðe
nÞ implies (5.10).
Therefore, with probability 1,
/#yx yxS þ lhjˆcx cxj ¼ Ofðl2hÞ2=3
ðlog nÞ1=3þ ðlhÞ1þtg: The theorem follows directly from this result.
5.2. Proof of Theorem 3.2
In a slight abuse of notation, write ˇg;%g; C and x for ˇgðxj*yx; ˜cx;l0Þ;%gðxj*yx; ˜cx;l0Þ;
Cðxj*yx; ˜cx;l0Þ and xð˜cx;l0Þ; and let ˇg0;%g0; C0and x0denote the respective versions of
those quantities whenð*yx; ˜cxÞ is replaced by ðyx; cxÞ: In a slight change of notation
from the previous proof, put Z¼ ZðxÞ ¼ /*yx yxS and z ¼ zðxÞ ¼ j˜cx cxj:
Standard methods of strong approximation, similar to those used to derive (5.11), may be used to show that under the conditions of the theorem, j ˆgðzÞ gðzÞj ¼ Ofh2ðlog nÞ1=2
g and jDð ˆg gÞjðzÞ ¼ Ofhðlog nÞ1=2g uniformly in zASe; for each
e40; with probability 1. Using this result, (5.4), (5.5) and the representations ˇ g%g ¼ x1Z C ð ˆg gÞ; gˇ0%g0¼ x10 Z C0 ð ˆg0 gÞ;
we may prove that with probability 1,
j ˇgðxj*yx; ˜cx;l0Þ %gðxj*yx; ˜cx;l0Þ f ˇgðxjyx; cx;l0Þ %gðxjyx; cx;l0Þgj
¼ OfðZ þ hzÞh2ðlog nÞ1=2g: ð5:21Þ
Similarly, using the fact that jDðg gÞjðzÞ ¼ OðhÞ uniformly in zASe; and
applying (5.4), (5.5) and the relation %gðxj*yx; ˜cx;l0Þ %gðxjyx; cx;l0Þ ¼ x1 Z C ðg gÞ x10 Z C0 ðg gÞ; we may prove that
j%gðxj*yx; ˜cx;l0Þ %gðxjyx; cx;l0Þj ¼ OfðZ þ hzÞh2g: ð5:22Þ
Likewise, recalling that ˇgcontðxjl0Þ is the average value of ˆg along the contour
segment Dðxjl0Þ; and noting that, in view of the Ho¨lder continuity of second
derivatives of g; Dðxjl0Þ and the parabola segment Cðxjyx; cx;l0Þ are uniformly
distant h2ne apart, for some e40; we may show that with probability 1,
j ˇgðxjyx; cx;l0Þ ˇgcontðxjl0Þj ¼ oðh2Þ: ð5:23Þ
Combining (5.21)–(5.23) we deduce that with probability 1, and uniformly in xASe;
ˇ
gðxj*yx; ˜cx;l0Þ ˇgcontðxjl0Þ ¼ OfðZ þ hzÞh2ðlog nÞ1=2g þ oðh2Þ: ð5:24Þ
The theorem follows from this property and (3.4). 5.3. Proof of Theorem 3.3
Here we show thatðCKÞ; (3.8) and (3.9) are sufficient for (3.4) when h ¼ c1n2=11
and l0¼ c2n1=11;and that estimatorsð*yx; ˜cxÞ satisfying (3.9) are readily constructed
when (3.8) holds.
The arguments leading to (5.21) and (5.22) apply as before, although the terms zh and h2 on the right-hand sides of those formulae should be replaced by zl
0h and
ðl0hÞ2;respectively. Therefore, in view of (3.9), for the present choices of h and l0;
the right-hand sides of (5.21) and (5.22) equal oðh2Þ with probability 1, uniformly in
xASe:
For some x40;
jE ˆgðy1Þ gðy1Þ fE ˆgðy2Þ gðy2Þgj ¼ Oðh2jy1 y2jxÞ;
j ˆgðy1Þ E ˆgðy1Þ f ˆgðy2Þ E ˆgðy2Þgj
¼ Ofðnh2l0Þ1=2ðjjy1 y2jj=hÞxðlog nÞ1=2g
uniformly in points y1ADðxjl0Þ and y2ACðxjyx; cx;l0Þ that are both distant s from x and are on the same side of x; and are in xASe:(In the case of the second identity
the result holds with probability 1.) For some Z40; jjy2 y1jj ¼ Ofðl0hÞ2þ2Zg ¼
Oðh1þZÞ; uniformly in pairs ðy1; y2Þ: Therefore, with probability 1 the difference
between the integral averages of ˆg g over Dðxjl0Þ and Cðxjyx; cx;l0Þ equals oðh2Þ;
uniformly in xASe:
Given y1 and y2 as before,
gðy2Þ ¼ gðy1Þ þ
X2 i¼1
ðy2 y1ÞðiÞgiðy1Þ þ Oðjjy2 y1jj2Þ;
where g1and g2represent first partial derivatives, and bracketed superscripts denote
vector components. Recall thatjjy2 y1jj ¼ oðhÞ; uniformly in ðy1; y2Þ; and observe
that the integral average ofðy2 y1ÞðiÞgiðy1Þ over y2ACðxjyx; cx;l0Þ is bounded by a constant multiple of the integral average ofjjy2 y1jj2;and hence equals oðh2Þ:
From this property and the fact that gðy1Þ ¼ gðxÞ for each y1 we deduce that the
integral average of gðy2Þ over Cðxjyx; cx;l0Þ equals the integral average of gðy1Þ over
Dðxjl0Þ; plus a term equal to oðh2Þ:
Combining these results we see that the difference between the integral averages of ˆ
g over Dðxjl0Þ and Cðxjyx; cx;l0Þ equals oðh2Þ; uniformly in xASe: This is the
analogue of (5.23) in the present setting. Combining this property and the versions of (5.21) and (5.22) we obtain the following version of (5.24): ˇgðxj*yx; ˜cx;l0Þ
ˇ
gcontðxjl0Þ ¼ oðh2Þ uniformly in xASe;with probability 1. This is equivalent to (3.4).
Next we show that, if (3.8) holds, estimators *yx and ˜cx can be constructed such
that (3.9) is true. Note that, in view of the present choice of h and l0; (3.9) is
equivalent to n2=11ðlog nÞ1=2 sup xASe /*yx yxS-0; n1=11ðlog nÞ1=2 sup xASe j˜cx cxj-0 ð5:25Þ
with probability 1. Now, (3.8) implies that, simply by forming the respective derivatives of ˆg; one may estimate first and second derivatives of g with respective rates nð5=22ÞZ and nð1=11ÞZ; for uniform convergence in Se with probability 1.
Therefore we may estimate contour tangent angle and contour curvature with the same respective rates. (In fact we may achieve this end by fitting a local quadratic to contours, as suggested in Section 2.1.) Result (5.25) follows from this property.
If, when using the local quadratic contour estimation method outlined in Section 2.1, we choose the bandwidth for ˆg to be h¼ nð3x2Þ=22where 0ox
2o9x1=ð2 þ 3x1Þ
and x1Að0;13Þ is the Ho¨lder coefficient mentioned in (3.8), then for some Z40 the
convergence rates nð5=22ÞZ and nð1=11ÞZ (for, respectively, first and second
derivatives of g) mentioned in the previous paragraph are obtained. It follows that the local quadratic contour estimators also enjoy these rates.
Acknowledgments
Helpful comments of an editor and two reviewers have helped improve the paper.
References
[1] M.-Y. Cheng, J. Fan, J.S. Marron, On automatic boundary corrections, Ann. Statist. 25 (1997) 1691–1708.
[1a] A. Cowling, P. Hall, On pseudodata methods for removing boundary effects in kernel density estimation, J. Roy. Statist. Soc. Ser. B 58 (1996) 551–563.
[2] R.A. Djojosugito, P.L. Speckman, Boundary bias correction in nonparametric density estimation, Comm. Statist. Theory Methods 21 (1992) 69–88.
[3] R.L. Eubank, P. Speckman, A bias reduction theorem with applications in nonparametric regression, Scand. J. Statist. 18 (1991) 211–222.
[4] J. Fan, Local linear regression smoothers and their minimax efficiencies, Ann. Statist. 21 (1993) 196–216.
[6] T. Gasser, H.G. Mu¨ller, Kernel estimation of regression functions, in: T. Gasser, M. Rosenblatt (Eds.), Smoothing Techniques for Curve Estimation, Lecture Notes in Mathematics, Vol. 757, Springer, New York, 1979, pp. 23–68.
[7] T. Gasser, H.G. Mu¨ller, V. Mammitzsch, Kernels for nonparametric curve estimation, J. Roy. Statist. Soc. Ser. B 47 (1985) 238–252.
[8] B.L. Granovsky, H.G. Mu¨ller, Optimizing kernel methods: a unifying variational principle, Internat. Statist. Rev. 59 (1991) 373–388.
[9] P. Hall, C. Huber, A. Owen, A. Coventry, Asymptotically optimal balloon density estimates, J. Multivariate Anal. 51 (1994) 352–371.
[10] P. Hall, T.E. Wehrly, A geometrical method for removing edge effects from kernel-type nonparametric regression estimators, J. Amer. Statist. Assoc. 86 (1991) 665–672.
[11] T. Hastie, C. Loader, Local regression: automatic kernel carpentry, Statist. Sci. 8 (1993) 120–143. [12] M.C. Jones, Simple boundary correction for kernel density estimation, Statist. Comput. 3 (1993)
135–146.
[13] H.G. Mu¨ller, Smooth optimal kernel estimators near endpoints, Biometrika 78 (1991) 521–530. [14] H.G. Mu¨ller, U. Stadtmu¨ller, Multivariate boundary kernels and a continuous least squares principle,
J. Roy. Statist. Soc. Ser. B 61 (1999) 439–458.
[15] J. Rice, Boundary modification for kernel estimators, Commun. Statist. Theory Methods 13 (1984) 893–900.
[16] D. Ruppert, M.P. Wand, Multivariate locally weighted least squares regression, Ann. Statist. 22 (1994) 1346–1370.
[17] E.F. Schuster, Incorporating support constraints into nonparametric estimators of densities, Commun. Statist. Theory Methods 14 (1985) 1123–1136.
[18] D.W. Scott, Multivariate Density Estimation—Theory, Practice and Visualization, Wiley, New York, 1992.
[19] J.G. Staniswalis, K. Messer, D.R. Finston, Kernel estimators for multivariate regression, J. Nonparamet. Statist. 3 (1993) 103–121.
[20] J.G. Staniswalis, K. Messer, Addendum to Staniswalis, Messer and Finston (1993), J. Nonparamet. Statist. 7 (1996) 67–68.
[21] C.J. Stone, Optimal rates of convergence for nonparametric estimators, Ann. Statist. 8 (1980) 1348–1360.