Estimation of the location of the maximum of a regression function using extreme? order statistics

(1)

File: 683J 160001 . By:CV . Date:21:05:96 . Time:15:42 LOP8M. V8.0. Page 01:01 Codes: 4362 Signs: 2270 . Length: 50 pic 3 pts, 212 mm

journal of multivariate analysis57, 191214 (1996)

Estimation of the Location of the Maximum of a

Regression Function Using Extreme Order Statistics*

Hung Chen

-Department of Mathematics, National Taiwan University, Taipei, Taiwan 10764, Republic of China

Mong-Na Lo Huang and Wen-Jang Huang

Institute of Applied Mathematics, National Sun Yat-sen University, Kaohsiung, Taiwan 80424, Republic of China

In this paper, we consider the problem of approximating the location, x0# C, of

a maximum of a regresion function, %(x), under certain weak assumptions on %. Here C is a bounded interval in R. A specific algorithm considered in this paper is as follows. Taking a random sample X1, ..., Xnfrom a distribution over C, we have

(Xi, Yi), where Yiis the outcome of noisy measurement of %(Xi). Arrange the Yi's

in nondecreasing order and take the average of the r Xi's which are associated with

the r largest order statistics of Yi. This average, x^0, will then be used as an estimate

of x0. The utility of such an algorithm with fixed r is evaluated in this paper. To

be specific, the convergence rates of x^0to x0are derived. Those rates will depend

on the right tail of the noise distribution and the shape of %( } ) near x0. 1996

Academic Press, Inc.

1. Introduction

Let % be a real function defined on a bounded interval C # R, and sup-pose there is an x0# C with %(x0)>%(x) for any x{x0 in C. It is further assumed that %( } ) is continuous. The objective is to determine x0 based on n samples (X1, Y1), ..., (Xn, Yn) with Yi=%(Xi)+=i, where n is a predeter-mined number. Here [=i] are independent and identically distributed (i.i.d.) random variables with zero expectation.

article no.0029

191

0047-259X96 18.00

Received April 28, 1994; revised July 1995.

AMS 1991 subject classifications: primary 62G05, 62G20, 62G30; secondary 62N99. Key words and phrases: extreme-value distribution, global function optimization from noisy samples, errors of measurement, ranking selection.

* Part of this research is supported by National Science Council of Republic of China.

-_{Hung Chen's research is also sponsored by the National Science Foundation under Grant}

(2)

method used in Changchien [5]. This method has been used to search for the optimum range of burden distribution indices of blast furnace to extract iron from large quantities of iron-bearing materials. A quick introduction on metal production using an electric furnace can be found in Lawson [12]. More explicitly, for given n samples (X1, Y1), ..., (Xn, Yn), where X

(3)

under certain situations the noise level is low. Intuitively, the regression function value associated with the largest order statistic of Yi's should be large compared to %(X1), ..., %(Xn). It is also clear that these central order

statistics of Yi's cannot carry asymptotic information concerning x0

without additional global condition on %( } ). On the other hand, the second largest order statistics of

(4)

[%(X1), ..., %(Xn)]. To answer this question, we reformulate the problem as follows. Suppose (Yi, Zi) (i=1, 2, ..., n) is a random sample of n observa-tions of a bivariate random variable (Y, Z) with Y=Z+=, where Z and =

are independent variables. Note that only the Yi's are observed and the

corresponding Zi is %(Xi). Some notations will be introduced first. If we place the Yi's in nondecreasing order, as Y1 : n } } } Yn : n, then the Z

(5)

,

Z satisfies Condition R with {=1\. As an example, if %(x) is twice dif-ferentiable near x0 and %"(x0)<0, then Condition R holds with {=12 ( \=2).

Throughout this paper, we assume that = satisfies either Condition E(1) or E(2) described in the following.

Condition E. (1) w=< and for some k0, f= and F= satisfy

(&1)k f(k)

= (w=)>0, f ( j)

= (w=)=0 for every 0 jk&1, and

limtA w=(w=&t) f=(t)[1&F=(t)]=k+1.

(2) w== and f=satisfies

f=(x)tABvx

&u+v&1

exp(&Bxv

) as x ,

where v>1, u0, and A, B are positive constants.

Here ``g(x)t(h(x) as x '' denotes limx g(x)h(x)=1.

Theorem 1. Suppose Y=Z+=, where Z and = are independent. Let Z

satisfy Condition R. Then

(a) Z[n&l : n]&wZ=Op((log nn)

1(1+(k+1){)

) for all l<r under Con-dition E(1);

(b) Z[n&l : n]&wZ=Op((log n)

&((v&1)v)

(log log n){

) for all l<r

under Condition E(2).

Since we only consider the case that r is fixed throughout this paper, for simplicity denote x^0(r) by x^0. Now we describe the asymptotic behavior of x^0&x0 which follows from Theorem 1 and Example 1.

Theorem 2. Assume there exist some positive constants c₃and c₄such that (1) holds. Then x^0&x0=Op((log nn)

1(1+(k+1){)

) or Op((log n)

&((v&1)v) (log log n){

) when = satisfies Condition E(1) or E(2), respectively.

Remark 1. When = is uniformly distributed, Condition E(1) is satisfied

with k=0. Theorem 2 states that x^0&x0=Op((log n) 12

n&12

(6)

is ``wedge-shaped'' ({=1) around x=x0. If %(x) is twice differentiable

near x0 and %"(x0)<0, then x^0&x0=Op((log n) 13

n&13

). Muller [15] proposes an estimate of x0, xM 0, by estimating %(x0) with the kernel smoother. If |%(x)&%(x0)| c |x&x0|

\

for some c>0 and \1 in a neighborhood of x0, then x^M 0&x0=Op([(n log n)

&25 ]{

). Compare the estimate obtained by the best-r-points-average method with the one in Muller [15] and the so-called passive stochastic approximation method in Tsybakov [18], where the convergence rate is about the same as for that in Muller [15] and Tsybakov [18], it is easy to see that the estimator in this paper is better when = is a uniform random variable. On the other hand, when = is a normal random variable (i.e., u=1, v=2 in Condition E(2)), Theorem 2 states that x^0&x0=Op([(log n)

&12_]{_{(log log n)}{_), which then implies that this estimator is not as good as those considered in Muller [15] and Tsybakov [18]. But the estimator based on the best-r-points-average is much simpler and easily understood so that it can be implemented in practical applications easily. Furthermore, the result derived in Muller [15] cannot be improved even when = is known to be uniformly distributed.

Remark 2. Theorem 2 states that the best-r-points-average method for

locating the peak works better under Condition E(1) than Condition E(2). By formulating the problem in the framework of the ranking selection problem as described in Section 4, the rate of x^0&x0 depends critically on whether w= is finite or not and the local behavior of F= near w=. In par-ticular, the best-r-points-average method works best when w=< or = has

a short-tailed distribution. When the tail of f= is long as discussed in

Remark 1, other estimates, such as that in Muller [15], are perhaps more suitable.

3. Extreme-Value Distributions

To facilitate the discussions in Section 4, we briefly review the aspects of the extreme-value distribution theory which can be found in Section 5.1 of Reiss [16] on the topic of the domain of convergence. They are summarized in two lemmas and will be used repeatedly in Sections 4 and 5.

Let =1, ..., =nbe independent random variables with common distribution F=( } ). Let =h : n=max(=1, ..., =n). Denote the (left continuous) inverse of FY as F

Y(u) which is defined as inf[ y: FY( y)u]. The distribution F= is said to be in the domain of (maximum) attraction of a distribution G (written F=# D(G)) if there are [an] (an>0) and [bn] so that

lim

n F

n

(7)

at every continuity point of G. It is well known (see, e.g., p. 5.3 of Reiss [16]) that G must belong to, up to location and scale, one of the three classes of extreme-value distributions described in the following lemma.

Lemma 1. Suppose there exist a_n>0, b_n# R, n1, such that

P[(=n : n&bn)anx]=F

n

=(anx+bn) G(x),

weakly as n_{, where G is assumed to be nondegenerate. Then, with}

suitable numbers A>0 and B, G(Ax+B) belongs to one of the following three classes: (a) 8:(x)=

{

exp(&x&: ) 0 if x>0 if x0

=

for some :>0; (b) 9:(x)=

{

1 exp(&(&x): ) if x>0 if x0

=

for some :>0; (c) 4(x)=exp(&e&x ), x # R

To prove Theorem 1, we need the bound for the variational distance en=sup

x # R |Fn

=(anx+bn)&G(x)|,

where F=# D(G), between the exact and limiting distributions. The follow-ing lemma gives a prescription on the choice of normalizfollow-ing constants [an] and [bn].

Lemma 2. Assume that f₌is positive on (t, w₌), where t<w₌.

(a) If w==, and lim t tf=(t) 1&F=(t) =:, (2)

with some :>0, then there are constants an>0 and bnsuch that the distribu-tion of (=n : n&bn)anconverges to 8:. Moreover, the constants can be chosen as an=F= (1&1n) and bn=0.

(b) If w=< and limtA w=(w=&t) f=(t)[1&F=(t)]=:, then there are constants an>0 and bn such that the distribution of (=n : n&bn)anconverges to 9:. Moreover, the constants can be chosen as an=w=&F= (1&1n) and bn=w=.

(c) If w=

&(1&F=(t)) dt< and

lim tA w= f=(t) [1&F=(t)] 2

|

w= t

(8)

then there are constants an>0 and bn such that the distribution of

(=n : n&bn)anconverges to 4. Moreover, the constants can be chosen as an=

[nf=(bn)] &1

and bn=F= (1&1n).

(d) en=o(1) if one of (a), (b), and (c) applies.

If, to F=(t), none of (a), (b), and (c) applies, then there are no constants an>0 and bn such that the distribution of (=n : n&bn)anwould converge.

4. Discussion and Monte Carlo Study

In this section, we give a heuristic argument to illuminate when and why the best-r-points-average method for locating the maximum works. In order to get some ideas on the finite sample property of the best-r-points-average method, we also run a Monte Carlo study as in Muller [15] with r=1, 5. The results are summarized in Tables II and III which indicate the advantage of using r>1. We also compare our simulation results with the one in Muller [15]. Theoretical development in this paper states that the best-r-points-average method can be useful in locating global maximum of a regression function with local maximum. A Monte Carlo experiment is conducted to confirm it.

We now give a heuristic argument to explain why the convergence rates of the proposed estimate should depend on the behavior of the tail of 1&F=(x) as x increases. This argument is essentially used in Section 5 to prove Theorem 1. Although Z is assumed to be a continuous random variable in Section 2, we here consider the case where Z takes values on discrete levels 0%n 1< } } } <%nK1, for some integer K1. Let n=KN,

where N is an integer. For each %nj, we further assume that there are N

samples from Y=%nj+=. In other words, we have i.i.d. random variables

Yj 1, ..., YjN, from the j th population with distribution F=( }&%nj) for 1 jK. It is then clear that the utility of the best-r-points-average method with r=1 depends on whether the location parameter, associated with the population yielding Yn : n, is close to %nK. This problem can then be viewed as to use the largest order statistics from each population to dis-criminate among location parameter families F=( }&%) for % # [%n 1, ..., %nK]. Obviously, this problem is related to the ranking selection problem as introduced by Bechhofer [2].

When Yj 1, ..., YjN follow the distribution F=( }&%nj), its sample mean is the complete sufficient statistics of %nj when = is normally distributed. In this case, it seems reasonable to use the sample means from those K pop-ulations to discriminate the location parameter families F=( }&%). At the above setting, Muller's curve fitting approach [15] reduces to discriminate the location parameter families [F=( }&%); %=%n 1, ..., %nK] with sample

(9)

means. When = is uniformly distributed, the largest order statistics from those K populations can be used to discriminate the location parameter families F=( }&%) effectively.

It is then expected that the estimate derived by Muller's curve fitting

approach will converge to x0 with a faster rate than the estimate derived

from the best-r-points-average method with r=1 for normal error. Also, Muller's estimate should have a slower convergence rate than the estimate obtained by the best-r-points-average method with r=1 for uniform error. This conjecture is confirmed by Theorem 2 and the discussions at the end of Section 2.

Next, we will demonstrate that the best-r-points-average method fails when the right tail of the error distribution is heavy. Let now =1, =2, ..., =N be a random sample of size N from a unit double exponential distribution.

Then =N : N# D(4) with aN=1 and bN=log N by Lemmas 1 and 2.

There-fore, the largest order statistic from the K th population (with location parameter %nK) is not necessarily greater than the largest order statistic from the first population (with location parameter %n 1) with probability 1,

even when %n 1=0 and %nK=1. According to the above discussion, it is

expected that the best-r-points-average method will fail to give a consistent estimate of x0 when the limit of an is nonzero.

By Lemma 2(a), we have limN aN= for F # D(8:). Hence, we

exclude those F # D(8:) from our study on the utility of the

best-r-points-average method. Also by Lemma 2(b), bN is finite and limN aN=0 for

F # D(9:). Therefore, we consider a class of distributions in D(9:) with aN=

O(N&1(k+1)

) as described in Theorem 1(a). When F # D(4), limN aN

may take any nonnegative value, as are the cases for double exponential

distribution with aN=1 and normal distribution with aN=(2 log N)

&12_. Hence, we consider a class of distributions in D(4), as described in Theorem 1(b), whose tail is ``lighter'' than the double exponential

distribu-tion (with aN=(B

&1

log N)(1&v)v

for v>1).

Motivated by the problem of estimating distance to a stellar system from measurements on the apparent magnitude of a few of the brightest objects in the system, Rohatgi [17] considers the problem to determine which of the objects should be observed. She finds that the extreme order statistics are asymptotically sufficient for estimating distance when the distribution of the apparent magnitude of star in that galaxy is known up to a location parameter. Her conclusion is close to the above discussion in spirit.

In order to have some ideas on how well our asymptotic results of the best-r-points-average method predicted what would transpire for finite samples, we consider a Monte Carlo study as in Muller [15] with r=1, 5 and sample size n=50. In this study, %(x)=1+3 exp(&(x&0.5)2_0.01) (symmetric peak at (0.5, 4.0)=(x0, %(x0))) with 50 points Xi's from the uniform distribution over [0, 1]. Since the performance of the proposed

(10)

TABLE I

Muller's Adaptive Procedure for Peak Estimation

_2_=0.25 _{_}2_=0.5 _{_}2_=1.0

Average 0.5005 0.5012 0.5012

ASE 1.247&4 _1.893&4 _2.860&4

estimate x^0(r) depends on the error distribution, we consider three error distributions, which are uniform, normal, and double exponential. Note that Muller [15] only reports results for normal error distribution. For ease of comparing with Muller's study, we also consider three noise levels with variances 0.25, 0.5, and 1.0. The number of Monte Carlo runs is 200. This experiment is repeated for 100 times. Tables IIIV show a typical result from one of these one hundred experiments. The notations used in the tables are defined as follows. Let _2

, Average, ASE, and Range denote the variance of noise variable, average estimated location, average squared error for location, and range of estimated location, respectively. Also, 5.272&4 _{should be read as 5.272_10}&4_{. We first give Table I which is} taken from Table 1 of Muller [15] which reflects the performance of his proposed procedure for adaptive peak estimation in that paper.

Tables I, II,.and III indicate that:

v The best-r-points-average method performs better when the error is uniform from the fact that it has smaller ASE and tighter range than that when the error is normal. This is consistent with Theorem 2(a) and (b) qualitatively, since according to Theorem 2, x0&x0=Op((log n)

13_n&13₎ and Op((log n)

&2_{(log log n)}2_{) when the error distribution is uniform and}

normal, respectively. Tables II and III also illustrate the advantage of using r>1.

v The best-r-points-average method with r=1, 5 is not as good as the adaptive procedure in Muller [15] from an ASE standpoint in this

TABLE II

Uniform Error, U[&_ - 3, _ - 3]

_2_=0.25 _{_}2_=0.5 _{_}2_=1.0

r=1 r=5 r=1 r=5 r=1 r=5

Average 0.4998 0.4982 0.5006 0.4958 0.4988 0.4965 ASE 6.627&4 _3.211&4 _9.290&4 _6.157&4 _1.283&3 _1.189&3

(11)

TABLE III Normal Error, N(0, _2₎

_2_=0.25 _{_}2_=0.5 _{_}2_=1.0

r=1 r=5 r=1 r=5 r=1 r=5

Average 0.5012 0.5014 0.5017 0.4995 0.5008 0.4966 ASE 7.521&4 _5.310&4 _1.106&3 _1.311&3 _2.480&3 _2.470&3

Range (0.437, 0.565) (0.414, 0.601) (0.387, 0.575) (0.360, 0.668) (0.270, 0.884) (0.306, 0.669)

particular setting. Also, for r=1, the faster rate of the best-1-average method as claimed in Section 2 is not realized, based on the comparison of Tables I and II, perhaps due to the fact that the sample n=50 is not large enough for reflecting the asymptotic results.

As a consequence, it is recommended to use r>1 and derive better asymptotic results such as the asymptotic distribution of x^0&x0. Research on the asymptotic distribution of the estimate based on the best-r-points-average method the practical choice of r is underway and the result will be reported elsewhere.

As a remark, Chen [6] shows that a modified KieferWolfowitz proce-dure in Fabian [7] achieves the optimal rates of convergence. However, it is known that the finite-sample performance of the KieferWolfowitzs procedure depends crucially on the choice of a starting point. The best-r-points-average method has the potential to be used in determining a ``good'' starting point for the KieferWolfowitz procedure.

According to the discussion at the beginning of this section, a problem with the best-r-points-average method is that it is not consistent when the tail of the error distribution is heavy. Table IV summarizes the Monte Carlo results at the setting when the error distribution is double exponential.

Table IV supports the discussion on the failure of the best-1-points-average method as the range of the estimator is much wider than the other

TABLE IV

Double Exponential Error, (_- 2) DE(1), with Peak (0.5, 4.0)

_2_=0.25 _{_}2_=0.5 _{_}2_=1.0

r=1 r=5 r=1 r=5 r=1 r=5

Average 0.4968 0.4980 0.4909 0.5008 0.5076 0.5040 ASE 1.202&3 _6.616&4 _5.261&3 _1.522&3 _1.088&2 _4.124&3

(12)

TABLE V

Double Exponential Error with Peak (0.7, 4.0)

_2_=0.25 _{_}2_=0.5 _{_}2_=1.0

r=1 r=5 r=1 r=5 r=1 r=5

Average 0.6968 0.6969 0.6926 0.6886 0.6708 0.6665 ASE 2.680&3 _6.525&4 _4.118&3 _1.828&3 _1.689&2 _4.958&3

Range (0.097, 0.793) (0.565, 0.760) (0.097, 0.799) (0.508, 0.788) (0.024, 0.976) (0.333, 0.770)

cases and either the left or the right end point of the interval for the range is close to one of the boundary points of the design interval from where the sample is taken.

But, based on a similar discussion above about the range of the estimator, it indicates that the best-5-points-average method might work. It is actually an artifact due to the facts that the peak is at 0.5 and the design points are uniformly distributed over [0, 1]. This explanation is supported by the following simulation study. In this study, we consider the case that %(x)=1+3 exp(&(x&0.7)2

0.01) with the peak at (0.7, 4.0) and the rest of settings remain the same. The results are summarized in Table V.

Table V supports the preceding explanation for results summarized as in Table IV. As the average of the estimators is shifting away from 0.7, which is the value of the maximizer of this example, when the variance is getting larger. Now, Tables IV and V clearly indicate that the best-r-points-average method is not consistent for the double exponential error. It sup-ports the discussion on the failure of the best-r-points-average method when the tail of the error distribution is heavy.

As discussed in Section 1, some commonly used sequential approaches

for estimating x0 may fail to approach the global maximum if the

regres-sion function has multiple stationary points. We now assess the per-formance of the best-r-points-average method when the regression function has two well-separated stationary points. Here we consider %(x)=3.2 exp(&(x&0.4)2

0.01)+4 exp(&(x&0.6)2

0.01) with the peak (0.5769, 3.9365) and a local maximum (0.4053, 3.3811). The error distribu-tion is uniform with variance 0.25. Based on 200 Monte Carlo runs with n=50 and r=1, there are 9 runs falling in (0.3900, 0.4380) and the rest are between 0.5036 and 0.6453. When n=100, the number reduces to only 3 out of 200 runs are within 0.02 distance of the local maximizer 0.4053. This indicates that the best-r-points-average method can pick up the global maximum when the signal to noise ratio is large enough.

(13)

5. Proof of Theorem 1

Let [Kn] denote a sequence of positive integers such that Kn and

Knn and Knn 0. For brevity we omit the subscript of Kn later on. Given K+1 knots :Z=t0<t1< } } } <tK&1<tK=wZ, let I=[:Z, wZ] be partitioned into subintervals

IKj=[tk&1, tk) for 1 j<K, IKK=[tK&1, tK].

Set IKj=[i: 1in and Zi# IKj] and denote the cardinality of IKj as

Nj(K). Assume that nK (#N) is an integer. Consider a particular choice

of knots [tn 1, ..., tn, K&1] such that Nj(K)#N for 1 jK. For i # IKj, denote those Zi's by Zk 1, ..., ZkN. We also denote those associated Yi's and =i's by Yj 1, ..., YjNand =j 1, ..., =jN, respectively. Arrange the Yjl(=jl, respec-tively) in nondecreasing order as the order statistics Yl : N, j (=l : N, j, respec-tively) for 1lN.

Suppose that the following statement holds for r<J. lim n P( inf K&r<lK YN : N, l sup 1 jK&J YN : N, j)=1. (3)

By (3) and the definition of YN : N, j, we have wZ&Z[n&r+1 : n](J&r)Kn. In other words, Z[n&r+1 : n]&wZ=Op((J&r) K

&1

N ).

Recall that Y=Z+=. It follows easily that for 1jK,

=N : N, j+tnj>YN : N, j=N : N, j+tn, j&1. (4)

By (4) and the Bonferroni inequality,

P( inf K&r<lK YN : N, l sup 1 jK&J YN : N, j) 1& : K l=K&r+1 : K&J j=1

P(=N : N, l&=N : N, jtnj&tn, l&1). (5)

Note that the Zi's are independent of the =i's. This implies that

[=jl]1lN; 1 jKare independent since the new label jl attached to the ='s

are determined by Zi. Hence, the sample maxima =N : N, j for 1 jK are

i.i.d. random variables. Denote by dnlj=tn, l&1&tnj>0. Write P(=N : N, l&d=N : N, j &dnlj)=

|

w= := [F=(t&dnlj)] N N[F=(t)] N&1 f=(t) dt, (6) where N[F=(t)] N&1 _f

=(t) is the density function of =N : N, j. Unless the

dis-tribution of =N : N, j (suitably normalized) can be approximated by a

(14)

now on, we assume that F= belongs to the domain of attraction of an

G

\

aN&1t+bN&1&bN&dnlj

aN

+

_[F=(aN&1t+bN&1)] N&1_f

=(aN&1t+bN&1) dt

eN+NaN&1

|

(b0&bN&1)aN&1

(a0&bN&1)aN&1

G

\

aN&1t+bN&1&bN&dnlj

aN

+

_G(t) f=(aN&1t+bN&1) dt +eN&1NaN&1

|

(b0&bN&1)aN&1

(a0&bN&1)aN&1

G

\

aN&1t+bN&1&bN&dnlj

aN

+

_f=(aN&1t+bN&1) dt

=(I)+(II)+(III). (7)

Recall that tnjis the jK

&1 _{quantile of F}

Z( } ). A bound of tnj&FZ( jK &1

n )

can be obtained from the following lemma on sample quantiles, which can be found as Proposition 2 in Lo [13].

Lemma 3. Let F(z) be a continuous distribution on the real line and let

[ pn] be a positive monotone increasing sequence between 0 and 1, let !Pn

denote the pnth quantile of the distribution F, and let !pn=F

(15)

sample quantile. Here Fn is the usual empirical distribution function of F. Then, for (log n)(n(1& pn))

&1 =O(1), P( |!pn&!pn| >3 - 2 (log n) 12 (1& pn) 12 n&12 ) 4 exp(&2 log n)=O(n&2_).

5.1. A Family of Distributions in the Domain of Attraction of 9:(x) In this section we consider the case that = satisfies Condition E(1), which would imply that = is a random variable with w=< and the density func-tion, f=, in a neighborhood of w=behaves like c(w=&x)

k

for some constant c and nonnegative integer k. This particular setting is used to illustrate how the behavior of f= near w= affects the behavior of Z[n&l+1 : n]&w=. Before

we prove Theorem 1(a), we need a preliminary result on anand bn.

Lemma 4. F₌# 9_k+1(x) with a_n=cn&1(k+1) and b_n=w₌, where c= [(&1)k

(k+1) !f(k)

= (w=)]

1(k+1) .

Proof. According to Condition E(1) and Lemma 2(b), F=# 9k+1(x),

an=w=&F= (1&1n), and bn=w=. Note that F=(w=)&F=(F= (1&1n)) = 1n. Set cn=F= (1&1n). Hence,

n&1 =

|

w= cn

_

f=(t)& f(k) = (w=) k ! (t&w=) k

&

dt+f (k) = (w=) k !

|

w= cn (t&w=) k dt, n&1 =O

\

f (k) = (w=) (k+1) !(cn&w=) k+1

+

.

N&1N

+

1(k+1) u&N 1(k+1) dnlj c

+

9k+1(u) _f=(c(N&1) &1(k+1) u+w=) du. (8)

We will study the third term and the second term at the right-hand side

of (8), respectively. Note that 9k+1( } ) is a nondecreasing function. We

then have

|

((a0 0&w=)c) N1(k+1) 9k+1

\\

N N&1

+

1(k+1) u&N 1(k+1) dnlj c

+

_9k+1(u) f=(c(N&1)&1(k+1)u+w=) du

=

|

0 ((a0&w=)c) N1(k+1) exp

\

(&1)k

{_\

N&1N

+

1(k+1) u &N 1(k+1) dnlj c

&

k+1 +uk+1

=+

_f=(c(N&1) &1(k+1) u+w=) du exp

\

&(&1)

k_f(k) = (w=) (k+1) ! Nd k+1 nlj

(&1)k f(k) = (w=) (k+1) ! Nd k+1 nlj

+

. (10)

Since eN&1=o(1) by Lemma 2(d), the right-hand side of (10) is of smaller order than the right-hand side of (9).

By (8), (9), and (10), it is clear that P(=N : N, l&=N : N, j &dnlj) tends to zero if Ndk+1

nlj tends to infinity. Recall that (&1) k f(k) = (w=)>0. Observe that

|

:a=0 [F=(t&dnlj)] N N[F=(t)] N&1 f=(t) dtN[F=(a0)] 2N .

(17)

Hence, for r<J, we have

P( inf K&r<lK YN : N, K sup 1 jK&J YN : N, j) 1+eN& : K l=K&r+1 : K&J j=1 N[F=(w=)] 2N &c 2 : K l=K&r+1 : K&J j=1

exp

\

&(&1) k f(k) = (w=) (k+1) ! Nd k+1 nlj

+

+r(K&J) O(N&1 ).

Finally, we evaluate the magnitude of dnlj. Recall that dnlj=tn, l&1&tnj and that tnjis the jK

&1

quantile of FZ( } ). Note that (log n)(n(1&jK &1

n ))

&1

= O(1) for 1 jK when log nN_{0. It follows from Lemma 3 that}

P( sup 1 jK |tnj&FZ( jK &1 )| >3 - 2 (log n)12 n&12 ) 4K exp(&2 log n)=O(Kn&2

)=O(n&# ) for some #>1. By the BorelCantelli lemma, we have

tnj&FZ( jK &1 n )=O((log n) 12 n&12 ) a.s. (11)

For the ease of presentation, we first consider the case {=1. It follows

from (11) and Condition R that F

Z( jK

&1

)[ jK&1

] is bounded away from zero and infinity when j is large. Hence, we have

: K l=K&r+1 : K&J j=1

exp

\

&(&1) k f(k) = (w=) (k+1) ! Nd k+1 nlj

+

rK exp

\

&M1 (&1)k_f(k) = (w=) (k+1) ! N[(J&r) K &1 ]k+1

+

for some positive constant M1. Set K=O((nlog n)

1(k+2)_{). Note that} : K l=K&r+1 : K&J j=1 N[F=(a0)] 2N rn[F=(a0)] 2N 0.

It follows easily that limnrN

k(k+1)_{K exp(&M((&1)}k_f(k)

= (w=)

(k+1!) N[(J&r) K&1_]k+1_{)=0 when J}

. Since r(K&J) O(N&1₎₌

O(n&k(k+2)_{(log n)}&12(k+2)_{)=o(1) and e}

N=o(1) by Lemma 2(d), the

above discussions conclude that Z[n : n]&wZ=Op((log nn)1(k+2)) for all l<r under Condition R with {=1.

(18)

For general {, F Z(1& jK &1 )=wZ&O(( jK &1 )1{

) when j is small and Condition R holds, which implies dn, K&r+1, K&J=O([J

1{ &(r&1)1{ ] K&1{ ). It follows that Nk(k+1) : K l=K&r+1 : K&J j=1

exp

\

&(&1) k f(k) = (w=) (k+1) ! Nd k+1 nlj

+

rNk(k+1) K exp

\

&M2 (&1)k f(k) = (w=) (k+1) ! N[(J 1{ &r1{ ) K&1{ ]k+1

+

for some positive constant M2. Set Kn=O((nlog n)

1[1+(k+1){] ). It

follows easily that limnrN

k(k+1) K exp(&M2(&1) k f(k) = (w=)(k+1) !) N[(J1{ &r1{ ) K&1{ ]k+1

)=0 when J . Since r(K&J) O(N&1

)= O((log n)&2{(k+1+{)_n&(k+1&{)(k+1+{)_{)=o(1), the above discussions} con-clude that Z[n&l+1 : n]&wZ=Op((log nn)

1[1+(k+1){]_{) for all lr under}

Condition R. _K

5.2. A Family of Distributions in the Domain of Attraction of 4(x)

In this section, we consider the case that = satisfies Condition E(2). Before we prove Theorem 1(b), we need the following lemma.

Lemma 5. (a) 1&F₌(x)tAx&uexp(&Bxv) as x . (b) F=# D(4) with an=(nf (bn))

&1

t(Bv)&1_b1&v

n as n and bn=

(B&1_{log n)}1v_{&u log(B}&1_{log n)v}2_B1v_{(log n)}(v&1)v_.

Proof. When u=0, (a) follows easily. When u>0 for t>0,

1 tv

|

t x&u+v&1 exp(&Bxv ) dx >1 ut &u exp(&Btv )&Bv u

|

t x&u+v&1 exp(&Bxv ) dx, whence At&u_exp(&Btv₎₌

|

t

\

1+ u Bvx &(v&1)

+

BvA x &u+v&1_exp(&Bxv_{) dx} >1&F=(t)>ABv

\

Bv u + 1 tv

+

&1 1 ut &u_exp(&Btv_).

The conclusion of (a) follows again for u>0.

By Lemma 2(c), Condition E(2), and (a), F=# D(4) by simple algebra. We

now find the acceptable choices of norming constants. Since 1&F=(x)tAx &u

(19)

exp(&Bxv

), taking the logarithm of both sides of Ab&u

n exp(&Bb v

n)=n

&1 gives

&log A+u log bn+Bb v

n=log n. (12)

Hence bn and bnt(B

&1

log n)1v

by dividing both sides of (12) by bv n. Since an=(nf (bn))

&1

we see that an acceptable choice for an is (Bv)

&1 (B&1

log n)(1&v)v .

Next, try an expansion of bnby writing bn=(B &1

log n)1v

+rn, where rn is a remainder which is o((log n)1v_{). Substitute this b}

ninto (12) and we find

o(1)&u vlog(B &1 log n)+(log n)

{_

1+ rn (B&1 log n)1v

&

v &1

=

=0. Hence we conclude that

bn=(B &1 log n)1v & u log(B &1_{log n)} v2_B1v_{(log n)}(v&1)v. K

Proof of Theorem 1(b). Recall that

P(=N : N, l&=N : N, j &dnj)=

|

& [F=(t&dnlj)] N N[F=(t)] N&1 f=(t) dt, where N[F=(t)] N&1

f=(t) is the density function of =N : N, j. The proof argu-ment is motivated by the following heuristic. Since a&1

N (=N : N, j&bN) 4 in

distribution by Lemma 2, it is then expected that P(=N : , l&=N : N, j &dnlj) 0 when dnlja

&1

N . To avoid notational complexity, we only

consider r=1. For fixed r, the result can be derived accordingly.

The above-mentioned probability will be evaluated by dividing

(&, ) into three intervals (&, a0), [a0, b0), and [b0, ) with a0=0 and b0=bN+cNaN, where cN=[4(v&1)v] log log N. Observe that

|

b0 [F=(t&dnKj)] N N[F=(t)] N&1 f=(t) dt N

_|

b0 [F=(t)] 2N&1 f=(t) dt 1 2[1&F 2N = (b0)] and

|

&a0 [F=(t&dnKj)] N N[F=(t)] N&1 f=(t) dt [F2N&1 = (0)]

|

a0 & f=(t) dtN[F=(0)] &2N . (13)

(20)

Since 1&F=(t)tAt &u exp(&Btv ) and b0 , we have F=(b0)1&2Ab &u N e &Bbv₀ 1& 2A (B&1

log N)uvexp(&Bb v N) exp(&BvcNaNb v&1 N 2) 1&1 N A (log N)(u+2v&2)v and 1&F2N = (b0)

\

1& A N(log N)(u+2v&2)v

+

2N 2A(log N)&(u+2v&2)v . Hence

|

b0 [F=(t&dnKj)] N N[F=(t)] N&1 f=(t) dt2A(log N) &(u+2v&2)v . (14)

Note that (II) in (7) can be written as NaN&1

|

(b0&bN&1)aN&1

(a0&bN&1)aN&1

4

\

+

+

exp(a_N&1dnKj) +N[exp(&Nv(1&c) )]exp(a_N&1dnKj)

&

. Again, when Condition R holds with {=1, we have

(log N) : K&J

j=1

e&exp(a_N&1dnKj)

K(log N) exp

_

&exp

\

Bv

\

log N

B

+

(v&1)v JK&1

+&

,

by the same argument used in Section 5.1. Set K=Bv(B&1

log n)(v&1)v (log log n)&1

. Hence, eN=o(1) by Lemma 2(d). It follows easily

that K(log N)&(u+2v&2)v

=o(1), K log N K&J

j=1 exp(&exp(a &1 N dnKj))=o(1), N[F=(0)] 2N =o(1), : K&J j=1

e&exp(a_N&1dnKj&cN)

K exp

\

&exp

\\

log N

B

+

(v&1)v _{J log log n}

(B&1_{log n)}(v&1)v&log log n

++

K exp(&(log n)J2&1 )=o(1) and N : K&J j=1 [exp(&Nv(1&c) )]exp(a_N&1dnKj)

n[exp(&Nv(1&c)_)]exp(a_N&1dnKj)_=o(1).

The above discussions conclude that Z[n : n]&wZ=Op((log n)&(v&1)v log log n) under Condition R with {=1.

(23)

(log N) : K&J

j=1

e&exp(a_N&1dnKj)

K(log N) exp[&exp((2 log N)(v&1)v_JK&1{_)]. Set K=(log n)((v&1)v) {

(log log n)&{

. Applying the same argument

used in deriving the result with {=1, we have Z[n : n]&wZ=

Op((log n)

&((v&1)v) {

(log log n){

) under Condition R. K

Acknowledgments

The authors thank the referees for helpful comments which led to significant improvements of this paper.

References

[1] Banks, D. (1993). Is industrial statistics out of control? (with discussions). Statist. Sci. 8356409.

[2] Bechhofer, R. E. (1954). A single sample multiple procedure for ranking means of normal populations with known variances. Ann. Math. Statist. 25 1639.

[3] Bhattacharya, P. K. (1984). Induced order statistics: Theory and applications. In Handbook of Statistics (P. R. Krishnaiah and P. K. Sen, Eds.), Vol. 4, pp. 383403, Elsevier, Amsterdam.

[4] Box, G. E. P., and Wilson, K. B. (1951). On the experimental attainment of optimum conditions. J. Roy. Statist. Soc. Ser. B 13 145.

[5] Changchien, G. M. (1990). Optimization of blast furnace burden distribution. In Proceedings of the 1990 Taipei Symposium in Statistics, June 28-30, 1990 (M. T. Chao and P. E. Cheng, Eds.), pp. 6378.

[6] Chen, H. (1988). Lower rate of convergence for locating a maximum of a function. Ann. Statist. 16 13301334.

(24)

[14] Muller, H. G. (1985). Kernel estimators of zeros and of location and size of extrema of regression functions. Scand. J. Statist. 12 221232.

[15] Muller, H. G. (1989). Adaptive nonparametric peak estimation. Ann. Statist. 17 10531069.

[16] Reiss, R. D. (1989). Approximate Distributions of Order Statistics: With Applications to Nonparamatric Statistics. Springer-Verag, New York.

[17] Rohatgi, M. S. (1962). On the asymptotic sufficiency of certain order statistics. J. Roy. Statist. Soc. Ser. B 24 167176.

[18] Tsybakov, A. B. (1988). Passive Stochastic Approximation. University of Bonn, SFB 303 Discussion Paper.