Variance Reduction in Multiparameter Likelihood Models

(1)

Variance Reduction in Multiparameter Likelihood Models

Ming-Yen CHENGand Liang PENG

Local likelihood modeling is a unified and effective approach to establishing the dependence of a response variable, which can be of various types, on independent variables. Therefore, these models have become popular in a wide range of applications. There is an increasing interest in employing multiparameter local likelihood models to investigate trends of sample extremes in environmental statistics. When sample maxima are modeled by a generalized extreme value distribution, the sample size is small in general and local likelihood estimation exhibits a large variation. In this article variance reduction techniques are employed to improve the efficiency of the inference. A simulation study and an application to annual maximum temperatures show that our methods are very effective in finite samples.

KEY WORDS: Bootstrap; Extreme value distribution; Generalized linear models; Local likelihood; Local linear MLE; Logistic regression; Variance reduction.

1. INTRODUCTION

Suppose that (X1, Y₁), . . . , (X_n, Y_n)are independent bivariate observations from the distribution of (X, Y). Consider a multi- parameter likelihood model

f

y; θ1(x), . . . , θd(x)

(1.1) for the conditional density of the response Y given that the covariate X= x, where the form of the probability density function f is known and the unknown parameters θ (x) = (θ₁(x), . . . , θ_d(x))^T depend on X= x. Under model (1.1), the dependence of Y on X is specified by a parametric law, with probability density function f , in which the parameter vector θ is an unknown d-dimensional function of X.

Statistical inference for the underlying population (X, Y) based on the observed data (X1, Y1), . . . , (Xn, Yn)relies heavily on nonparametric estimation of the curves θ1(x), . . . , θ_d(x). An efficient approach is the local linear maximum likelihood esti- mation: Locally around every x the curves θ1(·), . . . , θd(·) are modeled as linear functions and then estimated by maximizing a kernel-weighted likelihood function. Specifically, define the local linear log-likelihood at θ (x) as

Ln(θ^∗(x))=

n i=1

Kh(Xi− x) log f (Yi; θ(x, Xi)), where Kh(x)= K(x/h)/h, K is a kernel, h > 0, h = h(n) → 0 as n→ ∞ is a bandwidth, θ^∗(x)= (θ1(x), θ₁(x), . . . , θ_d(x), θ_d(x))^T, and θ (x, u)= θ(x) + (θ₁(x)(u− x), . . . , θ_d(x)(u− x))^T. Then the local linear maximum likelihood estimator ˆθ^∗(x)= ( ˆθ₁(x), ˆθ₁(x), . . . , ˆθ_d(x), ˆθ_d(x))^T of θ^∗(x)maximizes Ln(θ^∗(x)), that is,

ˆθ^∗(x)= arg max

θ^∗(x)Ln(θ^∗(x)).

Note that in model (1.1) what is essential is θ (x)= (θ1(x), . . . , θ_d(x))^T and the derivatives α(x)= (θ₁(x), . . . , θ_d(x))^T are ir- relevant. Hence, the local linear maximum likelihood estimator (MLE) for θ (x) is defined as

ˆθ(x) = ˆθ₁(x), ˆθ₂(x), . . . , ˆθ_d(x)T

. (1.2)

Ming-Yen Cheng is Professor, Department of Mathematics, National Taiwan University, Taipei 106, Taiwan (Email: [email protected]). Liang Peng is Associate Professor, School of Mathematics, Georgia Institute of Technology, Atlanta GA 30332 (Email: [email protected]). This article was supported in part by NSF grant DMS-04-03443 and a Humboldt research fellowship. The comments from a joint editor, an associate editor, and two reviewers are greatly appreciated.

We refer to Aerts and Claeskens (1997) and Fan, Farmen, and Gijbels (1998) for the asymptotic properties of ˆθ^∗(x)and ˆθ (x) and the choice of bandwidth. In addition, see Claeskens and Van Keilegom (2003) for a study on confidence bands of θ (x).

To provide a motivating example for studying variance reduction in the construction of multiparameter local likelihood models, we mention modeling of extremes and exceedances.

Recently, there has been an increasing interest in applying likelihood models to investigate the trend in sample extremes;

see Davison and Ramesh (2000) for fitting a generalized extreme value distribution to sample maxima, Hall and Tajvidi (2000) for fitting a generalized extreme value distribution and a generalized Pareto distribution to data, and Beirlant and Goegebeur (2004) for fitting a generalized Pareto distribution to exceedances by taking the unknown high threshold into ac- count. Some applications of fitting an extreme value distribution locally to environmental data can be found in Ramesh and Davison (2002) and Chavez-Demoulin and Davison (2005).

When we model sample maxima by a generalized extreme value distribution, the sample size is not large in general. Then the estimation procedure can become much more reliable and efficient provided that variance reduction techniques are imple- mented.

Kogure (1998) studied general order polynomial interpola- tion of kernel density estimation and showed that the asymptotic integrated variance becomes smaller. Cheng, Peng, and Wu (2005) introduced variance reduction techniques for nonparametric regression, which reduce the pointwise asymptotic variance uniformly. In this article we adopt the approach of Cheng et al. (2005) because it is more effective. Theoreti- cal study shows that our method reduces asymptotic variances of the d-parameter estimators by a common and known con- stant factor. Interestingly, this variance reduction in estimat- ing d parameters is simultaneously achieved by applying the technique once all together. These results are nontrivial given those of Cheng et al. (2005): The multiparameter local likelihood model specifies the conditional distribution of the response variable given the covariates, whereas nonparametric regression considers the conditional mean, and asymptotic be- haviors of nonparametric kernel regression are different from those of local likelihood estimation.

(2)

Here we discuss briefly the scope of local likelihood models outside extremes and exceedances. Local likelihood methods effectively model the dependence of various kinds of response variables on covariates in a unified framework. If f (y; θ1(x))= exp[C + {y − θ1(x)}²/σ], where C and σ are given constants, then the local linear maximum likelihood estimator (MLE) reduces to the local linear regression estimator; see, for example, Loader (1999). Yu and Jones (2004) adapted an analogous local Normal likelihood model for estimation of the conditional variance function in nonparametric regression. Tibshirani and Hastie (1987) suggested (1.2) when d= 1 and applied it to local logistic regression and local partial likelihood estimation of Cox’s proportional hazard model. Staniswalis (1989) consid- ered a local constant MLE and allowed Xi to be multivariate.

Another special case of local likelihood modeling is, in general- ized linear models, when the conditional distribution of Y given X belongs to a one-parameter exponential family and the para- meter depends on X. In this regard, Loader (1999) discussed various examples, including local Poisson regression and a local Gamma model for survival analysis. Fan, Heckman, and Wand (1995) extended the idea to local quasi-likelihood estimation. The literature further includes Irizarry (2001), Eguchi, Kim, and Park (2003), and Signorini and Jones (2004).

This article is organized as follows. In Section 2 we demon- strate a way to incorporate the variance reduction techniques of Cheng et al. (2005) to improve the local linear maximum likelihood estimator (1.2). In addition, the main theoretical results and discussions on additional variants are suggested. A simulation study and a real application on annual maximum temperatures are presented in Section 3. All proofs of the theoretical results are given in the Appendix.

2. METHODOLOGY AND MAIN RESULTS The idea of our variance reduction strategy is as follows.

For each point of estimation, construct a linear combination of local linear maximum likelihood estimates at three points around the point of estimation such that the asymptotic bias remains unchanged. Specifically, for any given point x, let {βx,0, β_x,1, β_x,2} be an equally spaced grid of points with bin width δh= βx,1− βx,0= βx,2− βx,1such that x= βx,1+ rδh, where r∈ (−1, 1) \ {0} and δ > 0 are given constants. Then, as in Cheng et al. (2005), a variance reduction estimator for θ (x) is defined as

˜θ(x) =r(r− 1)

2 ˆθ(βx,0)+ (1 − r²) ˆθ (β_x,1) +r(r+ 1)

2 ˆθ(βx,2), (2.1) where ˆθ (x)= ( ˆθ1(x), . . . , ˆθ_d(x))^T is the local linear maximum likelihood estimate given in (1.2). If supp(X) were bounded, supp(X)= [0, 1], say, because x − (1 − r)δh = βx,0 < x <

βx,2= x + (1 + r)δh, then the grid points βx,0and βx,2would be outside supp(X) if x is close to the endpoints. Therefore, we take

δ(x)= min

δ, x

(1+ r)h, 1− x (1− r)h

such that{βx,0, β_x,1, β_x,2} ∈ supp(X) = [0, 1] all the time.

Next we compare the asymptotic distributions of our vari- ance reduction estimator ˜θ (x) and the local linear maximum likelihood estimator ˆθ (x). For simplicity, we consider the case d = 2. Generalization of the results to general d values is straightforward. Define the local Fisher information matrix of θ (x)= (θ1(x), θ₂(x))^T as

I(θ1(x), θ2(x))=

I11(θ1(x), θ2(x)) I12(θ1(x), θ2(x)) I21(θ₁(x), θ₂(x)) I22(θ₁(x), θ₂(x))

,

where

Ist(θ1(x), θ2(x))= Ex

− ∂²

∂θ_s∂θ_tlog f (Y; θ1(x), θ2(x))

= Ex

∂

∂θslog f (Y; θ1(x), θ₂(x))

× ∂

∂θ_tlog f (Y; θ1(x), θ₂(x))

and Exdenotes the expectation conditional on X= x. Let fX(x) denote the marginal probability density function of X. De- fine νi,j=

zⁱK^j(z) dz, C(s, t)=

K(u− st)K(u + st) du, and C(s)=³₂C(0, s)− 2C(¹₂, s)+¹₂C(1, s). The following theorem states the asymptotic normality of our variance reduction esti- mator ˜θ (x).

Theorem 1. Under the same regularity conditions given by Aerts and Claeskens (1997), for interior point x we have, as n→ ∞,

√nh

˜θ(x) − θ(x) −1

2h²ν2,1θ(x)

d

→ Z1, (2.2) where θ(x)= (θ₁(x), θ₂(x))^T and Z1is a d-dimensional nor- mal random vector with mean 0 and covariance matrix{ν0,2− r²(1− r²)C(δ)}fX(x)⁻¹I(θ1(x), θ₂(x))⁻¹.

It follows from Aerts and Claeskens (1997) that, as n→ ∞,

√nh

ˆθ(x) − θ(x) −1

2h²ν_2,1θ(x)

d

→ Z2, (2.3) where Z2is a d-dimensional normal random vector with mean 0 and covariance matrix ν0,2fX(x)⁻¹I(θ1(x), θ₂(x))⁻¹.

Remark 1. When x is a boundary point, that is, x is close to the endpoints of supp(X), ˆθ (x) and ˜θ (x) each still has an as- ymptotic normal distribution, and only the constant factors in the asymptotic bias vector and covariance matrix change. Typi- cally, the asymptotic variances are inflated because of a reduced number of data points there.

Define the asymptotic mean squared error (AMSE) of an es- timator of θ (x) as the sum of the trace of the covariance matrix and the squared norm of the asymptotic bias vector in its asymptotic Normal distribution. Then, from (2.2) and (2.3), the asymptotic mean squared errors of ˜θ (x) and ˆθ (x) are, respec- tively,

AMSE{˜θ(x)} = {nhfX(x)}⁻¹tr

I(θ1(x), θ₂(x))⁻¹

× {ν0,2− r²(1− r²)C(δ)} +1

4h⁴ν_2,1²

(θ₁(x))²+ (θ₂(x))²

, (2.4)

(3)

AMSE{ˆθ(x)} = {nhfX(x)}⁻¹tr

I(θ1(x), θ2(x))⁻¹ ν0,2

+1 4h⁴ν_2,1²

(θ₁(x))²+ (θ₂(x))²

. (2.5) Comparing (2.4) and (2.5), note that the asymptotic mean squared error of ˜θ (x) differs from that of ˆθ (x) by the term

−r²(1 − r²)C(δ){nhfX(x)}⁻¹tr{I(θ1(x), θ₂(x))⁻¹}. Note that 0 < r²(1− r²)≤ 1/4 for any r ∈ (−1, 1) \ {0} and it attains the maximum at r= ±2^−1/2. Moreover, for any symmetric kernel K, 0≤ C(δ) ≤ 3ν0,2/2 for all δ > 0 and C(δ) is in- creasing in δ if K is, in addition, unimodal and concave; see Cheng et al. (2005). Hence, the variance reduction estimator is better than the local linear maximum likelihood estimator in terms of asymptotic mean squared errors.

Remark 2. Comparing (2.2) and (2.3), note that our variance reduction method simultaneously reduces asymptotic variances in estimating all the parameters θ1(x), . . . , θ_d(x) no matter what d, the number of parameters in the local likelihood model, is. This property holds for all our estimators, ˜θ^{( j)}(x), j= 1, 2, 3, which are introduced in (2.6), (2.7), and (2.14) and are con- structed based on ˜θ (x).

Note that r in the definition of ˜θ (x) is an arbitrary constant in (−1, 1) \ {0}. As discussed earlier, choosing r = ±2^−1/2, we achieve the most variance reduction regardless of what h, δ, and K are, and the resultant estimators are

˜θ⁽¹⁾(x)=1− 2^1/2 4 ˆθx −

1+ 2^−1/2 δh

+1

2ˆθx − 2^−1/2δh +1+ 2^1/2

4 ˆθx −

2^−1/2− 1 δh

(2.6) and

˜θ⁽²⁾(x)=1+ 2^1/2 4 ˆθx +

2^−1/2− 1 δh

+1

2ˆθx + 2^−1/2δh +1− 2^1/2

4 ˆθx +

2^−1/2+ 1 δh

. (2.7) The next theorem states the asymptotic mean squared error of the preceding variance reduction estimators.

Theorem 2. Under the same regularity conditions given by Aerts and Claeskens (1997), for interior point x we have, as n→ ∞,

AMSE ˜θ^{( j)}(x)

= {nhfX(x)}⁻¹tr

I(θ1(x), θ₂(x))⁻¹

×

ν_0,2−C(δ) 4

+1 4h⁴ν_2,1²

(θ₁(x))²+ (θ2(x))² , (2.8) j= 1, 2.

Remark 3. It follows from (2.5) that the optimal bandwidth minimizing AMSE{ˆθ(x)} is

h0=

ν_0,2 fX(x)ν_2,1²

1/5

tr(I(θ1(x), θ₂(x))⁻¹) (θ₁(x))²+ (θ₂(x))²

1/5

n^−1/5. (2.9)

Similarly, for j = 1, 2, the optimal bandwidth minimizing AMSE{˜θ^{( j)}(x)} given in (2.8) is

hj=

ν0,2−C(δ) 4

1/5

ν_0,2^−1/5h0. (2.10) Remark 4. The bandwidth h0given in (2.9) yields the opti- mal AMSE of ˆθ (x):

AMSE0(x)=

ν0,2tr(I(θ1(x), θ2(x))⁻¹) fX(x)

4/5

× ν_2,1²

(θ₁(x))²+ (θ₂(x))²1/5

n^−4/5. (2.11) For j= 1, 2, the optimal bandwidth given in (2.10) yields the optimal AMSE of ˜θ^{( j)}(x):

AMSEj(x)=

ν_0,2−C(δ) 4

4/5

ν_0,2^−4/5AMSE0(x), (2.12)

and, hence, the asymptotic relative efficiency of ˜θ^{( j)}(x)com- pared to ˆθ (x) is

eff ˜θ^{( j)}(x), ˆθ (x)

=

ν0,2−C(δ) 4

_−4/5

ν_0,2^4/5≥ 1. (2.13) Although theoretical results imply that more variance reduc- tion is achieved by implementing larger δ values, that may introduce large finite-sample bias effects. We suggest taking δ= 1 for general purposes. Slightly larger values of δ, for ex- ample, δ= 1.2, may be useful in applications when the second derivatives of the curves θj(x), j= 1, . . . , d, are small. A simple way to judge is to examine departures of the curve estimates

˜θ⁽¹⁾(·) and ˜θ⁽²⁾(·) from ˆθ(·) when using different δ choices.

Either of the variance reduction estimator ˜θ⁽¹⁾(x)or ˜θ⁽²⁾(x) uses more information from data points on one side of x than those on the other side; see (2.6) and (2.7). One way to cancel out these finite-sample biases is to take the average of the two estimators

˜θ⁽³⁾(x)=1

2 ˜θ⁽¹⁾(x)+ ˜θ⁽²⁾(x)

. (2.14)

When supp(X)= [0, 1], to keep the points {βx,0, βx,1, βx,2} with both r= 2^−1/2and r= −2^−1/2all within the data range[0, 1], we let

δ(x)= min

δ, x

(1+ 2^−1/2)h, 1− x (1+ 2^−1/2)h

for a given positive constant δ, δ= 1 say.

Theorem 3. Under the same regularity conditions of Aerts and Claeskens (1997), for interior point x we have, as n→ ∞,

√nh

˜θ⁽³⁾(x)− θ(x) −1

2h²ν_2,1θ(x)

d

→ Z3, (2.15) where Z3 is a d-dimensional normal random vector with mean 0 and covariance matrix {ν0,2 − C(δ)/4 − D(δ)/2} × fX(x)⁻¹I(θ1(x), θ₂(x))⁻¹and

D(δ)= ν0,2−C(δ)

4 − 1

16

4(1+√ 2 )C(√

2− 1, δ/2) + (3 + 2√

2 )C(2−√ 2, δ/2)

(4)

+ 2C(√

2, δ/2)+ 4(1 −√ 2 )C(√

2+ 1, δ/2) + (3 − 2√

2 )C(√

2+ 2, δ/2) .

It follows from (2.15) that the asymptotic mean squared error of ˜θ⁽³⁾(x)is

AMSE ˜θ⁽³⁾(x)

= {nhfX(x)}⁻¹tr

I(θ1(x), θ₂(x))⁻¹

×

ν0,2−C(δ) 4 −D(δ)

2

+1 4h⁴ν_2,1²

(θ₁(x))²+ (θ2(x))²

. (2.16) Remark 5. The optimal bandwidth of ˜θ⁽³⁾(x) minimizing AMSE{˜θ⁽³⁾(x)} in (2.16) is

h3=

ν_0,2−C(δ) 4 −D(δ)

2

1/5

ν_0,2^−1/5h0, (2.17)

giving the optimal AMSE of ˜θ⁽³⁾(x):

AMSE3(x)=

ν_0,2−C(δ) 4 − D(δ)

4/5

ν^−4/5_0,2 AMSE0(x).

(2.18) Hence, the asymptotic relative efficiency of ˜θ⁽³⁾(x)compared to ˆθ (x) is

eff ˜θ⁽³⁾(x), ˆθ (x)

=

ν_0,2−C(δ) 4 −D(δ)

2

_−4/5

ν_0,2^4/5≥ 1.

(2.19) For any kernel K, 0≤ D(δ) ≤⁵₈ν_0,2 for all δ > 0. Comparing (2.13) and (2.19), we find that ˜θ⁽³⁾(x)has a better asymptotic efficiency compared to ˜θ^{( j)}(x), j= 1, 2, and the improvement is significant.

Remark 6. The conclusions in Remarks 3–5 are all based on the total mean squared error measure given in (2.4), (2.8), and (2.16). In some circumstances, the coordinatewise mean squared errors may be used to accommodate different accuracy requirements. Then the coordinatewise optimal bandwidths for the different estimators follow the same relations as in (2.10) and (2.17), and the componentwise relative efficiencies of our estimators compared to the local linear MLE remain as on the right sides of (2.13) and (2.19).

Remark 7. Existing data-driven bandwidth selection rules for the local linear maximum likelihood estimator ˆθ (x) include the cross-validation method of Aerts and Claeskens (1997) and the grid search approach of Fan et al. (1998). To implement our estimators, one can simply replace ˆθ (x) by our estimators in the previously mentioned procedures.

Remark 8. For any j= 1, 2, 3, in our construction of our vari- ance reduction estimator ˜θ^{( j)}, the same value of δ is applied to obtain all the d-parameter estimators ˜θ₁^{( j)}(x), . . . , ˜θ_d^{( j)}(x). If the curvature in the parameter curves θ1(x), . . . , θ_d(x)varies largely from one to another, then it may be more preferable to imple- ment different δ values for ˜θ₁^{( j)}(x), . . . , ˜θ_d^{( j)}(x)and that can be done coordinatewise.

3. SIMULATION STUDY AND REAL APPLICATION 3.1 Simulation Study

We consider the following two models in our simulation study.

Model A (Extreme value distribution).

P(Y≤ y|X = x) = exp

−

1+ γ (x)y− µ(x) σ (x)

_{−1/γ (x)}

+

, (3.1)

where θ (x)= (γ (x), µ(x), σ (x))^T, σ (x)= 1 + x², µ(x)= −1 + 2x, γ (x)= −.2 or 0, (u)₊= u for positive u and (u)₊= 0 oth- erwise, and X is uniformly distributed on[0, 1].

Model B (Logistic regression).

P(Y= 1|X = x) = exp{θ(x)}

1+ exp{θ(x)} and P(Y= 0|X = x) = 1

1+ exp{θ(x)},

where θ (x)= θ1(x)= 7{exp{−(x + 1)²} + exp{−(x − 1)²}} − 5.5 or θ (x)= θ2(x)= 2 − x²and X is uniformly distributed on [0, 1].

Note that γ (x), µ(x), and σ (x) in Model A are called the shape, location, and scale parameter curves, respectively. The reason that we consider γ (x) as being a constant is suggested by the real data application in the next section. Model B was considered by Fan et al. (1998) as well.

We drew 400 random samples of size n= 400 and n = 600 from both Models A and B and took

δ= δ(x) = min

1, x

(1+√

1/2 )h, 1− x (1+√

1/2 )h

in computing the variance reduction estimates. The biweight kernel K(u)=¹⁵₁₆(1− u²)²I(|u| ≤ 1) was employed.

For Model A, we kept the bandwidth h at .15 for both ˆθ (x) and ˜θ⁽³⁾(x). For Model B, we employed the data-driven method of Fan et al. (1998) to choose the optimal bandwidth. That is, we searched the optimal h from .1 to .4 in increments of .01 to min- imize the median of the integrated squared errors of the local linear maximum likelihood estimate. Then this optimal band- width was applied to both ˆθ (x) and ˜θ⁽³⁾(x). Under Model B, we also experimented with h= .15 for both estimates.

One way to measure the relative performance of our variance reduction estimate ˜θ⁽³⁾(x)to the local linear maximum likeli- hood estimate ˆθ (x) on each sample is to compute the ratio of the integrated squared error (ISE) of the latter to that of the for- mer. Table 1 reports the mean and standard deviation of the ISE ratios obtained from the 400 samples. From Table 1, we clearly observe the effectiveness of the variance reduction techniques, especially when a small bandwidth is employed. In addition, the fact that, in all cases considered, the relative performance stays roughly the same when n changed from 400 to 600 indicates that ˜θ⁽³⁾(x)already achieves the asymptotic relative efficiency at moderate sample sizes.

(5)

Table 1. Relative Performance Model A

h γ_{= −}.2, n₌400 γ_{= −}.2, n₌600 γ₌.0, n₌400 γ₌.0, n₌600

.15 γ: 1.394(.504) γ: 1.359(.489) γ: 1.476(.488) γ: 1.468(.480)

µ: 1.419(.404) µ: 1.411(.425) µ: 1.406(.465) µ: 1.406(.457)

σ: 1.384_(.455) σ: 1.332_(.350) σ: 1.395_(.405) σ: 1.404_(.418)

Model B

h θ1( x), n₌400 θ1( x), n₌600 θ2( x), n₌400 θ2( x), n₌600

Optimal 1.041(.154) 1.027(.152) 1.044(.158) 1.052(.157)

.15 1.227(.380) 1.227(.372) 1.311(.474) 1.281(.416)

NOTE: The numbers denote the mean of ratios of the integrated squared errors of the local linear maximum likelihood estimateˆθ(x) to those of the variance reduction estimate˜θ⁽³⁾(x) based on bothh₌.15 and the optimalhin the sense of minimizing the median of the integrated squared errors of the local linear maximum likelihood estimate. The corresponding standard deviations are given in brackets.

3.2 Real Application

We analyze the annual maximum temperatures (degrees Celsius) measured at Station De Bilt, the Netherlands, from January 1, 1901 to December 31, 2003, say y1, . . . , y103; see Figure 1. Here xi= 1,900 + i was transformed to xi = i/104 for i= 1, . . . , 103. This dataset is constructed by tak- ing the maximum of daily maximum temperatures available at http://www.knmi.nl/voorl/kd /lijsten/daggem/etmgeg_downl.

cgi?language=eng.

First, we applied the extreme value distribution (3.1) to this dataset with γ (x) and σ (x) being constants. The setup for esti- mation is the same as in the simulation study except the choice of bandwidth. Here the cross-validation bandwidth method of Aerts and Claeskens (1997) was employed. More specif- ically, we first computed CV(h)=_n

i=1log f (Yi; ˆθ^[i](x_i)) for h= .1, .101, .102, . . . , .4, where ˆθ^[i](x)is the local linear max- imum likelihood estimate without the observation (xi, y_i), and then chose h to minimize the quantity CV(h). Figure 2 depicts CV(h) versus h. Based on Figure 2, we employed h= .165 and h= .191, which correspond to the largest two values of h where local minima of CV(h) occur, to compute the local linear max-

Figure 1. Annual Maximum Temperatures (degrees Celsius) Mea- sured at Station De Bilt, the Netherlands, During the Period January 1, 1901–December 31, 2003.

imum likelihood estimates and the corresponding variance re- duction estimates ˜θ⁽³⁾(x); see Figures 3 and 4. In any of the cases, our curve estimate has less fluctuations than the original local linear estimate while the two suggest similar patterns of the parameter curves. The results from h= .191 are more use- ful because the curve estimates are smoother.

To estimate the variances of these estimates, a bootstrap- ping approach similar to that of Davison and Ramesh (2000) was employed. That is, we take with replacement 1,000 boot- strapping samples {^∗_1,j, . . . , _103,j^∗ }^1,000_j₌₁ from{(1 + ˆγ(xi)(yi− ˆµ(xi))/ˆσ (xi))^{−1/ ˆγ(x}ⁱ⁾, i= 1, . . . , 103}. For each of j = 1, . . . , 1,000, form a bootstrap sample by setting y^∗_i,j = ˆµ(xi)+ ˆσ (xi){(^∗_i,j)^{− ˆγ(x}ⁱ⁾− 1}/ ˆγ(xi), i= 1, . . . , 103, and then recal- culate the local linear maximum likelihood estimator ˆθ (xi) and the variance reduction estimator ˜θ⁽³⁾(x_i) based on the bootstrap sample (x1, y^∗_1,j), . . . , (x₁₀₃, y^∗_103,j)to give bootstrap estimates. Finally, the bootstrap estimates of variances of

Figure 2. Cross-Validation Function for Annual Maximum Tempera- ture Data Whenγ andσ Are Constants. The cross-validation function CV(h) is plotted against h from .1 to .4 with increments of .001. The largest two values of h where local minima of CV(h) occur are h₌.165 and h₌.191.

(6)

Figure 3. Estimators for the Case Whenγ andσ Are Constants. The solid and dashed lines represent the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.165, respectively. The upper left, upper right, and lower plots correspond to γ(x),µ(x), andσ(x), respectively.

(ˆγ(xi),ˆµ(xi),ˆσ (xi))^T and (˜γ⁽³⁾(x_i),˜µ⁽³⁾(x_i),˜σ⁽³⁾(x_i))^T are obtained as the respective sample variances of the 1,000 bootstrap estimates. They are depicted in Figures 5 and 6. We observe from Figures 5 and 6 that, for each of the parameter curves, the variance reduction estimator has a substantially smaller bootstrap variance estimate than the local linear maximum likeli- hood estimator for the interior points of x.

Second, we applied model (3.1) to this dataset with all three parameters being functions of x. We took h= .191. As before, the curve estimates and the bootstrapped variance estimates are plotted in Figures 7 and 8. Again, we observe that the variance reduction estimator has a substantially smaller bootstrap variance estimate than the local linear maximum likelihood estima- tor for the interior points of x. Moreover, both Figures 7 and 8 indicate that γ and σ may be modeled as constants, especially

for interior points of x. From Figures 6 and 8, we see that, for each ˆθ (x) and ˜θ⁽³⁾(x), the bootstrap variance estimate of µ(x) is much smaller when γ and σ are treated as constants compared to the case where they depend on x.

APPENDIX: PROOFS Proof of Theorem 1

Put Q1=

1 0 0 ν_2,1

, Q2=

ν_0,2 ν_1,2 ν_1,2 ν_2,2

,

Q3=

0 ν_2,1 ν_2,1 0

,

(7)

Figure 4. Estimators for the Case Whenγ andσ Are Constants. The solid and dashed lines represent the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.191, respectively. The upper left, upper right, and lower plots correspond to γ(x),µ(x), andσ(x), respectively.

x= fX(x)

I11(θ1(x), θ2(x))Q1 I12(θ1(x), θ2(x))Q1

I21(θ1(x), θ2(x))Q1 I22(θ1(x), θ2(x))Q1

,

x= fX(x)

I11(θ₁(x), θ₂(x))Q₂ I12(θ₁(x), θ₂(x))Q₂ I21(θ₁(x), θ₂(x))Q₂ I22(θ₁(x), θ₂(x))Q₂

,

x=

d

dx{ fX(x)I11(θ1(x), θ2(x))}Q3 d

dx{ fX(x)I₂₁(θ₁(x), θ₂(x))}Q3 d

dx{ fX(x)I12(θ1(x), θ2(x))}Q3 d

dx{fX(x)I₂₂(θ₁(x), θ₂(x))}Q3

,

Vn(x)=√

nh ˆθ1(x)− θ1(x), h{ ˆθ1(x)− θ1(x)}, ˆθ₂(x)− θ2(x), h{ ˆθ₂(x)− θ₂(x)}T

,

q1(y; u1, u₂)= ∂

∂slog f (y; s, t)

(s,t)=(u1,u2)

,

q2(y; u1, u₂)= ∂

∂tlog f (y; s, t)

(s,t)=(u1,u2)

,

Wn(x)=

Wn1(x), . . . , W_n4(x)T

,

Wn(2k+l−1)(x)=

√h

√n

n i=1

(X− xi)^lKh(X_i− x)

× qk

Yi; θ1(x)+ θ₁(x)(X_i− x),

θ₂(x)+ θ₂(x)(X_i− x) ,

(8)

Figure 5. Bootstrapped Variance Estimates for the Case Whenγ andσAre Constants. The solid and dashed lines represent the bootstrapped variances of the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.165, respectively. The upper left, upper right, and lower plots correspond toγ(x),µ(x), andσ(x), respectively.

where k = 1, 2, l = 0, 1. Then it follows from Aerts and Claeskens (1997) that

(_x+ hx)V_n(x)− E{Wn(x)}

= Wn(x)− E{Wn(x)} + op(1), (A.1)

E{Wn(x)} =√ nh

×







1

2h²f (x)ν2,1

j=1,2I1j(θ₁(x), θ₂(x))θ_j(x){1 + o(1)}

o(h²)

1

2h²f (x)ν2,1

j=1,2I2j(θ₁(x), θ₂(x))θ_j(x){1 + o(1)}

o(h²)





.

(A.2)

Define

V^∗_n(x)= diag(1, h, 1, h)√

nh{˜θ^∗(x)− θ^∗(x)},

˜θ^∗(x)=

j=0,1,2

Aj(r) ˆθ^∗(β_x,j),

where A0(r)= 2⁻¹r(r− 1), A1(r)= (1 − r²), and A2(r) = 2⁻¹r(1+ r). Note that

˜θ^∗(x)− θ^∗(x)

=

j=0,1,2

Aj(r){ˆθ^∗(β_x,j)− θ^∗(β_x,j)}

+ Aj(r){θ^∗(β_x,j)− θ^∗(x)}, (A.3)

(9)

Figure 6. Bootstrapped Variance Estimates for the Case WhenγandσAre Constants. The solid and dashed lines represent the bootstrapped variances of the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.191, respectively. The upper left, upper right, and lower plots correspond toγ(x),µ(x), andσ(x), respectively.

1=

j=0,1,2

Aj(r),

0=

j=0,1,2

(−1 + j − r)Aj(r), (A.4)

0=

j=0,1,2

(−1 + j − r)²Aj(r).

Define C₁^∗(a, b)=

K(s+ aδh)K(s + bδh) ds,

C₂^∗(a, b)=

K(s+ aδh)K(s + bδh)(s + aδh) ds,

C^∗₃(a, b)=

K(s+ aδh)K(s + bδh)(s + aδh)(s + bδh) ds,

γij= cov

l=0,1,2

Al(r)Wni(βx,l),

l=0,1,2

Al(r)Wnj(βx,l)

.

It is easy to check that γij= γji, γ_ll= fX(x)I₁₁(θ₁(x), θ₂(x))

×

i=0,1,2

j=0,1,2

Ai(r)A_j(r)

× C^∗_2l₋₁(1− i + r, 1 − j + r), l= 1, 2,

(10)

Figure 7. Estimators for the Case When All Three Parameters Depend on x . The solid and dashed lines represent the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.191, respectively. The upper left, upper right, and lower plots correspond toγ(x),µ(x), andσ(x), respectively.

γ₁₂= fX(x)I₁₁(θ₁(x), θ₂(x))

×

i=0,1,2

j=0,1,2

Ai(r)A_j(r)C^∗₂(1− i + r, 1 − j + r),

(γ13, γ14, γ23, γ24)^T = (γ11, γ12, γ12, γ22)^TI12(θ₁(x), θ₂(x)) I11(θ₁(x), θ₂(x)), (γ₃₃, γ₃₄, γ₄₄)^T = (γ11, γ₁₂, γ₂₂)^TI22(θ1(x), θ2(x))

I11(θ1(x), θ2(x)). From (A.2)–(A.4), we have

E{(x+ hx)V^∗_n(x)}

=

j=0,1,2

Aj(r)E{Wn(β_x,j)}{1 + o(1)}

+

j=0,1,2

Aj(r)diag(1, h, 1, h)√

nh{θ^∗(βx,j)− θ^∗(x)}

=

j=0,1,2

Aj(r)E{Wn(x)}{1 + o(1)}

+

j=0,1,2

Aj(r)diag(1, h, 1, h)√ nh

×

θ₁(x)(βx,j− x) +1

2θ₁(x)(βx,j− x)²+ O(h³), θ₁(x)(β_x,j− x) + O(h²),

θ₂(x)(β_x,j− x) +1

2θ₂(x)(β_x,j− x)²+ O(h³),

(11)

Figure 8. Bootstrapped Variance Estimates for the Case When All Three Parameters Depend on x . The solid and dashed lines represent the bootstrapped variances of the local linear maximum likelihood estimator and the variance reduction estimator with bandwidth h₌.191, respectively.

The upper left, upper right, and lower plots correspond toγ(x),µ(x), andσ(x), respectively.

θ₂(x)(β_x,j− x) + O(h²)

T

= E{Wn(x)}{1 + o(1)} + O(√

nhh³). (A.5)

Similar to the proof of (A.1), we have (A.2), and (A.5) implying that

(_x+ hx) ˜V^∗_n(x)− E{Wn(x)}

=

j=0,1,2

Aj(r)

Wn(β_x,j)− E{Wn(β_x,j)}

{1 + op(1)}

→ N(0, (γd ij)).

Hence, (2.2) can be shown by noting that

fX(x)I(θ1(x), θ₂(x))⁻¹_x =

J22 −J12

−J21 J11

,

where Jij= diag(Iij(θ₁(x), θ₂(x)), ν_2,1⁻¹Iij(θ₁(x), θ₂(x))).

Proof of Theorem 2

Follows directly from Theorem 1.

Proof of Theorem 3

Similar to the proof of Theorem 1.

[Received May 2005. Revised May 2006.]

(12)

REFERENCES

Aerts, M., and Claeskens, G. (1997), “Local Polynomial Estimators in Multipa- rameter Likelihood Models,” Journal of the American Statistical Association, 92, 1536–1545.

Beirlant, J., and Goegebeur, Y. (2004), “Local Polynomial Maximum Like- lihood Estimation for Pareto-Type Distributions,” Journal of Multivariate Analysis, 89, 97–118.

Chavez-Demoulin, V., and Davison, A. C. (2005), “Generalized Additive Mod- elling of Sample Extremes,” Journal of the Royal Statistical Society, Ser. C, 54, 207–222.

Cheng, M.-Y., Peng, L., and Wu, J.-S. (2005), “Reducing Variance in Univariate Smoothing,” technical report.

Claeskens, G., and Van Keilegom, I. (2003), “Bootstrap Confidence Bands for Regression Curves and Their Derivatives,” The Annals of Statistics, 31, 1852–1884.

Davison, A. C., and Ramesh, N. I. (2000), “Local Likelihood Smoothing of Sample Extremes,” Journal of the Royal Statistical Society, Ser. B, 62, 191–208.

Eguchi, S., Kim, T. Y., and Park, B. U. (2003), “Local Likelihood Method:

A Bridge Over Parametric and Nonparametric Regression,” Journal of Non- parametric Statistics, 15, 665–683.

Fan, J., Farmen, M., and Gijbels, I. (1998), “Local Maximum Likelihood Es- timation and Inference,” Journal of the Royal Statistical Society, Ser. B, 60, 591–608.

Fan, J., Heckman, N. E., and Wand, M. P. (1995), “Local Polynomial Kernel Regression for Generalized Linear Models and Quasi-Likelihood Functions,”

Journal of the American Statistical Association, 90, 141–150.

Hall, P., and Tajvidi, N. (2000), “Nonparametric Analysis of Temporal Trend When Fitting Parametric Models to Extreme-Value Data,” Statistical Science, 15, 153–167.

Irizarry, R. A. (2001), “Information and Posterior Probability Criteria for Model Selection in Local Likelihood Estimation,” Journal of the American Statisti- cal Association, 96, 303–315.

Kogure, A. (1998), “Effective Interpolations for Kernel Density Estimators,”

Journal of Nonparametric Statistics, 9, 165–195.

Loader, C. (1999), Local Regression and Likelihood, New York: Springer- Verlag.

Ramesh, N. I., and Davison, A. C. (2002), “Local Models for Exploratory Analysis of Hydrological Extremes,” Journal of Hydrology, 256, 106–119.

Signorini, D. F., and Jones, M. C. (2004), “Kernel Estimators for Univariate Binary Regression,” Journal of the American Association, 99, 119–126.

Staniswalis, J. G. (1989), “The Kernel Estimate of a Regression Function in Likelihood-Based Models,” Journal of the American Statistical Association, 84, 276–283.

Tibshirani, R., and Hastie, T. (1987), “Local Likelihood Estimation,” Journal of the American Statistical Association, 82, 559–567.

Yu, K., and Jones, M. C. (2004), “Likelihood-Based Local Linear Estimation of the Conditional Variance Function,” Journal of the American Statistical Association, 99, 139–144.