• 沒有找到結果。

Statistical Estimation in Generalized Multiparameter Likelihood Models

N/A
N/A
Protected

Academic year: 2022

Share "Statistical Estimation in Generalized Multiparameter Likelihood Models"

Copied!
32
0
0

加載中.... (立即查看全文)

全文

(1)

Statistical Estimation in Generalized Multiparameter Likelihood Models

Ming-Yen Cheng, Wenyang Zhang and Lu-Hung Chen

Abstract

Multiparameter likelihood models (MLM) with multiple covariates have a wide range of applications. However, they encounter the “curse of dimension- ality” problem when the dimension of the covariates is large. We develop a generalized multiparameter likelihood model that copes with multiple covari- ates and adapts to dynamic structural changes well. It includes some popular models, such as the partially linear and varying coefficient models, as special cases. When the model is fixed, a simple and effective two-step method is devel- oped to estimate both the parametric and the nonparametric components. The proposed estimator of the parametric component has the n−1/2 convergence rate, and the estimator of the nonparametric component enjoys an adaptivity property. A data-driven procedure is suggested to select the bandwidths in- volved. Also proposed is a new initial estimator in profile likelihood estimation of the parametric part to ensure stability of the approach in general settings.

We further develop an automatic procedure to identify constant parameters in the underlying model. A simulation study and an application to the infant mortality data of China are given to demonstrate performance of the proposed methods.

KEY WORDS: bandwidth selection, model selection, multiple covariate, partially linear models, profile likelihood, varying coefficient models, semiparametric models.

SHORT TITLE: Generalized multiparameter likelihood models.

Ming-Yen Cheng is Professor, Department of Mathematics, National Taiwan University, Taipei 106, Taiwan (Email: cheng@math.ntu.edu.tw) and Professor of Statistics, Department of Statis- tical Science, University College London, Gower Street, London, WC1E 6BT, UK (Email: ming- yen.cheng@ucl.ac.uk). Wenyang Zhang is Reader in Statistics, Department of Mathematical Sci- ences, University of Bath, Bath BA2 7AY, UK (Email: W.Zhang@bath.ac.uk). Lu-Hung Chen is Ph.D. Candidate, Department of Mathematics, National Taiwan University, Taipei 106, Taiwan (Email: d95221006@ntu.edu.tw). Cheng’s research was supported in part by the National Science Council (grant NSC95-2118-M-002-003-MY2). This project began while Wenyang Zhang visited National Taiwan University, with funding supported by the Mathematics Division of the National Center of Theoretical Sciences (Taipei). We thank the associate editor and two referees for con- structive comments which lead to substantial improvement of this paper.

(2)

1 Introduction

Consider statistical modeling of the relationship between a response variable and some covariates. Maximum likelihood estimation is most powerful when the joint distribution of the response variable and covariates is specified by a parametric form.

However, parametric approaches are in risk of model misspecification, which could result in seriously biased estimation, misinterpretation of data, and other difficulties.

Nonparametric modeling is more flexible and allows data to present the unknown truth. However, people always come up against the “curse of dimensionality” problem, i.e. model instability when the dimension of the covariates is large. There are many hybrids of parametric and nonparametric models, generally called semiparametric models, proposed to achieve a good balance between flexibility and stability in model specification. Examples are varying coefficients models (Hastie and Tibshirani, 1993;

Xia and Li, 1999; Fan and Huang, 2005; Cai, Fan, Jiang and Zhou, 2008; Li and Liang 2008), partially linear models (Engle, Granger, Rice and Weiss, 1986; H¨ardle, Liang and Gao, 2000; Cai, Fan, Jiang and Zhou, 2007) and additive models (Hastie and Tibshirani, 1990; Fan, H¨ardle and Mammen, 1998; Fan and Jiang, 2005; Jiang and Zhou, 2007).

In this paper we suggest a semiparametric model for a population (X, U, Y ) in which U is a continuous variable and the conditional density function of Y given (X, U ) is specified by

fY ; X, θ, xT1a1(U ), · · · , xT`a`(U ), (1.1) where f is a known parametric density function, θ = (θ1, · · · , θq)T is an unknown constant vector, X = (X1, · · · , Xp)Twith X1 ≡ 1, and xj is a pj-dimensional subvector of X and aj(·) = (aj1(·), · · · , ajpj(·))T is an unknown function, j = 1, · · · , `. Here, 1 ≤ ` ≤ d, where d is defined in (1.2). Model (1.1) is a hybrid of the standard MLM which assumes the conditional density function of Y given X follows the form

fY ; a1(X), · · · , ad(X), (1.2) where f has d identifiable parameters and a1(X), · · · , ad(X) are unknown functions, i.e. Y depends on X through the d identifiable parameters in f being modeled as

(3)

nonparametric functions of X. Aerts and Claeskens (1997) studied a locally linear maximum likelihood estimator of MLM when X is univariate, and Cheng and Peng (2007) proposed a variance reduction technique to improve the estimation. MLM provides a general framework for specifying statistical relationship between response and covariates under a wide range of data configurations, including continuous, cat- egorical, binary and count variables as the response and cases where the response is univariate or vector valued. In addition, it can be easily adopted to cope with various statistical problems such as mean regression, variance estimation, quantile regression, hazard regression, logistic regression and longitudinal data analysis. See Aerts and Claeskens (1997), Loader (1999), Claeskens and Aerts (2000) and Cheng and Peng (2007), among others, for further details.

With the availability of U , model (1.1) specifies ` of the d parameter functions in model (1.2) by some nonparametric or semiparametric form, and if d − ` > 0 the other d − ` parameter functions in (1.2) are now modeled parametrically in (1.1), with θ consisting of all the constant parameters. Like MLM (1.2), (1.1) offers a unified approach to modeling a wide range of data settings and to dealing with various in- ference problems. Nonetheless, (1.1) avoids the curse of dimensionality problem (1.2) has when the dimension of X is large and it allows parametric or nonparametric or semiparametric modeling of the parameter functions in (1.2). Further, (1.1) broadens the application of MLM since it can cope with categorical covariates, which often arise in practice. Model (1.1) is a very general semiparametric model provided there exists a continuous variable U and other covariates X. First, it reduces to a partially linear model when ` = 1, x1 = X1 ≡ 1 and θ interacts with X through a linear function.

When ` = 1, q = 0 and x1 = X, model (1.1) becomes the varying coefficients model of Hastie and Tibshirani (1993) with the same modifying variable U . Thus, (1.1) inherits the model stability, flexibility and interpretability that varying coefficients models enjoy. In addition, it is closely related the Regression Model II in Section 4.3 of Bickel, Klaassen, Ritov and Wellner (1993).

We propose a simple, effective and fast two-step procedure to estimate both the constant and functional parameters in (1.1). Its implementation does not involve any iteration which is usually required by conventional approaches, e.g. profile likelihood

(4)

and backfitting. Further, we develop an AIC data-driven procedure to select the bandwidths required in the two-step estimation. Using an AIC criterion, or modi- fied versions, to select smoothing parameters in nonparametric regression and local likelihood modeling has been extensively discussed and implemented, c.f. Hurvich, Simonoff and Tsai (1998), Loader (1999), Schucany (2004), among others. For lo- cal likelihood estimation, Aerts and Claeskens (1997) considered cross-validation and plug-in bandwidths, and Farmen and Gijbels (1998) suggested a bandwidth selector based on an approximation to the integrated mean squared error. A profile likelihood approach can be applied to estimate the constant parameters. A new initial estimator is proposed to ensure stability of the profile likelihood approach no matter what types of features, e.g. location, scale or shape, in the model the constant parameters are playing roles. In general, neither profile likelihood nor the two-step estimator of the constant parameters is consistently superior to the other, see the asymptotic results and discussion in Section 5 and simulation results reported in Section 6. Nevertheless, the major strength of the two-step estimation is its simple and fast implementation and numerical stability, as no iteration is required.

In practice, the real challenge is that we are often given a collection of significant covariates but do not know which of the parameter functions are constant and which are functional in (1.1), i.e. we are not sure about the specification of θ and x1, · · · , x`. In an attempt to solve this fundamental identification problem, we suggest a stepwise procedure based on a version of the BIC criterion accounted for our model. Iden- tification of constant parameters and bandwidth selection interact with each other.

We propose to select the bandwidths first and then keep them fixed throughout the procedure for identifying the constant parameters. This approach indeed resolves a complex problem in an effective, fast and stable fashion, and is confirmed to have these properties by a simulation study and a real data analysis. We are not aware of any existing methods for identifying constant parameters or covariates in the parametric component of a semiparametric model, although there exists an abundant literature on a different issue of variable selection for parametric models, nonparametric mod- els, and parametric or nonparametric components in semiparametric models. For example, Irizarry (2001) derived weighted versions of AIC and BIC and posterior

(5)

probability model selection criteria for one-parameter local likelihood models. Fan and Li (2002) employed profile likelihood techniques in their nonconcave penalized likelihood approach to selecting variables in the parametric part of Cox’s proportional hazards model. Fan and Li (2004) incorporated profiling ideas in their construction of penalized least squares for variable selection in the parametric component of a semiparametric model for longitudinal data analysis. Bunea (2004) constructed a penalized least squares criterion for variable selection in the linear part of a partially linear model. For a generalized varying-coefficient partially linear model, Li and Liang (2008) used a nonconcave penalized likelihood to select significant variables in the parametric component and a generalized likelihood ratio test to select significant variables in the nonparametric component, assuming the two sets of covariates in the parametric and nonparametric components are separated in advance.

Section 2 provides some motivating examples for model (1.1) and discusses the identifiability issue. Our two-step estimation procedure for both the constant and functional parameters and a new initial estimator for profile likelihood estimation of the constant parameters are given in Section 3. Bandwidth selection and identification of the constant parameters are addressed in Section 4. Asymptotic properties of the two-step estimators are investigated in Section 5. Section 6 presents a simulated example and an analysis of a motivating example about infant mortality. Proofs of the theoretical results are deferred to the Appendix.

2 Motivating examples and model identifiability

In applications, some of the unknown functional parameters in MLM (1.2) may be simply unknown constants. Under such circumstances, we would pay a price at efficiency if the unknown constants are still treated as unknown functions. An example is the analysis of 103 annual maximum temperatures (Cheng and Peng, 2007) in which Y |X is modeled by an extreme value distribution, where X is year. The estimates of the shape and scale parameter curves are flat except in the boundary regions, which is reasonable as the two parameters are unlikely to change much within one hundred years. To accommodate such situations, (1.2) needs to be restricted to the following

(6)

semiparametric model

fY ; X, θ, a1(X), · · · , a`(X), (2.1) where 1 ≤ ` < d and θ is a q-dimensional unknown parameter. Here, ` out of the d parameter functions in (1.2) remain as unknown functions of X, and the other d − ` parameter functions are formulated by certain parametric forms, e.g. unknown constants, with θ consisting of all the constant parameters. The model studied by Severini and Wong (1992) is a special case of (2.1) with d = 2, ` = 1, q = 1, and the covariate being univariate. Profile likelihood estimation of θ, along with consistent estimators of a least favorable curve, was studied therein.

When the dimension of X is large, neither (1.2) nor (2.1) would work due to the curse of dimensionality problem. Claeskens and Aerts (2000) suggested to alleviate this problem by restricting a1(·), · · · , ad(·) in (1.2) to additive models and estimate them by a backfitting algorithm. Alternatively, the following restriction of (2.1)

fY ; X, θ, XTβ1, · · · , XTβ`, (2.2) where β1, · · · , β` are unknown constant vectors, would cope with the curse of di- mensionality problem. However, (2.2) actually implies that the impact of X on Y is constant, which is somewhat implausible in practice. For example, in the analysis of infant mortality in China detailed later, the impact of type of region of residence on mortality would not be a constant along the time U as China has changed very much since 1950 and the difference between rural and urban regions has been changing.

The impact must vary with U and the pattern of the change is of interest. To fit the data more accurately and to catch the dynamic pattern of the changes in the impact, we extend (2.2) to

fY ; X, θ, XTa1(U ), · · · , XTa`(U ), (2.3) where θ is an unknown constant vector, and aj(·) = (aj1(·), · · · , ajp(·))T, j = 1, · · · , `.

In (2.3), a1(·), · · · , a`(·) have to share the same dimension p and all the aij(·)’s are assumed to be functional. This model assumption may be unnecessary in some situations. The analysis of infant mortality in China is an example: the impact of ethnic group or type of feeding on infant mortality can be formulated as an unknown

(7)

constant parameter. To remove such unnecessary restriction and to make the model more versatile, we generalize (2.3) to (1.1) with all the aij(·)’s in (2.3) that are constant absorbed by θ in (1.1).

When a1(·), · · · , a`(·) are all constant, model (2.3) reduces to model (2.2) and I(γ) defined in Theorem 1 becomes ˜I, where ˜I is I(γ) with aj(U ) replaced by βj. Condition (7) in the Appendix ensures the smallest eigenvalue of ˜I is greater than the positive number λ0 in Condition (7). If (Xi, Yi), i = 1, · · · , n, is a sample from model (2.2) then, under Condition (7), the Fisher information matrix is

n

X

i=1

diag(Iq, I`⊗ Xi) ˜Iidiag(Iq, I`⊗ XTi) > λ0

n

X

i=1

diag(Iq, I`⊗ Xi)diag(Iq, I`⊗ XTi)

≈ nλ0diag(Iq, I`⊗ E(XXT)) > 0,

where ˜Ii is ˜I with X replaced by Xi. Here, Ik denotes a size k identity matrix and, for any matrices A and B, diag(A, B) denotes the matrix

 A 0

0 B



.

Hence, Condition (7) ensures the Fisher information matrix of the parametric model (2.2) is positive-definite, i.e. model (2.2) is identifiable. Further, for any given value of U , the local version of model (2.3) is model (2.2). Therefore, under Condition (7), model (2.3) is identifiable for any given value of U , and so model (2.3) is identifiable.

Also, model (1.1) specifies some of the aij(·)’s in model (2.3) as constant and therefore is identifiable. Based on the above arguments, we have the following lemma.

Lemma 1. Under Condition (7) in the Appendix, both models (1.1) and (2.3) are identifiable.

3 Estimation procedures

Suppose we have a sample (Xi, Ui, Yi), i = 1, · · · , n, from (X, U, Y ) which obeys model (1.1). Let xi,j be the pj-dimensional subvector of Xi that corresponds to xj, j = 1, · · · , `, i = 1, · · · , n. We introduce our two-step and profile likelihood procedures for estimating both the constant and functional parameters in Sections 3.1 and 3.2.

(8)

3.1 Two-step estimation

Our two-step approach produces an estimator for the constant vector θ first, and then this estimator is plugged-in into the local likelihood function to estimate the functions aj(·), j = 1, · · · , `.

The estimation procedure for θ consists of two stages. First, we treat θ as an unknown function of U and appeal to the local likelihood approach to get a prelimi- nary estimator ˜θ(Ui) for θ(Ui) for each Ui, i = 1, · · · , n. Then, we average ˜θ(Ui) over i = 1, · · · , n to get the final estimator for θ. The procedure is detailed as follows.

The conditional log-likelihood function is L0(θ, a) =

n

X

i=1

log fYi; Xi, θ, xTi,1a1(Ui), · · · , xTi,`a`(Ui) , (3.1)

where a(·) = (a1(·)T, · · · , a`(·)T)T. Given u, by Taylor’s expansion, we have for each j aj(Ui) ≈ aj(u) + ˙aj(u)(Ui− u)

when Ui is in a neighborhood of u, where ˙aj(u) = daj(u)/du. This leads to the following local log-likelihood function

n

X

i=1

Kh(Ui−u) log fYi; Xi, θ, xTi,1{a1+b1(Ui−u)}, · · · , xTi,`{a`+b`(Ui−u)}, (3.2) where Kh(·) = K(·/h)/h, K(·) is a kernel function, and h > 0 is a bandwidth. Maxi- mizing (3.2) with respect to (θT, aT1, bT1, · · · , aT`, bT`)Twe get the maximizerθ(u)˜ T, ˜a1(u)T, b˜1(u)T, · · · , ˜a`(u)T, ˜b`(u)TT. In the above local likelihood estimation, θ is fitted by a local constant vector since θ is constant under model (1.1) and fitting it by a local constant vector stabilizes the procedure. For i = 1, · · · , n, let u = Ui and we get an initial estimator ˜θ(Ui) of θ. The final estimator of θ is taken to be

θ = nˆ −1

n

X

i=1

θ(U˜ i). (3.3)

For ˆθ to achieve the n−1/2 convergence rate, we have to choose a relatively small bandwidth h so that the biases of ˜θ(·) and ˜aj(·), j = 1, · · · , `, are dominated by n−1/2. This ensures that estimating the constant and the functional parts simultaneously in the first step does not create extra bias for θ. Then, averaging over ˜θ(Ui), i = 1, · · · , n,

(9)

as in (3.3) brings the variance from the order (nh)−1 in nonparametric estimation back to the order n−1 in parametric estimation. We will show later that ˆθ is root- n consistent when h is properly chosen. Like any other maximum local likelihood estimation procedure, the bandwidth h cannot be chosen too small; otherwise one runs into problems with singularity of the design matrix. From the asymptotic point of view, Condition (5) prohibits the bandwidth h from being too small. Thus, Conditions (5) and (7) guarantee the estimators ˜θ(U1), · · · , ˜θ(Un) exist. Further, the method of Cheng, Wu and Yang (2008) can be employed to modify the local likelihood function (3.2) to overcome the singularity problem caused by a small h or sparsity in the design points Ui’s. This approach can also be applied to (3.4) when estimating the function a(u).

With ˆθ, we can estimate a(u) using the maximum local likelihood approach again.

Note that the estimator ˜a(u) = (˜a1(u)T, · · · , ˜a`(u)T)T we obtained before is too noisy and is not appropriate for this purpose since the bandwidth h is intentionally chosen small in order to get a good estimator of θ. Hence, we will use another, larger bandwidth to estimate a(u). We replace θ in (3.2) by ˆθ to get a local log-likelihood function for a(u):

n

X

i=1

Kh1(Ui−u) log fYi; Xi, ˆθ, xTi,1{a1+b1(Ui−u)}, · · · , xTi,`{a`+b`(Ui−u)}, (3.4) where h1 > 0 is a bandwidth different from h. We could use a kernel other than K at this step but that does not matter much. Maximizing (3.4) with respect to (aT1, bT1, · · · , aT`, bT`)T we get the maximizer (ˆa1(u)T, ˆb1(u)T, · · · , ˆa`(u)T, ˆb`(u)T)T. Our estimator of a(u) is taken to be ˆa(u) = (ˆa1(u)T, · · · , ˆa`(u)T)T. As the convergence rate of ˆθ is n−1/2, see Section 5, ˆa(u) would work as well as when θ is known in the local log-likelihood (3.4). That is, ˆa(u) has the adaptivity property.

In some cases, local likelihood estimation of the varying coefficients aj(·), j = 1, · · · , `, may require a different amount of smoothing, see, for example, Claeskens and Aerts (2000). Backfitting ideas can be implemented to achieve this goal: (i) use ˆa(·) as the initial estimate; (ii) for each j, substitute all the local linear coeffi- cient functions except the jth and h1 in (3.4) respectively by the previous estimates and the bandwidth for smoothing the jth functional parameter, and then maximize the resulted local likelihood to find an estimate of aj(·); (iii) iterate step (ii) until

(10)

convergence. The convergence is usually attained quickly.

3.2 Profile likelihood estimation

A profile likelihood estimator for θ maximizes, with respect to θ, a profiled log- likelihood

n

X

i=1

log fYi; Xi, θ, xTi,11θ(Ui), · · · , xTi,`˜a`,θ(Ui),

where, for any given θ, ˜aθ(·) = (˜a1,θ(·)T, · · · , ˜a`,θ(·)T)T is an estimator for a(·). In practice, one needs to find the minimizer of

∂L0

∂θ (θ, ˜aθ) + ∂L0

∂a (θ, ˜aθ) ∂

∂θ˜aθ

(3.5) by iteration, where L0 is given in (3.1). When the specified semiparametric model is general like (1.1), in which θ may involve shape or scale parameters in f , sta- bility of the iteration heavily relies on proper choice of the initial estimate. Under semiparametric models for the regression mean, Fan and Huang (2005) and Lam and Fan (2008) employed difference-based methods to obtain a reliable initial estimate.

However, difference-based methods may not work for model (1.1) since some of the elements in θ can be other than mean parameters. We propose a new initial estimate in the following.

First, we derive some rough estimates of aj(Ui), i = 1, · · · , n, j = 1, · · · , `. Con- sider a model obtained by replacing θ in (1.1) with a0(U ), a q-dimensional unknown function of U ; it specifies the conditional density of Y given X and U as

fY ; X, a0(U ), xT1a1(U ), · · · , xT`a`(U ). (3.6) For any given u, let0(u)T, ¯b0(u)T, ¯a1(u)T, ¯b1(u)T, · · · , ¯a`(u)T, ¯b`(u)TT be the maxi- mizer, with respect to (aT0, bT0, aT1, bT1, · · · , aT`, bT`)T, of the local log-likelihood function

n

X

i=1

Kh1(Ui−u) log fYi; Xi, a0+b0(Ui−u), xTi,1{a1+b1(Ui−u)}, · · · , xTi,`{a`+b`(Ui−u)}.

Here, h1 can be taken as the bandwidth ˆh1 in Section 4.2 since it is selected for local likelihood estimation by assuming model (3.6). Letting u = Ui in the above

(11)

procedure, we have ¯aj(Ui), j = 1, · · · , `, i = 1, · · · , n. Then, our initial estimate ¯θ is the maximizer of

n

X

i=1

log fYi; Xi, θ, xTi,1¯a1(Ui), · · · , xTi,``(Ui).

During the iteration in finding the minimizer of (3.5), ˜aθ(·) is taken as the esti- mator that solves (3.4) with ˆθ and h1 replaced by θ and ˆh1 respectively. With this choice of bandwidth the least favorable curve is approximated well, by the nature of model (3.6). Upon convergence of the iteration, we obtain the profile likelihood estimator for θ. Then we can estimate a(·) and select the bandwidth in the same manner as described in Sections 3.1 and 4.2 with ˆθ replaced by the profile likelihood estimator for θ.

4 Bandwidth selection and identifying constant pa- rameters

In reality, we do not know which of the parameters are constant and which are func- tional in model (1.1). This is essentially a model selection problem. There are many model selection criteria under parametric assumptions, e.g. cross-validation (Stone, 1974), AIC (Akaike, 1970), BIC (Schwarz, 1978), and nonconcave penalized likelihood (Fan and Li, 2001). Among these various criteria, due to their easy implementation, AIC and BIC are probably most commonly used in practice. We appeal to the AIC and BIC ideas to select the bandwidths h1 and h in the estimation procedures and to identify the constant parameters in model (1.1).

4.1 Model selection criteria

The standard AIC and BIC formulae contain the number of unknown parameters in the model, so does the versions of AIC and BIC for model (1.1). To work out that number for (1.1), we have to find out how many unknown parameters each unknown function aij(·) amounts to. In nonparametric modeling, when a locally linear approximation is used, Fan and Gijbels (1996) suggested that an unknown

(12)

function amounts to

trn(GT0W0G0)−1GT0W20G0o unknown parameters, where

G0 =

1 U1− u ... ... 1 Un− u

, W0 = diagKh1(U1− u), · · · , Kh1(Un− u).

To simplify the calculation, we look into its asymptotic version: when n is large enough,

trn(GT0W0G0)−1GT0W20G0

o≈ h−110+ ν22), where νi =R tiK2(t)dt, µi =R tiK(t)dt. Thus, we take

K = q + h−110+ ν22)(p1+ · · · + p`),

as the number of parameters involved in our estimation procedure for model (1.1).

When the Epanechnikov kernel K(t) = 0.75(1 − t2)+ is used, ν0+ ν22 = 1.028571.

The AIC and BIC formulae given later apply to any models of the form (1.1), which can have different q, ` or xj, and model (3.6); in the latter case, K = h−110 + ν22)(q + p1+ · · · + p`) and ˆθ, ˆaj(·), j = 1, · · · , `, are replaced by ¯aj(·), j = 0, 1, · · · , `, in the formulae.

4.2 Bandwidth selection

Suppose (1.1) is the true model and is used to analyze the data. Choice of the bandwidths h and h1 determines performance of the two-step estimators in Section 3.1. Compared to choosing h1, selecting h is relatively simple since h is employed to get undersmoothed estimators of the functional parameters. It follows from Theorem 1 that we get a good estimator of θ by letting h be of an order n−αfor any α ∈ (1/4, 1).

In practice, we may take α = 1/4 + δ for a small δ > 0 to avoid difficulties in the maximization of (3.2) caused by design sparsity. Proper selection of h1 is crucial for ˆ

a(·) to perform well. We propose to obtain a reasonable choice of h1 first, then utilize the relationship between the optimal rate of h1 and a suitable rate of h to determine h, and finally select h1.

(13)

Define the AIC criterion for model (1.1) as AIC = −2

n

X

i=1

log fYi; Xi, ˆθ, xTi,1ˆa1(Ui), · · · , xTi,``(Ui)+ 2K .

To get a reasonable choice of h1, we compute the version of AIC for model (3.6) for different values of h1. This yields an AIC function of h1 only and h is not involved since there is no constant parameters in (3.6). Then ˆh1, the minimizer of the AIC function of h1, is a rough approximation to the optimal value of h1 in the two-step estimator ˆa(·) for a(·) in (1.1). The reason is that, when modeling data from (1.1) using (3.6), the true value of a0(·) is the constant vector θ, so the curve estimate of a0(·) would be roughly flat for a wide range of h1 and the AIC criterion for (3.6) mainly measures performance of the estimators of aj(·), j = 1, · · · , `, while h1 varies.

Selection of the bandwidth h in the two-step estimation of θ is not a major issue as we explained before. Any bandwidth h will do as long as it is relatively small but not too small. In the light of Theorem 2, which suggests that the optimal rate of h1 is n−1/5, and the discussion on choice of h earlier we take ˆh = n−0.051ˆh1. Note that value of n−0.051 falls in the narrow range (0.5559, 0.7907) for n ∈ [102, 105].

The bandwidth ˆh1 is chosen for estimating the functions a0(·), a1(·), · · · , a`(·) in model (3.6), which specifies the constant vector θ in the true model (1.1) as functional.

Now, we refine our data-driven selection of h1, which is required in the two-step estimation of a(·) in the true model (1.1). Based ˆh we obtain the two-step estimator θ for θ in (1.1). Plug-in ˆˆ θ into (3.4) to find ˆa(·) and compute the AIC criterion for model (1.1) for a range of h1. Denote the minimizer of this AIC function as ˜h1.

4.3 Identifying constant parameters

Define the version of BIC for model (1.1) as BIC = −2

n

X

i=1

log fYi; Xi, ˆθ, xTi,11(Ui), · · · , xTi,`ˆa`(Ui)+ K log(n) .

We propose a procedure to identify which parameters are constant and which are functional in model (1.1) based on the BIC criterion. This model selection problem interacts with the bandwidth selection problem: the BIC formula depends on the bandwidth h1. In fact, it is almost impossible to choose the bandwidth and the

(14)

constant parameters simultaneously. This is because either a complex model or a small bandwidth can result in a small bias and a large variance, and either a simple model or a large bandwidth can result in a large bias and a small variance. Thus, a complex model with a large bandwidth would have same effects as a simple model with a small bandwidth. A sensible solution is to choose the bandwidths h1 and h first and then identify the constant parameters, as suggested in the following.

We start with a model M0 which is of the form (3.6), and then decide which parameters in M0 are functional and which are constant. The choice of a0(U ) and x1, · · · , x`, and how they determine the dependence of Y on X and U in M0 should come from the basic assumptions on the model; in practice they are determined by the analyst. Due to the curse of dimensionality issue, we have to impose some basic assumptions on the model based on some knowledge about the data we analyze, which is usually available from the background of the data or people working with the area where the data arise. Since all the unknown parameters in M0 are functions, as in Section 4.2, we choose the bandwidth h1 for estimating the unknown functions by minimizing the version of the AIC criterion for model M0. For simplicity of notation, denote this bandwidth as ˆh1 and let ˆh = ˆh1n−0.051 again. Then we fix at these two bandwidths throughout the model selection procedure.

Ideally, we could compute the BICs for all possible combinations, and the chosen combination is the one with the smallest BIC value. Unfortunately, this approach would immediately become computationally impossible when κ, number of parame- ters that can be either functional or constant, is not very small as there are 2κpossible combinations. We propose the following iterative procedure to reduce the computa- tional burden. We start with M0 as the candidate model, and at the L-th step of the iteration we examine whether one of the functional parameters in the candidate model ML can be further reduced to a constant.

(a): Set L = 0. Based on model M0, compute local likelihood estimates of all the unknown parameter functions using bandwidth ˆh1.

(b): If L = 2κ− 1 (all the m parameters are reduced to constants in ML) then ML

is the chosen model and the model selection is completed. Otherwise, for each of the unknown functions in the candidate model ML, say aij(·), that could be

(15)

reduced to a constant calculate Sij =

n

X

k=1

ˆaij(Uk) − ¯aij

2

, ¯aij = n−1

n

X

k=1

ˆ aij(Uk).

Changing the function aij(·) in ML that has the smallest Sij to a constant parameter we get a new model ML+1.

(c): Based on the new model ML+1 and the bandwidths ˆh1 and ˆh, compute the estimates of the unknown functions and constants. Compute the BIC of ML+1, and compare it with that of ML. If ML has a smaller BIC, then ML is the chosen model and the model selection is completed. Otherwise, ML+1 becomes the candidate model, so we denote the new constant parameter in (b) as θL+1

and change L to L + 1 then go to (b).

The above iterative process continues until ML has a smaller BIC than ML+1 for some L < 2κ− 1 (ML is the chosen model) or until L = 2κ − 1 (the chosen model has all the considered parameters constant). Apparently, the final chosen model can be written in the form of (1.1).

5 Asymptotic properties

This section investigates asymptotic distributions of the two-step estimators given in Section 3.1. Theory for profile likelihood estimation of θ can be developed under another set of regularity conditions. The derivation is straightforward given the ex- isting literature (Severini and Wong, 1992; Murphy and van der Vaart, 2000; Fan and Huang, 2005) and hence is omitted.

For simplicity of notation, the theory presented here concerns the case where xj = X, j = 1, · · · , `. The established theory straightforwardly carries over to the general case where x1, · · · , x`, are different. Let π(u) be the density of U and ¨aj(u) be the second derivative of aj(u), j = 1, · · · , `. Write

z = (z1, · · · , z`)T, zj = XTaj(u), j = 1, · · · , `, D = I`⊗(XT, 01×p)T, Dc= I`⊗(01×p, XT)T. Theorem 1 gives the asymptotic distribution of ˆθ and shows that ˆθ is asymptotically unbiased as an estimator of the constant parameter θ, provided that the bandwidth h is of an order smaller than that of optimal bandwidths used in univariate smoothing.

(16)

Theorem 1. Under the regularity conditions stated in the Appendix, if h = o(n−1/4) and nh/ log2n −→ ∞, we have

n1/2(ˆθ − θ)−→ N (0D q×1, ∆) when n −→ ∞, where

∆ = (Iq, 0q×2p`)E{Vc(U )−1V0(U )Vc(U )−1}(Iq, 0q×2p`)T,

V0(u) = E{HI(γ)HT|U = u}, Vc(u) = V0(u) + E{µ2HcI(γ)HTc|U = u}, H = diag (Iq, D) , Hc= diag (0q×q, Dc) , I(γ) = −E { ˙g(Y ; X, γ)|X, U } , g(Y ; X, γ) = ∂ log f (Y ; X, γ)

∂γ , ˙g(Y ; X, γ) = ∂g(Y ; X, γ)

∂γ , γ = (θT, zT)T.

In general, neither profile likelihood nor the two-step estimator of the constant parameters θ is consistently superior to the other in their asymptotic performance.

Profile likelihood estimator may have a smaller asymptotic variance than the two-step estimator when both are asymptotically normal, c.f. Severini and Wong (1992). On the other hand, there are situations for which Theorem 1 holds but profile likelihood estimator does not work since it requires existence of the least favorable curves, see the example discussed in Fan and Wong (2000).

Theorem 2. Under the regularity conditions stated in the Appendix, if h1 −→ 0 and nh1/ log2n −→ ∞, we have

(nh1)1/2{ˆa(u) − a(u) + B}−→ N (0D p`×1, Σ) when n −→ ∞, where

B = 2−1µ2h21I`⊗ {(1, 0) ⊗ Ip}G−1c Γ , Γ = EnDI1(z)(¨a1(u), · · · , ¨a`(u))TX U = uo,

Σ = I`⊗ {(1, 0) ⊗ Ip}G−1c GG−1c π(u)−1I`⊗ {(1, 0)T⊗ Ip} , G = Enν0DI1(z)DT+ ν2DcI1(z)DTc U = uo,

Gc = EnDI1(z)DT+ µ2DcI1(z)DTc U = uo, I1(z) = −E

(2log f (Y ; X, θ, z)

∂z∂zT

X, U

) U =u

.

(17)

Theorem 2 tells that our estimator ˆa(·) has the adaptivity property: it has the same asymptotic distribution as the estimator of a(·) that is obtained by maximizing (3.4) with ˆθ replaced by the true value of θ. In addition, the optimal bandwidth h1 is of the order n−1/5 and the optimal convergence rate of ˆa(·) is n−2/5. We leave the proof of these two theorems to the Appendix.

6 Simulation study and data analysis

6.1 A simulated example

We use a simulated example to demonstrate how well the proposed methods work.

Suppose that, conditionally on X = x and U = u, Y has a Weibull distribution with density function

f (y; x, θ, a(u)x) = θ

{a(u)x}θ yθ−1 exph− {y/a(u)x}θi, y > 0, (6.1) where the constant θ > 0 is taken to be 1, the function a(·) is set to be a(u) = 2 + sin(2πu), U ∼ Uniform(0, 1), X ∼ Uniform(1, 2) and X and U are independent.

We simulated 300 samples of size 1000 from this model and applied our estimation and model selection procedures to the samples.

In the two-step estimation procedure, the bandwidths ˆh and ˜h1 in Section 4.2 were employed, with model (3.6) specifying the conditional density

f (y; x, a0(u), a1(u)x) = a0(u)

{a(u)x}a0(u)ya0(u)−1 exph− {y/a(u)x}a0(u)i, y > 0. (6.2) The kernel function K was taken to be the Epanechnikov kernel. We use the mean integrated absolute error (MIAE) to assess the accuracy of an estimator. The MIAE of an estimator of an unknown constant is defined as its mean absolute error. The MIAE of an estimator ˆa(·) of an unknown function a(·) is defined as

MIAE = E(IAE), where IAE =

Z

a(u) − a(u)ˆ du.

The MIAE for θ and a(·) are respectively 0.0154 and 0.1067. The bias and standard deviation for θ are respectively -0.0016 and 0.0194, agreeing with the theory that ˆθ

(18)

is asymptotically unbiased, see Theorem 1. Panel (a) of Fig. 1 plots the pointwise 10%, 50% and 90% quantiles of the 300 curve estimates of a(·). Both estimators of θ and a(·) are quite accurate. Also, the constant parameter θ is estimated with a higher level of accuracy than the functional parameter a(·). This coincides with our theory that ˆθ has a faster rate of convergence than ˆa(·) does. Panel (b) of Fig. 1 plots the estimates of a(·) based on the sample with median IAE performance when θ is treated as unknown (dotted line) and known (dashed line). They are close to each other, indicating that our estimator of a(·) has the adaptivity property.

0.0 0.2 0.4 0.6 0.8 1.0

1.01.52.02.53.0

(a)

0.0 0.2 0.4 0.6 0.8 1.0

1.01.52.02.53.0

(b)

Figure 1: Estimates of the functional parameter in the Weibull example. Panel (a) plots the pointwise 10%, 50% and 90% quantiles (long-dash) of the 300 estimates of a(·) (solid). Panel (b) plots the estimates of a(·) (solid) based on the sample with median IAE performance when the constant parameter θ is treated as unknown (dotted) or known (dashed).

The profile likelihood method in Section 3.2 was applied to the same 300 samples.

The MIAEs for θ and a(·) are respectively 0.0168 and 0.1077. The bias and standard deviation for θ are respectively 0.0011 and 0.0206. In this example, θ is the shape parameter and a(·) determines the scale parameter in the conditional Weibull dis- tribution. The two-step method performs slightly better than the profile likelihood method in estimating the constant (shape) parameter, while the two perform equally well in estimating the functional (scale) parameter.

Suppose it is not known which of the two parameters are constant and which are functional. For each of the 300 samples simulated from (6.1), the model selection

(19)

procedure in Section 4.3, with the start model M0 given by model (6.2), was applied to select the constant parameters. The correct model was selected 296 times, and for all the other 4 samples the model with both θ and a as functions of u was selected.

Thus, our model selection criterion has a high success rate of 98%. Furthermore, it never makes the mistake specifying the functional parameter a(·) as constant, which would result in a large bias and inconsistency in post model selection inference. Com- paratively, misspecifying the constant parameter θ as functional, which our model se- lection rule does 2% of the time, is a minor problem as the nonparametric estimates under the wrong model would still look flat. Thus, it is a good sign that when our model selection method misspecifies the model, the fitted model still resembles the true model. As an evidence, the MIAE of estimating θ after model selection is 0.0167 and that of estimating a(·) is 0.1068, both of them are only slightly larger than the corresponding MIAEs with the correct model specified.

To further examine performance of the model selection procedure, for each of the 300 samples, we use the selected model and the corresponding parameter estimates to predict the true values of the parameters θ and a(U0) associated with a future observation (Y0, X0, U0). The mean absolute prediction errors for θ and a(U0) are 0.0170 and 0.1130 respectively. In Fig. 2, panel (a) is a boxplot of the 300 predicted values of θ ≡ 1, and panel (b) depicts the point cloud of the predicted value against the true value of a(U0) for the 300 samples. We can see that the predictions are both quite satisfactory even though the model selection procedure misspecifies the model 12%

of the time. Thus, we can conclude from this example that our proposed estimation and model selection procedures work together as a powerful tool for multiparameter likelihood modeling even when there is little knowledge about whether or not some of the parameters are constant.

In model (6.1), the Weibull conditional distribution has d = 2 parameters, of which ` = 1 follows a nonparametric form and the other d − ` = 1 follows a para- metric form. A more complex model is to change the conditional distribution to Weibull(θ + a1(u)x, a(u)x), where x, u, θ and a(·) are the same as in (6.1), and a1(u) = 0.1 + 0.1 cos(2πu). In this case, ` = d = 2, with the shape parameter modeled semiparametrically and the scale parameter modeled nonparametrically.

參考文獻

相關文件

6 《中論·觀因緣品》,《佛藏要籍選刊》第 9 冊,上海古籍出版社 1994 年版,第 1

• Many statistical procedures are based on sta- tistical models which specify under which conditions the data are generated.... – Consider a new model of automobile which is

In the inverse boundary value problems of isotropic elasticity and complex conductivity, we derive estimates for the volume fraction of an inclusion whose physical parameters

Although there was not much significant difference in the performance of students in relation to their durations of computer usage per day in the secondary

Interestingly, the periodicity in the intercept and alpha parameter of our two-stage or five-stage PGARCH(1,1) DGPs does not seem to have any special impacts on the model

an insider, trades or procures other persons to trade in the securities or derivatives of the company so as to make profits or avoid losses before the public are aware of

• Tactics: the art of organizing an army, and using weapons or military units in combination against the enemy in military encounters.. • Operational art: a component of military

We showed that the BCDM is a unifying model in that conceptual instances could be mapped into instances of five existing bitemporal representational data models: a first normal