### Statistical Estimation in Generalized Multiparameter Likelihood Models

Ming-Yen Cheng, Wenyang Zhang and Lu-Hung Chen

Abstract

Multiparameter likelihood models (MLM) with multiple covariates have a
wide range of applications. However, they encounter the “curse of dimension-
ality” problem when the dimension of the covariates is large. We develop a
generalized multiparameter likelihood model that copes with multiple covari-
ates and adapts to dynamic structural changes well. It includes some popular
models, such as the partially linear and varying coefficient models, as special
cases. When the model is fixed, a simple and effective two-step method is devel-
oped to estimate both the parametric and the nonparametric components. The
proposed estimator of the parametric component has the n^{−1/2} convergence
rate, and the estimator of the nonparametric component enjoys an adaptivity
property. A data-driven procedure is suggested to select the bandwidths in-
volved. Also proposed is a new initial estimator in profile likelihood estimation
of the parametric part to ensure stability of the approach in general settings.

We further develop an automatic procedure to identify constant parameters in the underlying model. A simulation study and an application to the infant mortality data of China are given to demonstrate performance of the proposed methods.

KEY WORDS: bandwidth selection, model selection, multiple covariate, partially linear models, profile likelihood, varying coefficient models, semiparametric models.

SHORT TITLE: Generalized multiparameter likelihood models.

Ming-Yen Cheng is Professor, Department of Mathematics, National Taiwan University, Taipei 106, Taiwan (Email: cheng@math.ntu.edu.tw) and Professor of Statistics, Department of Statis- tical Science, University College London, Gower Street, London, WC1E 6BT, UK (Email: ming- yen.cheng@ucl.ac.uk). Wenyang Zhang is Reader in Statistics, Department of Mathematical Sci- ences, University of Bath, Bath BA2 7AY, UK (Email: W.Zhang@bath.ac.uk). Lu-Hung Chen is Ph.D. Candidate, Department of Mathematics, National Taiwan University, Taipei 106, Taiwan (Email: d95221006@ntu.edu.tw). Cheng’s research was supported in part by the National Science Council (grant NSC95-2118-M-002-003-MY2). This project began while Wenyang Zhang visited National Taiwan University, with funding supported by the Mathematics Division of the National Center of Theoretical Sciences (Taipei). We thank the associate editor and two referees for con- structive comments which lead to substantial improvement of this paper.

### 1 Introduction

Consider statistical modeling of the relationship between a response variable and some covariates. Maximum likelihood estimation is most powerful when the joint distribution of the response variable and covariates is specified by a parametric form.

However, parametric approaches are in risk of model misspecification, which could result in seriously biased estimation, misinterpretation of data, and other difficulties.

Nonparametric modeling is more flexible and allows data to present the unknown truth. However, people always come up against the “curse of dimensionality” problem, i.e. model instability when the dimension of the covariates is large. There are many hybrids of parametric and nonparametric models, generally called semiparametric models, proposed to achieve a good balance between flexibility and stability in model specification. Examples are varying coefficients models (Hastie and Tibshirani, 1993;

Xia and Li, 1999; Fan and Huang, 2005; Cai, Fan, Jiang and Zhou, 2008; Li and Liang 2008), partially linear models (Engle, Granger, Rice and Weiss, 1986; H¨ardle, Liang and Gao, 2000; Cai, Fan, Jiang and Zhou, 2007) and additive models (Hastie and Tibshirani, 1990; Fan, H¨ardle and Mammen, 1998; Fan and Jiang, 2005; Jiang and Zhou, 2007).

In this paper we suggest a semiparametric model for a population (X, U, Y ) in which U is a continuous variable and the conditional density function of Y given (X, U ) is specified by

f^{}Y ; X, θ, x^{T}_{1}a_{1}(U ), · · · , x^{T}_{`}a_{`}(U )^{}, (1.1)
where f is a known parametric density function, θ = (θ_{1}, · · · , θ_{q})^{T} is an unknown
constant vector, X = (X_{1}, · · · , X_{p})^{T}with X_{1} ≡ 1, and x_{j} is a p_{j}-dimensional subvector
of X and a_{j}(·) = (a_{j1}(·), · · · , a_{jp}_{j}(·))^{T} is an unknown function, j = 1, · · · , `. Here,
1 ≤ ` ≤ d, where d is defined in (1.2). Model (1.1) is a hybrid of the standard MLM
which assumes the conditional density function of Y given X follows the form

f^{}Y ; a_{1}(X), · · · , a_{d}(X)^{}, (1.2)
where f has d identifiable parameters and a_{1}(X), · · · , a_{d}(X) are unknown functions,
i.e. Y depends on X through the d identifiable parameters in f being modeled as

nonparametric functions of X. Aerts and Claeskens (1997) studied a locally linear maximum likelihood estimator of MLM when X is univariate, and Cheng and Peng (2007) proposed a variance reduction technique to improve the estimation. MLM provides a general framework for specifying statistical relationship between response and covariates under a wide range of data configurations, including continuous, cat- egorical, binary and count variables as the response and cases where the response is univariate or vector valued. In addition, it can be easily adopted to cope with various statistical problems such as mean regression, variance estimation, quantile regression, hazard regression, logistic regression and longitudinal data analysis. See Aerts and Claeskens (1997), Loader (1999), Claeskens and Aerts (2000) and Cheng and Peng (2007), among others, for further details.

With the availability of U , model (1.1) specifies ` of the d parameter functions
in model (1.2) by some nonparametric or semiparametric form, and if d − ` > 0 the
other d − ` parameter functions in (1.2) are now modeled parametrically in (1.1), with
θ consisting of all the constant parameters. Like MLM (1.2), (1.1) offers a unified
approach to modeling a wide range of data settings and to dealing with various in-
ference problems. Nonetheless, (1.1) avoids the curse of dimensionality problem (1.2)
has when the dimension of X is large and it allows parametric or nonparametric or
semiparametric modeling of the parameter functions in (1.2). Further, (1.1) broadens
the application of MLM since it can cope with categorical covariates, which often arise
in practice. Model (1.1) is a very general semiparametric model provided there exists
a continuous variable U and other covariates X. First, it reduces to a partially linear
model when ` = 1, x_{1} = X_{1} ≡ 1 and θ interacts with X through a linear function.

When ` = 1, q = 0 and x_{1} = X, model (1.1) becomes the varying coefficients model
of Hastie and Tibshirani (1993) with the same modifying variable U . Thus, (1.1)
inherits the model stability, flexibility and interpretability that varying coefficients
models enjoy. In addition, it is closely related the Regression Model II in Section 4.3
of Bickel, Klaassen, Ritov and Wellner (1993).

We propose a simple, effective and fast two-step procedure to estimate both the constant and functional parameters in (1.1). Its implementation does not involve any iteration which is usually required by conventional approaches, e.g. profile likelihood

and backfitting. Further, we develop an AIC data-driven procedure to select the bandwidths required in the two-step estimation. Using an AIC criterion, or modi- fied versions, to select smoothing parameters in nonparametric regression and local likelihood modeling has been extensively discussed and implemented, c.f. Hurvich, Simonoff and Tsai (1998), Loader (1999), Schucany (2004), among others. For lo- cal likelihood estimation, Aerts and Claeskens (1997) considered cross-validation and plug-in bandwidths, and Farmen and Gijbels (1998) suggested a bandwidth selector based on an approximation to the integrated mean squared error. A profile likelihood approach can be applied to estimate the constant parameters. A new initial estimator is proposed to ensure stability of the profile likelihood approach no matter what types of features, e.g. location, scale or shape, in the model the constant parameters are playing roles. In general, neither profile likelihood nor the two-step estimator of the constant parameters is consistently superior to the other, see the asymptotic results and discussion in Section 5 and simulation results reported in Section 6. Nevertheless, the major strength of the two-step estimation is its simple and fast implementation and numerical stability, as no iteration is required.

In practice, the real challenge is that we are often given a collection of significant
covariates but do not know which of the parameter functions are constant and which
are functional in (1.1), i.e. we are not sure about the specification of θ and x_{1}, · · · , x_{`}.
In an attempt to solve this fundamental identification problem, we suggest a stepwise
procedure based on a version of the BIC criterion accounted for our model. Iden-
tification of constant parameters and bandwidth selection interact with each other.

We propose to select the bandwidths first and then keep them fixed throughout the procedure for identifying the constant parameters. This approach indeed resolves a complex problem in an effective, fast and stable fashion, and is confirmed to have these properties by a simulation study and a real data analysis. We are not aware of any existing methods for identifying constant parameters or covariates in the parametric component of a semiparametric model, although there exists an abundant literature on a different issue of variable selection for parametric models, nonparametric mod- els, and parametric or nonparametric components in semiparametric models. For example, Irizarry (2001) derived weighted versions of AIC and BIC and posterior

probability model selection criteria for one-parameter local likelihood models. Fan and Li (2002) employed profile likelihood techniques in their nonconcave penalized likelihood approach to selecting variables in the parametric part of Cox’s proportional hazards model. Fan and Li (2004) incorporated profiling ideas in their construction of penalized least squares for variable selection in the parametric component of a semiparametric model for longitudinal data analysis. Bunea (2004) constructed a penalized least squares criterion for variable selection in the linear part of a partially linear model. For a generalized varying-coefficient partially linear model, Li and Liang (2008) used a nonconcave penalized likelihood to select significant variables in the parametric component and a generalized likelihood ratio test to select significant variables in the nonparametric component, assuming the two sets of covariates in the parametric and nonparametric components are separated in advance.

Section 2 provides some motivating examples for model (1.1) and discusses the identifiability issue. Our two-step estimation procedure for both the constant and functional parameters and a new initial estimator for profile likelihood estimation of the constant parameters are given in Section 3. Bandwidth selection and identification of the constant parameters are addressed in Section 4. Asymptotic properties of the two-step estimators are investigated in Section 5. Section 6 presents a simulated example and an analysis of a motivating example about infant mortality. Proofs of the theoretical results are deferred to the Appendix.

### 2 Motivating examples and model identifiability

In applications, some of the unknown functional parameters in MLM (1.2) may be simply unknown constants. Under such circumstances, we would pay a price at efficiency if the unknown constants are still treated as unknown functions. An example is the analysis of 103 annual maximum temperatures (Cheng and Peng, 2007) in which Y |X is modeled by an extreme value distribution, where X is year. The estimates of the shape and scale parameter curves are flat except in the boundary regions, which is reasonable as the two parameters are unlikely to change much within one hundred years. To accommodate such situations, (1.2) needs to be restricted to the following

semiparametric model

f^{}Y ; X, θ, a_{1}(X), · · · , a_{`}(X)^{}, (2.1)
where 1 ≤ ` < d and θ is a q-dimensional unknown parameter. Here, ` out of the
d parameter functions in (1.2) remain as unknown functions of X, and the other
d − ` parameter functions are formulated by certain parametric forms, e.g. unknown
constants, with θ consisting of all the constant parameters. The model studied by
Severini and Wong (1992) is a special case of (2.1) with d = 2, ` = 1, q = 1, and the
covariate being univariate. Profile likelihood estimation of θ, along with consistent
estimators of a least favorable curve, was studied therein.

When the dimension of X is large, neither (1.2) nor (2.1) would work due to the
curse of dimensionality problem. Claeskens and Aerts (2000) suggested to alleviate
this problem by restricting a_{1}(·), · · · , a_{d}(·) in (1.2) to additive models and estimate
them by a backfitting algorithm. Alternatively, the following restriction of (2.1)

f^{}Y ; X, θ, X^{T}β_{1}, · · · , X^{T}β_{`}^{}, (2.2)
where β_{1}, · · · , β_{`} are unknown constant vectors, would cope with the curse of di-
mensionality problem. However, (2.2) actually implies that the impact of X on Y is
constant, which is somewhat implausible in practice. For example, in the analysis of
infant mortality in China detailed later, the impact of type of region of residence on
mortality would not be a constant along the time U as China has changed very much
since 1950 and the difference between rural and urban regions has been changing.

The impact must vary with U and the pattern of the change is of interest. To fit the data more accurately and to catch the dynamic pattern of the changes in the impact, we extend (2.2) to

f^{}Y ; X, θ, X^{T}a_{1}(U ), · · · , X^{T}a_{`}(U )^{}, (2.3)
where θ is an unknown constant vector, and a_{j}(·) = (a_{j1}(·), · · · , a_{jp}(·))^{T}, j = 1, · · · , `.

In (2.3), a_{1}(·), · · · , a_{`}(·) have to share the same dimension p and all the a_{ij}(·)’s
are assumed to be functional. This model assumption may be unnecessary in some
situations. The analysis of infant mortality in China is an example: the impact of
ethnic group or type of feeding on infant mortality can be formulated as an unknown

constant parameter. To remove such unnecessary restriction and to make the model
more versatile, we generalize (2.3) to (1.1) with all the a_{ij}(·)’s in (2.3) that are constant
absorbed by θ in (1.1).

When a_{1}(·), · · · , a_{`}(·) are all constant, model (2.3) reduces to model (2.2) and
I(γ) defined in Theorem 1 becomes ˜I, where ˜I is I(γ) with a_{j}(U ) replaced by β_{j}.
Condition (7) in the Appendix ensures the smallest eigenvalue of ˜I is greater than
the positive number λ_{0} in Condition (7). If (X_{i}, Y_{i}), i = 1, · · · , n, is a sample from
model (2.2) then, under Condition (7), the Fisher information matrix is

n

X

i=1

diag(I_{q}, I_{`}⊗ X_{i}) ˜I_{i}diag(I_{q}, I_{`}⊗ X^{T}_{i}) > λ_{0}

n

X

i=1

diag(I_{q}, I_{`}⊗ X_{i})diag(I_{q}, I_{`}⊗ X^{T}_{i})

≈ nλ_{0}diag(I_{q}, I_{`}⊗ E(XX^{T})) > 0,

where ˜I_{i} is ˜I with X replaced by X_{i}. Here, I_{k} denotes a size k identity matrix and,
for any matrices A and B, diag(A, B) denotes the matrix

A 0

0 B

.

Hence, Condition (7) ensures the Fisher information matrix of the parametric model (2.2) is positive-definite, i.e. model (2.2) is identifiable. Further, for any given value of U , the local version of model (2.3) is model (2.2). Therefore, under Condition (7), model (2.3) is identifiable for any given value of U , and so model (2.3) is identifiable.

Also, model (1.1) specifies some of the a_{ij}(·)’s in model (2.3) as constant and therefore
is identifiable. Based on the above arguments, we have the following lemma.

Lemma 1. Under Condition (7) in the Appendix, both models (1.1) and (2.3) are identifiable.

### 3 Estimation procedures

Suppose we have a sample (X_{i}, U_{i}, Y_{i}), i = 1, · · · , n, from (X, U, Y ) which obeys
model (1.1). Let x_{i,j} be the p_{j}-dimensional subvector of X_{i} that corresponds to x_{j},
j = 1, · · · , `, i = 1, · · · , n. We introduce our two-step and profile likelihood
procedures for estimating both the constant and functional parameters in Sections
3.1 and 3.2.

### 3.1 Two-step estimation

Our two-step approach produces an estimator for the constant vector θ first, and
then this estimator is plugged-in into the local likelihood function to estimate the
functions a_{j}(·), j = 1, · · · , `.

The estimation procedure for θ consists of two stages. First, we treat θ as an unknown function of U and appeal to the local likelihood approach to get a prelimi- nary estimator ˜θ(Ui) for θ(Ui) for each Ui, i = 1, · · · , n. Then, we average ˜θ(Ui) over i = 1, · · · , n to get the final estimator for θ. The procedure is detailed as follows.

The conditional log-likelihood function is
L_{0}(θ, a) =

n

X

i=1

log f^{}Y_{i}; X_{i}, θ, x^{T}_{i,1}a_{1}(U_{i}), · · · , x^{T}_{i,`}a_{`}(U_{i})^{} , (3.1)

where a(·) = (a_{1}(·)^{T}, · · · , a_{`}(·)^{T})^{T}. Given u, by Taylor’s expansion, we have for each j
a_{j}(U_{i}) ≈ a_{j}(u) + ˙a_{j}(u)(U_{i}− u)

when U_{i} is in a neighborhood of u, where ˙a_{j}(u) = da_{j}(u)/du. This leads to the
following local log-likelihood function

n

X

i=1

K_{h}(U_{i}−u) log f^{}Y_{i}; X_{i}, θ, x^{T}_{i,1}{a_{1}+b_{1}(U_{i}−u)}, · · · , x^{T}_{i,`}{a_{`}+b_{`}(U_{i}−u)}^{}, (3.2)
where K_{h}(·) = K(·/h)/h, K(·) is a kernel function, and h > 0 is a bandwidth. Maxi-
mizing (3.2) with respect to (θ^{T}, a^{T}_{1}, b^{T}_{1}, · · · , a^{T}_{`}, b^{T}_{`})^{T}we get the maximizer^{}θ(u)˜ ^{T}, ˜a_{1}(u)^{T},
b˜1(u)^{T}, · · · , ˜a`(u)^{T}, ˜b`(u)^{T}^{}^{T}. In the above local likelihood estimation, θ is fitted by a
local constant vector since θ is constant under model (1.1) and fitting it by a local
constant vector stabilizes the procedure. For i = 1, · · · , n, let u = U_{i} and we get an
initial estimator ˜θ(U_{i}) of θ. The final estimator of θ is taken to be

θ = nˆ ^{−1}

n

X

i=1

θ(U˜ _{i}). (3.3)

For ˆθ to achieve the n^{−1/2} convergence rate, we have to choose a relatively small
bandwidth h so that the biases of ˜θ(·) and ˜a_{j}(·), j = 1, · · · , `, are dominated by n^{−1/2}.
This ensures that estimating the constant and the functional parts simultaneously in
the first step does not create extra bias for θ. Then, averaging over ˜θ(Ui), i = 1, · · · , n,

as in (3.3) brings the variance from the order (nh)^{−1} in nonparametric estimation
back to the order n^{−1} in parametric estimation. We will show later that ˆθ is root-
n consistent when h is properly chosen. Like any other maximum local likelihood
estimation procedure, the bandwidth h cannot be chosen too small; otherwise one runs
into problems with singularity of the design matrix. From the asymptotic point of
view, Condition (5) prohibits the bandwidth h from being too small. Thus, Conditions
(5) and (7) guarantee the estimators ˜θ(U_{1}), · · · , ˜θ(U_{n}) exist. Further, the method of
Cheng, Wu and Yang (2008) can be employed to modify the local likelihood function
(3.2) to overcome the singularity problem caused by a small h or sparsity in the design
points U_{i}’s. This approach can also be applied to (3.4) when estimating the function
a(u).

With ˆθ, we can estimate a(u) using the maximum local likelihood approach again.

Note that the estimator ˜a(u) = (˜a1(u)^{T}, · · · , ˜a`(u)^{T})^{T} we obtained before is too noisy
and is not appropriate for this purpose since the bandwidth h is intentionally chosen
small in order to get a good estimator of θ. Hence, we will use another, larger
bandwidth to estimate a(u). We replace θ in (3.2) by ˆθ to get a local log-likelihood
function for a(u):

n

X

i=1

K_{h}_{1}(U_{i}−u) log f^{}Y_{i}; X_{i}, ˆθ, x^{T}_{i,1}{a_{1}+b_{1}(U_{i}−u)}, · · · , x^{T}_{i,`}{a_{`}+b_{`}(U_{i}−u)}^{}, (3.4)
where h_{1} > 0 is a bandwidth different from h. We could use a kernel other than
K at this step but that does not matter much. Maximizing (3.4) with respect to
(a^{T}_{1}, b^{T}_{1}, · · · , a^{T}_{`}, b^{T}_{`})^{T} we get the maximizer (ˆa_{1}(u)^{T}, ˆb_{1}(u)^{T}, · · · , ˆa_{`}(u)^{T}, ˆb_{`}(u)^{T})^{T}.
Our estimator of a(u) is taken to be ˆa(u) = (ˆa_{1}(u)^{T}, · · · , ˆa_{`}(u)^{T})^{T}. As the convergence
rate of ˆθ is n^{−1/2}, see Section 5, ˆa(u) would work as well as when θ is known in the
local log-likelihood (3.4). That is, ˆa(u) has the adaptivity property.

In some cases, local likelihood estimation of the varying coefficients a_{j}(·), j =
1, · · · , `, may require a different amount of smoothing, see, for example, Claeskens
and Aerts (2000). Backfitting ideas can be implemented to achieve this goal: (i)
use ˆa(·) as the initial estimate; (ii) for each j, substitute all the local linear coeffi-
cient functions except the jth and h_{1} in (3.4) respectively by the previous estimates
and the bandwidth for smoothing the jth functional parameter, and then maximize
the resulted local likelihood to find an estimate of aj(·); (iii) iterate step (ii) until

convergence. The convergence is usually attained quickly.

### 3.2 Profile likelihood estimation

A profile likelihood estimator for θ maximizes, with respect to θ, a profiled log- likelihood

n

X

i=1

log f^{}Y_{i}; X_{i}, θ, x^{T}_{i,1}a˜_{1}θ(U^{i}), · · · , x^{T}_{i,`}˜a_{`,}θ(U^{i})^{},

where, for any given θ, ˜aθ(·) = (˜a1,θ(·)^{T}, · · · , ˜a_{`,}θ(·)^{T})^{T} is an estimator for a(·). In
practice, one needs to find the minimizer of

∂L_{0}

∂θ (θ, ˜aθ) + ∂L_{0}

∂a (θ, ˜aθ) ∂

∂θ˜aθ

(3.5)
by iteration, where L_{0} is given in (3.1). When the specified semiparametric model
is general like (1.1), in which θ may involve shape or scale parameters in f , sta-
bility of the iteration heavily relies on proper choice of the initial estimate. Under
semiparametric models for the regression mean, Fan and Huang (2005) and Lam and
Fan (2008) employed difference-based methods to obtain a reliable initial estimate.

However, difference-based methods may not work for model (1.1) since some of the elements in θ can be other than mean parameters. We propose a new initial estimate in the following.

First, we derive some rough estimates of a_{j}(U_{i}), i = 1, · · · , n, j = 1, · · · , `. Con-
sider a model obtained by replacing θ in (1.1) with a0(U ), a q-dimensional unknown
function of U ; it specifies the conditional density of Y given X and U as

f^{}Y ; X, a_{0}(U ), x^{T}_{1}a_{1}(U ), · · · , x^{T}_{`}a_{`}(U )^{}. (3.6)
For any given u, let^{}a¯_{0}(u)^{T}, ¯b_{0}(u)^{T}, ¯a_{1}(u)^{T}, ¯b_{1}(u)^{T}, · · · , ¯a_{`}(u)^{T}, ¯b_{`}(u)^{T}^{}^{T} be the maxi-
mizer, with respect to (a^{T}_{0}, b^{T}_{0}, a^{T}_{1}, b^{T}_{1}, · · · , a^{T}_{`}, b^{T}_{`})^{T}, of the local log-likelihood function

n

X

i=1

K_{h}_{1}(U_{i}−u) log f^{}Y_{i}; X_{i}, a_{0}+b_{0}(U_{i}−u), x^{T}_{i,1}{a_{1}+b_{1}(U_{i}−u)}, · · · , x^{T}_{i,`}{a_{`}+b_{`}(U_{i}−u)}^{}.

Here, h_{1} can be taken as the bandwidth ˆh_{1} in Section 4.2 since it is selected for
local likelihood estimation by assuming model (3.6). Letting u = U_{i} in the above

procedure, we have ¯a_{j}(U_{i}), j = 1, · · · , `, i = 1, · · · , n. Then, our initial estimate ¯θ is
the maximizer of

n

X

i=1

log f^{}Yi; Xi, θ, x^{T}_{i,1}¯a1(Ui), · · · , x^{T}_{i,`}a¯`(Ui)^{}.

During the iteration in finding the minimizer of (3.5), ˜aθ(·) is taken as the esti-
mator that solves (3.4) with ˆθ and h_{1} replaced by θ and ˆh_{1} respectively. With this
choice of bandwidth the least favorable curve is approximated well, by the nature
of model (3.6). Upon convergence of the iteration, we obtain the profile likelihood
estimator for θ. Then we can estimate a(·) and select the bandwidth in the same
manner as described in Sections 3.1 and 4.2 with ˆθ replaced by the profile likelihood
estimator for θ.

### 4 Bandwidth selection and identifying constant pa- rameters

In reality, we do not know which of the parameters are constant and which are func-
tional in model (1.1). This is essentially a model selection problem. There are many
model selection criteria under parametric assumptions, e.g. cross-validation (Stone,
1974), AIC (Akaike, 1970), BIC (Schwarz, 1978), and nonconcave penalized likelihood
(Fan and Li, 2001). Among these various criteria, due to their easy implementation,
AIC and BIC are probably most commonly used in practice. We appeal to the AIC
and BIC ideas to select the bandwidths h_{1} and h in the estimation procedures and
to identify the constant parameters in model (1.1).

### 4.1 Model selection criteria

The standard AIC and BIC formulae contain the number of unknown parameters
in the model, so does the versions of AIC and BIC for model (1.1). To work out
that number for (1.1), we have to find out how many unknown parameters each
unknown function a_{ij}(·) amounts to. In nonparametric modeling, when a locally
linear approximation is used, Fan and Gijbels (1996) suggested that an unknown

function amounts to

tr^{n}(G^{T}_{0}W_{0}G_{0})^{−1}G^{T}_{0}W^{2}_{0}G_{0}^{o}
unknown parameters, where

G_{0} =

1 U_{1}− u
... ...
1 U_{n}− u

, W_{0} = diag^{}K_{h}_{1}(U_{1}− u), · · · , K_{h}_{1}(U_{n}− u)^{}.

To simplify the calculation, we look into its asymptotic version: when n is large enough,

tr^{n}(G^{T}_{0}W0G0)^{−1}G^{T}_{0}W^{2}_{0}G0

o≈ h^{−1}_{1} (ν0+ ν2/µ2),
where ν_{i} =^{R} t^{i}K^{2}(t)dt, µ_{i} =^{R} t^{i}K(t)dt. Thus, we take

K = q + h^{−1}_{1} (ν_{0}+ ν_{2}/µ_{2})(p_{1}+ · · · + p_{`}),

as the number of parameters involved in our estimation procedure for model (1.1).

When the Epanechnikov kernel K(t) = 0.75(1 − t^{2})_{+} is used, ν_{0}+ ν_{2}/µ_{2} = 1.028571.

The AIC and BIC formulae given later apply to any models of the form (1.1), which
can have different q, ` or x_{j}, and model (3.6); in the latter case, K = h^{−1}_{1} (ν_{0} +
ν2/µ2)(q + p1+ · · · + p`) and ˆθ, ˆaj(·), j = 1, · · · , `, are replaced by ¯aj(·), j = 0, 1, · · · , `,
in the formulae.

### 4.2 Bandwidth selection

Suppose (1.1) is the true model and is used to analyze the data. Choice of the
bandwidths h and h_{1} determines performance of the two-step estimators in Section
3.1. Compared to choosing h_{1}, selecting h is relatively simple since h is employed to
get undersmoothed estimators of the functional parameters. It follows from Theorem
1 that we get a good estimator of θ by letting h be of an order n^{−α}for any α ∈ (1/4, 1).

In practice, we may take α = 1/4 + δ for a small δ > 0 to avoid difficulties in the
maximization of (3.2) caused by design sparsity. Proper selection of h_{1} is crucial for
ˆ

a(·) to perform well. We propose to obtain a reasonable choice of h_{1} first, then utilize
the relationship between the optimal rate of h_{1} and a suitable rate of h to determine
h, and finally select h1.

Define the AIC criterion for model (1.1) as AIC = −2

n

X

i=1

log f^{}Y_{i}; X_{i}, ˆθ, x^{T}_{i,1}ˆa_{1}(U_{i}), · · · , x^{T}_{i,`}aˆ_{`}(U_{i})^{}+ 2K .

To get a reasonable choice of h_{1}, we compute the version of AIC for model (3.6) for
different values of h_{1}. This yields an AIC function of h_{1} only and h is not involved
since there is no constant parameters in (3.6). Then ˆh_{1}, the minimizer of the AIC
function of h1, is a rough approximation to the optimal value of h1 in the two-step
estimator ˆa(·) for a(·) in (1.1). The reason is that, when modeling data from (1.1)
using (3.6), the true value of a_{0}(·) is the constant vector θ, so the curve estimate
of a_{0}(·) would be roughly flat for a wide range of h_{1} and the AIC criterion for (3.6)
mainly measures performance of the estimators of a_{j}(·), j = 1, · · · , `, while h_{1} varies.

Selection of the bandwidth h in the two-step estimation of θ is not a major issue
as we explained before. Any bandwidth h will do as long as it is relatively small but
not too small. In the light of Theorem 2, which suggests that the optimal rate of h_{1}
is n^{−1/5}, and the discussion on choice of h earlier we take ˆh = n^{−0.051}ˆh_{1}. Note that
value of n^{−0.051} falls in the narrow range (0.5559, 0.7907) for n ∈ [10^{2}, 10^{5}].

The bandwidth ˆh_{1} is chosen for estimating the functions a_{0}(·), a_{1}(·), · · · , a_{`}(·) in
model (3.6), which specifies the constant vector θ in the true model (1.1) as functional.

Now, we refine our data-driven selection of h1, which is required in the two-step
estimation of a(·) in the true model (1.1). Based ˆh we obtain the two-step estimator
θ for θ in (1.1). Plug-in ˆˆ θ into (3.4) to find ˆa(·) and compute the AIC criterion for
model (1.1) for a range of h_{1}. Denote the minimizer of this AIC function as ˜h_{1}.

### 4.3 Identifying constant parameters

Define the version of BIC for model (1.1) as BIC = −2

n

X

i=1

log f^{}Y_{i}; X_{i}, ˆθ, x^{T}_{i,1}aˆ_{1}(U_{i}), · · · , x^{T}_{i,`}ˆa_{`}(U_{i})^{}+ K log(n) .

We propose a procedure to identify which parameters are constant and which are functional in model (1.1) based on the BIC criterion. This model selection problem interacts with the bandwidth selection problem: the BIC formula depends on the bandwidth h1. In fact, it is almost impossible to choose the bandwidth and the

constant parameters simultaneously. This is because either a complex model or a
small bandwidth can result in a small bias and a large variance, and either a simple
model or a large bandwidth can result in a large bias and a small variance. Thus,
a complex model with a large bandwidth would have same effects as a simple model
with a small bandwidth. A sensible solution is to choose the bandwidths h_{1} and h
first and then identify the constant parameters, as suggested in the following.

We start with a model M_{0} which is of the form (3.6), and then decide which
parameters in M0 are functional and which are constant. The choice of a0(U ) and
x_{1}, · · · , x_{`}, and how they determine the dependence of Y on X and U in M_{0} should
come from the basic assumptions on the model; in practice they are determined by
the analyst. Due to the curse of dimensionality issue, we have to impose some basic
assumptions on the model based on some knowledge about the data we analyze, which
is usually available from the background of the data or people working with the area
where the data arise. Since all the unknown parameters in M_{0} are functions, as
in Section 4.2, we choose the bandwidth h_{1} for estimating the unknown functions by
minimizing the version of the AIC criterion for model M_{0}. For simplicity of notation,
denote this bandwidth as ˆh_{1} and let ˆh = ˆh_{1}n^{−0.051} again. Then we fix at these two
bandwidths throughout the model selection procedure.

Ideally, we could compute the BICs for all possible combinations, and the chosen
combination is the one with the smallest BIC value. Unfortunately, this approach
would immediately become computationally impossible when κ, number of parame-
ters that can be either functional or constant, is not very small as there are 2^{κ}possible
combinations. We propose the following iterative procedure to reduce the computa-
tional burden. We start with M_{0} as the candidate model, and at the L-th step of
the iteration we examine whether one of the functional parameters in the candidate
model ML can be further reduced to a constant.

(a): Set L = 0. Based on model M_{0}, compute local likelihood estimates of all the
unknown parameter functions using bandwidth ˆh_{1}.

(b): If L = 2^{κ}− 1 (all the m parameters are reduced to constants in ML) then ML

is the chosen model and the model selection is completed. Otherwise, for each
of the unknown functions in the candidate model ML, say a_{ij}(·), that could be

reduced to a constant calculate Sij =

n

X

k=1

ˆaij(Uk) − ¯aij

2

, ¯aij = n^{−1}

n

X

k=1

ˆ aij(Uk).

Changing the function a_{ij}(·) in ML that has the smallest S_{ij} to a constant
parameter we get a new model ML+1.

(c): Based on the new model M_{L+1} and the bandwidths ˆh_{1} and ˆh, compute the
estimates of the unknown functions and constants. Compute the BIC of ML+1,
and compare it with that of ML. If ML has a smaller BIC, then ML is the
chosen model and the model selection is completed. Otherwise, ML+1 becomes
the candidate model, so we denote the new constant parameter in (b) as θL+1

and change L to L + 1 then go to (b).

The above iterative process continues until ML has a smaller BIC than ML+1 for
some L < 2^{κ}− 1 (ML is the chosen model) or until L = 2^{κ} − 1 (the chosen model
has all the considered parameters constant). Apparently, the final chosen model can
be written in the form of (1.1).

### 5 Asymptotic properties

This section investigates asymptotic distributions of the two-step estimators given in Section 3.1. Theory for profile likelihood estimation of θ can be developed under another set of regularity conditions. The derivation is straightforward given the ex- isting literature (Severini and Wong, 1992; Murphy and van der Vaart, 2000; Fan and Huang, 2005) and hence is omitted.

For simplicity of notation, the theory presented here concerns the case where
x_{j} = X, j = 1, · · · , `. The established theory straightforwardly carries over to the
general case where x_{1}, · · · , x_{`}, are different. Let π(u) be the density of U and ¨a_{j}(u)
be the second derivative of a_{j}(u), j = 1, · · · , `. Write

z = (z_{1}, · · · , z_{`})^{T}, z_{j} = X^{T}a_{j}(u), j = 1, · · · , `, D = I_{`}⊗(X^{T}, 0_{1×p})^{T}, D_{c}= I_{`}⊗(0_{1×p}, X^{T})^{T}.
Theorem 1 gives the asymptotic distribution of ˆθ and shows that ˆθ is asymptotically
unbiased as an estimator of the constant parameter θ, provided that the bandwidth h
is of an order smaller than that of optimal bandwidths used in univariate smoothing.

Theorem 1. Under the regularity conditions stated in the Appendix, if h = o(n^{−1/4})
and nh/ log^{2}n −→ ∞, we have

n^{1/2}(ˆθ − θ)−→ N (0^{D} _{q×1}, ∆) when n −→ ∞,
where

∆ = (I_{q}, 0_{q×2p`})E{V_{c}(U )^{−1}V_{0}(U )V_{c}(U )^{−1}}(I_{q}, 0_{q×2p`})^{T},

V_{0}(u) = E{HI(γ)H^{T}|U = u}, V_{c}(u) = V_{0}(u) + E{µ_{2}H_{c}I(γ)H^{T}_{c}|U = u},
H = diag (I_{q}, D) , H_{c}= diag (0_{q×q}, D_{c}) , I(γ) = −E { ˙g(Y ; X, γ)|X, U } ,
g(Y ; X, γ) = ∂ log f (Y ; X, γ)

∂γ , ˙g(Y ; X, γ) = ∂g(Y ; X, γ)

∂γ , γ = (θ^{T}, z^{T})^{T}.

In general, neither profile likelihood nor the two-step estimator of the constant parameters θ is consistently superior to the other in their asymptotic performance.

Profile likelihood estimator may have a smaller asymptotic variance than the two-step estimator when both are asymptotically normal, c.f. Severini and Wong (1992). On the other hand, there are situations for which Theorem 1 holds but profile likelihood estimator does not work since it requires existence of the least favorable curves, see the example discussed in Fan and Wong (2000).

Theorem 2. Under the regularity conditions stated in the Appendix, if h_{1} −→ 0 and
nh_{1}/ log^{2}n −→ ∞, we have

(nh_{1})^{1/2}{ˆa(u) − a(u) + B}−→ N (0^{D} _{p`×1}, Σ) when n −→ ∞,
where

B = 2^{−1}µ_{2}h^{2}_{1}I_{`}⊗ {(1, 0) ⊗ I_{p}}G^{−1}_{c} Γ ,
Γ = E^{n}DI_{1}(z)(¨a_{1}(u), · · · , ¨a_{`}(u))^{T}X^{}^{}_{}U = u^{o},

Σ = I_{`}⊗ {(1, 0) ⊗ I_{p}}G^{−1}_{c} GG^{−1}_{c} π(u)^{−1}I_{`}⊗ {(1, 0)^{T}⊗ I_{p}} ,
G = E^{n}ν0DI1(z)D^{T}+ ν2DcI1(z)D^{T}_{c}^{}^{}_{}U = u^{o},

G_{c} = E^{n}DI_{1}(z)D^{T}+ µ_{2}D_{c}I_{1}(z)D^{T}_{c}^{}^{}_{}U = u^{o},
I_{1}(z) = −E

(∂^{2}log f (Y ; X, θ, z)

∂z∂z^{T}

X, U

)
_{U =u}

.

Theorem 2 tells that our estimator ˆa(·) has the adaptivity property: it has the
same asymptotic distribution as the estimator of a(·) that is obtained by maximizing
(3.4) with ˆθ replaced by the true value of θ. In addition, the optimal bandwidth h_{1}
is of the order n^{−1/5} and the optimal convergence rate of ˆa(·) is n^{−2/5}. We leave the
proof of these two theorems to the Appendix.

### 6 Simulation study and data analysis

### 6.1 A simulated example

We use a simulated example to demonstrate how well the proposed methods work.

Suppose that, conditionally on X = x and U = u, Y has a Weibull distribution with density function

f (y; x, θ, a(u)x) = θ

{a(u)x}^{θ} y^{θ−1} exp^{h}− {y/a(u)x}^{θ}^{i}, y > 0, (6.1)
where the constant θ > 0 is taken to be 1, the function a(·) is set to be a(u) =
2 + sin(2πu), U ∼ Uniform(0, 1), X ∼ Uniform(1, 2) and X and U are independent.

We simulated 300 samples of size 1000 from this model and applied our estimation and model selection procedures to the samples.

In the two-step estimation procedure, the bandwidths ˆh and ˜h_{1} in Section 4.2 were
employed, with model (3.6) specifying the conditional density

f (y; x, a_{0}(u), a_{1}(u)x) = a_{0}(u)

{a(u)x}^{a}^{0}^{(u)}y^{a}^{0}^{(u)−1} exp^{h}− {y/a(u)x}^{a}^{0}^{(u)}^{i}, y > 0. (6.2)
The kernel function K was taken to be the Epanechnikov kernel. We use the mean
integrated absolute error (MIAE) to assess the accuracy of an estimator. The MIAE
of an estimator of an unknown constant is defined as its mean absolute error. The
MIAE of an estimator ˆa(·) of an unknown function a(·) is defined as

MIAE = E(IAE), where IAE =

Z

a(u) − a(u)ˆ ^{}^{}_{} du.

The MIAE for θ and a(·) are respectively 0.0154 and 0.1067. The bias and standard deviation for θ are respectively -0.0016 and 0.0194, agreeing with the theory that ˆθ

is asymptotically unbiased, see Theorem 1. Panel (a) of Fig. 1 plots the pointwise 10%, 50% and 90% quantiles of the 300 curve estimates of a(·). Both estimators of θ and a(·) are quite accurate. Also, the constant parameter θ is estimated with a higher level of accuracy than the functional parameter a(·). This coincides with our theory that ˆθ has a faster rate of convergence than ˆa(·) does. Panel (b) of Fig. 1 plots the estimates of a(·) based on the sample with median IAE performance when θ is treated as unknown (dotted line) and known (dashed line). They are close to each other, indicating that our estimator of a(·) has the adaptivity property.

0.0 0.2 0.4 0.6 0.8 1.0

1.01.52.02.53.0

(a)

0.0 0.2 0.4 0.6 0.8 1.0

1.01.52.02.53.0

(b)

Figure 1: Estimates of the functional parameter in the Weibull example. Panel (a) plots the pointwise 10%, 50% and 90% quantiles (long-dash) of the 300 estimates of a(·) (solid). Panel (b) plots the estimates of a(·) (solid) based on the sample with median IAE performance when the constant parameter θ is treated as unknown (dotted) or known (dashed).

The profile likelihood method in Section 3.2 was applied to the same 300 samples.

The MIAEs for θ and a(·) are respectively 0.0168 and 0.1077. The bias and standard deviation for θ are respectively 0.0011 and 0.0206. In this example, θ is the shape parameter and a(·) determines the scale parameter in the conditional Weibull dis- tribution. The two-step method performs slightly better than the profile likelihood method in estimating the constant (shape) parameter, while the two perform equally well in estimating the functional (scale) parameter.

Suppose it is not known which of the two parameters are constant and which are functional. For each of the 300 samples simulated from (6.1), the model selection

procedure in Section 4.3, with the start model M_{0} given by model (6.2), was applied
to select the constant parameters. The correct model was selected 296 times, and for
all the other 4 samples the model with both θ and a as functions of u was selected.

Thus, our model selection criterion has a high success rate of 98%. Furthermore, it never makes the mistake specifying the functional parameter a(·) as constant, which would result in a large bias and inconsistency in post model selection inference. Com- paratively, misspecifying the constant parameter θ as functional, which our model se- lection rule does 2% of the time, is a minor problem as the nonparametric estimates under the wrong model would still look flat. Thus, it is a good sign that when our model selection method misspecifies the model, the fitted model still resembles the true model. As an evidence, the MIAE of estimating θ after model selection is 0.0167 and that of estimating a(·) is 0.1068, both of them are only slightly larger than the corresponding MIAEs with the correct model specified.

To further examine performance of the model selection procedure, for each of the
300 samples, we use the selected model and the corresponding parameter estimates
to predict the true values of the parameters θ and a(U_{0}) associated with a future
observation (Y_{0}, X_{0}, U_{0}). The mean absolute prediction errors for θ and a(U_{0}) are
0.0170 and 0.1130 respectively. In Fig. 2, panel (a) is a boxplot of the 300 predicted
values of θ ≡ 1, and panel (b) depicts the point cloud of the predicted value against the
true value of a(U_{0}) for the 300 samples. We can see that the predictions are both quite
satisfactory even though the model selection procedure misspecifies the model 12%

of the time. Thus, we can conclude from this example that our proposed estimation and model selection procedures work together as a powerful tool for multiparameter likelihood modeling even when there is little knowledge about whether or not some of the parameters are constant.

In model (6.1), the Weibull conditional distribution has d = 2 parameters, of
which ` = 1 follows a nonparametric form and the other d − ` = 1 follows a para-
metric form. A more complex model is to change the conditional distribution to
Weibull(θ + a_{1}(u)x, a(u)x), where x, u, θ and a(·) are the same as in (6.1), and
a_{1}(u) = 0.1 + 0.1 cos(2πu). In this case, ` = d = 2, with the shape parameter
modeled semiparametrically and the scale parameter modeled nonparametrically.