• 沒有找到結果。

Identifying constant parameters

在文檔中 廣義多參數概似模型之估計 (頁 40-109)

We propose a procedure to identify which parameters are constant and which are functional in model (1.1) based on the BIC criterion. This model selection

prob-lem interacts with the bandwidth selection probprob-lem; the BIC formula depends on the bandwidth h1. In fact, choosing the bandwidth and the constant parameters simultaneously is almost impossible, because either a complex model or a small bandwidth can result in a small bias and a large variance, and either a simple model or a large bandwidth can result in a large bias and a small variance. Thus a complex model with a large bandwidth would have same effects as a simple model with a small bandwidth. A sensible solution is to first choose the bandwidths h1 and h and then identify the constant parameters.

We start with a model M0 of the form (5.2), and then determine which pa-rameters in M0 are functional and which are constant. The choice of a0(U ) and x1,· · · , x, and knowledge of how they determine the dependence of Y on X and U inM0, should come from the basic assumptions on the model; in practice, they are determined by the analyst. Because of the curse of dimensionality issue, we need to impose some basic assumptions on the model based on some knowledge about the data that we are analyzing, which usually is available from the background of the data or people working in the area where the data arise. Because all of the un-known parameters in M0 are functions, as in Section 5.2, we choose the bandwidth h1 for estimating the unknown functions by minimizing the version of the AIC for modelM0. For simplicity of notation, we denote this bandwidth by ˆh1 and again let ˆh = ˆh1n−0.051. Then we fix at these two bandwidths throughout the model selection procedure.

Ideally, we could compute the BICs for all possible combinations, and the chosen combination would be the one with the smallest BIC value. Unfortunately, however, this approach would immediately become computationally impossible when κ (the number of parameters that can be either functional or constant) is not very small because there are 2κ possible combinations. We propose the following iterative procedure to reduce the computational burden. We start withM0 as the candidate model and at theLth step of the iteration we examine whether one of the functional parameters in the candidate modelML can be further reduced to a constant.

(a): Set L = 0. Based on model M0, compute local likelihood estimates of all of the unknown parameter functions using bandwidth ˆh1.

(b): If L = κ (i.e., all of the κ parameters are reduced to constants in ML) then ML is the chosen model, and model selection is completed. Otherwise, for each of the unknown functions in the candidate model ML, say aij(·), that could be reduced to a constant, calculate

Sij =

n k=1

( ˆ

aij(Uk)− ¯aij

)2

, ¯aij = n−1

n k=1

ˆ aij(Uk).

Changing the function aij(·) in ML that has the smallest Sij to a constant parameter results in a new model, ML+1.

(c): Based on the new model ML+1 and the bandwidths ˆh1 and ˆh, compute the estimates of the unknown functions and constants. Compute the BIC ofML+1 and compare it with that of ML. If ML has a smaller BIC, then ML is the

chosen model and the model selection is completed. Otherwise,ML+1becomes the candidate model; thus we denote the new constant parameter in (b) as θL+1 and change L to L + 1 then go to (b).

The foregoing iterative process continues until ML has a smaller BIC than ML+1

for someL < κ (i.e., MLis the chosen model) or untilL = κ (i.e., the chosen model has all of the considered parameters constant). Apparently, the final chosen model can be written in the form of (1.1).

6 Asymptotic properties

In this section we investigate asymptotic distributions of the backfitting estimators given in Section 4.1, the profile likelihood estimators given in Section 4.2, and the two-step estimators given in Section 4.3.

For simplicity of notation, the theory presented here concerns the case with xj = X, j = 1,· · · , ℓ. The established theory straightforwardly carries over to the general case where x1,· · · , x, are different. Let π(u) be the density of U and let

¨aj(u) be the second derivative of aj(u), j = 1,· · · , ℓ. Write

z = (z1,· · · , z)T, zj = XTaj(u), j = 1,· · · , ℓ, D = I⊗(XT, 01×p)T, Dc= I⊗(01×p, XT)T.

Theorem 1 and 2 gives the asymptotic distribution of ˆθBF and ˆθP R. Note that the backfitting and profiling procedure produce estimators with the same asymptotic distribution. The backfitting procedure requires that undersmoothing be used to estimate ˆa(U ), whereas the profiling procedure does not.

Theorem 1. Assume that the regularity conditions (S2)–(S4) and (BF1)–(BF5) stated in Appendix A and C hold, and that the bandwidth h satisfies nh4 → 0 and not h∝ n−1/5, then we have

n1/2

(θˆBF − θ)

−→ ND (

0q×1, G−101G−10))

when n−→ ∞,

where

G(θ) = d E

[

∂θL (

θ, a0θ0(U ) )]

,

Σ1 = cov [

∂θL (

θ0, a0θ0

) +

∂aL (

θ0, a0θ0

)

∂θa0θ0(U ) ]

Theorem 2. Assume the regularity conditions (PR1)–(PR4) stated in Appendix B and allow that h∝ n−1/5, then we have

n1/2

(θˆP R − θ) D

−→ N(

0q×1, G−101G−10))

when n −→ ∞,

The asymptotic distributions of the proposed 2-step estimators are discussed by Cheng et al. (2009). We state them in Theorem 3 and 4 as follows. Theorem 3 gives the asymptotic distribution of ˆθ and shows that ˆθ is asymptotically unbiased as an estimator of the constant parameter θ, provided that the bandwidth h is of an order smaller than that of optimal bandwidths used in univariate smoothing.

Theorem 3. Under the regularity conditions (S1)–(S7) stated in the Appendix, if h = o(n−1/4) and nh/ log2n−→ ∞, then we have

n1/2θ− θ)−→ N(0D q×1, ∆) when n−→ ∞,

where

∆ = (Iq, 0q×2pℓ)E{

Vc(U )−1V0(U )Vc(U )−1}

(Iq, 0q×2pℓ)T, V0(u) = E{

HI(γ)HT U = u}

, Vc(u) = V0(u) + E{

µ2HcI(γ)HTc U = u} , H = diag (Iq, D) , Hc= diag (0q×q, Dc) , I(γ) = −E { ˙g(Y ; X, γ)|X, U} ,

g(Y ; X, γ) = ∂ log f (Y ; X, γ)

∂γ , g(Y ; X, γ) =˙ ∂g(Y ; X, γ)

∂γ , γ = (θT, zT)T.

In general, neither the profile likelihood nor the two-step estimator of the con-stant parameter θ is consistently superior to the other in their asymptotic perfor-mance. The profile likelihood estimator may have a smaller asymptotic variance than the two-step estimator when both are asymptotically normal (Severini and Wong 1992), but on the other hand, there are situations for which Theorem 1 holds but the profile likelihood estimator does not work because it requires existence of the least favorable curves ( see the example discussed in Fan and Wong 2000).

Theorem 4. Under the regularity conditions stated in the Appendix, if h1 −→ 0 and nh1/ log2n−→ ∞, then we have

(nh1)1/2{

ˆa(u)− a(u) + B} D

−→ N(0pℓ×1, Σ) when n−→ ∞,

where

B = 2−1µ2h21I⊗ {(1, 0) ⊗ Ip}G−1c Γ ,

Γ = E {

DI1(z)(

a¨1(u), · · · , ¨a(u))T

X U = u} ,

Σ = I⊗ {(1, 0) ⊗ Ip}G−1c GG−1c π(u)−1I⊗ {(1, 0)T⊗ Ip} , G = E{

ν0DI1(z)DT+ ν2DcI1(z)DTc U = u} , Gc= E{

DI1(z)DT+ µ2DcI1(z)DTc U = u} ,

I1(z) = −E

{2log f (Y ; X, θ, z)

∂z∂zT

X,U} U =u

.

Theorem 4 says that our estimator ˆa(·) has the adaptivity property; it has the same asymptotic distribution as the estimator of a(·) obtained by maximizing (4.10), with ˆθ replaced by the true value of θ. In addition, the optimal bandwidth h1 is of order n−1/5, and the optimal convergence rate of ˆa(·) is n−2/5. We defer the proofs of these two theorems to the Appendix.

7 Simulation study and data analysis

7.1 Logistic Regression

Consider the following logistic regression model

log

( P (Y = 1|X = x, U = u) 1− P (Y = 1|X = x, U = u)

)

= a1(u) x1+ a2(u) x2+ a3x3+ a4x4 (7.1)

where a1(·) and a2(·) are unknown functional parameters, a3, a4 are unknown con-stant parameters, and X and U are independent. Further, assume X ∼ N (0, I), U ∼ Uniform (0, 1), a1(u) = sin (2πu), a2(u) = cos (2πu), a3 = 2 and a4 = 1. The sample sizes were set to be 500 and 1000. For each sample size we repeated the experiment 300 times.

The kernel function K was set to be the Epanechnikov kernel. The bandwidths h and h1 were respectively taken to be the data-driven AIC bandwidths ˆh and ˜h1

given in Section 5.2, with model 5.2 specifying the conditional distribution

log

( P (Y = 1|X = x, U = u) 1− P (Y = 1|X = x, U = u)

)

= a1(u) x1+ a2(u) x2+ a3(u) x3 + a4(u) x4 (7.2) We use the mean integrated absolute error (MIAE) to access the accuracy of an estimator. The MAIE of an estimator of an unknown constant is defined as its mean absolute error. The MIAE of a estimator ˆa (·) of an unknown function a (·) is defined as

MIAE = E (IAE) , where IAE=

|ˆa (u) − a (u)| du.

The proposed two-step estimation method and the profile likelihood estimation us-ing our suggested initial value were employed to estimate a1(·), a2(·), a3, and a4

for the 300 random samples. Table 1 compares the performances of our two-step estimation and the profile likelihood estimation under different sample sizes. The result suggests that both the two-step and the profile likelihood estimation methods do work well. The profile likelihood method is slightly better than the two-step method in estimating the constant parameters, while they preform equally well in estimating the functional parameters.

Table 1: MIAEs of different estimation methods for logistic regression.

Two-step Profile Likelihood

Sample size a1(·) a2(·) a3 a4 a1(·) a2(·) a3 a4

1000 0.1775 0.1753 0.1887 0.1174 0.1781 0.1748 0.1650 0.1121 500 0.2488 0.2465 0.2123 0.1368 0.2509 0.2514 0.2014 0.1406

To give a visible picture of how well the two-step estimators of the functional parameters work, the pointwise 10%, 50% and 90% quantiles of the 300 estimates of a1(·) and a2(·) are plotted in Fig. 1 when the sample size n = 1000 and 2 when n = 500. The solid lines are the true curves. Further, we single out the samples with median total IAE performance, i.e. the one that yields the median of the IAEs for

a1(·) and a2(·). In these samples, the estimates of a3 and a4 are respectively 2.280 and 1.094 when the sample size is 1000 and 1.9590 and 0.8112 when the sample size is 500. The dotted lines are the estimates, based on this sample, when a3 and a4 are treated unknown. The dashed lines are the estimates, based on the same sample, when a3 and a4 are treated known and replaced by their true values in the local likelihood function 4.10. From Fig. 1 and 2, we can see that the proposed method works quite well. Also, the estimators of the unknown functional parameters work as well as when the unknown constant parameters are replaced by their true values. This means the proposed estimators for the functional parameters do have the adaptivity property.

Suppose we do not know which of the four parameters are constant and which are functional. The BIC model selection procedure proposed in Section 3.3, with the start model M0 specified by (7.2), was applied to the simulated samples. When the sample size is 1000, 276 of the 300 samples specify the true model, 7 samples pick a1, a2, and a3 as constant parameters, 9 samples take a1, a2, and a4 as constant parameters, and the remaining 8 samples determine all the parameters as constant parameters. When the sample size is 500, 251 of the 300 samples specify the true model, 19 samples select a3 as constant parameter, 2 samples prefer a4 as constant parameter, 8 samples pick a1, a3, and a4 as constant parameters, 9 samples take a2, a3, and a4 as constant parameters, 8 samples determine all the parameters as constant parameters, and the remaining 3 samples determine all the parameters as

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1

−0.5 0 0.5 1 1.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1

−0.5 0 0.5 1 1.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1

−0.5 0 0.5 1 1.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1

−0.5 0 0.5 1 1.5

u

Figure 1: Functional parameters in logistic regression when the sample size is 1000.

The left and right columns depict results for a1(·) and a2(·), respectively. In the upper row, the long-dash lines are the pointwise 10%, 50% and 90% quantiles of the 300 estimates. The bottom row plots the estimates based on the sample with median total ISE performance when the constant coefficients a3 and a4 are treated unknown (dotted) or known (dashed). The solid lines represent the true functions.

0.0 0.2 0.4 0.6 0.8 1.0

−1.5−0.50.00.51.01.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5−0.50.00.51.01.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5−0.50.00.51.01.5

u

0.0 0.2 0.4 0.6 0.8 1.0

−1.5−0.50.00.51.01.5

u

Figure 2: Functional parameters in logistic regression when the sample size is 500.

The left and right columns depict results for a1(·) and a2(·), respectively. In the upper row, the long-dash lines are the pointwise 10%, 50% and 90% quantiles of the 300 estimates. The bottom row plots the estimates based on the sample with median total ISE performance when the constant coefficients a3 and a4 are treated

functional parameters. The left panels of Fig. 3 and 4 show boxplots of the 300 predicted values of a3 ≡ 2 and a4 ≡ 1, and the right panels of Fig. 3 and 4 depict the point cloud of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples.

Other than the BIC criterion, we can also apply the AIC criterion to build the model selection procedure. In this case, when sample size is 1000, 242 of the 300 samples specify the true model, and the remaining 58 samples determine all the parameters as functional parameters. When the sample size is 500, 211 of the 300 samples specify the true model, and the remaining 89 samples select all the parameters as functional parameters. The left panels of Fig. 5 and 6 show boxplots of the 300 predicted values of a3 ≡ 2 and a4 ≡ 1, and the right panels of Fig.

5 and 6 depict the point cloud of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples. Note that although the AIC criterion does not select correct model as many times as the BIC criterion does, it generates less prediction error. This is because that the AIC criterion tends to select more complex models (for example, model with all aj as functions for j = 1,· · · , 4), while the BIC criterion tends to select simpler models (for example, model with all aj as constants for j = 1,· · · , 4) due to their penalties to model complexity. When the model is mis-specified, specifying the functional parameters a3(·) and a3(·) as constants would result in a large bias and inconsistency in post-model selection inference, while misspecifying the constant parameter a1 and a2 as functionals is only a minor

1.8 2 2.2 2.4 2.6 2.8 3

1

0.8 1 1.2 1.4

1

−1 −0.5 0 0.5 1

−1.5

−1

−0.5 0 0.5 1 1.5

−1 −0.5 0 0.5 1

−2

−1 0 1 2

Figure 3: Parameter predictions in the logistic example when the sample size is 1000.

Left: boxplots of the predicted values of a3 ≡ 2 and a4 ≡ 1 based on the selected model for the 300 samples. Right: scatterplots of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples.

Figure 4: Parameter predictions in the logistic example when the sample size is 500.

Left: boxplots of the predicted values of a3 ≡ 2 and a4 ≡ 1 based on the selected model for the 300 samples. Right: scatterplots of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples.

problem because the nonparametric estimates under the wrong model still look flat.

The results of our simulation may support empirically that the consistency property of BIC and the efficiency property of AIC in parametric models also hold in our model.

7.2 Weibull model

Suppose that, conditional on X = x and U = u, Y has a Weibull distribution with density function

f(

y; x, θ, a(u)x)

= θ

{a(u)x}θ yθ−1 exp[

{

y/a(u)x}θ]

, y > 0, (7.3)

where the constant θ > 0 is the shape parameter and is taken to be 2, the function a(·) is the scale parameter and is set to be a quadratic function a(u) = β01u+β2u2, U ∼ Uniform(0, 1), X ∼ Uniform(1, 2), and X and U are independent. This exam-ple is motivated by some real applications. For examexam-ple, in reliability data analysis, Meeker and Escobar (1997), Nelson (1984) and Wang and Kececioglu (2000) studied the low-cycle fatigue life data for a strain-controlled test on 26 cylindrical specimens of a nickel-base superalloy to estimate the curve giving the number of cycles at which 0.1% of the population of such specimens would fail, as a function of the pseudostress U . They assumed that the logarithm of the number of cycles condition on the pseudostress follows a weibull distribution with a constant shape parameter (independent of the pseudostress) and a functional scale parameter. The scale

pa-●

Figure 5: Parameter predictions in the logistic example by the AIC criterion when the sample size is 1000. Left: boxplots of the predicted values of a3 ≡ 2 and a4 ≡ 1 based on the selected model for the 300 samples. Right: scatterplots of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples.

Figure 6: Parameter predictions in the logistic example by the AIC criterion when the sample size is 500. Left: boxplots of the predicted values of a3 ≡ 2 and a4 ≡ 1 based on the selected model for the 300 samples. Right: scatterplots of the predicted value against the true value of a1(U0) and a2(U0) for the 300 samples.

rameter was often assumed to be a linear, quadratic, or log-linear function of the pseudostress. In our implementation, we set β0 = 2, β1 =−1.6, and β2 = 3.6. The sample sizes are taken to be 250, 500, and 1000; for each size we simulated 300 samples from this model and applied our estimation and model selection procedures to the samples.

In the two-step estimation procedure, we used the bandwidths ˆh and ˜h1 in Section 5.2, with model (5.2) specifying the conditional density

f(

y; x, a0(u), a(u)x)

= a0(u)

{a(u)x}a0(u) ya0(u)−1 exp[

{

y/a(u)x}a0(u)]

, y > 0. (7.4)

The kernel function K was taken to be the Epanechnikov kernel. The MIAEs for θ and a(·) are 0.0175 and 0.1090 with sample size 1000, 0.0318 and 0.1369 with sample size 500, and 0.0623 and 0.1774 with sample size 250. The bias and standard deviation for θ are 0.00149 and 0.0496 with sample size 1000, in agreement with the theory that ˆθ is asymptotically unbiased (see Theorem 1). The bias and standard deviation for θ with sample size 500 and 250 are reported in Table 3 and 4. The left panel of Fig. 7 plots the pointwise 10%, 50%, and 90% quantiles of the 300 curve estimates of a(·) with sample sizes 1000, 500, and 250 from top to bottom. Both estimators of θ and a(·) are quite accurate. In addition, the constant parameter θ is estimated with a higher level of accuracy than the functional parameter a(·). This coincides with our theory that ˆθ has a faster rate of convergence than ˆa(·). The right panel of Fig. 7 plots the estimates of a(·) based on the sample with median

IAE performance when θ is treated as unknown (dotted line) and known (dashed line). The estimates are close to each other, indicating that our estimator of a(·) has the adaptivity property.

We also applied the profile likelihood method described in Section 4.2 to the same 300 samples. The MIAEs, biases and standard deviations for θ and a(·) with sample sizes n = 1000, 500, and 250 are summarized in Table 2, 3 and 4, respectively.

Note that the profile likelihood estimators diverge in some samples with sample size 250 which may due to design sparsity. In this example, θ is the scale parameter and a(·) determines the shape parameter in the conditional Weibull distribution. The two-step method performs slightly better than the profile likelihood method in esti-mating both the constant (scale) parameter and the functional (shape) parameter.

We also fitted parametric models to the same 300 examples. Later, quadratic model denotes the case if a(·) is correctly specified as a quadratic function, cubic model denotes the case if a(·) is mis-specified as a cubic function, and linear model denotes the case if a(·) is assumed to be a linear function. Table 2 – 4 list the performances of different methods.

Suppose that it is not known which of the two parameters are constant and which are functional. For each of the 300 samples simulated from (7.3), we used the model selection procedure in Section 5.3, with the start model M0 given by model (7.4), to select the constant parameters. When the sample size is 1000, 295 samples specify the true model, and for all of the other 4 samples, the model with both θ and

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

0.0 0.2 0.4 0.6 0.8 1.0

1.01.21.41.61.8

u

Figure 7: Estimates of the functional parameter in the Weibull example. Left panel:

Pointwise 10%, 50%, and 90% quantiles (long-dashed lines) of the 300 estimates of a(·) (solid line) for sample size 1000, 500, and 250 from top to bottom. Right panel:

Estimates of a(·) (solid line) based on the sample with median IAE performance with the constant parameter θ treated as unknown (dotted) or known (dashed line).

Table 2: Performances of different estimation methods on the Weibull example when the sample size is 1000.

MAE of θ bias of θ std of θ MIAE of a(·)

Two-step 0.0175 0.00149 0.0496 0.1090

Profile likelihood 0.0181 0.00155 0.0504 0.1121 Linear model 0.0266 0.00248 0.0501 0.2001 Quadratic model 0.0168 0.00111 0.0488 0.0542 Cubic model 0.0180 -0.00156 0.0524 0.1080

Table 3: Performances of different estimation methods on the Weibull example when the sample size is 500.

MAE of θ bias of θ std of θ MIAE of a(·)

Two-step 0.0318 0.00169 0.0711 0.1367

Profile likelihood 0.0344 0.00195 0.0721 0.1400 Linear model 0.0339 0.00221 0.0702 0.1989 Quadratic model 0.0299 0.00152 0.0698 0.0704 Cubic model 0.0320 0.00177 0.0780 0.1360

Table 4: Performances of different estimation methods on the Weibull example when the sample size is 250.

MAE of θ bias of θ std of θ MIAE of a(·)

Two-step 0.0623 0.0149 0.1201 0.1774

Profile likelihood 1.2555 1.8400 21.8001 18.6441 Linear model 0.0666 0.0183 0.0911 0.1802 Quadratic model 0.0414 0.0100 0.1989 0.0780 Cubic model 0.0517 0.0126 0.1075 0.1511

a as functions of u was selected. This indicated that our model selection criterion has a high success rate of 98%. When the sample size is 500, 271 samples specify the true model, and the other 22 samples take the model with both θ and a as functions.

When the sample size is 250, 160 samples specify the true model, 62 samples prefer the model with both θ and a as constants, and the other 78 samples opt the model with both θ and a as functions. This indicated that our model selection criterion has a high success rate when the sample size is moderately large (98% when the sample size is 1000, and 90% when the sample size is 500.)

To further examine the performance of the model selection procedure, for each sample sizes, we used the selected model and the corresponding parameter estimates to predict the true values of the parameters θ and a(U0) associated with a future

observation (Y0, X0, U0) for each of the 300 samples. The mean absolute prediction errors (MAPE) for θ and a(U0) are reported in Table 5. The left panel of Fig. 8 shows boxplots of the 300 predicted values of θ ≡ 2 with sample sizes 1000, 500, and 250 from top to bottom, and the right panel of Fig. 8 depicts the point clouds of the predicted value against the true value of a(U0) for the 300 samples. We can see that the predictions are both quite satisfactory even though the model selection procedure may misspecifies the model when the sample size is moderately large.

Thus we can conclude from this example that our proposed estimation and model selection procedures work together to provide a powerful tool for multiparameter likelihood modeling even when there is little knowledge regarding whether or not some of the parameters are constant.

Thus we can conclude from this example that our proposed estimation and model selection procedures work together to provide a powerful tool for multiparameter likelihood modeling even when there is little knowledge regarding whether or not some of the parameters are constant.

在文檔中 廣義多參數概似模型之估計 (頁 40-109)

相關文件