6 Numerical Examples

∂β_AS ˆβ^(∞) − ∂

∂β_AS ˆβ^(ml)T βˆ^(∞)− ˆβ^(ml)

ˆβ^(∞)− ˆβ^(ml)

≥ c_minτ /2−2√

qλ/τ >

0, which contradicts to that 0 ∈

∂

∂β_AS ˆβ^(∞) − ∂

∂β_AS ˆβ^(ml)T

βˆ^(∞)− ˆβ^(ml).

Thus ˆβ^(∞)= ˆβ^(ml) on F .

Finally, it is straightforward to show that P (F ) → 1 as n → ∞. This completes the proof.

Note that under some regularity conditions, n^1/2 βˆ^(ml)− β₀ d

→ N 0, I⁻¹, (5.2)

where I is the Fisher information. It follows that n^1/2 βˆ^(∞)− β₀ d

→ N 0, I⁻¹, under (5.2) and the conditions given by the above theorem.

In this chapter, we consider two simulation experiments with one for Poisson regression and the other for logistic regression cases. We are interested in knowing the performance of the proposed PML method in comparison with AIC, BIC, and their stepwise versions in terms of the Kullback-Leibler (KL) risk and the probability of identifying the true model.

6.1 Poisson Regression

We generate data y1, . . . , ynindependently from Poisson distributions with means µ1, . . . , µn

according to the model:

log µ_i = β₀+ β₁x_i1+ · · · + β_px_ip= x⁰_iβ; i = 1, . . . , n,

where x_i’s are generated independently from N (0, Σ) with ρ^|i−j| being the (i, j)th entry of Σ, and β = (0, 2, −1, 0, . . . , 0)⁰. We apply the proposed PML method for τ = τ₁, . . . , τ₅ and λ = λ₁, . . . , λ₁₀₀, which are equally spaced in the log scale. Throughout the simula-tion, we choose τ₁ = 5 and τ₅ = 0.1. In addition, we choose λ₁ to be the smallest value such that ˆβ₁^(∞)(λ, 5) = · · · = ˆβp^(∞)(λ, 5) = 0, which can be obtained from the R package

“glmnet” developed by Friedman et al. (2010) only for the Lasso penalty, and choose λ₁₀₀ = 0.001.

Table 1: Performance of various methods for Poisson regression with p = 6 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 3.822 (0.273) 1.00 3.582 (0.319) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00

AIC 2.978 (0.298) 0.53 2.717 (0.325) 0.42

BIC 2.135 (0.244) 0.17 2.042 (0.258) 0.14

βˆ^∞(ˆλ, 0.1) 2.045 (0.245) 0.13 1.983 (0.258) 0.12 βˆ^∞(ˆλ, 0.27) 1.728 (0.181) 0.08 1.771 (0.212) 0.09 βˆ^∞(ˆλ, 0.71) 1.909 (0.216) 0.14 2.428 (0.338) 0.23 βˆ^∞(ˆλ, 1.88) 3.393 (0.282) 0.65 3.386 (0.347) 0.70 βˆ^∞(ˆλ, 5) 3.396 (0.278) 0.74 3.439 (0.342) 0.73 βˆ^∞(ˆλ, ∞) 3.396 (0.278) 0.74 3.439 (0.342) 0.73 βˆ^∞(ˆλ, ˆτ ) 2.030 (0.243) 0.12 1.975 (0.258) 0.11

For different methods, we estimate the Kullback-Leibler risk (KL risk) which take expectation of the KL loss function. KL loss measures the difference between the estimates and the true parameters. The KL loss for Poisson cases is defined by

i=1

µ_i(log µ_i− log ˆµ_i) − (µ_i− ˆµ_i)

From the KL risk, we can see whether our method overcomes the other competitors.

Table 1 and 2 present the results of repeating 100 times simulations and show the KL risk together with its standard error. The error rate is estimated by the proportion of how many times the methods chose the wrong model during the 100 simulations. When determining the tuning parameters, we utilize the BIC score method, which choose the estimates corresponding to the smallest BIC score.

From the results of Table 1 and 2, we notice that when there are more zero coeffi-cients than non-zero coefficoeffi-cients, our method performs better than the others. We can see that the ML estimate of β based on the full model are more unstable. The method PML chooses a suitable tuning parameters pair (λ, τ ) can do the model selection and estimation as good as AIC/BIC does even sometimes overcomes them. When the number of interested variables increases, AIC and BIC tend to consume much of time on compu-tation. Therefore, PML offers an efficient way to select a suitable model and estimates the coefficient simultaneously, which not only saves the energy on examine each possible

Table 2: Performance of various methods for Poisson regression with p = 10 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 5.731 (0.320) 1.00 5.830 (0.370) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00

AIC 3.931 (0.346) 0.75 3.955 (0.373) 0.73

BIC 2.529 (0.286) 0.27 2.522 (0.311) 0.26

βˆ^∞(ˆλ, 0.1) 2.050 (0.243) 0.14 2.317 (0.306) 0.20 βˆ^∞(ˆλ, 0.27) 1.689 (0.174) 0.08 1.915 (0.251) 0.15 βˆ^∞(ˆλ, 0.71) 1.957 (0.235) 0.19 2.812 (0.390) 0.36 βˆ^∞(ˆλ, 1.88) 4.012 (0.327) 0.80 4.421 (0.366) 0.86 βˆ^∞(ˆλ, 5) 3.902 (0.299) 0.89 4.405 (0.361) 0.90 βˆ^∞(ˆλ, ∞) 3.902 (0.299) 0.89 4.405 (0.361) 0.90 βˆ^∞(ˆλ, ˆτ ) 2.045 (0.242) 0.14 2.309 (0.306) 0.19

models but also provides better estimates with smaller risk and standard error.

Figure 3 and 4 show the how the KL loss and BIC scores behave with the path of λ^∗ under log scale. Each curve denotes one candidate value of τ . It is easy to see that these curves for each τ has a minimum value. Obviously, the two figures present almost the same patterns. It is intuitive to use BIC scores as a criterion of determining the tuning parameters. Even though we have different values of τ , their BIC scores can help us find a good λ^∗ for each τ and give not bad estimates. Note that in the two tuning parameters, λ plays a more important role in model selection than τ does.

From small p cases, we can see that the PML method can provide better estimates which are more precise and accurate than the others. However, high dimension problems become more and more important recently. We are interested in whether PML is available when p increases. We try larger p = 20 and p = 40 for Poisson regression. Note that p is to large to use AIC/BIC method since there are 2^p models to compare. We utilize stepwise AIC/BIC instead. The results are shown in Table 3 to Table 5.

Under high dimensional conditions, the benefits of PML method are easier to see.

Compared to the other methods, our method provides little risk and smaller standard error. The proportion of selecting the wrong model is smaller than its competitors.

−6 −4 −2 0 2 4

1.52.02.53.03.54.04.55.0

log(λ τ)

KL loss

0.1 0.5 1 Inf

−6 −4 −2 0 2 4

−4420−4415−4410−4405

log(λ τ)

BIC score 0.1

0.5 1 Inf

Figure 3: KL losses (left) and the corresponding BIC scores based on PML (right) for poisson regression with p = 6 and n = 50.

−6 −4 −2 0 2 4

234567

log(λ τ)

KL loss

0.1 0.5 1 Inf

−6 −4 −2 0 2 4

−4610−4600−4590−4580

log(λ τ)

BIC score 0.1

0.5 1 Inf

Figure 4: KL losses (left) and the corresponding BIC scores based on PML (right) for poisson regression with p = 10 and n = 50.

Table 3: Performance of various methods for Poisson regression with p = 20 and n = 50, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 12.016 (0.610) 1.00 12.216 (0.528) 1.00 True model MLE 1.610 (0.173) 0.00 1.603 (0.191) 0.00 Stepwise AIC 8.223 (0.571) 0.94 8.306 (0.537) 0.97 Stepwise BIC 4.628 (0.502) 0.52 4.500 (0.505) 0.54 βˆ^∞(ˆλ, 0.1) 2.367 (0.300) 0.20 2.928 (0.430) 0.29 βˆ^∞(ˆλ, 0.27) 1.746 (0.181) 0.10 1.846 (0.256) 0.15 βˆ^∞(ˆλ, 0.71) 2.107 (0.251) 0.25 3.324 (0.425) 0.54 βˆ^∞(ˆλ, 1.88) 5.104 (0.340) 0.93 6.203 (0.456) 0.98 βˆ^∞(ˆλ, 5) 4.963 (0.325) 0.95 6.234 (0.455) 1.00 βˆ^∞(ˆλ, ∞) 4.963 (0.325) 0.95 6.234 (0.455) 1.00 βˆ^∞(ˆλ, ˆτ ) 2.310 (0.295) 0.18 2.858 (0.429) 0.26

Table 4: Performance of various methods for Poisson regression with p = 20 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 10.555 (0.317) 1.00 10.786 (0.310) 1.00 True model MLE 1.652 (0.145) 0.00 1.725 (0.145) 0.00 Stepwise AIC 7.057 (0.354) 0.92 7.185 (0.332) 0.94 Stepwise BIC 3.411 (0.268) 0.39 3.076 (0.257) 0.31 βˆ^∞(ˆλ, 0.1) 1.927 (0.162) 0.11 1.954 (0.163) 0.08 βˆ^∞(ˆλ, 0.27) 1.758 (0.151) 0.09 1.766 (0.145) 0.04 βˆ^∞(ˆλ, 0.71) 1.760 (0.151) 0.11 2.310 (0.281) 0.08 βˆ^∞(ˆλ, 1.88) 5.278 (0.399) 0.89 6.222 (0.396) 0.88 βˆ^∞(ˆλ, 5) 5.378 (0.389) 0.87 6.442 (0.390) 0.90 βˆ^∞(ˆλ, ∞) 5.378 (0.389) 0.97 6.442 (0.390) 0.90 βˆ^∞(ˆλ, ˆτ ) 1.927 (0.162) 0.11 1.954 (0.163) 0.08

Table 5: Performance of various methods for Poisson regression with p = 40 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 22.497 (0.554) 1.00 22.585 (0.519) 1.00 True model MLE 1.652 (0.145) 0.00 1.725 (0.145) 0.00 Stepwise AIC 14.538 (0.585) 1.00 14.246 (0.501) 1.00 Stepwise BIC 5.364 (0.435) 0.58 5.011 (0.400) 0.55 βˆ^∞(ˆλ, 0.1) 1.968 (0.184) 0.13 2.001 (0.157) 0.11 βˆ^∞(ˆλ, 0.27) 1.864 (0.166) 0.12 1.750 (0.145) 0.03 βˆ^∞(ˆλ, 0.71) 1.938 (0.187) 0.15 2.905 (0.375) 0.22 βˆ^∞(ˆλ, 1.88) 6.430 (0.409) 0.92 8.435 (0.451) 0.98 βˆ^∞(ˆλ, 5) 7.098 (0.399) 1.00 8.481 (0.444) 1.00 βˆ^∞(ˆλ, ∞) 7.098 (0.399) 1.00 8.481 (0.444) 1.00 βˆ^∞(ˆλ, ˆτ ) 1.968 (0.184) 0.13 1.982 (0.158) 0.10

Note that our objective function (3.1) can approximate AIC/BIC when τ is close to zero. We are interested whether our L₀ approach works for the approximation of exhausted AIC and BIC. We want to know whether AIC/BIC and our approach select the same model and whether their AIC/BIC score are close under small p cases. Table 6 and Table 7 shows the proportion that our approach well approximates the results of AIC and BIC.

From the results, we can see that larger sample size leads to larger well-approximation proportion.

6.2 Logistic Regression

For logistic regression simulation settings, the observation y₁, y₂, . . . , y_nare generated from bernoulli distribution with probability pi satisfying the model

p_i

1 − p_i = β₀+ β₁x_i1+ · · · + β_px_iq = x⁰_iβ,

where x_i = (x_i0, x_i1, . . . , x_ip)⁰ and β = (β₀, β₁, . . . , β_p)⁰ are (p + 1)−dimension vectors.

The predictors x_i are generated form i.i.d. normal N (0, Σ_p×p) distribution where Σ_p×p denotes the correlation matrix of time series AR(1) model with each component expressed by ρ^|i−j|. We assign β = (0, 3, −2, 0, . . . , 0)⁰ We do the same procedure as we do in Poisson

Table 6: Approximations to AIC and BIC for Poisson regression with p = 6.

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=50 τ = 0.5 0.02 0 0 0 0 0

τ = 0.1 0.26 0.63 0.10 0.55 0.10 0.55

τ = 0.01 0.41 0.46 0.41 0.46 0.41 0.46

τ = 0.001 0.05 0.04 0.05 0.04 0.05 0.04

n=100 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0.05 0.27 0 0.26 0 0.26

τ = 0.01 0.45 0.78 0.44 0.78 0.44 0.78

τ = 0.001 0.14 0.05 0.14 0.05 0.14 0.05

n=200 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0.03 0.10 0 0.08 0 0.08

τ = 0.01 0.67 0.85 0.64 0.85 0.64 0.85

τ = 0.001 0.43 0.26 0.43 0.26 0.43 0.26

n=500 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0.02 0 0 0 0 0

τ = 0.01 0.76 0.91 0.59 0.91 0.59 0.91

τ = 0.001 0.55 0.89 0.55 0.89 0.55 0.89

Table 7: Approximations to AIC and BIC for Poisson regression with p = 10.

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=50 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0.08 0.40 0.04 0.34 0.04 0.34

τ = 0.01 0.16 0.38 0.16 0.38 0.16 0.38

τ = 0.001 0.04 0.01 0.03 0 0.03 0

n=100 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0 0.10 0 0.05 0 0.05

τ = 0.01 0.21 0.62 0.21 0.62 0.21 0.62

τ = 0.001 0.09 0.05 0.09 0.05 0.09 0.05

n=200 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0 0.02 0 0.01 0 0.01

τ = 0.01 0.27 0.71 0.26 0.71 0.26 0.71

τ = 0.001 0.30 0.14 0.30 0.14 0.30 0.14

n=500 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0 0 0 0 0 0

τ = 0.01 0.52 0.81 0.31 0.81 0.31 0.81

τ = 0.001 0.24 0.71 0.24 0.71 0.24 0.71

Table 8: Performance of various methods for logistic regression with p = 6 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 4.906 (0.399) 1.00 4.133 (0.242) 1.00 True model MLE 2.004 (0.204) 0.00 1.591 (0.128) 0.00

AIC 3.727 (0.352) 0.47 3.074 (0.234) 0.47

BIC 2.558 (0.255) 0.12 2.017 (0.187) 0.10

βˆ^∞(ˆλ, 0.1) 2.558 (0.255) 0.12 2.017 (0.187) 0.10 βˆ^∞(ˆλ, 0.27) 2.561 (0.200) 0.11 2.017 (0.187) 0.10 βˆ^∞(ˆλ, 0.71) 2.317 (0.236) 0.09 1.875 (0.188) 0.09 βˆ^∞(ˆλ, 1.88) 3.693 (0.256) 0.56 3.045 (0.183) 0.74 βˆ^∞(ˆλ, 5) 4.095 (0.312) 0.60 3.154 (0.166) 0.74 βˆ^∞(ˆλ, ∞) 4.108 (0.313) 0.60 3.154 (0.166) 0.74 βˆ^∞(ˆλ, ˆτ ) 2.558 (0.255) 0.12 2.017 (0.187) 0.10

regression. The KL loss for logistic regression is defined by

i=1

p_ilog p_i ˆ pi

+ (1 − p_i) log1 − p_i 1 − ˆpi

First we consider smaller dimension cases with p = 6, p = 10 and ρ = 0, ρ = 0.5, respectively. Table 8 and 9 presents the results of repeating 100 times of simulations and show the KL risk together with its standard error. The error rate is estimated by the proportion of how many times during the 100 simulations the methods chose the wrong model. When determining the tuning parameters, we utilize the BIC score method and choose the estimates with the smallest BIC score.

Compare to the results of Poisson simulations, the results of logistic regression are very unstable. The reason can be seen from Figure 5 and Figure 6. We can see the patterns in KL loss and BIC scores do not match. We choose our estimates with smallest BIC value, but the estimates we choose does not achieve the point with the smallest loss. The most possible reason might be the sample size is too small to estimate the coefficient precisely and accurately. The response we have is binary data and it tells little information. Al-though we can not choose the optimize solution, PML method still competes the other methods. We can see the KL risk of PML is smaller than the others and the proportion

Table 9: Performance of various methods for logistic regression with p = 10 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 9.528 (0.741) 1.00 8.442 (0.674) 1.00 True model MLE 1.804 (0.210) 0.00 2.001 (0.310) 0.00

AIC 6.405 (0.505) 0.85 5.899 (0.574) 0.89

BIC 2.949 (0.288) 0.26 3.133 (0.388) 0.24

βˆ^∞(ˆλ, 0.1) 2.955 (0.307) 0.25 3.092 (0.393) 0.22 βˆ^∞(ˆλ, 0.27) 2.680 (0.289) 0.18 2.849 (0.390) 0.14 βˆ^∞(ˆλ, 0.71) 2.139 (0.237) 0.12 2.359 (0.340) 0.18 βˆ^∞(ˆλ, 1.88) 4.449 (0.267) 0.82 4.706 (0.309) 0.82 βˆ^∞(ˆλ, 5) 4.408 (0.210) 0.75 4.831 (0.254) 0.80 βˆ^∞(ˆλ, ∞) 4.408 (0.210) 0.75 4.985 (0.299) 0.80 βˆ^∞(ˆλ, ˆτ ) 2.898 (0.286) 0.25 3.092 (0.393) 0.22

−7 −6 −5 −4 −3 −2

23456

log(λ τ)

KL loss

0.1 0.5 1 5 Inf

−7 −6 −5 −4 −3 −2

58606264666870

log(λ τ)

BIC score 0.1

0.5 1 5 Inf

Figure 5: KL losses (left) and the corresponding BIC scores based on PML (right) for logistic regression with p = 6 and n = 50.

−7 −6 −5 −4 −3 −2

234567

log(λ τ)

KL loss

0.1 0.5 1 5 Inf

−7 −6 −5 −4 −3 −2

6065707580

log(λ τ)

BIC score 0.1

0.5 1 5 Inf

Figure 6: KL losses (left) and the corresponding BIC scores based on PML (right) for logistic regression with p = 10 and n = 50.

of selecting the wrong model is smaller than its competitors. So we can still conclude our method does better.

We are also interested in high dimensional logistic regression, then we try p = 20 and p = 40. The method of exact AIC and BIC are replaced by stepwise AIC/BIC procedure.

Table 10 to 12 shows the simulation results.

We can see that the results in Table 10 to 12 our method can still give not bad estimates with smaller risk and error rate than the others. Then we say our method is effective.

Similarly, we are interested whether our L₀ approach works for approximating the exact AIC and BIC methods. Table 13 and Table 14 shows the proportion that our approach well approximates the results of AIC and BIC.

6.3 Data Analysis : Low Birth Weight Data

We apply our method to a low birth weight dataset of Hosmer and Lemeshow (1989).

The data on 189 new born babies were collected at Baystate Medical Center, Springfield, MA during 1986. The data contains a binary response that indicates whether a new born baby has a low birth weight. The dataset also includes several risk factors associated with low birth weights, which are used as explanatory variables. We apply the following

Table 10: Performance of various methods for logistic regression with p = 20 and n = 100, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 83.310 (8.767) 1.00 42.549 (6.512) 1.00 True model MLE 1.839 (0.209) 0.00 2.001 (0.309) 0.00 Stepwise AIC 75.063 (9.184) 0.98 33.391 (6.673) 0.98 Stepwise BIC 67.630 (9.637) 0.70 24.598 (6.801) 0.55 βˆ^∞(ˆλ, 0.1) 8.111 (1.527) 0.52 5.304 (0.595) 0.38 βˆ^∞(ˆλ, 0.27) 6.947 (1.757) 0.27 4.033 (0.519) 0.22 βˆ^∞(ˆλ, 0.71) 4.777 (1.390) 0.25 2.690 (0.389) 0.34 βˆ^∞(ˆλ, 1.88) 10.523 (1.966) 0.73 6.690 (0.403) 0.88 βˆ^∞(ˆλ, 5) 11.642 (2.470) 0.65 8.335 (1.611) 0.84 βˆ^∞(ˆλ, ∞) 6.624 (0.400) 0.64 6.733 (0.305) 0.84 βˆ^∞(ˆλ, ˆτ ) 9.297 (1.976) 0.52 5.341 (0.594) 0.39

Table 11: Performance of various methods for logistic regression with p = 20 and n = 200, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 15.659 (0.801) 1.00 13.712 (0.334) 1.00 True model MLE 1.767 (0.127) 0.00 1.576 (0.090) 0.00 Stepwise AIC 9.683 (0.591) 0.93 8.632 (0.316) 0.96 Stepwise BIC 3.618 (0.243) 0.36 3.556 (0.217) 0.38 βˆ^∞(ˆλ, 0.1) 3.456 (0.239) 0.32 3.286 (0.202) 0.35 βˆ^∞(ˆλ, 0.27) 2.111 (0.164) 0.05 2.231 (0.169) 0.10 βˆ^∞(ˆλ, 0.71) 1.852 (0.128) 0.05 1.739 (0.093) 0.11 βˆ^∞(ˆλ, 1.88) 4.835 (0.224) 0.75 6.290 (0.319) 0.81 βˆ^∞(ˆλ, 5) 6.759 (0.182) 0.63 7.904 (0.275) 0.81 βˆ^∞(ˆλ, ∞) 6.759 (0.182) 0.63 7.904 (0.275) 0.81 βˆ^∞(ˆλ, ˆτ ) 3.456 (0.239) 0.32 3.286 (0.202) 0.35

Table 12: Performance of various methods for logistic regression with p = 40 and n = 200, where values given in parentheses are the corresponding standard errors.

Method ρ = 0 ρ = 0.5

KL risk Error rate KL risk Error rate Full model MLE 124.839 (9.943) 1.00 48.942 (2.483) 1.00 True model MLE 1.767 (0.127) 0.00 1.576 (0.090) 0.00 Stepwise AIC 93.931 (10.402) 1.00 26.552 (1.370) 1.00 Stepwise BIC 67.207 (10.908) 0.61 6.841 (0.472) 0.63 βˆ^∞(ˆλ, 0.1) 6.228 (0.695) 0.43 5.472 (0.402) 0.51 βˆ^∞(ˆλ, 0.27) 3.266 (0.600) 0.08 2.042 (0.157) 0.07 βˆ^∞(ˆλ, 0.71) 1.819 (0.127) 0.06 1.696 (0.094) 0.13 βˆ^∞(ˆλ, 1.88) 7.179 (0.298) 0.77 8.579 (0.382) 0.87 βˆ^∞(ˆλ, 5) 9.340 (0.198) 0.55 10.947 (0.291) 0.78 βˆ^∞(ˆλ, ∞) 9.340 (0.198) 0.55 10.947 (0.291) 0.78 βˆ^∞(ˆλ, ˆτ ) 6.228 (0.695) 0.43 5.472 (0.402) 0.51

Table 13: Approximations to AIC and BIC for logistic regression with p = 6

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=10 τ = 0.5 0.23 0.67 0.11 0.58 0.11 0.58

τ = 0.1 0.50 0.17 0.50 0.09 0.50 0.09

τ = 0.01 0 0 0 0 0 0

τ = 0.001 0 0 0 0 0 0

n=200 τ = 0.5 0.09 0.43 0.02 0.37 0.02 0.37

τ = 0.1 0.53 0.87 0.53 0.87 0.53 0.87

τ = 0.01 0 0 0 0 0 0

τ = 0.001 0 0 0 0 0 0

n=500 τ = 0.5 0.01 0.14 0 0.12 0 0.12

τ = 0.1 0.93 0.99 0.56 0.99 0.56 0.99

τ = 0.01 0.26 0 0.24 0 0.24 0

τ = 0.001 0 0 0 0 0 0

n=1000 τ = 0.5 0.01 0.09 0 0.09 0 0.09

τ = 0.1 0.54 0.97 0.29 0.97 0.29 0.97

τ = 0.01 0.44 0 0.44 0 0.44 0

τ = 0.001 0 0 0 0 0 0

Table 14: Approximations to AIC and BIC for logistic regression with p = 10

proportion of selecting proportion of having proportion of both the same model the same estimates the events occur

AIC BIC AIC BIC AIC BIC

n=10 τ = 0.5 0.09 0.43 0 0.33 0 0.33

τ = 0.1 0.27 0.18 0.27 0.09 0.17 0.09

τ = 0.01 0 0 0 0 0 0

τ = 0.001 0 0 0 0 0 0

n=200 τ = 0.5 0.01 0.17 0 0.16 0 0.16

τ = 0.1 0.25 0.86 0.25 0.86 0.25 0.86

τ = 0.01 0 0 0 0 0 0

τ = 0.001 0 0 0 0 0 0

n=500 τ = 0.5 0 0 0 0 0 0

τ = 0.1 0.82 0.91 0.33 0.91 0.33 0.91

τ = 0.01 0.13 0 0.12 0 0.12 0

τ = 0.001 0 0 0 0 0 0

n=1000 τ = 0.5 0 0.01 0 0.01 0 0.01

τ = 0.1 0.32 0.93 0.12 0.93 0.12 0.93

τ = 0.01 0.26 0 0.26 0 0.26 0

τ = 0.001 0 0 0 0 0 0

model:

low = β₀+ β₁× age + β₂× lwt + β₃× race : white + β₄× race : black + β₅× smoke + β₆× ht + β₇× ui + β₈× f tv + β₉× ptl,

where age and lwt are standardized to have mean 0 and variance 1. The details of these predictors are shown in Table 15 and the results of the selected model and the estimates are presented in Table 16. The standard deviations of the estimated coefficients are shown as “boot.std” in Table 16 using a parametric bootstrap method. We also show the standard deviations “glm.std” obtained based on the selected model using the R function

“glm”.

From the results shown in Table 16, we can see that AIC selected more variables than the others. Obviously, the variables lwt and ptl are important risk factors of this research. It is intuitive to see that the weight of the mothers and whether the mothers were in premature labors directly influence the results of having a low birth weight baby.

One worth mentioning thing is that both AIC and BIC select the variable ht but PML does not. Note that there are only 12 people over the 189 observations having history of hypertension, so it is not easy to tell whether this variable is important. Another reason we can see from the “glm.std” and “boot.std”. In contrast to the standard deviations

Table 15: Variables in the low birth weight dataset.

Name Description

low indicator of birth weight less than 2.5kg age mother’s age in years

lwt mother’s weight in pounds at last menstrual period race mothers race (”white”, ”black”, ”other”)

smoke smoking status during pregnancy ht history of hypertension

ui presence of uterine irritability

ftv number of physician visits during the first trimester ptl number of previous premature labors

Table 16: Estimated parameters obtained from various methods.

Variable PML AIC BIC

ˆλ^∗ 0.0355

τ 0.01

coeff (boot.std) coeff (glm.std) (boot.std) coeff (glm.std) (boot.std) intercept -1.116 (0.260) -1.211 (0.279) (0.399) -1.225 (0.200) (0.257)

age -0.322 (0.274) 0.000 (0.000) (0.189) 0.000 (0.000) (0.083)

lwt -0.321 (0.278) -0.431 (0.201) (0.294) -0.523 (0.207) (0.341) race:white 0.000 (0.069) -1.011 (0.396) (0.562) 0.000 (0.000) (0.089) race:black 0.000 (0.171) 0.000 (0.000) (0.478) 0.000 (0.000) (0.223) smoke 0.000 (0.127) 0.931 (0.399) (0.530) 0.000 (0.000) (0.122)

ht 0.000 (0.254) 1.848 (0.705) (1.123) 1.888 (0.720) (2.200)

ui 0.000 (0.274) 0.739 (0.461) (0.691) 0.000 (0.000) (0.250)

ftv 0.000 (0.110) 0.000 (0.000) (0.332) 0.000 (0.000) (0.192)

ptl 1.523 (0.733) 1.119 (0.451) (0.607) 1.407 (0.428) (0.679)

Size of model 3 6 3

Log-likelihood -107.181 -99.309 -105.131

BIC score 230.087 230.068 225.987

obtained from parametric bootstrap, the standard deviations obtain by “glm” are much smaller, which are somewhat expected, since they do not take model selection into account and are expected to be underestimated.

7 Discussion

In the context of model selection under generalized linear models, we propose a PML method that enables simultaneous model selection and parameter estimation. Despite using a nonconvex penalty, the proposed estimates can be efficiently computed for high dimensional data thanks to difference convex programming and the coordinate descent al-gorithm. In addition, some numerical and theoretical results are provided to demonstrate the effectiveness of the proposed method.

References

[1] Akaike, H. (1973). Information theory and the maximum likelihood principle. In International Symposium on Information Theory. Edited by Petrov, V. and Cs´aki, F. Akademiai Ki´ado, Budapest.

[2] Akaike, H. (1974). A New Look at the Statistical Model Identification. IEEE Trans-actions on Automatic Control, 19, 716-713.

[3] An, H. L. T. and Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. Journal of Global Optimization, 11, 253-285.

[4] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression (with discussion). The Annals of Statistics, 32, 407-499.

[5] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties, Journal of the American Statistical Association, 96, 1348-1360.

[6] Friedman, J., Haste, T., Hofling, H., and Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302-332.

[7] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization Paths for Gen-eralized Linear Models via Coordinate Decent. Journal of Statistic Software, 33.

[8] Hosmer, D.W., Lemeshow, S., (1989). Applied Logistic Regression. Wiley, New York

[9] Lange, K. (2004). Optimization. Springer, New York.

[10] Nishii, R. (1984). Asymptotic properties of criteria for selection of variables in mul-tiple regression. The Annals of Statistics, 12, 758-765.

[11] Osborne, M., Presnell, B., and Turlach, B. (2000). On the LASSO and its dual.

Journal of Computational and Graphical Statistics, 9, 319-337.

[12] Schwarz, G. (1978). Estimating the Dimension of a Model. The Annuals of Statis-tics, 19, 461-464.

[13] Shao, J. (1997). An asymptotic theory for linear model selection. The Annals of Statistics, 7, 221-242.

[14] Shen, X. and Huang, H.-C. (2010). Grouping pursuit through a regularization solution surface. Journal of the American Statistical Association, 490, 727-739.

[15] Shen, X., Pan, W., Zhu, Y., and Zhou, H. (2010). On L₀ regularization in high-dimensional regression, manuscript.

[16] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, B, 58, 267-288.

[17] Zhao, P. and Yu, B. (2006). On model selection consistency of LASSO. Journal of Machine Learning Research, 7, 2541-2567.

[18] Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty, The Annals of Statistics, 38, 894-942.

Appendices

在文檔中廣義線性模型使用懲罰概似函數之模型選取 (頁 23-40)