Variable selection in linear regression through adaptive penalty selection

(1)

PREPRINT

國立臺灣大學數學系預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2012- 04.pdf

Variable selection in linear regression through adaptive penalty selection

Hung Chen and Chuan-Fa Tang

September 06, 2012

(2)

Variable selection in linear regression through adaptive penalty selection

Hung Chen¹ Chuan-Fa Tang²

1Department of Mathematics, National Taiwan University, Taipei, 11167, Taiwan

2Department of Statistics, University of South Carolina, Columbia, SC 29208, USA

Abstract

Model selection procedures often use a fixed penalty, such as Mallows’ Cp, to avoid choosing a model which fits a particular data set extremely well. These procedures are often devised to give an unbiased risk estimate when a particular chosen model is used to predict future responses.

As a correction for not including the variability induced in model selection, generalized degrees of freedom is introduced in Ye (1998) as an estimate of model selection uncertainty that arise in using the same data for both model selection and associated parameter estimation. Built upon generalized degrees of freedom, Shen and Ye (2002) proposed a data-adaptive complexity penalty.

In this article, we evaluate the validity of such an approach on model selection of linear regression when the set of candidate models satisfies nested structure and includes true model. It is found that the performance of such an approach is even worse than Mallows’ Cp on the probability of correct selection. However, this approach coupled with proper selection of the range of penalty or little bootstrap proposed in Breiman (1992) performs better than Cp with increasing probability of correct selection but still cannot match with BIC on achieving model selection consistency.

Keywords: Adaptive penalty selection; Mallows’ Cp; Model selection consistency; little bootstrap;

Generalized degrees of freedom

(3)

1 Introduction

In analyzing data from complex scientific experiments or observational studies, one often encounters the problem of choosing one plausible linear regression model from a sequence of nested candidate models. When the exact true model is among the candidate models, BIC (Schwarz, 1978) has been shown to be a consistent model selector in Nishii (1984), but C_p (Mallows, 1973) is shown to be a conservative model selector in Woodrofe (1982) for picking a slightly overfitted model. From BIC to C_p (AIC), the chosen penalty differs in measuring the dimension of the model to log n where n represents the number of observations. Many efforts were made to understand in what circumstances these criteria allow to identify the right model asymptotically.

Liu and Yang (2011) proposed a model selector in terms of data-adaptive choice of penalty be- tween AIC and BIC. When the exact true model is among the candidate models, the newly proposed selector is a consistent model selector. Since both the number of candidate predictors and sample size influence the performance of any model selector, an alternative approach of penalty selection is to do adaptive choice of penalty. Shen and Ye (2002) proposed such an approach in which the restriction on the penalty is lifted and the generalized degrees of freedom (GDF) is used to compen- sate possible overfitting due to model selection. To avoid overfitting the available data caused by the collection of candidate models, little bootstrap has been advocated in Breiman (1992) to decrease possible overfitting due to multi-model inference. Often GDF is difficult to compute, Shen and Ye (2002) demonstrated that the use of little bootstrap (i.e. data perturbation) to replace the calculation of GDF can lead to a better regression model in terms of mean prediction error as compared to the use of C_p or BIC.

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of multi-model inference for nested candidate models when the candidate models include correct model. As shown in this paper, we recommend the use of little bootstrap (data perturbation) instead of calculating GDF with a fully adaptive penalty procedure for nested candidate models.

The nested candidate models are of the form

Y = X_Kβ + ϵ, (1)

where Y = (Y1, . . . , Yn)^T is the response vector, E(Y) = µ₀ = (µ1, . . . , µn)^T, ϵ = (ϵ1, . . . , ϵn)^T ∼ N_n(0, σ²I_n), x_j = (x_1j, . . . , x_nj)^T for j = 1, . . . , K, X_K = (x₁, . . . , x_K) is an n×K matrix of potential explanatory variables to be used in predicting linearly the response vector, and β = (β1 . . . , βK)^T is a vector of unknown coefficients. Here I_nis the n× n identity matrix and σ² is known. In this paper, we further assume that µ₀= XKβ and some of the coefficients may equal zero.

(4)

Consider adaptive choice of λ by finding the best λ which minimizes ˆM (λ) over λ∈ Λ. Here Λ denotes the collection of penalty to choose. It is a two-step procedure. At the first step, for each λ∈ Λ, choose the optimal model ˆM (λ) by minimizing (Y− ˆµ)^T(Y− ˆµ) + λ|M|σ², where |M| denotes the number of unknown parameters used to prescribe model M . Then calculate g₀(λ) which is the twice of generalized degrees of freedom with penalty λ. At the second step, choose the optimal λ∈ Λ based on data as follows:

ˆλ = argmin_λ_∈Λ {

RSS( ˆM (λ)) + g₀(λ)· σ²} where RSS( ˆM (λ)) =||Y − ˆµM (λ)ˆ ||²,∥ · ∥ is usual L2 norm, ˆµ_{M (λ)}_ˆ =∑_K

k=1µˆ_M_k· 1_{{ ˆ}_{M (λ)=M}_k_}, 1_{·} is the usual indicator function and ˆµ_M_k is the estimator of µ in model Mk. Here Λ = [0,∞) or (0, ∞) as recommended in Shen and Ye (2002) and Shen and Huang (2006). When g₀(λ) is difficult to compute, data perturbation is recommended.

Despite GDF already take into the account of the data-driven selection of the sequence of estimators (i.e., candidate models), it is found in this paper that the model selected by C_p is better than ˆM (ˆλ) in terms of the probability of selecting correct model unless λ falling between AIC and BIC. However, the model selector by replacing g₀(λ) with the estimate given by little bootstrap can beat the performance of Cp in terms of the probability of selecting correct model.

The rest of the paper is organized as the following. In Section 2, we introduce the unbiased risk estimation and generalized degrees of freedom in the setting of model selection with nested linear regression models. It also addresses the performance of the adaptive penalty selection procedure in Shen and Ye (2002) with approximated GDF in terms of model selection consistency. Section 3 gives the evaluation of the performance of adaptive penalty selection with GDF. Section 4 concludes with a discussion on the merit of data perturbation method. The Appendix contains technical proofs.

2 Adaptive penalty selection with nested candidate models

In this section, we start with the derivation of generalized degrees of freedom defined in Ye (1998) through unbiased risk estimation, describe its potential benefit on increasing the probability of selecting correct model, and pitfalls without a proper constraint on the choice of penalty.

2.1 Unbiased risk estimation under nested linear regression model selection The available data contains n observations and each of them has K explanatory predictors, x1, . . . , xK, and response y. Denote the matrix formed by the first k explanatory variables as X_k = (x₁, . . . , x_k) and β_k = (β₁, . . . , β_k)^T where k = 1, . . . , K. We then have K candidate models and the kth candidate

(5)

model M_k is

Y = Xkβ_k+ ϵ.

The set of candidate models is denoted asM = {Mk|k = 1, . . . , K} while the true model Mk0 ∈ M.

Namely, µ₀ = X_k₀β_k₀ with 1 ≤ k0 ≤ K. For the ease of presentation, denote {Mk|k0 ≤ j ≤ K} by Mk0 which includes the true model and overfitted models.

For kth candidate model M_k with least squares fit, we have ˆµ_M_k = P_kY where P_k stands for the orthogonal projection operator onto the space spanned by Xk. We evaluate the model ˆµ_M_k by its prediction accuracy. Denote the future response vector at X_K by Y⁰ and assume that E(Y⁰) = E(Y) and V ar(Y⁰) = σ²I_n. The unobservable prediction error is defined as P E(M_k) = ∥Y⁰ − ˆµM_k∥² and residual sum of squares as RSS(M_k) = ∥Y − ˆµMk∥². It is well known that the residul sum of squares RSS(M_k) often smaller than P E(M_k). Namely, E[P E(M_k)] > E[RSS(M_k)]. Mallows (1973) proposed Cp by inflating RSS(M_k) with 2tr(P_k)σ² where tr(P_k) is often called degrees of freedom.

Since both P E(M_k) and RSS(M_k) are of quadratic form, it follows easily that E[P E(M_k)] = E[RSS(M_k)] + 2kσ².

We now give generalized degrees of freedom through unbiased risk estimate overM while the true model is M_k₀. For simplicity, we assume that σ² is known from now on. The model selected through Mallows’ Cp criterion among M is denoted as ˆM (2). We can then write the post-model-selection estimator of µ₀ through C_p by ˆµ_{M (2)}_ˆ =∑_K

k=1µˆ_M

k· 1{^{M (2)=M}^ˆ ^k}.

The prediction risk is defined as follows:

RISK(Cp) = E [(

Y⁰− ˆµM (2)ˆ

)_T(

Y⁰− ˆµM (2)ˆ

)]

.

By standard argument, we have RISK(Cp) = E

[

(µ₀− ˆµM (2)ˆ )^T(µ₀− ˆµM (2)ˆ ) ]

+ nσ² = E[RSS( ˆM (2))] + 2E[ϵ^T( ˆµM (2)ˆ − µ₀)]

= E[RSS( ˆM (2))] + g₀(2)· σ²

where g₀(2) = 2E[ϵ^T( ˆµ_{M (2)}_ˆ − µ0)]/σ² and g₀(2) is twice GDF defined in Ye (1998). In general, the calculation of GDF associated with Cp can be complicate but its analytic form can be obtained under the settings of this paper.

When K − k0 = 20, and n = 404, g0(2) is found to be around 2k0 + 5.02 as given in Lemma 4. Or, the unbiased risk estimator of using C_p over M should be RSS( ˆM (2)) + g₀(2)· σ² instead of RSS( ˆM (2)) + 2| ˆM (2)|σ² where | ˆM (2)| is the total number of unknown parameters of the chosen model by C_p. It will be shown in Lemma 4(a) that g₀(2) ≥ 2k0 under some regularity conditions.

Since BIC achieves model selection consistency as shown in Schwarz (1978), its associated unbiased

(6)

risk estimator will be RSS( ˆM (log n)) + 2k₀σ². This leads to the proposal in Shen and Ye (2002) on employing adaptive choice of penalty incorporating GDF. We now investigate whether the adaptive penalty selection proposal incorporate GDF in Shen and Ye (2002) is a consistent model selection procedure.

2.2 Adaptive choice of λ over Λ ={0, 2, log n}

As an illustration, we consider how adaptive choice of penalty, λ ∈ Λ = {0, 2, log n}, with GDF improves on the fixed choice of penalty. Consider , k₀ = 1 and K− k0 = 20 in which finding the underfitted model in M0 is not possible. When λ = 0, model MK will always be chosen since it gives the smallest residual of sum squares. Then g₀(0) = 2K which is equal to the degree of freedom assigned by Mallows’ C_p. When λ = log n, we have ˆM (λ) = M_k₀ with probability close to 1 and then g0(log n)≈ 2k0. When λ = 2, Woodroofe (1982) shows that the probability of selecting correct model is slightly larger than 0.7 while choose no more than one superfluous explanatory variables in average.

Then g0(2)≈ 2k0+ 5.02 where 5.02 depends on k0and K− k0. It will be shown that adaptive penalty selection procedure with GDF improves C_p selection in increasing the probability of selecting correct model Mk0 when K− k0≥ 20. This improvement mainly come from the reduction of the probability of selecting M_k₀₊₁ by C_p. Technical details will be provided later on.

We now give a lemma which will be used repeatedly to give an upper bound on| ˆM (λ)|−k0. It can also be used to give an upper bound on the probability of choosing M_k₀ when the adaptive penalty selection procedure is used.

Lemma 1. Suppose that log n∈ Λ, λ ≤ log n for all λ ∈ Λ, and ˆM (log n) = M_k₀. (a) For all λ∈ Λ,

P

(M (ˆˆ λ) = M_k₀

)≤ P( λ

(| ˆM (λ)| − |Mk0|)

≤ g0(λ)− g0(log n) )

.

(b) ˆM (ˆλ)̸= Mk0 if there exists λ such that λ

(| ˆM (λ)| − |Mk0|)

> g0(λ)− g0(log n).

Proof. Recall that ˆM (λ) = argmin_M_k_∈M{RSS(Mk) + λ|Mk|σ²} for any given λ ∈ Λ or ˆM (λ) is the chosen model for given λ. We conclude that RSS( ˆM (λ)) + λ| ˆM (λ)|σ² ≤ RSS(Mk0) + λ|Mk0|σ². Hence, for all λ,

λ

[| ˆM (λ)| − |Mk0|]

σ² ≤ RSS(Mk0)− RSS( ˆM (λ)). (2)

When the event{ˆλ = log n} occurs, the following must hold.

RSS( ˆM (λ)) + g0(λ)≥ RSS( ˆM (log n)) + g0(log n) for all λ.

(7)

Hence, P (ˆλ = log n)≤ P(

RSS( ˆM (λ)) + g₀(λ)≥ RSS( ˆM (log n)) + g₀(log n) )

. We conclude that P (ˆλ = log n) ≤ P(

RSS( ˆM (log n))− RSS( ˆM (λ))≤ g0(λ)− g0(log n) )

≤ P( λ

(M (λ)ˆ − |Mk0|)

σ²≤ RSS( ˆM (log n))− RSS( ˆM (λ))≤ g0(λ)− g0(log n) )

≤ P( λ

(M (λ)ˆ − |Mk0|)

σ²≤ g0(λ)− g0(log n) )

.

The second inequality holds by (2). We conclude the proof of this lemma. 2 Next lemma states that | ˆM (λ)| is a decreasing step function of λ.

Lemma 2. When the set of candidate models M is a sequence of nested linear regression models,

| ˆM (λ)| is a non-increasing step function over Λ. When | ˆM (λ_ℓ)| = kℓ for some 1≤ kℓ ≤ K, | ˆM (λ)| ≤ k_ℓ for all λ≥ λℓ.

Proof. Recall that ˆM (λ) denotes the selected model for given penalty λ|Mk|σ². It follows from Propo- sition 5.1 of Breiman (1992) that RSS( ˆM (λ)) must be at an extreme point of the lower convex envelop of the graph{k, RSS(k)}, k = 1, . . . , K. Hence, this lemma holds. 2 Adaptive choice of λ over Λ = {0, 2, log n} depends on the comparison of RSS( ˆM (0)) + g₀(0)· σ², RSS( ˆM (log n)) + g0(log n)σ², and RSS( ˆM (2)) + g0(2)σ². We start with the comparison of RSS( ˆM (0)) + g₀(0)σ² and RSS( ˆM (log n)) + g₀(log n)σ². Note that, for K− k0= 20,

P (

RSS( ˆM (0)) + g0(0)σ²> RSS( ˆM (log n)) + g0(log n)σ² )

= P

(

RSS( ˆM (0)) + g₀(0)σ² > RSS( ˆM (log n)) + g₀(log n)σ², ˆM (log n) = M_k₀ )

+P (

RSS( ˆM (0)) + g0(0)σ² > RSS( ˆM (log n)) + g0(log n)σ², ˆM (log n)̸= Mk0

)

≥ P(

Y^T(P_K− Pk0)Y < 40σ², ˆM (log n) = M_k₀ )

.

Since Y^T(P_K− Pk0)Y/σ² is a chi-square random variable with degrees of freedom 20, it follows from the consistency of BIC on model selection, P

(

RSS( ˆM (0)) + g0(0)σ²> RSS( ˆM (log n)) + g0(log n)σ² )

is close to 1.

Next, we evaluate the difference between RSS( ˆM (2)) + g₀(2)σ² and RSS( ˆM (log n)) + g₀(log n)σ². When ˆM (log n) = M_k₀ and ˆM (2) = M_k, it follows from Lemma 1 that 6 ≤ | ˆM (2)| − |Mk0| must hold if k − k0 ≥ 3. Since g0(2)− 2k0 ≈ 5.02, the adaptive choice of λ over {0, 2, log n} might increase the probability of selecting M_k₀. This increase must come from the occurrence of the event {1 ≤ | ˆM (2)|−k0 ≤ 3}. Or, it does improve Cp on increasing the probability of correct selection. Refer to Lemma 5 for further details.

(8)

2.3 Adaptive choice of λ over Λ = [2, log n]

The choice of Λ can be difficult in practice if the users don’t know the behavior of fix penalty model selection criterion. The natural choice of Λ can be on [0,∞) as suggested in Shen and Ye (2002). In this subsection, Λ is chosen to be [2, log n]. The case Λ = [0, log n] is deferred to next subsection.

Note that g₀(λ) = 2∑_K_−k₀

k=1 P(

χ²_k+2> kλ)

+ 2k₀ under some regularity conditions by Lemma 4(a) in Section 3.

When Λ = [2, log n], adaptive choice of λ, ˆλ is defined as follows:

λ = argminˆ _λ_{∈[2,log n]}

{

RSS( ˆM (λ)) + g₀(λ)σ² }

.

When the event{ ˆM (2) = M_k₀} occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g0(λ) is decreasing. Hence, adaptive penalty selection procedure will select Mk0 when Mallows’ Cp select Mk0. This conclude that adaptive penalty selection with λ∈ [2, log n] does improve Cp on the probability of correct selection. Table 1 presents the finding of a simulation experiment and leave general discussion in Section 3. This simulation result shows that the adaptive choice of λ over [2, log n] performs better than C_p but cannot match with BIC. Compare to C_p, adaptive choice of λ over [2, log n] improves over Mallows’ Cp by increasing the probability of correct selection from 71.29% to 74.88% and this improvement is being observed consistently over simulations. In Section 3, it quantifies the increase of probability of selecting M_k₀ based on adaptive choice of λ over [2, log n] over fixed penalty 2 is given which matches well with the above simulation experiment.

2.4 Adaptive choice of λ over Λ = [0, log n]

Woodrofe (1982) showed that Mallows’ C_p selection procedure can be viewed as finding the global minimum of a random walk with negative drift 1 formed by [√

n( ˆβ_k − βk)/

√

V ar( ˆβ_k)]² − 2 over {k0, . . . , K}. When λ = 0, it leads to the selection of MK, the largest candidate model. Section 2.2 shows that g0(0) = 2K which is twice GDF defined in Shen (1998).

As in last subsection, we again use a simulation study to evaluate adaptive penalty selection procedure when Λ = [0, log n]. The finding is summarized in Table 2. Clearly, adaptive penalty selection of λ over [0, log n] not only cannot improve over C_p but decreases the probability of selecting Mk0 to about 52.6% as shown in Table 2. This finding is consistent with the simulation results in Atkinson (1980) and Zhang (1992). Both papers suggested that the penalty factor for Atkinson’

generalized Mallows’ Cp statistics should be chosen between 1.5 and 6.

When λ = 1, the size of selected model is the one that reaches the global minimum of a random walk without drift over {k0, . . . , K}. It leads to an overfitted model with lots of superfluous explanatory variables. We now employ Lemma 1 to explain why adaptive penalty selection performs worse than

(9)

| ˆM (ˆλ)| λ ∈ [2, log n] λ = 2 | ˆM (ˆλ)| λ ∈ [2, log n] λ = 2 k0 0.7488 0.7129 k0+ 11 0.0016 0.0016 k₀+ 1 0.0834 0.1116 k₀+ 12 0.0011 0.0011 k0+ 2 0.0513 0.0578 k0+ 13 0.0022 0.0022 k₀+ 3 0.0345 0.0352 k₀+ 14 0.0014 0.0014 k0+ 4 0.0237 0.0240 k0+ 15 0.0014 0.0014 k₀+ 5 0.0147 0.0149 k₀+ 16 0.0015 0.0015 k0+ 6 0.0091 0.0090 k0+ 17 0.0007 0.0007 k₀+ 7 0.0081 0.0082 k₀+ 18 0.0006 0.0006 k0+ 8 0.0058 0.0057 k0+ 19 0.0009 0.0009 k₀+ 9 0.0054 0.0055 k₀+ 20 0.0002 0.0002 k₀+ 10 0.0036 0.0036

Table 1: The empirical distribution of chosen model through adaptive model selection when λ ∈ [2, log n] based on 100, 000 simulated runs.

| ˆM (ˆλ)| λ ∈ [0, log n] λ = 2 | ˆM (ˆλ)| λ ∈ [0, log n] λ = 2 k0 0.5259 0.7129 k0+ 11 0.0154 0.0016 k₀+ 1 0.0611 0.1116 k₀+ 12 0.0169 0.0011 k0+ 2 0.0348 0.0578 k0+ 13 0.0169 0.0022 k₀+ 3 0.0271 0.0352 k₀+ 14 0.0162 0.0014 k0+ 4 0.0235 0.0240 k0+ 15 0.0189 0.0014 k₀+ 5 0.0189 0.0149 k₀+ 16 0.0216 0.0015 k0+ 6 0.0159 0.0090 k0+ 17 0.0225 0.0007 k₀+ 7 0.0168 0.0082 k₀+ 18 0.0277 0.0006 k0+ 8 0.0132 0.0057 k0+ 19 0.0308 0.0009 k0+ 9 0.0174 0.0055 k0+ 20 0.0438 0.0002 k₀+ 10 0.0149 0.0036

Table 2: The empirical distribution of chosen model through adaptive model selection when λ ∈ [0, log n] based on 100, 000 simulated runs.

(10)

C_p when Λ = [0, log n]. Recall that g₀(λ) = 2E[ϵ^T( ˆµ_{M (λ)}_ˆ − µ0)]/σ² where ˆµ_{M (λ)}_ˆ = ∑_K

k=1µˆ_M_k · 1{^{M (λ)=M}^ˆ ^k}. Note that P^k does not depend on ϵ and hence

E[ϵ^T( ˆµ_{M (λ)}_ˆ − µ0)] = E [_K

∑

k=1

ϵ^TP_kϵ· 1{^{M (λ)=M}^ˆ ^k} ]

= σ²g₀(λ)/2 > 0.

When ˆM (λ) ≥ k0, g₀(λ)/σ² must be greater than 2k₀. Since ϵ^T( ˆµ_{M (λ)}_ˆ − µ0) is a mixture of χ² random variables and cannot be observed, it is being estimated by its expected value as suggested in Mallows (1973). Note that g₀(λ)− 2k0 are equal to 25.35, 21.98, 18.78, 15.98, 13.36, and 11.22 as λ takes values 1.0, 1.1, 1.2, 1.3, 1.4, and 1.5. Since the probability distribution of χ²₁ random variable is positively skew, the replacement of such a random variable by its mean can be problematic. Moreover, the following must hold when ˆM (ˆλ) = M_k₀ for all λ∈ Λ.

| ˆM (λ)| − k0 ≤

⌊g₀(λ)− 2k0

λ

⌋ ,

where⌊·⌋ is the Gauss integer. When ˆM (ˆλ) = M_k₀, it follows from Lemma 1 that| ˆM (1.1)| − k0 ≤ 19,

| ˆM (1.2)| − k0 ≤ 15, | ˆM (1.3)| − k0 ≤ 12, | ˆM (1.4)| − k0 ≤ 9, and | ˆM (1.5)| − k0 ≤ 7. This reduces the probability of choosing M_k₀ which provides an explanation of why adaptive penalty selection is so problematic when λ∈ [1, 1.5]).

3 Performance of Adaptive Penalty Selection with GDF

In this section, we start with assumptions which ensure that BIC is a consistent model selector and C_p never miss on choosing nonzero coefficient explanatory variables. Under the setting of this paper, the analytic form of g0(λ) is available. Its numerical values can be obtained easily by any statistical software. To evaluate whether the method of adaptive penalty selection can improve fixed penalty selection, we combine simulation study and analytic approximation together on those (K, k0) with K− k0 ≥ 20. When K − k0 = 20, numerical values of g₀(λ) is obtained by R software. Then combine numerical values of g0(λ) for K− k0 = 20 and the exponential inequality in Teicher (1984) to give a bound on g₀(λ) for K− k0 > 20.

3.1 Key assumptions

We now state three assumptions on the class of competing models M. The following assumptions ensure that employing BIC leads to consistent model selection and we never get an underfitted model when λ∈ [0, log n]. They mainly prescribe the relation between n, the total number of observation, K, the total number of explanatory variables, and k₀, the total number of nonzero regression coefficients.

As a remark, the setting of simulation studies in Section 2 satisfies these three assumptions.

(11)

Assumption B. X_K is full rank and there exists a constant c > 0 such that µ^T₀(I_n− Pj)µ₀/σ² ≥ cn for all j < k0, where µ₀= Xk0β_k₀ is the mean vector of the true model.

Assumption N1. cn > 2k₀log n.

Assumption N2. log n > 2 log(K− k0).

Assumptions B and N1 are assumed to quantify the magnitude of the bias when the fitted model leaves out nonzero coefficient explanatory variable while Assumption N2 is specified to guard the inclusion of those zero coefficient explanatory variables, max_k₀_≤j≤K| ˆβj|. When those assumptions hold, it ensures that BIC selects the correct model with probability 1 as n goes to infinity. Moreover, for all λ ∈ [0, log n], the probability of the event {| ˆM (λ)| ≤ k0− 1} goes to zero as shown in the following Lemma. Its proof is deferred to the Appendix.

Lemma 3. When k₀> 1, Assumptions B and N1 hold, P

(| ˆM (λ)| ≥ k0, for all λ∈ [0, log n])

→ 1 as n goes to infinity.

When Assumption N2 also holds, P

(M (log n) = Mˆ _k₀

)→ 1 as n goes to infinity.

Under the above assumptions, we only need to address the adaptive penalty selection over Λ = [0, log n] for Mk0 on the inclusion of superfluous explanatory variables. It remains to study

M (λ) = argminˆ _M_k_∈M_k0{

RSS(M_k) + λkσ²} and ˆλ which is defined as follows

λ = argminˆ _λ_∈Λ {

RSS( ˆM (λ)) + g₀(λ)σ² }

to quantify | ˆM (λ)| − k0.

We now employ the argument in Woodroofe (1982) to calculate g0(λ) defined by 2E[ϵ^T( ˆµ_{M (λ)}_ˆ − µ₀)]/σ² overMk0 and details deferred to the Appendix.

Lemma 4. When Assumption B, N1 and N2 hold, the followings hold.

(a) For λ∈ [0, log n], g0(λ) = 2∑_K_−k₀

j=1 P (

χ²_j+2> jλ )

+2k₀ and g₀(λ) is a strictly decreasing function of λ, where χ²_ℓ is chi-square distributed with degrees of freedom ℓ.

(b) For λ∈ [2, log n], 2∑_K_−k₀

j=21 P (

χ²_j+2> jλ

)≤ 2.6102 when K − k0 ≥ 21.

Figure 1 gives the plot of (λ, g0(λ)) when K−k0= 20 in which g0(λ) is approximated by R software by the accuracy of fourth decimal points. Lemma 4(b) gives us a general idea on the plot of (λ, g₀(λ)) when K− k0 > 20.

(12)

0 1 2 3 4 5 6

010203040

λ g0(λ)

Figure 1: g₀(λ) when K− k0= 20.

3.2 Performance of adaptive penalty selection procedure

To evaluate adaptive penalty selection procedure, we start with the case (K− k0, n) = (20, 404) and Λ = [2, log n]. Since log 404 > 2 log(20), Assumption N2 holds. Assume that Assumptions B and N1 hold. It follows from Lemma 3 that the adaptive penalty selection procedure in Shen and Ye (2002) will not miss any nonzero coefficient explanatory variable. Lemma 4(a) gives g0(λ) for all 0≤ λ ≤ log n which can be obtained by numerical approximation using R program.

By the simulation result presented in Table 1 in ˆλ is often smaller than log n and ˆM (ˆλ) cannot be M_k₀ whenever | ˆM (2)| > 2 + k0. An estimate of the lower bound of the probability of the event { ˆM (ˆλ) = Mk0} will be given.

When the event { ˆM (2) = M_k₀} occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g0(λ) is decreasing and ˆM (ˆλ) = Mk0. Hence, adaptive penalty selection will select Mk0 when Mallows’ Cp selects M_k₀. We now demonstrate how the adaptive penalty selection improves over C_p when C_p overfits an explanatory variable, the event { ˆM (2) = M_k₀₊₁, ˆM (ˆλ) = M_k₀} occurs. For i = k0+ 1, . . . , K, define Vi = [RSS(M_k₀_+i)− RSS(Mk0+i−1)]/σ². Note that Vi are i.i.d. chi-square random variables with degree of freedom 1. When the event{ ˆM (2) = M_k₀₊₁} occurs, it follows from

(13)

Lemma 2 that ˆM (λ) = M_k₀ when λ∈ [Vk0+1, log n] and ˆM (λ) = M_k₀₊₁ when λ∈ [2, Vk0+1).

The following lemma gives a lower bound of the probability of the event{ ˆM (2) = Mk0+1, ˆλ = log n}

when Λ = [2, log n]. Define p_λ,k₀_,K_−k₀(k) to be the probability of the event { ˆM (λ) = M_k} occurs for given penalty λ when k = 1, . . . , K.

Lemma 5. Suppose Assumptions B, N1, and N2 hold and Λ = [2, log n]. A lower bound on the probability of { ˆM (2) = M_k₀₊₁, ˆM (ˆλ) = M_k₀} is

p2,k0+1,K−k0(k0+ 1)· P (

2 < Vk0+1 ≤ sup

λ>0{g0(λ)− 2k0> λ} )

.

Proof. Define E1 = { ˆM (λ) = M_k₀₊₁ for all λ ∈ [2, Vk0+1), ˆM (ˆλ) = M_k₀} and E2 = { ˆM (λ^′) = M_k₀ for all λ^′ ∈ [Vk0, log n], ˆM (ˆλ) = M_k₀}. Then

E1 = {RSS(Mk0)− RSS(Mk0+1) > λσ²,

RSS(Mk0+1)− RSS(Mk0+1+i)≤ iλσ², for all 0 < i≤ K − k0− 1, RSS(Mk0) + 2k0σ² ≤ RSS( ˆM (λ)) + g0(λ)σ², for all λ∈ [2, Vk0+1)}

= {Vk0+2+· · · + Vk0+2+i≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 2, 2 < Vk0+1 ≤ g0(Vk0+1)− 2k0}

and

E₂ = {RSS(Mk0)− RSS(Mk0+i)≤ iλ^′σ², for all 0 < i≤ K − k0,

RSS(M_k₀) + 2k₀σ² ≤ RSS( ˆM (λ^′)) + g₀(λ^′)σ², for all λ^′ ∈ [Vk0+1, log n)}

= {Vk0+1+· · · + Vk0+1+i≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 1}.

Observe that

{ ˆM (2) = M_k₀₊₁, ˆM (ˆλ) = M_k₀} = E1∩ E2

= {Vk0+2+· · · + Vk0+2+i ≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 2, 2 < V_k₀₊₁≤ g0(V_k₀₊₁)− 2k0,

V_k₀₊₁+· · · + Vk0+1+j ≤ (j + 1)Vk0+1, for all 0≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2,

2 < Vk0+1≤ g0(Vk0+1)− 2k0, Vk0+2+· · · + Vk0+1+j ≤ jVk0+1, for all 1≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2,

2 < Vk0+1≤ g0(Vk0+1)− 2k0, Vk0+2+· · · + Vk0+1+j ≤ 2j, for all 1 ≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2, 2 < Vk0+1≤ g0(Vk0+1)− 2k0}.

(14)

Then we have

P ( ˆM (2) = Mk0+1, ˆM (ˆλ) = Mk0)≥ P (2 < Vk0+1 ≤ g0(Vk0+1)− 2k0)

·P (Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2). (3) Note that the first term in the right hand side of (3) is equal to p_2,k₀_+1,K_−k₀₋₁(k0+ 1). We conclude

the proof. 2

Remark 1. When K− k0 = 20, p_2,k₀_+1,19(k₀+ 1) is around 0.7151 by simulation. Since g₀(λ) is strictly decreasing, a lower bound of the second term of (3) is obtained as follows.

P (2 < Vk0+1 ≤ g0(Vk0+1)− 2k0) ≥ P (

2 < Vk0+1≤ sup

λ>0

{g0(λ)− 2k0 > λ}

)

≈ P (2 ≤ Vk0+1≤ 2.56) ≈ 0.04770, (4) where 2.56 is obtained by simulation to approximate sup_λ>0{g0(λ)− 2k0 > λ} for K − k0 = 20 and 0.04770 is a numerical approximation of P (2≤ Vk0+1 ≤ 2.56) by R program. It follows from (3) and (4) that, for Λ = [2, log n], P ( ˆM (2) = M_k₀₊₁, ˆλ = log n)≥ 0.7151 · 0.04770 = 0.0341.

In fact, { ˆM (ˆλ) = Mk0} = ∪k0≤k≤K{ ˆM (2) = Mk, ˆM (ˆλ) = Mk0}. The adaptive penalty selection procedure picks M_k₀ with probability

P ( ˆM (ˆλ) = M_k₀) ≥ ∑

k0≤k≤K

P ( ˆM (2) = M_k, ˆM (ˆλ) = M_k₀)

≥ ∑

k0≤k≤k0+1

P ( ˆM (2) = M_k, ˆM (ˆλ) = M_k₀)

= p₁(2, 1, 20) + p₁(2, 1, 19)· P (

2≤ χ²1≤ sup

λ>0

{g0(λ)− 2k0> λ} )

≈ 0.7129 + 0.7151 · 0.04770 = 0.7470.

Since P

(M (ˆˆ λ) = M_k₀

)≥ p2,k0,K−k0(k₀), adaptive penalty selector with λ∈ [2, log n] improves Cp

on correct selection probability. Moreover, a lower bound on the increase of correct selection probability is around 0.034 for K− k0 = 20 and k₀ = 1. We conclude that when λ∈ [2, log n], adaptive penalty selector improves C_p at least at the amount of 3.4% on correct selection probability. As suggested by the simulation study, the improvement mainly comes from the event { ˆM (2) = M_k₀₊₁, ˆλ = log n} .

Furthermore, we provide a crude upper bound which is obtained by Lemma 1. When K− k0 = 20, we have

g₀(λ) = 2

∑20 j=1

P (χ²_j+2 > jλ) + 2k₀ for all 2≤ λ ≤ log n and

g₀(2)− 2k0 = 5.02 when K− k0 = 20. (5)

(15)

As suggested by (5), the upper bound of the probability of selecting correct model M_k₀ by adaptive penalty selection is 0.88 which is obtained by adding up P (| ˆM (2)| = k) for k = k0, . . . , k0+ 2 in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection procedure improves fixed penalty C_p by increasing the probability of correct selection around 3.5%. This reduction must come from including no more than 2 superfluous zero-coefficient explanatory variables.

According to the above discussion and the result presented in Table 1, we have the following lemma.

Lemma 6. When K− k0 = 20, all three assumptions hold if n≥ 404. The adaptive penalty selection procedure proposed in Shen and Ye (2002) improves over Mallows’ Cp in terms of model selection consistency where λ ∈ [2, log n]. However, it cannot match with BIC in terms of model selection consistency.

We now compare the performance of adaptive penalty selection under K − k0 > 20 (where n >

(K− k0)²) versus K− k0 = 20 (where n = 404). When K− k0 > 20, it follows from Lemma 4 that

g0(λ) = 2

∑20 j=1

P (χ²_j+2> jλ) +

K∑−k0

j=21

P (χ²_j+2> jλ) + 2k0

for all 2≤ λ ≤ log n and ∑_K_−k₀

j=21 P (χ²_j+2 > jλ)≤ 2.61. We then have

g₀(2)− 2k0 < 5.02 + 2.61 = 7.63 when K− k0 > 20. (6) As suggested by (6), the upper bound of the probability of selecting correct model Mk0 by adaptive penalty selection is 0.91 which is obtained by adding up P (| ˆM (2)| = k) for k = k0, . . . , k₀ + 3 in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection improves fixed penalty C_p by increasing the probability of correct selection around 3.5%. The reduction must come from including no more than 3 superfluous zero-coefficient explanatory variables.

We conclude the following Theorem.

Theorem 1. When Assumption B, Assumption N1, and Assumption N2 hold, the adaptive penalty selection procedure with generalized degrees of freedom over λ∈ [2, log n] is still a conservative model selection procedure. In terms of the probability of selecting overparametrized models, it improves fixed penalty selection procedure Cp but cannot match with the fixed penalty selection BIC.

As a remark, a lower bound on the probability of selecting M_k₀ can also obtained by the same method used in the proof of Lemma 5. It only needs to replace p2,1,20(1) with p2,1,K−k0(1) for K − k0 > 20 and p_2,1,19(1) by p_2,1,K_−k₀₋₁(1) for K − k0 > 20. Since g₀(λ) increases with n, sup_λ>0{g0(λ)− 2k0> λ} increases accordingly. Hence, there is no need to adjust the probability

(16)

P (2≤ χ²₁ ≤ supλ>0{g0(λ)− 2k0 > λ}) for lower bound. Similarly, a lower bound for K − k0 > 20 is given as p2,1,K−k0(1) + p2,1,K−k0−1(1)· 0.04770. Then we have the approximated lower bound

P (ˆλ = log n) = p₁(2, 1, K− k0) + p₁(2, 1, K− k0− 1) · P (

2≤ χ²1 ≤ sup

λ>0{g0(λ)− 2k0> λ} )

≈ (0.7129 + 0.7151 · 0.04770)(1 − 0.0119) = 0.7382.

This approximated lower bound only decrease slightly when K− k0> 20.

Note that the difference of p_2,1,K_−k₀(1) and p_2,1,20(1) is small. That is because for j > 0 and λ > 2 p_λ,1,K_−k₀(1)− pλ,1,K−k0+j(1)

= P (V_k₀₊₁ ≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0))

−P (Vk0+1≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0), V_k₀₊₁+· · · + VK ≤ λ(K − k0+ j))

= P (V_k₀₊₁ ≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0), V_k₀₊₁+· · · + VK > λ(K− k0+ j))

≤ P (Vk0+1+· · · + VK > λ(K− k0+ j))

The last probability is close to 0 by the law of large numbers. For example, when λ≥ 2 and K−k0 = 15, p_λ,1,K_−k₀(1)− pλ,1,K−k0+j(1)≤ P (Vk0+1+· · · + VK> 2(15))≈ 0.0119 by R. Use the same argument, we can show that the difference between p_λ,1,K_−k₀(ℓ) and p_λ,1,K_−k₀_+j(ℓ) should be small for ℓ = 1, 2.

4 Discussion and Conclusion

Our main purpose, in this paper, is to show that the practice of a fully adaptive choice of penalty should be cautious. In particular, we have demonstrated that a fully adaptive penalty selection procedure through the unbiased risk estimator can be a bad practice for nested candidate models. But the adaptive penalty selection procedure does improve the performance of C_p when the range of penalty is chosen properly. However, such a procedure still cannot achieve model selection consistency as fixed penalty BIC approach does.

To achieve proper selection of the range of penalty, it can be done by using little bootstrap or data perturbation suggested in Shen and Ye (2002). When the data perturbation method with suggested perturbation variance 0.25σ² is employed to estimate GDF over Λ = [0, log n], it outperforms the procedure by adding GDF as reported in Table 3.

(17)

| ˆM (ˆλ)| ˆg0τ(λ) g₀(λ) M (ˆˆ λ) |ˆg0τ(λ)| g₀(λ) k₀ 0.7487 0.5259 k₀+ 11 0.0071 0.0154 k0+ 1 0.0518 0.0611 k0+ 12 0.0082 0.0169 k₀+ 2 0.0258 0.0348 k₀+ 13 0.0075 0.0169 k0+ 3 0.0194 0.0271 k0+ 14 0.0061 0.0162 k₀+ 4 0.0163 0.0235 k₀+ 15 0.0065 0.0189 k₀+ 5 0.0130 0.0189 k₀+ 16 0.0072 0.0216 k0+ 6 0.0088 0.0159 k0+ 17 0.0076 0.0225 k₀+ 7 0.0097 0.0168 k₀+ 18 0.0092 0.0277 k0+ 8 0.0085 0.0132 k0+ 19 0.0086 0.0308 k₀+ 9 0.0092 0.0174 k₀+ 20 0.0127 0.0438 k0+ 10 0.0081 0.0149

Table 3: The empirical distribution of chosen model through adaptive model selection with ˆg0τ(λ), τ /σ = 0.5 when λ∈ [0, log n] based on 10, 000 simulated runs.

5 Appendix

5.1 Proof of Lemma 3

It follows easily from Lemma 2 that P

(| ˆM (log n)| ≥ |Mk0|)

≤ P(

| ˆM (λ)| ≥ |Mk0|, for all λ ∈ [0, log n]) . Or

P

(| ˆM (log n)| < |Mk0|)

> P

(| ˆM (λ)| < |Mk0|, for some λ ∈ [0, log n]) . Observe that

P

(| ˆM (λ)| < |Mk0| for some λ ∈ [0, log n])

≤ P(

| ˆM (log n)| < |Mk0|)

≤

k∑0−1 k=1

P

(| ˆM (log n)| = k)

≤

k∑0−1 k=1

P(

RSS(Mk)− RSS(Mk0)≤ (k0− k)σ²log n)

=

k∑0−1 k=1

P(

µ^T₀(I− Pk)µ₀+ 2ϵ^T(I− Pk)µ₀+ ϵ^T(I− Pk)ϵ≤ (k0− k)σ²log n)

≤

k∑0−1 k=1

P(

µ^T₀(I− Pk)µ₀+ 2ϵ^T(I− Pk)µ₀ ≤ k0σ²log n) .