**PREPRINT**

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

### www.math.ntu.edu.tw/ ~ mathlib/preprint/2012- 04.pdf

## Variable selection in linear regression through adaptive penalty selection

### Hung Chen and Chuan-Fa Tang

### September 06, 2012

### Variable selection in linear regression through adaptive penalty selection

Hung Chen^{1} Chuan-Fa Tang^{2}

1Department of Mathematics, National Taiwan University, Taipei, 11167, Taiwan

2Department of Statistics, University of South Carolina, Columbia, SC 29208, USA

**Abstract**

*Model selection procedures often use a fixed penalty, such as Mallows’ C**p*, to avoid choosing
a model which fits a particular data set extremely well. These procedures are often devised to
give an unbiased risk estimate when a particular chosen model is used to predict future responses.

As a correction for not including the variability induced in model selection, generalized degrees
*of freedom is introduced in Ye (1998) as an estimate of model selection uncertainty that arise in*
using the same data for both model selection and associated parameter estimation. Built upon
generalized degrees of freedom, Shen and Ye (2002) proposed a data-adaptive complexity penalty.

In this article, we evaluate the validity of such an approach on model selection of linear regression
when the set of candidate models satisfies nested structure and includes true model. It is found
*that the performance of such an approach is even worse than Mallows’ C**p* on the probability of
correct selection. However, this approach coupled with proper selection of the range of penalty or
*little bootstrap proposed in Breiman (1992) performs better than C**p* with increasing probability
of correct selection but still cannot match with BIC on achieving model selection consistency.

*Keywords: Adaptive penalty selection; Mallows’ C**p*; Model selection consistency; little bootstrap;

Generalized degrees of freedom

**1** **Introduction**

In analyzing data from complex scientific experiments or observational studies, one often encounters
the problem of choosing one plausible linear regression model from a sequence of nested candidate
*models. When the exact true model is among the candidate models, BIC (Schwarz, 1978) has been*
*shown to be a consistent model selector in Nishii (1984), but C** _{p}* (Mallows, 1973) is shown to be a
conservative model selector in Woodrofe (1982) for picking a slightly overfitted model. From BIC

*to C*

_{p}*(AIC), the chosen penalty differs in measuring the dimension of the model to log n where n*represents the number of observations. Many efforts were made to understand in what circumstances

*these criteria allow to identify the right model asymptotically.*

Liu and Yang (2011) proposed a model selector in terms of data-adaptive choice of penalty be-
*tween AIC and BIC. When the exact true model is among the candidate models, the newly proposed*
selector is a consistent model selector. Since both the number of candidate predictors and sample
size influence the performance of any model selector, an alternative approach of penalty selection
is to do adaptive choice of penalty. Shen and Ye (2002) proposed such an approach in which the
restriction on the penalty is lifted and the generalized degrees of freedom (GDF) is used to compen-
*sate possible overfitting due to model selection. To avoid overfitting the available data caused by the*
collection of candidate models, little bootstrap has been advocated in Breiman (1992) to decrease
possible overfitting due to multi-model inference. Often GDF is difficult to compute, Shen and Ye
(2002) demonstrated that the use of little bootstrap (i.e. data perturbation) to replace the calculation
of GDF can lead to a better regression model in terms of mean prediction error as compared to the
*use of C** _{p}* or BIC.

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of multi-model inference for nested candidate models when the candidate models include correct model. As shown in this paper, we recommend the use of little bootstrap (data perturbation) instead of calculating GDF with a fully adaptive penalty procedure for nested candidate models.

The nested candidate models are of the form

**Y = X**_{K}* β + ϵ,* (1)

* where Y = (Y*1

*, . . . , Y*

*n*)

^{T}

**is the response vector, E(Y) = µ**_{0}

*= (µ*1

*, . . . , µ*

*n*)

^{T}*1*

**, ϵ = (ϵ***, . . . , ϵ*

*n*)

^{T}*∼*

*N*

_{n}

**(0, σ**^{2}

**I**

_{n}**), x**

_{j}*= (x*

_{1j}*, . . . , x*

*)*

_{nj}

^{T}

**for j = 1, . . . , K, X**

_{K}**= (x**

_{1}

**, . . . , x**

_{K}*) is an n×K matrix of potential*

*1*

**explanatory variables to be used in predicting linearly the response vector, and β = (β***. . . , β*

*K*)

*is*

^{T}**a vector of unknown coefficients. Here I**

_{n}*is the n× n identity matrix and σ*

^{2}is known. In this paper,

**we further assume that µ**_{0}

**= X**

*K*

**β and some of the coefficients may equal zero.***Consider adaptive choice of λ by finding the best λ which minimizes ˆM (λ) over λ∈ Λ. Here Λ*
denotes the collection of penalty to choose. It is a two-step procedure. At the first step, for each
*λ∈ Λ, choose the optimal model ˆ M (λ) by minimizing (Y− ˆµ)*

^{T}**(Y**

**− ˆµ) + λ|M|σ**^{2}, where

*|M| denotes*

*the number of unknown parameters used to prescribe model M . Then calculate g*

_{0}

*(λ) which is the*

*twice of generalized degrees of freedom with penalty λ. At the second step, choose the optimal λ∈ Λ*based on data as follows:

ˆ*λ = argmin*_{λ}* _{∈Λ}*
{

*RSS( ˆM (λ)) + g*_{0}*(λ)· σ*^{2}}
*where RSS( ˆM (λ)) = ||Y − ˆµ*

*M (λ)*ˆ

*||*

^{2},

*∥ · ∥ is usual L*2 norm, ˆ

**µ**

_{M (λ)}_{ˆ}=∑

_{K}*k=1** µ*ˆ

_{M}

_{k}*· 1*

_{{ ˆ}

_{M (λ)=M}

_{k}*, 1*

_{}}*is the usual indicator function and ˆ*

_{{·}}

**µ**

_{M}

_{k}

**is the estimator of µ in model M***k*

*. Here Λ = [0,∞) or (0, ∞) as*

*recommended in Shen and Ye (2002) and Shen and Huang (2006). When g*

_{0}

*(λ) is difficult to compute,*data perturbation is recommended.

Despite GDF already take into the account of the data-driven selection of the sequence of estimators
*(i.e., candidate models), it is found in this paper that the model selected by C** _{p}* is better than ˆ

*M (ˆλ) in*

*terms of the probability of selecting correct model unless λ falling between AIC and BIC. However, the*

*model selector by replacing g*

_{0}

*(λ) with the estimate given by little bootstrap can beat the performance*

*of C*

*p*in terms of the probability of selecting correct model.

The rest of the paper is organized as the following. In Section 2, we introduce the unbiased risk estimation and generalized degrees of freedom in the setting of model selection with nested linear regression models. It also addresses the performance of the adaptive penalty selection procedure in Shen and Ye (2002) with approximated GDF in terms of model selection consistency. Section 3 gives the evaluation of the performance of adaptive penalty selection with GDF. Section 4 concludes with a discussion on the merit of data perturbation method. The Appendix contains technical proofs.

**2** **Adaptive penalty selection with nested candidate models**

*In this section, we start with the derivation of generalized degrees of freedom defined in Ye (1998)*
through unbiased risk estimation, describe its potential benefit on increasing the probability of selecting
correct model, and pitfalls without a proper constraint on the choice of penalty.

**2.1** **Unbiased risk estimation under nested linear regression model selection**
*The available data contains n observations and each of them has K explanatory predictors, x*1*, . . . , x**K*,
**and response y. Denote the matrix formed by the first k explanatory variables as X**_{k}**= (x**_{1}**, . . . , x*** _{k}*)

**and β**

_{k}*= (β*

_{1}

*, . . . , β*

*)*

_{k}

^{T}*where k = 1, . . . , K. We then have K candidate models and the kth candidate*

*model M** _{k}* is

**Y = X***k***β**_{k}**+ ϵ.**

The set of candidate models is denoted as*M = {M**k**|k = 1, . . . , K} while the true model M**k*0 *∈ M.*

**Namely, µ**_{0} **= X**_{k}_{0}**β**_{k}_{0} with 1 *≤ k*0 *≤ K. For the ease of presentation, denote {M**k**|k*0 *≤ j ≤ K} by*
*M**k*0 which includes the true model and overfitted models.

*For kth candidate model M** _{k}* with least squares fit, we have ˆ

**µ**

_{M}

_{k}**= P**

_{k}**Y where P**

*stands for the*

_{k}**orthogonal projection operator onto the space spanned by X**

*k*. We evaluate the model ˆ

**µ**

_{M}*by its*

_{k}**prediction accuracy. Denote the future response vector at X**

_{K}**by Y**

^{0}

**and assume that E(Y**^{0}

**) = E(Y)**

**and V ar(Y**^{0}

*) = σ*

^{2}

**I**

_{n}*. The unobservable prediction error is defined as P E(M*

*) =*

_{k}

**∥Y**^{0}

**− ˆµ***M*

_{k}*∥*

^{2}

*and residual sum of squares as RSS(M*

*) =*

_{k}

**∥Y − ˆµ***M*

*k*

*∥*

^{2}. It is well known that the residul sum of

*squares RSS(M*

_{k}*) often smaller than P E(M*

_{k}*). Namely, E[P E(M*

_{k}*)] > E[RSS(M*

*)]. Mallows (1973)*

_{k}*proposed C*

*p*

*by inflating RSS(M*

_{k}

**) with 2tr(P**

_{k}*)σ*

^{2}

**where tr(P**

_{k}*) is often called degrees of freedom.*

*Since both P E(M*_{k}*) and RSS(M** _{k}*) are of quadratic form, it follows easily that

*E[P E(M*

_{k}*)] = E[RSS(M*

_{k}*)] + 2kσ*

^{2}

*.*

We now give generalized degrees of freedom through unbiased risk estimate over*M while the true*
*model is M*_{k}_{0}*. For simplicity, we assume that σ*^{2} is known from now on. The model selected through
*Mallows’ C**p* criterion among *M is denoted as ˆM (2). We can then write the post-model-selection*
**estimator of µ**_{0} *through C** _{p}* by ˆ

**µ**

_{M (2)}_{ˆ}=∑

_{K}*k=1** µ*ˆ

_{M}*k**· 1{*^{M (2)=M}^{ˆ} ^{k}*}.*

The prediction risk is defined as follows:

*RISK(C**p**) = E*
[(

**Y**^{0}**− ˆµ***M (2)*ˆ

)* _{T}*(

**Y**^{0}**− ˆµ***M (2)*ˆ

)]

*.*

By standard argument, we have
*RISK(C**p*) = *E*

[

**(µ**_{0}**− ˆµ***M (2)*ˆ )^{T}**(µ**_{0}**− ˆµ***M (2)*ˆ )
]

*+ nσ*^{2} *= E[RSS( ˆ M (2))] + 2E[ϵ*

*( ˆ*

^{T}

**µ***M (2)*ˆ

**− µ**_{0})]

= *E[RSS( ˆM (2))] + g*_{0}(2)*· σ*^{2}

*where g*_{0}**(2) = 2E[ϵ*** ^{T}*( ˆ

**µ**

_{M (2)}_{ˆ}

*0*

**− µ***)]/σ*

^{2}

*and g*

_{0}(2) is twice GDF defined in Ye (1998). In general, the

*calculation of GDF associated with C*

*p*can be complicate but its analytic form can be obtained under the settings of this paper.

*When K* *− k*0 *= 20, and n = 404, g*0*(2) is found to be around 2k*0 *+ 5.02 as given in Lemma*
*4. Or, the unbiased risk estimator of using C** _{p}* over

*M should be RSS( ˆM (2)) + g*

_{0}(2)

*· σ*

^{2}instead

*of RSS( ˆM (2)) + 2| ˆM (2)|σ*

^{2}where

*| ˆM (2)| is the total number of unknown parameters of the chosen*

*model by C*

_{p}*. It will be shown in Lemma 4(a) that g*

_{0}(2)

*≥ 2k*0 under some regularity conditions.

Since BIC achieves model selection consistency as shown in Schwarz (1978), its associated unbiased

*risk estimator will be RSS( ˆM (log n)) + 2k*_{0}*σ*^{2}. This leads to the proposal in Shen and Ye (2002) on
employing adaptive choice of penalty incorporating GDF. We now investigate whether the adaptive
penalty selection proposal incorporate GDF in Shen and Ye (2002) is a consistent model selection
procedure.

**2.2** **Adaptive choice of λ over Λ =**{0, 2, log n}

*As an illustration, we consider how adaptive choice of penalty, λ* *∈ Λ = {0, 2, log n}, with GDF*
*improves on the fixed choice of penalty. Consider , k*_{0} *= 1 and K− k*0 = 20 in which finding the
underfitted model in *M*0 *is not possible. When λ = 0, model M**K* will always be chosen since it
*gives the smallest residual of sum squares. Then g*_{0}*(0) = 2K which is equal to the degree of freedom*
*assigned by Mallows’ C*_{p}*. When λ = log n, we have ˆM (λ) = M*_{k}_{0} with probability close to 1 and then
*g*0*(log n)≈ 2k*0*. When λ = 2, Woodroofe (1982) shows that the probability of selecting correct model*
*is slightly larger than 0.7 while choose no more than one superfluous explanatory variables in average.*

*Then g*0(2)*≈ 2k*0*+ 5.02 where 5.02 depends on k*0*and K− k*0. It will be shown that adaptive penalty
*selection procedure with GDF improves C** _{p}* selection in increasing the probability of selecting correct

*model M*

*k*0

*when K− k*0

*≥ 20. This improvement mainly come from the reduction of the probability*

*of selecting M*

_{k}_{0}

_{+1}

*by C*

*. Technical details will be provided later on.*

_{p}We now give a lemma which will be used repeatedly to give an upper bound on*| ˆM (λ)|−k*0. It can
*also be used to give an upper bound on the probability of choosing M*_{k}_{0} when the adaptive penalty
selection procedure is used.

**Lemma 1. Suppose that log n**∈ Λ, λ ≤ log n for all λ ∈ Λ, and ˆM (log n) = M_{k}_{0}*.*
**(a) For all λ**∈ Λ,

*P*

(*M (ˆ*ˆ *λ) = M*_{k}_{0}

)*≤ P*(
*λ*

(*| ˆM (λ)| − |M**k*0*|*)

*≤ g*0*(λ)− g*0*(log n)*
)

*.*

**(b) ˆ***M (ˆλ)̸= M**k*0 *if there exists λ such that λ*

(*| ˆM (λ)| − |M**k*0*|*)

*> g*0*(λ)− g*0*(log n).*

*Proof. Recall that ˆM (λ) = argmin*_{M}_{k}_{∈M}*{RSS(M**k**) + λ|M**k**|σ*^{2}*} for any given λ ∈ Λ or ˆM (λ) is the*
*chosen model for given λ. We conclude that RSS( ˆM (λ)) + λ| ˆM (λ)|σ*^{2} *≤ RSS(M**k*0*) + λ|M**k*0*|σ*^{2}.
*Hence, for all λ,*

*λ*

[*| ˆM (λ)| − |M**k*0*|*]

*σ*^{2} *≤ RSS(M**k*0)*− RSS( ˆM (λ)).* (2)

When the event*{ˆλ = log n} occurs, the following must hold.*

*RSS( ˆM (λ)) + g*0*(λ)≥ RSS( ˆM (log n)) + g*0*(log n) for all λ.*

*Hence, P (ˆλ = log n)≤ P*(

*RSS( ˆM (λ)) + g*_{0}*(λ)≥ RSS( ˆM (log n)) + g*_{0}*(log n)*
)

. We conclude that
*P (ˆλ = log n)* *≤ P*(

*RSS( ˆM (log n))− RSS( ˆM (λ))≤ g*0*(λ)− g*0*(log n)*
)

*≤ P*(
*λ*

(*M (λ)*ˆ *− |M**k*0*|*)

*σ*^{2}*≤ RSS( ˆM (log n))− RSS( ˆM (λ))≤ g*0*(λ)− g*0*(log n)*
)

*≤ P*(
*λ*

(*M (λ)*ˆ *− |M**k*0*|*)

*σ*^{2}*≤ g*0*(λ)− g*0*(log n)*
)

*.*

The second inequality holds by (2). We conclude the proof of this lemma. *2*
Next lemma states that *| ˆM (λ)| is a decreasing step function of λ.*

**Lemma 2. When the set of candidate models***M is a sequence of nested linear regression models,*

*| ˆM (λ)| is a non-increasing step function over Λ. When | ˆM (λ** _{ℓ}*)

*| = k*

*ℓ*

*for some 1≤ k*

*ℓ*

*≤ K, | ˆM (λ)| ≤*

*k*

_{ℓ}*for all λ≥ λ*

*ℓ*

*.*

*Proof. Recall that ˆM (λ) denotes the selected model for given penalty λ|M**k**|σ*^{2}. It follows from Propo-
*sition 5.1 of Breiman (1992) that RSS( ˆM (λ)) must be at an extreme point of the lower convex envelop*
of the graph*{k, RSS(k)}, k = 1, . . . , K. Hence, this lemma holds.* *2*
*Adaptive choice of λ over Λ =* *{0, 2, log n} depends on the comparison of RSS( ˆM (0)) + g*_{0}(0)*·*
*σ*^{2}*, RSS( ˆM (log n)) + g*0*(log n)σ*^{2}*, and RSS( ˆM (2)) + g*0*(2)σ*^{2}. We start with the comparison of
*RSS( ˆM (0)) + g*_{0}*(0)σ*^{2} *and RSS( ˆM (log n)) + g*_{0}*(log n)σ*^{2}*. Note that, for K− k*0= 20,

*P*
(

*RSS( ˆM (0)) + g*0*(0)σ*^{2}*> RSS( ˆM (log n)) + g*0*(log n)σ*^{2}
)

= *P*

(

*RSS( ˆM (0)) + g*_{0}*(0)σ*^{2} *> RSS( ˆM (log n)) + g*_{0}*(log n)σ*^{2}*, ˆM (log n) = M*_{k}_{0}
)

*+P*
(

*RSS( ˆM (0)) + g*0*(0)σ*^{2} *> RSS( ˆM (log n)) + g*0*(log n)σ*^{2}*, ˆM (log n)̸= M**k*0

)

*≥ P*(

**Y**^{T}**(P**_{K}**− P***k*0**)Y < 40σ**^{2}*, ˆM (log n) = M*_{k}_{0}
)

*.*

**Since Y**^{T}**(P**_{K}**− P***k*0**)Y/σ**^{2} is a chi-square random variable with degrees of freedom 20, it follows from
*the consistency of BIC on model selection, P*

(

*RSS( ˆM (0)) + g*0*(0)σ*^{2}*> RSS( ˆM (log n)) + g*0*(log n)σ*^{2}
)

is close to 1.

*Next, we evaluate the difference between RSS( ˆM (2)) + g*_{0}*(2)σ*^{2} *and RSS( ˆM (log n)) + g*_{0}*(log n)σ*^{2}.
When ˆ*M (log n) = M*_{k}_{0} and ˆ*M (2) = M** _{k}*, it follows from Lemma 1 that 6

*≤ | ˆM (2)| − |M*

*k*0

*| must*

*hold if k*

*− k*0

*≥ 3. Since g*0(2)

*− 2k*0

*≈ 5.02, the adaptive choice of λ over {0, 2, log n} might*

*increase the probability of selecting M*

_{k}_{0}. This increase must come from the occurrence of the event

*{1 ≤ | ˆM (2)|−k*0

*≤ 3}. Or, it does improve C*

*p*on increasing the probability of correct selection. Refer to Lemma 5 for further details.

**2.3** **Adaptive choice of λ over Λ = [2, log n]**

The choice of Λ can be difficult in practice if the users don’t know the behavior of fix penalty model
*selection criterion. The natural choice of Λ can be on [0,∞) as suggested in Shen and Ye (2002). In*
*this subsection, Λ is chosen to be [2, log n]. The case Λ = [0, log n] is deferred to next subsection.*

*Note that g*_{0}*(λ) = 2*∑_{K}_{−k}_{0}

*k=1* *P*(

*χ*^{2}_{k+2}*> kλ*)

*+ 2k*_{0} under some regularity conditions by Lemma 4(a) in
Section 3.

*When Λ = [2, log n], adaptive choice of λ, ˆλ is defined as follows:*

*λ = argmin*ˆ _{λ}_{∈[2,log n]}

{

*RSS( ˆM (λ)) + g*_{0}*(λ)σ*^{2}
}

*.*

When the event*{ ˆM (2) = M*_{k}_{0}*} occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g*0*(λ) is*
*decreasing. Hence, adaptive penalty selection procedure will select M**k*0 *when Mallows’ C**p* *select M**k*0.
*This conclude that adaptive penalty selection with λ∈ [2, log n] does improve C**p* on the probability of
correct selection. Table 1 presents the finding of a simulation experiment and leave general discussion
*in Section 3. This simulation result shows that the adaptive choice of λ over [2, log n] performs better*
*than C*_{p}*but cannot match with BIC. Compare to C*_{p}*, adaptive choice of λ over [2, log n] improves*
*over Mallows’ C**p* *by increasing the probability of correct selection from 71.29% to 74.88% and this*
improvement is being observed consistently over simulations. In Section 3, it quantifies the increase of
*probability of selecting M*_{k}_{0} *based on adaptive choice of λ over [2, log n] over fixed penalty 2 is given*
which matches well with the above simulation experiment.

**2.4** **Adaptive choice of λ over Λ = [0, log n]**

*Woodrofe (1982) showed that Mallows’ C** _{p}* selection procedure can be viewed as finding the global
minimum of a random walk with negative drift 1 formed by [

*√*

*n( ˆβ*_{k}*− β**k**)/*

√

*V ar( ˆβ** _{k}*)]

^{2}

*− 2 over*

*{k*0

*, . . . , K}. When λ = 0, it leads to the selection of M*

*K*, the largest candidate model. Section 2.2

*shows that g*0

*(0) = 2K which is twice GDF defined in Shen (1998).*

As in last subsection, we again use a simulation study to evaluate adaptive penalty selection
*procedure when Λ = [0, log n]. The finding is summarized in Table 2. Clearly, adaptive penalty*
*selection of λ over [0, log n] not only cannot improve over C** _{p}* but decreases the probability of selecting

*M*

*k*0

*to about 52.6% as shown in Table 2. This finding is consistent with the simulation results in*Atkinson (1980) and Zhang (1992). Both papers suggested that the penalty factor for Atkinson’

*generalized Mallows’ C**p* *statistics should be chosen between 1.5 and 6.*

*When λ = 1, the size of selected model is the one that reaches the global minimum of a random walk*
without drift over *{k*0*, . . . , K}. It leads to an overfitted model with lots of superfluous explanatory*
variables. We now employ Lemma 1 to explain why adaptive penalty selection performs worse than

*| ˆM (ˆλ)| λ ∈ [2, log n] λ = 2* *| ˆM (ˆλ)| λ ∈ [2, log n] λ = 2*
*k*0 0.7488 0.7129 *k*0+ 11 0.0016 0.0016
*k*_{0}+ 1 0.0834 0.1116 *k*_{0}+ 12 0.0011 0.0011
*k*0+ 2 0.0513 0.0578 *k*0+ 13 0.0022 0.0022
*k*_{0}+ 3 0.0345 0.0352 *k*_{0}+ 14 0.0014 0.0014
*k*0+ 4 0.0237 0.0240 *k*0+ 15 0.0014 0.0014
*k*_{0}+ 5 0.0147 0.0149 *k*_{0}+ 16 0.0015 0.0015
*k*0+ 6 0.0091 0.0090 *k*0+ 17 0.0007 0.0007
*k*_{0}+ 7 0.0081 0.0082 *k*_{0}+ 18 0.0006 0.0006
*k*0+ 8 0.0058 0.0057 *k*0+ 19 0.0009 0.0009
*k*_{0}+ 9 0.0054 0.0055 *k*_{0}+ 20 0.0002 0.0002
*k*_{0}+ 10 0.0036 0.0036

*Table 1: The empirical distribution of chosen model through adaptive model selection when λ* *∈*
*[2, log n] based on 100, 000 simulated runs.*

*| ˆM (ˆλ)| λ ∈ [0, log n] λ = 2* *| ˆM (ˆλ)| λ ∈ [0, log n] λ = 2*
*k*0 0.5259 0.7129 *k*0+ 11 0.0154 0.0016
*k*_{0}+ 1 0.0611 0.1116 *k*_{0}+ 12 0.0169 0.0011
*k*0+ 2 0.0348 0.0578 *k*0+ 13 0.0169 0.0022
*k*_{0}+ 3 0.0271 0.0352 *k*_{0}+ 14 0.0162 0.0014
*k*0+ 4 0.0235 0.0240 *k*0+ 15 0.0189 0.0014
*k*_{0}+ 5 0.0189 0.0149 *k*_{0}+ 16 0.0216 0.0015
*k*0+ 6 0.0159 0.0090 *k*0+ 17 0.0225 0.0007
*k*_{0}+ 7 0.0168 0.0082 *k*_{0}+ 18 0.0277 0.0006
*k*0+ 8 0.0132 0.0057 *k*0+ 19 0.0308 0.0009
*k*0+ 9 0.0174 0.0055 *k*0+ 20 0.0438 0.0002
*k*_{0}+ 10 0.0149 0.0036

*Table 2: The empirical distribution of chosen model through adaptive model selection when λ* *∈*
*[0, log n] based on 100, 000 simulated runs.*

*C*_{p}*when Λ = [0, log n]. Recall that g*_{0}**(λ) = 2E[ϵ*** ^{T}*( ˆ

**µ**

_{M (λ)}_{ˆ}

*0*

**− µ***)]/σ*

^{2}where ˆ

**µ**

_{M (λ)}_{ˆ}= ∑

_{K}*k=1** µ*ˆ

_{M}

_{k}*·*

*1{*

^{M (λ)=M}^{ˆ}

^{k}

**}. Note that P**

^{k}

**does not depend on ϵ and hence****E[ϵ*** ^{T}*( ˆ

**µ**

_{M (λ)}_{ˆ}

*0*

**− µ***)] = E*[

_{K}∑

*k=1*

**ϵ**^{T}**P**_{k}**ϵ**· 1{^{M (λ)=M}^{ˆ} ^{k}*}*
]

*= σ*^{2}*g*_{0}*(λ)/2 > 0.*

When ˆ*M (λ)* *≥ k*0*, g*_{0}*(λ)/σ*^{2} *must be greater than 2k*_{0}**. Since ϵ*** ^{T}*( ˆ

**µ**

_{M (λ)}_{ˆ}

*0*

**− µ***) is a mixture of χ*

^{2}random variables and cannot be observed, it is being estimated by its expected value as suggested in

*Mallows (1973). Note that g*

_{0}

*(λ)− 2k*0

*are equal to 25.35, 21.98, 18.78, 15.98, 13.36, and 11.22 as λ*

*takes values 1.0, 1.1, 1.2, 1.3, 1.4, and 1.5. Since the probability distribution of χ*

^{2}

_{1}random variable is positively skew, the replacement of such a random variable by its mean can be problematic. Moreover, the following must hold when ˆ

*M (ˆλ) = M*

_{k}_{0}

*for all λ∈ Λ.*

*| ˆM (λ)| − k*0 *≤*

⌊*g*_{0}*(λ)− 2k*0

*λ*

⌋
*,*

where*⌊·⌋ is the Gauss integer. When ˆM (ˆλ) = M*_{k}_{0}, it follows from Lemma 1 that*| ˆM (1.1)| − k*0 *≤ 19,*

*| ˆM (1.2)| − k*0 *≤ 15, | ˆM (1.3)| − k*0 *≤ 12, | ˆM (1.4)| − k*0 *≤ 9, and | ˆM (1.5)| − k*0 *≤ 7. This reduces*
*the probability of choosing M*_{k}_{0} which provides an explanation of why adaptive penalty selection is so
*problematic when λ∈ [1, 1.5]).*

**3** **Performance of Adaptive Penalty Selection with GDF**

In this section, we start with assumptions which ensure that BIC is a consistent model selector and
*C** _{p}* never miss on choosing nonzero coefficient explanatory variables. Under the setting of this paper,

*the analytic form of g*0

*(λ) is available. Its numerical values can be obtained easily by any statistical*software. To evaluate whether the method of adaptive penalty selection can improve fixed penalty

*selection, we combine simulation study and analytic approximation together on those (K, k*0) with

*K− k*0

*≥ 20. When K − k*0

*= 20, numerical values of g*

_{0}

*(λ) is obtained by R software. Then combine*

*numerical values of g*0

*(λ) for K− k*0 = 20 and the exponential inequality in Teicher (1984) to give a

*bound on g*

_{0}

*(λ) for K− k*0

*> 20.*

**3.1** **Key assumptions**

We now state three assumptions on the class of competing models *M. The following assumptions*
ensure that employing BIC leads to consistent model selection and we never get an underfitted model
*when λ∈ [0, log n]. They mainly prescribe the relation between n, the total number of observation, K,*
*the total number of explanatory variables, and k*_{0}, the total number of nonzero regression coefficients.

As a remark, the setting of simulation studies in Section 2 satisfies these three assumptions.

**Assumption B. X**_{K}**is full rank and there exists a constant c > 0 such that µ**^{T}_{0}**(I**_{n}**− P***j***)µ**_{0}*/σ*^{2} *≥ cn*
*for all j < k*0**, where µ**_{0}**= X***k*0**β**_{k}_{0} is the mean vector of the true model.

**Assumption N1. cn > 2k**_{0}*log n.*

* Assumption N2. log n > 2 log(K− k*0).

**Assumptions B and N1 are assumed to quantify the magnitude of the bias when the fitted model**
**leaves out nonzero coefficient explanatory variable while Assumption N2 is specified to guard the**
inclusion of those zero coefficient explanatory variables, max_{k}_{0}_{≤j≤K}*| ˆβ**j**|. When those assumptions*
*hold, it ensures that BIC selects the correct model with probability 1 as n goes to infinity. Moreover,*
*for all λ* *∈ [0, log n], the probability of the event {| ˆM (λ)| ≤ k*0*− 1} goes to zero as shown in the*
following Lemma. Its proof is deferred to the Appendix.

**Lemma 3. When k**_{0}**> 1, Assumptions B and N1 hold,***P*

(*| ˆM (λ)| ≥ k*0*, for all λ∈ [0, log n]*)

*→ 1 as n goes to infinity.*

**When Assumption N2 also holds,***P*

(*M (log n) = M*ˆ _{k}_{0}

)*→ 1 as n goes to infinity.*

Under the above assumptions, we only need to address the adaptive penalty selection over Λ =
*[0, log n] for* *M**k*0 on the inclusion of superfluous explanatory variables. It remains to study

*M (λ) = argmin*ˆ _{M}_{k}_{∈M}* _{k0}*{

*RSS(M*_{k}*) + λkσ*^{2}}
and ˆ*λ which is defined as follows*

*λ = argmin*ˆ _{λ}* _{∈Λ}*
{

*RSS( ˆM (λ)) + g*_{0}*(λ)σ*^{2}
}

to quantify *| ˆM (λ)| − k*0.

*We now employ the argument in Woodroofe (1982) to calculate g*0**(λ) defined by 2E[ϵ*** ^{T}*( ˆ

**µ**

_{M (λ)}_{ˆ}

*−*

**µ**_{0}

*)]/σ*

^{2}over

*M*

*k*0 and details deferred to the Appendix.

**Lemma 4. When Assumption B, N1 and N2 hold, the followings hold.**

* (a) For λ∈ [0, log n], g*0

*(λ) = 2*∑

_{K}

_{−k}_{0}

*j=1* *P*
(

*χ*^{2}_{j+2}*> jλ*
)

*+2k*_{0} *and g*_{0}*(λ) is a strictly decreasing function*
*of λ, where χ*^{2}_{ℓ}*is chi-square distributed with degrees of freedom ℓ.*

* (b) For λ∈ [2, log n], 2*∑

_{K}

_{−k}_{0}

*j=21* *P*
(

*χ*^{2}_{j+2}*> jλ*

)*≤ 2.6102 when K − k*0 *≥ 21.*

*Figure 1 gives the plot of (λ, g*0*(λ)) when K−k*0*= 20 in which g*0*(λ) is approximated by R software*
*by the accuracy of fourth decimal points. Lemma 4(b) gives us a general idea on the plot of (λ, g*_{0}*(λ))*
*when K− k*0 *> 20.*

0 1 2 3 4 5 6

010203040

λ g0(λ)

*Figure 1: g*_{0}*(λ) when K− k*0= 20.

**3.2** **Performance of adaptive penalty selection procedure**

*To evaluate adaptive penalty selection procedure, we start with the case (K− k*0*, n) = (20, 404) and*
**Λ = [2, log n]. Since log 404 > 2 log(20), Assumption N2 holds. Assume that Assumptions B****and N1 hold. It follows from Lemma 3 that the adaptive penalty selection procedure in Shen and**
*Ye (2002) will not miss any nonzero coefficient explanatory variable. Lemma 4(a) gives g*0*(λ) for all*
0*≤ λ ≤ log n which can be obtained by numerical approximation using R program.*

By the simulation result presented in Table 1 in ˆ*λ is often smaller than log n and ˆM (ˆλ) cannot*
*be M*_{k}_{0} whenever *| ˆM (2)| > 2 + k*0. An estimate of the lower bound of the probability of the event
*{ ˆM (ˆλ) = M**k*0*} will be given.*

When the event *{ ˆM (2) = M*_{k}_{0}*} occurs, it follows from Lemma 2 that ˆλ = log n by the fact*
*that g*0*(λ) is decreasing and ˆM (ˆλ) = M**k*0*. Hence, adaptive penalty selection will select M**k*0 when
*Mallows’ C**p* *selects M*_{k}_{0}. We now demonstrate how the adaptive penalty selection improves over
*C*_{p}*when C** _{p}* overfits an explanatory variable, the event

*{ ˆM (2) = M*

_{k}_{0}

_{+1}

*, ˆM (ˆλ) = M*

_{k}_{0}

*} occurs. For*

*i = k*0

*+ 1, . . . , K, define V*

*i*

*= [RSS(M*

_{k}_{0}

*)*

_{+i}*− RSS(M*

*k*0

*+i*

*−1*

*)]/σ*

^{2}

*. Note that V*

*i*are i.i.d. chi-square random variables with degree of freedom 1. When the event

*{ ˆM (2) = M*

_{k}_{0}

_{+1}

*} occurs, it follows from*

Lemma 2 that ˆ*M (λ) = M*_{k}_{0} *when λ∈ [V**k*0+1*, log n] and ˆM (λ) = M*_{k}_{0}_{+1} *when λ∈ [2, V**k*0+1).

The following lemma gives a lower bound of the probability of the event*{ ˆM (2) = M**k*0+1*, ˆλ = log n}*

*when Λ = [2, log n]. Define p*_{λ,k}_{0}_{,K}_{−k}_{0}*(k) to be the probability of the event* *{ ˆM (λ) = M*_{k}*} occurs for*
*given penalty λ when k = 1, . . . , K.*

**Lemma 5. Suppose Assumptions B, N1, and N2 hold and Λ = [2, log n]. A lower bound on the***probability of* *{ ˆM (2) = M*_{k}_{0}_{+1}*, ˆM (ˆλ) = M*_{k}_{0}*} is*

*p**2,k*0*+1,K**−k*0*(k*0+ 1)*· P*
(

*2 < V**k*0+1 *≤ sup*

*λ>0**{g*0*(λ)− 2k*0*> λ}*
)

*.*

*Proof. Define E*1 = *{ ˆM (λ) = M*_{k}_{0}_{+1} *for all λ* *∈ [2, V**k*0+1*), ˆM (ˆλ) = M*_{k}_{0}*} and E*2 = *{ ˆM (λ** ^{′}*) =

*M*

_{k}_{0}

*for all λ*

^{′}*∈ [V*

*k*0

*, log n], ˆM (ˆλ) = M*

_{k}_{0}

*}. Then*

*E*1 = *{RSS(M**k*0)*− RSS(M**k*0+1*) > λσ*^{2}*,*

*RSS(M**k*0+1)*− RSS(M**k*0*+1+i*)*≤ iλσ*^{2}*, for all 0 < i≤ K − k*0*− 1,*
*RSS(M**k*0*) + 2k*0*σ*^{2} *≤ RSS( ˆM (λ)) + g*0*(λ)σ*^{2}*, for all λ∈ [2, V**k*0+1)}

= *{V**k*0+2+*· · · + V**k*0*+2+i**≤ (i + 1)V**k*0+1*, for all 0≤ i ≤ K − k*0*− 2,*
*2 < V**k*0+1 *≤ g*0*(V**k*0+1)*− 2k*0*}*

and

*E*_{2} = *{RSS(M**k*0)*− RSS(M**k*0*+i*)*≤ iλ*^{′}*σ*^{2}*, for all 0 < i≤ K − k*0*,*

*RSS(M*_{k}_{0}*) + 2k*_{0}*σ*^{2} *≤ RSS( ˆM (λ*^{′}*)) + g*_{0}*(λ*^{′}*)σ*^{2}*, for all λ*^{′}*∈ [V**k*0+1*, log n)}*

= *{V**k*0+1+*· · · + V**k*0*+1+i**≤ (i + 1)V**k*0+1*, for all 0≤ i ≤ K − k*0*− 1}.*

Observe that

*{ ˆM (2) = M*_{k}_{0}_{+1}*, ˆM (ˆλ) = M*_{k}_{0}*} = E*1*∩ E*2

= *{V**k*0+2+*· · · + V**k*0*+2+i* *≤ (i + 1)V**k*0+1*, for all 0≤ i ≤ K − k*0*− 2,*
*2 < V*_{k}_{0}_{+1}*≤ g*0*(V*_{k}_{0}_{+1})*− 2k*0*,*

*V*_{k}_{0}_{+1}+*· · · + V**k*0*+1+j* *≤ (j + 1)V**k*0+1*, for all 0≤ j ≤ K − k*0*− 1}*

*⊃ {V**k*0+2+*· · · + V**k*0*+2+i* *≤ 2(i + 1), for all 0 ≤ i ≤ K − k*0*− 2,*

*2 < V**k*0+1*≤ g*0*(V**k*0+1)*− 2k*0*, V**k*0+2+*· · · + V**k*0*+1+j* *≤ jV**k*0+1*, for all 1≤ j ≤ K − k*0*− 1}*

*⊃ {V**k*0+2+*· · · + V**k*0*+2+i* *≤ 2(i + 1), for all 0 ≤ i ≤ K − k*0*− 2,*

*2 < V**k*0+1*≤ g*0*(V**k*0+1)*− 2k*0*, V**k*0+2+*· · · + V**k*0*+1+j* *≤ 2j, for all 1 ≤ j ≤ K − k*0*− 1}*

*⊃ {V**k*0+2+*· · · + V**k*0*+2+i* *≤ 2(i + 1), for all 0 ≤ i ≤ K − k*0*− 2, 2 < V**k*0+1*≤ g*0*(V**k*0+1)*− 2k*0*}.*

Then we have

*P ( ˆM (2) = M**k*0+1*, ˆM (ˆλ) = M**k*0)*≥ P (2 < V**k*0+1 *≤ g*0*(V**k*0+1)*− 2k*0)

*·P (V**k*0+2+*· · · + V**k*0*+2+i* *≤ 2(i + 1), for all 0 ≤ i ≤ K − k*0*− 2).* (3)
*Note that the first term in the right hand side of (3) is equal to p*_{2,k}_{0}_{+1,K}_{−k}_{0}_{−1}*(k*0+ 1). We conclude

the proof. *2*

* Remark 1. When K− k*0

*= 20, p*

_{2,k}_{0}

_{+1,19}*(k*

_{0}

*+ 1) is around 0.7151 by simulation. Since g*

_{0}

*(λ) is*strictly decreasing, a lower bound of the second term of (3) is obtained as follows.

*P (2 < V**k*0+1 *≤ g*0*(V**k*0+1)*− 2k*0) *≥ P*
(

*2 < V**k*0+1*≤ sup*

*λ>0*

*{g*0*(λ)− 2k*0 *> λ}*

)

*≈ P (2 ≤ V**k*0+1*≤ 2.56) ≈ 0.04770,* (4)
*where 2.56 is obtained by simulation to approximate sup*_{λ>0}*{g*0*(λ)− 2k*0 *> λ} for K − k*0 = 20 and
*0.04770 is a numerical approximation of P (2≤ V**k*0+1 *≤ 2.56) by R program. It follows from (3) and*
*(4) that, for Λ = [2, log n], P ( ˆM (2) = M*_{k}_{0}_{+1}*, ˆλ = log n)≥ 0.7151 · 0.04770 = 0.0341.*

In fact, *{ ˆM (ˆλ) = M**k*0*} = ∪**k*0*≤k≤K**{ ˆM (2) = M**k**, ˆM (ˆλ) = M**k*0*}. The adaptive penalty selection*
*procedure picks M*_{k}_{0} with probability

*P ( ˆM (ˆλ) = M*_{k}_{0}) *≥* ∑

*k*0*≤k≤K*

*P ( ˆM (2) = M*_{k}*, ˆM (ˆλ) = M*_{k}_{0})

*≥* ∑

*k*0*≤k≤k*0+1

*P ( ˆM (2) = M*_{k}*, ˆM (ˆλ) = M*_{k}_{0})

= *p*_{1}*(2, 1, 20) + p*_{1}*(2, 1, 19)· P*
(

2*≤ χ*^{2}1*≤ sup*

*λ>0*

*{g*0*(λ)− 2k*0*> λ}*
)

*≈ 0.7129 + 0.7151 · 0.04770 = 0.7470.*

*Since P*

(*M (ˆ*ˆ *λ) = M*_{k}_{0}

)*≥ p**2,k*0*,K**−k*0*(k*_{0}*), adaptive penalty selector with λ∈ [2, log n] improves C**p*

on correct selection probability. Moreover, a lower bound on the increase of correct selection probability
*is around 0.034 for K− k*0 *= 20 and k*_{0} *= 1. We conclude that when λ∈ [2, log n], adaptive penalty*
*selector improves C*_{p}*at least at the amount of 3.4% on correct selection probability. As suggested by*
the simulation study, the improvement mainly comes from the event *{ ˆM (2) = M*_{k}_{0}_{+1}*, ˆλ = log n} .*

*Furthermore, we provide a crude upper bound which is obtained by Lemma 1. When K− k*0 = 20,
we have

*g*_{0}*(λ) = 2*

∑20
*j=1*

*P (χ*^{2}_{j+2}*> jλ) + 2k*_{0}
for all 2*≤ λ ≤ log n and*

*g*_{0}(2)*− 2k*0 *= 5.02 when K− k*0 *= 20.* (5)

*As suggested by (5), the upper bound of the probability of selecting correct model M*_{k}_{0} by adaptive
*penalty selection is 0.88 which is obtained by adding up P (| ˆM (2)| = k) for k = k*0*, . . . , k*0+ 2 in Table
2. As revealed in Table 1 of Section 2, adaptive penalty selection procedure improves fixed penalty
*C*_{p}*by increasing the probability of correct selection around 3.5%. This reduction must come from*
including no more than 2 superfluous zero-coefficient explanatory variables.

According to the above discussion and the result presented in Table 1, we have the following lemma.

* Lemma 6. When K− k*0

*= 20, all three assumptions hold if n≥ 404. The adaptive penalty selection*

*procedure proposed in Shen and Ye (2002) improves over Mallows’ C*

*p*

*in terms of model selection*

*consistency where λ*

*∈ [2, log n]. However, it cannot match with BIC in terms of model selection*

*consistency.*

*We now compare the performance of adaptive penalty selection under K* *− k*0 *> 20 (where n >*

*(K− k*0)^{2}*) versus K− k*0 *= 20 (where n = 404). When K− k*0 *> 20, it follows from Lemma 4 that*

*g*0*(λ) = 2*

∑20
*j=1*

*P (χ*^{2}_{j+2}*> jλ) +*

*K*∑*−k*0

*j=21*

*P (χ*^{2}_{j+2}*> jλ) + 2k*0

for all 2*≤ λ ≤ log n and* ∑_{K}_{−k}_{0}

*j=21* *P (χ*^{2}_{j+2}*> jλ)≤ 2.61. We then have*

*g*_{0}(2)*− 2k*0 *< 5.02 + 2.61 = 7.63 when K− k*0 *> 20.* (6)
*As suggested by (6), the upper bound of the probability of selecting correct model M**k*0 by adaptive
*penalty selection is 0.91 which is obtained by adding up P (| ˆM (2)| = k) for k = k*0*, . . . , k*_{0} + 3 in
*Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection improves fixed penalty C** _{p}* by

*increasing the probability of correct selection around 3.5%. The reduction must come from including*no more than 3 superfluous zero-coefficient explanatory variables.

We conclude the following Theorem.

**Theorem 1. When Assumption B, Assumption N1, and Assumption N2 hold, the adaptive***penalty selection procedure with generalized degrees of freedom over λ∈ [2, log n] is still a conservative*
*model selection procedure. In terms of the probability of selecting overparametrized models, it improves*
*fixed penalty selection procedure C**p* *but cannot match with the fixed penalty selection BIC.*

*As a remark, a lower bound on the probability of selecting M*_{k}_{0} can also obtained by the same
method used in the proof of Lemma 5. *It only needs to replace p**2,1,20**(1) with p**2,1,K**−k*0(1) for
*K* *− k*0 *> 20 and p*_{2,1,19}*(1) by p*_{2,1,K}_{−k}_{0}_{−1}*(1) for K* *− k*0 *> 20.* *Since g*_{0}*(λ) increases with n,*
sup_{λ>0}*{g*0*(λ)− 2k*0*> λ} increases accordingly. Hence, there is no need to adjust the probability*

*P (2≤ χ*^{2}_{1} *≤ sup**λ>0**{g*0*(λ)− 2k*0 *> λ}) for lower bound. Similarly, a lower bound for K − k*0 *> 20 is*
*given as p**2,1,K**−k*0*(1) + p**2,1,K**−k*0*−1*(1)*· 0.04770. Then we have the approximated lower bound*

*P (ˆλ = log n)* = *p*_{1}*(2, 1, K− k*0*) + p*_{1}*(2, 1, K− k*0*− 1) · P*
(

2*≤ χ*^{2}1 *≤ sup*

*λ>0**{g*0*(λ)− 2k*0*> λ}*
)

*≈ (0.7129 + 0.7151 · 0.04770)(1 − 0.0119) = 0.7382.*

*This approximated lower bound only decrease slightly when K− k*0*> 20.*

*Note that the difference of p*_{2,1,K}_{−k}_{0}*(1) and p*_{2,1,20}*(1) is small. That is because for j > 0 and λ > 2*
*p*_{λ,1,K}_{−k}_{0}(1)*− p**λ,1,K**−k*0*+j*(1)

= *P (V*_{k}_{0}_{+1} *≤ λ, . . . , V**k*0+1+*· · · + V**K* *≤ λ(K − k*0))

*−P (V**k*0+1*≤ λ, . . . , V**k*0+1+*· · · + V**K* *≤ λ(K − k*0*), V*_{k}_{0}_{+1}+*· · · + V**K* *≤ λ(K − k*0*+ j))*

= *P (V*_{k}_{0}_{+1} *≤ λ, . . . , V**k*0+1+*· · · + V**K* *≤ λ(K − k*0*), V*_{k}_{0}_{+1}+*· · · + V**K* *> λ(K− k*0*+ j))*

*≤ P (V**k*0+1+*· · · + V**K* *> λ(K− k*0*+ j))*

*The last probability is close to 0 by the law of large numbers. For example, when λ≥ 2 and K−k*0 = 15,
*p*_{λ,1,K}_{−k}_{0}(1)*− p**λ,1,K**−k*0*+j*(1)*≤ P (V**k*0+1+*· · · + V**K**> 2(15))≈ 0.0119 by R. Use the same argument,*
*we can show that the difference between p*_{λ,1,K}_{−k}_{0}*(ℓ) and p*_{λ,1,K}_{−k}_{0}_{+j}*(ℓ) should be small for ℓ = 1, 2.*

**4** **Discussion and Conclusion**

Our main purpose, in this paper, is to show that the practice of a fully adaptive choice of penalty should
be cautious. In particular, we have demonstrated that a fully adaptive penalty selection procedure
through the unbiased risk estimator can be a bad practice for nested candidate models. But the
*adaptive penalty selection procedure does improve the performance of C** _{p}* when the range of penalty is
chosen properly. However, such a procedure still cannot achieve model selection consistency as fixed
penalty BIC approach does.

To achieve proper selection of the range of penalty, it can be done by using little bootstrap or data
perturbation suggested in Shen and Ye (2002). When the data perturbation method with suggested
*perturbation variance 0.25σ*^{2} *is employed to estimate GDF over Λ = [0, log n], it outperforms the*
procedure by adding GDF as reported in Table 3.

*| ˆM (ˆλ)| ˆg**0τ**(λ)* *g*_{0}*(λ)* *M (ˆ*ˆ *λ)* *|ˆg**0τ**(λ)|* *g*_{0}*(λ)*
*k*_{0} 0.7487 0.5259 *k*_{0}+ 11 0.0071 0.0154
*k*0+ 1 0.0518 0.0611 *k*0+ 12 0.0082 0.0169
*k*_{0}+ 2 0.0258 0.0348 *k*_{0}+ 13 0.0075 0.0169
*k*0+ 3 0.0194 0.0271 *k*0+ 14 0.0061 0.0162
*k*_{0}+ 4 0.0163 0.0235 *k*_{0}+ 15 0.0065 0.0189
*k*_{0}+ 5 0.0130 0.0189 *k*_{0}+ 16 0.0072 0.0216
*k*0+ 6 0.0088 0.0159 *k*0+ 17 0.0076 0.0225
*k*_{0}+ 7 0.0097 0.0168 *k*_{0}+ 18 0.0092 0.0277
*k*0+ 8 0.0085 0.0132 *k*0+ 19 0.0086 0.0308
*k*_{0}+ 9 0.0092 0.0174 *k*_{0}+ 20 0.0127 0.0438
*k*0+ 10 0.0081 0.0149

Table 3: The empirical distribution of chosen model through adaptive model selection with ˆ*g**0τ**(λ),*
*τ /σ = 0.5 when λ∈ [0, log n] based on 10, 000 simulated runs.*

**5** **Appendix**

**5.1** **Proof of Lemma 3**

It follows easily from Lemma 2 that
*P*

(*| ˆM (log n)| ≥ |M**k*0*|*)

*≤ P*(

*| ˆM (λ)| ≥ |M**k*0*|, for all λ ∈ [0, log n]*)
*.*
Or

*P*

(*| ˆM (log n)| < |M**k*0*|*)

*> P*

(*| ˆM (λ)| < |M**k*0*|, for some λ ∈ [0, log n]*)
*.*
Observe that

*P*

(*| ˆM (λ)| < |M**k*0*| for some λ ∈ [0, log n]*)

*≤ P*(

*| ˆM (log n)| < |M**k*0*|*)

*≤*

*k*∑0*−1*
*k=1*

*P*

(*| ˆM (log n)| = k*)

*≤*

*k*∑0*−1*
*k=1*

*P*(

*RSS(M**k*)*− RSS(M**k*0)*≤ (k*0*− k)σ*^{2}*log n*)

=

*k*∑0*−1*
*k=1*

*P*(

**µ**^{T}_{0}**(I****− P***k***)µ**_{0}**+ 2ϵ**^{T}**(I****− P***k***)µ**_{0}**+ ϵ**^{T}**(I****− P***k** )ϵ≤ (k*0

*− k)σ*

^{2}

*log n*)

*≤*

*k*∑0*−1*
*k=1*

*P*(

**µ**^{T}_{0}**(I****− P***k***)µ**_{0}**+ 2ϵ**^{T}**(I****− P***k***)µ**_{0} *≤ k*0*σ*^{2}*log n*)
*.*