• 沒有找到結果。

Variable selection in linear regression through adaptive penalty selection

N/A
N/A
Protected

Academic year: 2022

Share "Variable selection in linear regression through adaptive penalty selection"

Copied!
20
0
0

加載中.... (立即查看全文)

全文

(1)

PREPRINT

國立臺灣大學 數學系 預印本 Department of Mathematics, National Taiwan University

www.math.ntu.edu.tw/ ~ mathlib/preprint/2012- 04.pdf

Variable selection in linear regression through adaptive penalty selection

Hung Chen and Chuan-Fa Tang

September 06, 2012

(2)

Variable selection in linear regression through adaptive penalty selection

Hung Chen1 Chuan-Fa Tang2

1Department of Mathematics, National Taiwan University, Taipei, 11167, Taiwan

2Department of Statistics, University of South Carolina, Columbia, SC 29208, USA

Abstract

Model selection procedures often use a fixed penalty, such as Mallows’ Cp, to avoid choosing a model which fits a particular data set extremely well. These procedures are often devised to give an unbiased risk estimate when a particular chosen model is used to predict future responses.

As a correction for not including the variability induced in model selection, generalized degrees of freedom is introduced in Ye (1998) as an estimate of model selection uncertainty that arise in using the same data for both model selection and associated parameter estimation. Built upon generalized degrees of freedom, Shen and Ye (2002) proposed a data-adaptive complexity penalty.

In this article, we evaluate the validity of such an approach on model selection of linear regression when the set of candidate models satisfies nested structure and includes true model. It is found that the performance of such an approach is even worse than Mallows’ Cp on the probability of correct selection. However, this approach coupled with proper selection of the range of penalty or little bootstrap proposed in Breiman (1992) performs better than Cp with increasing probability of correct selection but still cannot match with BIC on achieving model selection consistency.

Keywords: Adaptive penalty selection; Mallows’ Cp; Model selection consistency; little bootstrap;

Generalized degrees of freedom

(3)

1 Introduction

In analyzing data from complex scientific experiments or observational studies, one often encounters the problem of choosing one plausible linear regression model from a sequence of nested candidate models. When the exact true model is among the candidate models, BIC (Schwarz, 1978) has been shown to be a consistent model selector in Nishii (1984), but Cp (Mallows, 1973) is shown to be a conservative model selector in Woodrofe (1982) for picking a slightly overfitted model. From BIC to Cp (AIC), the chosen penalty differs in measuring the dimension of the model to log n where n represents the number of observations. Many efforts were made to understand in what circumstances these criteria allow to identify the right model asymptotically.

Liu and Yang (2011) proposed a model selector in terms of data-adaptive choice of penalty be- tween AIC and BIC. When the exact true model is among the candidate models, the newly proposed selector is a consistent model selector. Since both the number of candidate predictors and sample size influence the performance of any model selector, an alternative approach of penalty selection is to do adaptive choice of penalty. Shen and Ye (2002) proposed such an approach in which the restriction on the penalty is lifted and the generalized degrees of freedom (GDF) is used to compen- sate possible overfitting due to model selection. To avoid overfitting the available data caused by the collection of candidate models, little bootstrap has been advocated in Breiman (1992) to decrease possible overfitting due to multi-model inference. Often GDF is difficult to compute, Shen and Ye (2002) demonstrated that the use of little bootstrap (i.e. data perturbation) to replace the calculation of GDF can lead to a better regression model in terms of mean prediction error as compared to the use of Cp or BIC.

In this paper, we evaluate whether adaptive penalty selection procedure proposed in Shen and Ye (2002) leads to a consistent model selector or just reduce the overfitting of multi-model inference for nested candidate models when the candidate models include correct model. As shown in this paper, we recommend the use of little bootstrap (data perturbation) instead of calculating GDF with a fully adaptive penalty procedure for nested candidate models.

The nested candidate models are of the form

Y = XKβ + ϵ, (1)

where Y = (Y1, . . . , Yn)T is the response vector, E(Y) = µ0 = (µ1, . . . , µn)T, ϵ = (ϵ1, . . . , ϵn)T Nn(0, σ2In), xj = (x1j, . . . , xnj)T for j = 1, . . . , K, XK = (x1, . . . , xK) is an n×K matrix of potential explanatory variables to be used in predicting linearly the response vector, and β = (β1 . . . , βK)T is a vector of unknown coefficients. Here Inis the n× n identity matrix and σ2 is known. In this paper, we further assume that µ0= XKβ and some of the coefficients may equal zero.

(4)

Consider adaptive choice of λ by finding the best λ which minimizes ˆM (λ) over λ∈ Λ. Here Λ denotes the collection of penalty to choose. It is a two-step procedure. At the first step, for each λ∈ Λ, choose the optimal model ˆM (λ) by minimizing (Y− ˆµ)T(Y− ˆµ) + λ|M|σ2, where |M| denotes the number of unknown parameters used to prescribe model M . Then calculate g0(λ) which is the twice of generalized degrees of freedom with penalty λ. At the second step, choose the optimal λ∈ Λ based on data as follows:

ˆλ = argminλ∈Λ {

RSS( ˆM (λ)) + g0(λ)· σ2} where RSS( ˆM (λ)) =||Y − ˆµM (λ)ˆ ||2,∥ · ∥ is usual L2 norm, ˆµM (λ)ˆ =∑K

k=1µˆMk· 1{ ˆM (λ)=Mk}, 1{·} is the usual indicator function and ˆµMk is the estimator of µ in model Mk. Here Λ = [0,∞) or (0, ∞) as recommended in Shen and Ye (2002) and Shen and Huang (2006). When g0(λ) is difficult to compute, data perturbation is recommended.

Despite GDF already take into the account of the data-driven selection of the sequence of estimators (i.e., candidate models), it is found in this paper that the model selected by Cp is better than ˆM (ˆλ) in terms of the probability of selecting correct model unless λ falling between AIC and BIC. However, the model selector by replacing g0(λ) with the estimate given by little bootstrap can beat the performance of Cp in terms of the probability of selecting correct model.

The rest of the paper is organized as the following. In Section 2, we introduce the unbiased risk estimation and generalized degrees of freedom in the setting of model selection with nested linear regression models. It also addresses the performance of the adaptive penalty selection procedure in Shen and Ye (2002) with approximated GDF in terms of model selection consistency. Section 3 gives the evaluation of the performance of adaptive penalty selection with GDF. Section 4 concludes with a discussion on the merit of data perturbation method. The Appendix contains technical proofs.

2 Adaptive penalty selection with nested candidate models

In this section, we start with the derivation of generalized degrees of freedom defined in Ye (1998) through unbiased risk estimation, describe its potential benefit on increasing the probability of selecting correct model, and pitfalls without a proper constraint on the choice of penalty.

2.1 Unbiased risk estimation under nested linear regression model selection The available data contains n observations and each of them has K explanatory predictors, x1, . . . , xK, and response y. Denote the matrix formed by the first k explanatory variables as Xk = (x1, . . . , xk) and βk = (β1, . . . , βk)T where k = 1, . . . , K. We then have K candidate models and the kth candidate

(5)

model Mk is

Y = Xkβk+ ϵ.

The set of candidate models is denoted asM = {Mk|k = 1, . . . , K} while the true model Mk0 ∈ M.

Namely, µ0 = Xk0βk0 with 1 ≤ k0 ≤ K. For the ease of presentation, denote {Mk|k0 ≤ j ≤ K} by Mk0 which includes the true model and overfitted models.

For kth candidate model Mk with least squares fit, we have ˆµMk = PkY where Pk stands for the orthogonal projection operator onto the space spanned by Xk. We evaluate the model ˆµMk by its prediction accuracy. Denote the future response vector at XK by Y0 and assume that E(Y0) = E(Y) and V ar(Y0) = σ2In. The unobservable prediction error is defined as P E(Mk) = ∥Y0 − ˆµMk2 and residual sum of squares as RSS(Mk) = ∥Y − ˆµMk2. It is well known that the residul sum of squares RSS(Mk) often smaller than P E(Mk). Namely, E[P E(Mk)] > E[RSS(Mk)]. Mallows (1973) proposed Cp by inflating RSS(Mk) with 2tr(Pk2 where tr(Pk) is often called degrees of freedom.

Since both P E(Mk) and RSS(Mk) are of quadratic form, it follows easily that E[P E(Mk)] = E[RSS(Mk)] + 2kσ2.

We now give generalized degrees of freedom through unbiased risk estimate overM while the true model is Mk0. For simplicity, we assume that σ2 is known from now on. The model selected through Mallows’ Cp criterion among M is denoted as ˆM (2). We can then write the post-model-selection estimator of µ0 through Cp by ˆµM (2)ˆ =∑K

k=1µˆM

k· 1{M (2)=Mˆ k}.

The prediction risk is defined as follows:

RISK(Cp) = E [(

Y0− ˆµM (2)ˆ

)T(

Y0− ˆµM (2)ˆ

)]

.

By standard argument, we have RISK(Cp) = E

[

0− ˆµM (2)ˆ )T0− ˆµM (2)ˆ ) ]

+ nσ2 = E[RSS( ˆM (2))] + 2E[ϵT( ˆµM (2)ˆ − µ0)]

= E[RSS( ˆM (2))] + g0(2)· σ2

where g0(2) = 2E[ϵT( ˆµM (2)ˆ − µ0)]/σ2 and g0(2) is twice GDF defined in Ye (1998). In general, the calculation of GDF associated with Cp can be complicate but its analytic form can be obtained under the settings of this paper.

When K − k0 = 20, and n = 404, g0(2) is found to be around 2k0 + 5.02 as given in Lemma 4. Or, the unbiased risk estimator of using Cp over M should be RSS( ˆM (2)) + g0(2)· σ2 instead of RSS( ˆM (2)) + 2| ˆM (2)|σ2 where | ˆM (2)| is the total number of unknown parameters of the chosen model by Cp. It will be shown in Lemma 4(a) that g0(2) ≥ 2k0 under some regularity conditions.

Since BIC achieves model selection consistency as shown in Schwarz (1978), its associated unbiased

(6)

risk estimator will be RSS( ˆM (log n)) + 2k0σ2. This leads to the proposal in Shen and Ye (2002) on employing adaptive choice of penalty incorporating GDF. We now investigate whether the adaptive penalty selection proposal incorporate GDF in Shen and Ye (2002) is a consistent model selection procedure.

2.2 Adaptive choice of λ over Λ ={0, 2, log n}

As an illustration, we consider how adaptive choice of penalty, λ ∈ Λ = {0, 2, log n}, with GDF improves on the fixed choice of penalty. Consider , k0 = 1 and K− k0 = 20 in which finding the underfitted model in M0 is not possible. When λ = 0, model MK will always be chosen since it gives the smallest residual of sum squares. Then g0(0) = 2K which is equal to the degree of freedom assigned by Mallows’ Cp. When λ = log n, we have ˆM (λ) = Mk0 with probability close to 1 and then g0(log n)≈ 2k0. When λ = 2, Woodroofe (1982) shows that the probability of selecting correct model is slightly larger than 0.7 while choose no more than one superfluous explanatory variables in average.

Then g0(2)≈ 2k0+ 5.02 where 5.02 depends on k0and K− k0. It will be shown that adaptive penalty selection procedure with GDF improves Cp selection in increasing the probability of selecting correct model Mk0 when K− k0≥ 20. This improvement mainly come from the reduction of the probability of selecting Mk0+1 by Cp. Technical details will be provided later on.

We now give a lemma which will be used repeatedly to give an upper bound on| ˆM (λ)|−k0. It can also be used to give an upper bound on the probability of choosing Mk0 when the adaptive penalty selection procedure is used.

Lemma 1. Suppose that log n∈ Λ, λ ≤ log n for all λ ∈ Λ, and ˆM (log n) = Mk0. (a) For all λ∈ Λ,

P

(M (ˆˆ λ) = Mk0

)≤ P( λ

(| ˆM (λ)| − |Mk0|)

≤ g0(λ)− g0(log n) )

.

(b) ˆM (ˆλ)̸= Mk0 if there exists λ such that λ

(| ˆM (λ)| − |Mk0|)

> g0(λ)− g0(log n).

Proof. Recall that ˆM (λ) = argminMk∈M{RSS(Mk) + λ|Mk2} for any given λ ∈ Λ or ˆM (λ) is the chosen model for given λ. We conclude that RSS( ˆM (λ)) + λ| ˆM (λ)|σ2 ≤ RSS(Mk0) + λ|Mk02. Hence, for all λ,

λ

[| ˆM (λ)| − |Mk0|]

σ2 ≤ RSS(Mk0)− RSS( ˆM (λ)). (2)

When the event{ˆλ = log n} occurs, the following must hold.

RSS( ˆM (λ)) + g0(λ)≥ RSS( ˆM (log n)) + g0(log n) for all λ.

(7)

Hence, P (ˆλ = log n)≤ P(

RSS( ˆM (λ)) + g0(λ)≥ RSS( ˆM (log n)) + g0(log n) )

. We conclude that P (ˆλ = log n) ≤ P(

RSS( ˆM (log n))− RSS( ˆM (λ))≤ g0(λ)− g0(log n) )

≤ P( λ

(M (λ)ˆ − |Mk0|)

σ2≤ RSS( ˆM (log n))− RSS( ˆM (λ))≤ g0(λ)− g0(log n) )

≤ P( λ

(M (λ)ˆ − |Mk0|)

σ2≤ g0(λ)− g0(log n) )

.

The second inequality holds by (2). We conclude the proof of this lemma. 2 Next lemma states that | ˆM (λ)| is a decreasing step function of λ.

Lemma 2. When the set of candidate models M is a sequence of nested linear regression models,

| ˆM (λ)| is a non-increasing step function over Λ. When | ˆM (λ)| = k for some 1≤ k ≤ K, | ˆM (λ)| ≤ k for all λ≥ λ.

Proof. Recall that ˆM (λ) denotes the selected model for given penalty λ|Mk2. It follows from Propo- sition 5.1 of Breiman (1992) that RSS( ˆM (λ)) must be at an extreme point of the lower convex envelop of the graph{k, RSS(k)}, k = 1, . . . , K. Hence, this lemma holds. 2 Adaptive choice of λ over Λ = {0, 2, log n} depends on the comparison of RSS( ˆM (0)) + g0(0)· σ2, RSS( ˆM (log n)) + g0(log n)σ2, and RSS( ˆM (2)) + g0(2)σ2. We start with the comparison of RSS( ˆM (0)) + g0(0)σ2 and RSS( ˆM (log n)) + g0(log n)σ2. Note that, for K− k0= 20,

P (

RSS( ˆM (0)) + g0(0)σ2> RSS( ˆM (log n)) + g0(log n)σ2 )

= P

(

RSS( ˆM (0)) + g0(0)σ2 > RSS( ˆM (log n)) + g0(log n)σ2, ˆM (log n) = Mk0 )

+P (

RSS( ˆM (0)) + g0(0)σ2 > RSS( ˆM (log n)) + g0(log n)σ2, ˆM (log n)̸= Mk0

)

≥ P(

YT(PK− Pk0)Y < 40σ2, ˆM (log n) = Mk0 )

.

Since YT(PK− Pk0)Y/σ2 is a chi-square random variable with degrees of freedom 20, it follows from the consistency of BIC on model selection, P

(

RSS( ˆM (0)) + g0(0)σ2> RSS( ˆM (log n)) + g0(log n)σ2 )

is close to 1.

Next, we evaluate the difference between RSS( ˆM (2)) + g0(2)σ2 and RSS( ˆM (log n)) + g0(log n)σ2. When ˆM (log n) = Mk0 and ˆM (2) = Mk, it follows from Lemma 1 that 6 ≤ | ˆM (2)| − |Mk0| must hold if k − k0 ≥ 3. Since g0(2)− 2k0 ≈ 5.02, the adaptive choice of λ over {0, 2, log n} might increase the probability of selecting Mk0. This increase must come from the occurrence of the event {1 ≤ | ˆM (2)|−k0 ≤ 3}. Or, it does improve Cp on increasing the probability of correct selection. Refer to Lemma 5 for further details.

(8)

2.3 Adaptive choice of λ over Λ = [2, log n]

The choice of Λ can be difficult in practice if the users don’t know the behavior of fix penalty model selection criterion. The natural choice of Λ can be on [0,∞) as suggested in Shen and Ye (2002). In this subsection, Λ is chosen to be [2, log n]. The case Λ = [0, log n] is deferred to next subsection.

Note that g0(λ) = 2K−k0

k=1 P(

χ2k+2> kλ)

+ 2k0 under some regularity conditions by Lemma 4(a) in Section 3.

When Λ = [2, log n], adaptive choice of λ, ˆλ is defined as follows:

λ = argminˆ λ∈[2,log n]

{

RSS( ˆM (λ)) + g0(λ)σ2 }

.

When the event{ ˆM (2) = Mk0} occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g0(λ) is decreasing. Hence, adaptive penalty selection procedure will select Mk0 when Mallows’ Cp select Mk0. This conclude that adaptive penalty selection with λ∈ [2, log n] does improve Cp on the probability of correct selection. Table 1 presents the finding of a simulation experiment and leave general discussion in Section 3. This simulation result shows that the adaptive choice of λ over [2, log n] performs better than Cp but cannot match with BIC. Compare to Cp, adaptive choice of λ over [2, log n] improves over Mallows’ Cp by increasing the probability of correct selection from 71.29% to 74.88% and this improvement is being observed consistently over simulations. In Section 3, it quantifies the increase of probability of selecting Mk0 based on adaptive choice of λ over [2, log n] over fixed penalty 2 is given which matches well with the above simulation experiment.

2.4 Adaptive choice of λ over Λ = [0, log n]

Woodrofe (1982) showed that Mallows’ Cp selection procedure can be viewed as finding the global minimum of a random walk with negative drift 1 formed by [

n( ˆβk − βk)/

V ar( ˆβk)]2 − 2 over {k0, . . . , K}. When λ = 0, it leads to the selection of MK, the largest candidate model. Section 2.2 shows that g0(0) = 2K which is twice GDF defined in Shen (1998).

As in last subsection, we again use a simulation study to evaluate adaptive penalty selection procedure when Λ = [0, log n]. The finding is summarized in Table 2. Clearly, adaptive penalty selection of λ over [0, log n] not only cannot improve over Cp but decreases the probability of selecting Mk0 to about 52.6% as shown in Table 2. This finding is consistent with the simulation results in Atkinson (1980) and Zhang (1992). Both papers suggested that the penalty factor for Atkinson’

generalized Mallows’ Cp statistics should be chosen between 1.5 and 6.

When λ = 1, the size of selected model is the one that reaches the global minimum of a random walk without drift over {k0, . . . , K}. It leads to an overfitted model with lots of superfluous explanatory variables. We now employ Lemma 1 to explain why adaptive penalty selection performs worse than

(9)

| ˆM (ˆλ)| λ ∈ [2, log n] λ = 2 | ˆM (ˆλ)| λ ∈ [2, log n] λ = 2 k0 0.7488 0.7129 k0+ 11 0.0016 0.0016 k0+ 1 0.0834 0.1116 k0+ 12 0.0011 0.0011 k0+ 2 0.0513 0.0578 k0+ 13 0.0022 0.0022 k0+ 3 0.0345 0.0352 k0+ 14 0.0014 0.0014 k0+ 4 0.0237 0.0240 k0+ 15 0.0014 0.0014 k0+ 5 0.0147 0.0149 k0+ 16 0.0015 0.0015 k0+ 6 0.0091 0.0090 k0+ 17 0.0007 0.0007 k0+ 7 0.0081 0.0082 k0+ 18 0.0006 0.0006 k0+ 8 0.0058 0.0057 k0+ 19 0.0009 0.0009 k0+ 9 0.0054 0.0055 k0+ 20 0.0002 0.0002 k0+ 10 0.0036 0.0036

Table 1: The empirical distribution of chosen model through adaptive model selection when λ [2, log n] based on 100, 000 simulated runs.

| ˆM (ˆλ)| λ ∈ [0, log n] λ = 2 | ˆM (ˆλ)| λ ∈ [0, log n] λ = 2 k0 0.5259 0.7129 k0+ 11 0.0154 0.0016 k0+ 1 0.0611 0.1116 k0+ 12 0.0169 0.0011 k0+ 2 0.0348 0.0578 k0+ 13 0.0169 0.0022 k0+ 3 0.0271 0.0352 k0+ 14 0.0162 0.0014 k0+ 4 0.0235 0.0240 k0+ 15 0.0189 0.0014 k0+ 5 0.0189 0.0149 k0+ 16 0.0216 0.0015 k0+ 6 0.0159 0.0090 k0+ 17 0.0225 0.0007 k0+ 7 0.0168 0.0082 k0+ 18 0.0277 0.0006 k0+ 8 0.0132 0.0057 k0+ 19 0.0308 0.0009 k0+ 9 0.0174 0.0055 k0+ 20 0.0438 0.0002 k0+ 10 0.0149 0.0036

Table 2: The empirical distribution of chosen model through adaptive model selection when λ [0, log n] based on 100, 000 simulated runs.

(10)

Cp when Λ = [0, log n]. Recall that g0(λ) = 2E[ϵT( ˆµM (λ)ˆ − µ0)]/σ2 where ˆµM (λ)ˆ = ∑K

k=1µˆMk · 1{M (λ)=Mˆ k}. Note that Pk does not depend on ϵ and hence

E[ϵT( ˆµM (λ)ˆ − µ0)] = E [K

k=1

ϵTPkϵ· 1{M (λ)=Mˆ k} ]

= σ2g0(λ)/2 > 0.

When ˆM (λ) ≥ k0, g0(λ)/σ2 must be greater than 2k0. Since ϵT( ˆµM (λ)ˆ − µ0) is a mixture of χ2 random variables and cannot be observed, it is being estimated by its expected value as suggested in Mallows (1973). Note that g0(λ)− 2k0 are equal to 25.35, 21.98, 18.78, 15.98, 13.36, and 11.22 as λ takes values 1.0, 1.1, 1.2, 1.3, 1.4, and 1.5. Since the probability distribution of χ21 random variable is positively skew, the replacement of such a random variable by its mean can be problematic. Moreover, the following must hold when ˆM (ˆλ) = Mk0 for all λ∈ Λ.

| ˆM (λ)| − k0

g0(λ)− 2k0

λ

,

where⌊·⌋ is the Gauss integer. When ˆM (ˆλ) = Mk0, it follows from Lemma 1 that| ˆM (1.1)| − k0 ≤ 19,

| ˆM (1.2)| − k0 ≤ 15, | ˆM (1.3)| − k0 ≤ 12, | ˆM (1.4)| − k0 ≤ 9, and | ˆM (1.5)| − k0 ≤ 7. This reduces the probability of choosing Mk0 which provides an explanation of why adaptive penalty selection is so problematic when λ∈ [1, 1.5]).

3 Performance of Adaptive Penalty Selection with GDF

In this section, we start with assumptions which ensure that BIC is a consistent model selector and Cp never miss on choosing nonzero coefficient explanatory variables. Under the setting of this paper, the analytic form of g0(λ) is available. Its numerical values can be obtained easily by any statistical software. To evaluate whether the method of adaptive penalty selection can improve fixed penalty selection, we combine simulation study and analytic approximation together on those (K, k0) with K− k0 ≥ 20. When K − k0 = 20, numerical values of g0(λ) is obtained by R software. Then combine numerical values of g0(λ) for K− k0 = 20 and the exponential inequality in Teicher (1984) to give a bound on g0(λ) for K− k0 > 20.

3.1 Key assumptions

We now state three assumptions on the class of competing models M. The following assumptions ensure that employing BIC leads to consistent model selection and we never get an underfitted model when λ∈ [0, log n]. They mainly prescribe the relation between n, the total number of observation, K, the total number of explanatory variables, and k0, the total number of nonzero regression coefficients.

As a remark, the setting of simulation studies in Section 2 satisfies these three assumptions.

(11)

Assumption B. XK is full rank and there exists a constant c > 0 such that µT0(In− Pj02 ≥ cn for all j < k0, where µ0= Xk0βk0 is the mean vector of the true model.

Assumption N1. cn > 2k0log n.

Assumption N2. log n > 2 log(K− k0).

Assumptions B and N1 are assumed to quantify the magnitude of the bias when the fitted model leaves out nonzero coefficient explanatory variable while Assumption N2 is specified to guard the inclusion of those zero coefficient explanatory variables, maxk0≤j≤K| ˆβj|. When those assumptions hold, it ensures that BIC selects the correct model with probability 1 as n goes to infinity. Moreover, for all λ ∈ [0, log n], the probability of the event {| ˆM (λ)| ≤ k0− 1} goes to zero as shown in the following Lemma. Its proof is deferred to the Appendix.

Lemma 3. When k0> 1, Assumptions B and N1 hold, P

(| ˆM (λ)| ≥ k0, for all λ∈ [0, log n])

→ 1 as n goes to infinity.

When Assumption N2 also holds, P

(M (log n) = Mˆ k0

)→ 1 as n goes to infinity.

Under the above assumptions, we only need to address the adaptive penalty selection over Λ = [0, log n] for Mk0 on the inclusion of superfluous explanatory variables. It remains to study

M (λ) = argminˆ Mk∈Mk0{

RSS(Mk) + λkσ2} and ˆλ which is defined as follows

λ = argminˆ λ∈Λ {

RSS( ˆM (λ)) + g0(λ)σ2 }

to quantify | ˆM (λ)| − k0.

We now employ the argument in Woodroofe (1982) to calculate g0(λ) defined by 2E[ϵT( ˆµM (λ)ˆ µ0)]/σ2 overMk0 and details deferred to the Appendix.

Lemma 4. When Assumption B, N1 and N2 hold, the followings hold.

(a) For λ∈ [0, log n], g0(λ) = 2K−k0

j=1 P (

χ2j+2> jλ )

+2k0 and g0(λ) is a strictly decreasing function of λ, where χ2 is chi-square distributed with degrees of freedom ℓ.

(b) For λ∈ [2, log n], 2K−k0

j=21 P (

χ2j+2> jλ

)≤ 2.6102 when K − k0 ≥ 21.

Figure 1 gives the plot of (λ, g0(λ)) when K−k0= 20 in which g0(λ) is approximated by R software by the accuracy of fourth decimal points. Lemma 4(b) gives us a general idea on the plot of (λ, g0(λ)) when K− k0 > 20.

(12)

0 1 2 3 4 5 6

010203040

λ g0(λ)

Figure 1: g0(λ) when K− k0= 20.

3.2 Performance of adaptive penalty selection procedure

To evaluate adaptive penalty selection procedure, we start with the case (K− k0, n) = (20, 404) and Λ = [2, log n]. Since log 404 > 2 log(20), Assumption N2 holds. Assume that Assumptions B and N1 hold. It follows from Lemma 3 that the adaptive penalty selection procedure in Shen and Ye (2002) will not miss any nonzero coefficient explanatory variable. Lemma 4(a) gives g0(λ) for all 0≤ λ ≤ log n which can be obtained by numerical approximation using R program.

By the simulation result presented in Table 1 in ˆλ is often smaller than log n and ˆM (ˆλ) cannot be Mk0 whenever | ˆM (2)| > 2 + k0. An estimate of the lower bound of the probability of the event { ˆM (ˆλ) = Mk0} will be given.

When the event { ˆM (2) = Mk0} occurs, it follows from Lemma 2 that ˆλ = log n by the fact that g0(λ) is decreasing and ˆM (ˆλ) = Mk0. Hence, adaptive penalty selection will select Mk0 when Mallows’ Cp selects Mk0. We now demonstrate how the adaptive penalty selection improves over Cp when Cp overfits an explanatory variable, the event { ˆM (2) = Mk0+1, ˆM (ˆλ) = Mk0} occurs. For i = k0+ 1, . . . , K, define Vi = [RSS(Mk0+i)− RSS(Mk0+i−1)]/σ2. Note that Vi are i.i.d. chi-square random variables with degree of freedom 1. When the event{ ˆM (2) = Mk0+1} occurs, it follows from

(13)

Lemma 2 that ˆM (λ) = Mk0 when λ∈ [Vk0+1, log n] and ˆM (λ) = Mk0+1 when λ∈ [2, Vk0+1).

The following lemma gives a lower bound of the probability of the event{ ˆM (2) = Mk0+1, ˆλ = log n}

when Λ = [2, log n]. Define pλ,k0,K−k0(k) to be the probability of the event { ˆM (λ) = Mk} occurs for given penalty λ when k = 1, . . . , K.

Lemma 5. Suppose Assumptions B, N1, and N2 hold and Λ = [2, log n]. A lower bound on the probability of { ˆM (2) = Mk0+1, ˆM (ˆλ) = Mk0} is

p2,k0+1,K−k0(k0+ 1)· P (

2 < Vk0+1 ≤ sup

λ>0{g0(λ)− 2k0> λ} )

.

Proof. Define E1 = { ˆM (λ) = Mk0+1 for all λ ∈ [2, Vk0+1), ˆM (ˆλ) = Mk0} and E2 = { ˆM (λ) = Mk0 for all λ ∈ [Vk0, log n], ˆM (ˆλ) = Mk0}. Then

E1 = {RSS(Mk0)− RSS(Mk0+1) > λσ2,

RSS(Mk0+1)− RSS(Mk0+1+i)≤ iλσ2, for all 0 < i≤ K − k0− 1, RSS(Mk0) + 2k0σ2 ≤ RSS( ˆM (λ)) + g0(λ)σ2, for all λ∈ [2, Vk0+1)}

= {Vk0+2+· · · + Vk0+2+i≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 2, 2 < Vk0+1 ≤ g0(Vk0+1)− 2k0}

and

E2 = {RSS(Mk0)− RSS(Mk0+i)≤ iλσ2, for all 0 < i≤ K − k0,

RSS(Mk0) + 2k0σ2 ≤ RSS( ˆM (λ)) + g02, for all λ ∈ [Vk0+1, log n)}

= {Vk0+1+· · · + Vk0+1+i≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 1}.

Observe that

{ ˆM (2) = Mk0+1, ˆM (ˆλ) = Mk0} = E1∩ E2

= {Vk0+2+· · · + Vk0+2+i ≤ (i + 1)Vk0+1, for all 0≤ i ≤ K − k0− 2, 2 < Vk0+1≤ g0(Vk0+1)− 2k0,

Vk0+1+· · · + Vk0+1+j ≤ (j + 1)Vk0+1, for all 0≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2,

2 < Vk0+1≤ g0(Vk0+1)− 2k0, Vk0+2+· · · + Vk0+1+j ≤ jVk0+1, for all 1≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2,

2 < Vk0+1≤ g0(Vk0+1)− 2k0, Vk0+2+· · · + Vk0+1+j ≤ 2j, for all 1 ≤ j ≤ K − k0− 1}

⊃ {Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2, 2 < Vk0+1≤ g0(Vk0+1)− 2k0}.

(14)

Then we have

P ( ˆM (2) = Mk0+1, ˆM (ˆλ) = Mk0)≥ P (2 < Vk0+1 ≤ g0(Vk0+1)− 2k0)

·P (Vk0+2+· · · + Vk0+2+i ≤ 2(i + 1), for all 0 ≤ i ≤ K − k0− 2). (3) Note that the first term in the right hand side of (3) is equal to p2,k0+1,K−k0−1(k0+ 1). We conclude

the proof. 2

Remark 1. When K− k0 = 20, p2,k0+1,19(k0+ 1) is around 0.7151 by simulation. Since g0(λ) is strictly decreasing, a lower bound of the second term of (3) is obtained as follows.

P (2 < Vk0+1 ≤ g0(Vk0+1)− 2k0) ≥ P (

2 < Vk0+1≤ sup

λ>0

{g0(λ)− 2k0 > λ}

)

≈ P (2 ≤ Vk0+1≤ 2.56) ≈ 0.04770, (4) where 2.56 is obtained by simulation to approximate supλ>0{g0(λ)− 2k0 > λ} for K − k0 = 20 and 0.04770 is a numerical approximation of P (2≤ Vk0+1 ≤ 2.56) by R program. It follows from (3) and (4) that, for Λ = [2, log n], P ( ˆM (2) = Mk0+1, ˆλ = log n)≥ 0.7151 · 0.04770 = 0.0341.

In fact, { ˆM (ˆλ) = Mk0} = ∪k0≤k≤K{ ˆM (2) = Mk, ˆM (ˆλ) = Mk0}. The adaptive penalty selection procedure picks Mk0 with probability

P ( ˆM (ˆλ) = Mk0)

k0≤k≤K

P ( ˆM (2) = Mk, ˆM (ˆλ) = Mk0)

k0≤k≤k0+1

P ( ˆM (2) = Mk, ˆM (ˆλ) = Mk0)

= p1(2, 1, 20) + p1(2, 1, 19)· P (

2≤ χ21≤ sup

λ>0

{g0(λ)− 2k0> λ} )

≈ 0.7129 + 0.7151 · 0.04770 = 0.7470.

Since P

(M (ˆˆ λ) = Mk0

)≥ p2,k0,K−k0(k0), adaptive penalty selector with λ∈ [2, log n] improves Cp

on correct selection probability. Moreover, a lower bound on the increase of correct selection probability is around 0.034 for K− k0 = 20 and k0 = 1. We conclude that when λ∈ [2, log n], adaptive penalty selector improves Cp at least at the amount of 3.4% on correct selection probability. As suggested by the simulation study, the improvement mainly comes from the event { ˆM (2) = Mk0+1, ˆλ = log n} .

Furthermore, we provide a crude upper bound which is obtained by Lemma 1. When K− k0 = 20, we have

g0(λ) = 2

20 j=1

P (χ2j+2 > jλ) + 2k0 for all 2≤ λ ≤ log n and

g0(2)− 2k0 = 5.02 when K− k0 = 20. (5)

(15)

As suggested by (5), the upper bound of the probability of selecting correct model Mk0 by adaptive penalty selection is 0.88 which is obtained by adding up P (| ˆM (2)| = k) for k = k0, . . . , k0+ 2 in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection procedure improves fixed penalty Cp by increasing the probability of correct selection around 3.5%. This reduction must come from including no more than 2 superfluous zero-coefficient explanatory variables.

According to the above discussion and the result presented in Table 1, we have the following lemma.

Lemma 6. When K− k0 = 20, all three assumptions hold if n≥ 404. The adaptive penalty selection procedure proposed in Shen and Ye (2002) improves over Mallows’ Cp in terms of model selection consistency where λ ∈ [2, log n]. However, it cannot match with BIC in terms of model selection consistency.

We now compare the performance of adaptive penalty selection under K − k0 > 20 (where n >

(K− k0)2) versus K− k0 = 20 (where n = 404). When K− k0 > 20, it follows from Lemma 4 that

g0(λ) = 2

20 j=1

P (χ2j+2> jλ) +

K−k0

j=21

P (χ2j+2> jλ) + 2k0

for all 2≤ λ ≤ log n andK−k0

j=21 P (χ2j+2 > jλ)≤ 2.61. We then have

g0(2)− 2k0 < 5.02 + 2.61 = 7.63 when K− k0 > 20. (6) As suggested by (6), the upper bound of the probability of selecting correct model Mk0 by adaptive penalty selection is 0.91 which is obtained by adding up P (| ˆM (2)| = k) for k = k0, . . . , k0 + 3 in Table 2. As revealed in Table 1 of Section 2, adaptive penalty selection improves fixed penalty Cp by increasing the probability of correct selection around 3.5%. The reduction must come from including no more than 3 superfluous zero-coefficient explanatory variables.

We conclude the following Theorem.

Theorem 1. When Assumption B, Assumption N1, and Assumption N2 hold, the adaptive penalty selection procedure with generalized degrees of freedom over λ∈ [2, log n] is still a conservative model selection procedure. In terms of the probability of selecting overparametrized models, it improves fixed penalty selection procedure Cp but cannot match with the fixed penalty selection BIC.

As a remark, a lower bound on the probability of selecting Mk0 can also obtained by the same method used in the proof of Lemma 5. It only needs to replace p2,1,20(1) with p2,1,K−k0(1) for K − k0 > 20 and p2,1,19(1) by p2,1,K−k0−1(1) for K − k0 > 20. Since g0(λ) increases with n, supλ>0{g0(λ)− 2k0> λ} increases accordingly. Hence, there is no need to adjust the probability

(16)

P (2≤ χ21 ≤ supλ>0{g0(λ)− 2k0 > λ}) for lower bound. Similarly, a lower bound for K − k0 > 20 is given as p2,1,K−k0(1) + p2,1,K−k0−1(1)· 0.04770. Then we have the approximated lower bound

P (ˆλ = log n) = p1(2, 1, K− k0) + p1(2, 1, K− k0− 1) · P (

2≤ χ21 ≤ sup

λ>0{g0(λ)− 2k0> λ} )

≈ (0.7129 + 0.7151 · 0.04770)(1 − 0.0119) = 0.7382.

This approximated lower bound only decrease slightly when K− k0> 20.

Note that the difference of p2,1,K−k0(1) and p2,1,20(1) is small. That is because for j > 0 and λ > 2 pλ,1,K−k0(1)− pλ,1,K−k0+j(1)

= P (Vk0+1 ≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0))

−P (Vk0+1≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0), Vk0+1+· · · + VK ≤ λ(K − k0+ j))

= P (Vk0+1 ≤ λ, . . . , Vk0+1+· · · + VK ≤ λ(K − k0), Vk0+1+· · · + VK > λ(K− k0+ j))

≤ P (Vk0+1+· · · + VK > λ(K− k0+ j))

The last probability is close to 0 by the law of large numbers. For example, when λ≥ 2 and K−k0 = 15, pλ,1,K−k0(1)− pλ,1,K−k0+j(1)≤ P (Vk0+1+· · · + VK> 2(15))≈ 0.0119 by R. Use the same argument, we can show that the difference between pλ,1,K−k0(ℓ) and pλ,1,K−k0+j(ℓ) should be small for ℓ = 1, 2.

4 Discussion and Conclusion

Our main purpose, in this paper, is to show that the practice of a fully adaptive choice of penalty should be cautious. In particular, we have demonstrated that a fully adaptive penalty selection procedure through the unbiased risk estimator can be a bad practice for nested candidate models. But the adaptive penalty selection procedure does improve the performance of Cp when the range of penalty is chosen properly. However, such a procedure still cannot achieve model selection consistency as fixed penalty BIC approach does.

To achieve proper selection of the range of penalty, it can be done by using little bootstrap or data perturbation suggested in Shen and Ye (2002). When the data perturbation method with suggested perturbation variance 0.25σ2 is employed to estimate GDF over Λ = [0, log n], it outperforms the procedure by adding GDF as reported in Table 3.

(17)

| ˆM (ˆλ)| ˆg(λ) g0(λ) M (ˆˆ λ) |ˆg(λ)| g0(λ) k0 0.7487 0.5259 k0+ 11 0.0071 0.0154 k0+ 1 0.0518 0.0611 k0+ 12 0.0082 0.0169 k0+ 2 0.0258 0.0348 k0+ 13 0.0075 0.0169 k0+ 3 0.0194 0.0271 k0+ 14 0.0061 0.0162 k0+ 4 0.0163 0.0235 k0+ 15 0.0065 0.0189 k0+ 5 0.0130 0.0189 k0+ 16 0.0072 0.0216 k0+ 6 0.0088 0.0159 k0+ 17 0.0076 0.0225 k0+ 7 0.0097 0.0168 k0+ 18 0.0092 0.0277 k0+ 8 0.0085 0.0132 k0+ 19 0.0086 0.0308 k0+ 9 0.0092 0.0174 k0+ 20 0.0127 0.0438 k0+ 10 0.0081 0.0149

Table 3: The empirical distribution of chosen model through adaptive model selection with ˆg(λ), τ /σ = 0.5 when λ∈ [0, log n] based on 10, 000 simulated runs.

5 Appendix

5.1 Proof of Lemma 3

It follows easily from Lemma 2 that P

(| ˆM (log n)| ≥ |Mk0|)

≤ P(

| ˆM (λ)| ≥ |Mk0|, for all λ ∈ [0, log n]) . Or

P

(| ˆM (log n)| < |Mk0|)

> P

(| ˆM (λ)| < |Mk0|, for some λ ∈ [0, log n]) . Observe that

P

(| ˆM (λ)| < |Mk0| for some λ ∈ [0, log n])

≤ P(

| ˆM (log n)| < |Mk0|)

k0−1 k=1

P

(| ˆM (log n)| = k)

k0−1 k=1

P(

RSS(Mk)− RSS(Mk0)≤ (k0− k)σ2log n)

=

k0−1 k=1

P(

µT0(I− Pk0+ 2ϵT(I− Pk0+ ϵT(I− Pk≤ (k0− k)σ2log n)

k0−1 k=1

P(

µT0(I− Pk0+ 2ϵT(I− Pk0 ≤ k0σ2log n) .

參考文獻

相關文件

which can be used (i) to test specific assumptions about the distribution of speed and accuracy in a population of test takers and (ii) to iteratively build a structural

O.K., let’s study chiral phase transition. Quark

Therefore, in this research, we propose an influent learning model to improve learning efficiency of learners in virtual classroom.. In this model, teacher prepares

Unlike the case of optimizing the micro-average F-measure, where cyclic optimization does not help, here the exact match ratio is slightly improved for most data sets.. 5.5

We shall show that after finite times of switching, the premise variable of the fuzzy system will remain in the universe of discourse and stability of the adaptive control system

In this chapter, the results for each research question based on the data analysis were presented and discussed, including (a) the selection criteria on evaluating

The school/department selection is divided into psychology, society and economy in this study. The selection is affected by five factors including college, interpersonal,

Furthermore, based on the temperature calculation in the proposed 3D block-level thermal model and the final region, an iterative approach is proposed to reduce