• 沒有找到結果。

高維度時間序列並帶有測量誤差模型之模型選擇

N/A
N/A
Protected

Academic year: 2022

Share "高維度時間序列並帶有測量誤差模型之模型選擇"

Copied!
35
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學理學院應用數學科學研究所 碩士論文

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

高維度時間序列並帶有測量誤差模型之模型選擇

Model Selection for High-Dimensional Time Series Models with Measurement Errors

黃學涵

Hsueh-Han Huang

指導教授:銀慶剛 博士

Advisor: Ching-Kang Ing, Ph.D.

(2)

摘要

我們使用一個叫做正交化貪婪演算法的快速逐步迴歸對高維度時間序列並帶有測量誤差 的模型做模型選擇。在一個弱稀疏的條件下,我們推導出了正交化貪婪演算法預測誤差的收 斂速度。在一個強稀疏的條件下,發展出一套擁有一致性的選模準則。

關鍵詞:高維度、測量誤差、正交化貪婪演算法、稀疏性、時間序列

(3)

Abstract

We use a fast stepwise regression method, called orthogonal greedy algorithm (OGA) to select variables for high-dimensional time series model with measurement errors. Under a weak sparsity condition, we derive a convergence rate of OGA, which is expressed in terms of the number of iterations, the sample size and the order of the moment imposed on the error process. Under a strong sparsity condition, we develop a consistent model selection procedure using OGA and a

high-dimensional information criterion.

Keywords: High-dimensional, measurement error, OGA, sparsity, time series.

(4)

目 錄

中文摘要……… i

英文摘要………. ii

1.Introduction……….. 1

2.OGA and Noiseless OGA……….. 3

3.Uniform Convergence Rate of Empirical Prediction Error………... 4

4.Sure Screening Property and Model Selection Consistency……….. 9

5 . S i m u l a t i o n S t u d i e s … … … . . 1 6 參考文獻……….…… 23

附錄………..………. .24

(5)

表 目 錄

Table1……… 18

Table2……… 21

Table3……….…. 22

(6)

1 Introduction

Consider the simple linear regression without intercept y = β

e

Tx e

+ ξ, where x

e

= (x1, x2, ..., xp)T is a p-dimensional random vector satisfying E(x e

) = 0

e

= (0, 0, ..., 0)T and E(x e x e

T) = Σx. β e

= (β1, β2, ..., βp)T is a p-dimensional constant vector. E(ξ) = 0, E(ξ2) = σξ2 > 0, and x

e

and ξ are independent.

Assume that w e

= x e

+η e

is observed instead of x e

, where w e

= (w1, w2, ..., wp)T, η

e

= (η1, η2, ..., ηp)T is a vector of measurement errors, and y? = y + ηy is ob- served instead of y, ηy is a measurement error, where E(η

e ) = 0

e , E(η

e η e

T) = Ση and η

e

is independent of (x e

, ξ), E(ηy) = 0, E(ηy2) = σ2ηy and ηy is inde- pendent of (x

e , η

e

, ξ). To make complicated things simple, we assume that Ση = diag(σ2η

1, ση2

2, ..., σ2η

p) is a diagonal matrix. Note that since ηy can be absorb into ξ, we still denote y as y? for simplicity and view ξ as the random errors after absorbing ηy.

If we regress y on w e

, then it follows that y = β

e

?Tw e

+ ξ?, where β

e

? = β

e

− U e

and U e

= (Σx + Ση)−1Σηβ e

, noting that β e

? = β

e if

∀i = 1, 2, ..., p, σηi = 0.

Let (yt, w e

T

t), t = 1, 2, ..., n, be observations, where w e

t= (wt1, wt2, ..., wtp)T. We allow (yt, x

e

T t, η

e

T

t, ξt) be a stationary time series and p >> n. When p is larger than n, there are computational difficulties in estimating the regres- sion coefficients by standard regression methods. Ing and Lai (2011) propose

(7)

the orthogonal greedy algorithm (OGA) to circumvent the computation in high dimensional inversion matrix. They derive the convergence rate of OGA and provide a consistent model selection procedure for high-dimensional time independent models. Ing and Huang (2016) generalize the results to multi- variate time series model setting and relax the moment bound assumptions from exponential moment bounds to polynomial moment bounds. However, none of them consider measurement errors in their models. Since we often face data with measurement errors that cannot be ignored in many applica- tions, recently, high-dimensional models with measurement errors has been widely studied. Loh and Wainwright (2012) propose a non-convex modifica- tion of Lasso for doing high-dimensional models with measurement errors and missing data, they also consider a time series model setting but only the cases within class of VAR(1) models with an upper restricted eigenvalue condition for sample covariance matrix. Datta and Zou (2016) propose a modification which is called Convex Conditioned Lasso (CoCoLasso) to circumvent the problem of non-convexity and the method can handle with a general class of corrupted data, but they only develop theories of the case that the true regressors are fixed design and there is no theory of model selection for a time series model setting. Belloni, Rosenbaum and Tsybakov (2014), (2016) use a Dantzig Selector type method named matrix uncertainty (MU) selector for doing high-dimensional model with measurement errors, but they do not con- sider a time series model setting. To our best knowledge, the existing papers that are related with high-dimensional model with measurement errors sel- dom consider a time series setting additionally, and none of them use greedy algorithm to do model selection with the previous model settings. This paper focuses on the OGA method and generalizes the results in Ing and Lai (2011) and Ing and Huang (2016) to a new dimension: high-dimensional time series

(8)

models with measurement errors under polynomial moment bound assump- tions.

In this paper, We provide an upper bound of the number of iterations and derive the uniform convergence rate of empirical prediction error of OGA un- der a weak sparsity condition. We prove the sure screening property of OGA under a strong sparsity condition. We propose an information criterion to do model selection, together with a trimming method, the whole procedure is shown to achieve the oracle property. We also provide simulation studies to show that with proper order of moment bounds, OGA+HDBIC+Trim successfully identifies the smallest correct model with high ratios in some general model settings. Although some additional conditions are needed, the necessary conditions for OGA to do consistent model selection for models with measurement errors remain simple.

The rest of this paper is organized as follows: in Section 2, we intro- duce OGA and noiseless OGA. In Section 3, we derive the convergence rate of OGA. In Section 4, we prove the sure screening property of OGA, and introduce our model selection criterion along the OGA path which is called high-dimensional information criterion (HDIC). We also proposed a trimming method to exclude redundant variables and prove that OGA+HDIC+Trim achieves model selection consistency. In Section 5, we present simulation studies to illustrate the performance of OGA+HDIC+Trim.

2 OGA and Noiseless OGA

In this secition, we briefly introduce OGA and noiseless OGA that are proposed by Ing and Lai (2011).

(9)

Denote ˆyk(w e

) as a sequence of linear approximations of the regression function y(w

e ) = β

e

?Tw e

. Initializing with ˆy0(·) = 0, it computes the residuals Ut(k) := yt− ˆyk(w

e

t), 1 ≤ t ≤ n, at the end of the kth iterations and chooses wt,ˆj

k+1 on which Ut(k) is regressed, such that ˆjk+1= arg min

1≤j≤p n

X

t=1

(Ut(k)− ˜βj(k)wtj)2,

where ˜βj(k)=

Pn

t=1Ut(k)wtj

Pn

t=1w2tj . We update ˆ

yk+1(w e

t) = ˆyk(w e

t) + ˆβˆ(k)

jk+1wt,ˆj

k+1

where ˆβˆ(k)

jk+1 =

Pn

t=1Ut(k)w

t,ˆjk+1

Pn t=1w⊥2

t,ˆjk+1

, w

t,ˆjk+1 is the tth component of vector wˆ

jk+1 = wˆj

k+1− ˆwˆj

k+1, ˆwˆj

k+1 is the projection of wˆj

k+1 into the linear space spanned by (wˆj1, wˆj2, ..., wˆjk), where wj = (w1j, w2j, ..., wnj)T. The orthogonalization of the predictor variables allows us to use componentwise linear regression to compute OLS, thereby circumventing the difficulties with computing the inverse of high-dimensional matrix.

Noiseless OGA is similar to OGA but replaces yt by its mean y(w e

t). In the next section, we’ll use noiseless OGA to derive the convergence rate of the empirical prediction error of OGA. More details of OGA and noiseless OGA can be found in Ing and Lai (2011).

3 Uniform Convergence Rate of Empirical Pre- diction Error

In this section, we derive the convergence rate for OGA in linear re- gression time series models with measurement errors in which the number of

(10)

regressors is allowed to be much larger than the number of observations.

According to OGA, ˆym(w e

t) = w e

T t( ˆJm) ˆβ

e

( ˆJm), where ˆJm is the index set of the variable selected by OGA after m iterations, w

e

t(J ) = (wti, i ∈ J )T and ˆβ

e

(J ) = (Pn t=1w

e

t(J )w e

T

t(J ))−1Pn t=1w

e

t(J )yt is the LSE based on model J. Let Kn denote a prescribed upper bound on the number m of OGA it- erations. To provide the uniform convergence rate of the empirical norm

1 n

Pn

t=1(ˆym(w e

t) − β e

?Tw e

t)2, 1 ≤ m ≤ Kn, we make the following assumptions below.

Assume {ξt}nt=1 is a martingale difference sequence with respect to an increasing sequence of σ-fields {Ft}, {ηti}nt=1, i = 1, 2, ..., n are martingale difference sequences with respect to an increasing sequence of σ-fields n ˜Fto

, wti, i = 1, 2, ..., n, are Ft−1-measurable, xti, i = 1, 2, ..., n, are ˜Ft−1-measurable and there exist q1, q2 with q2 > q1 ≥ 2 s.t.

(C1) max

1≤t≤n,1≤i≤pE|xti|2q1 = O(1), sup

1≤t<∞,1≤i≤p

E[|ηti|2q1| ˜Ft−1] ≤ C1 < ∞ a.s., for some C1 > 0, sup

1≤t<∞

E[|ξt|q1|Ft−1] ≤ C2 < ∞ a.s., for some C2 > 0,

(C2) max

1≤i,j≤pE|1nPn

t=1(wtiwtj− σij)|2q2 = O(1), max

1≤i,j≤pE|1nPn

t=1(xtixtj− σxij)|2q2 = O(1), where σij = E(wtiwtj), σxij = E(xtixtj).

Remark. If wtj has a linear representation

wtj =

X

k=−∞

a(k)αj(t − k)

(11)

where (αj(t), Ft) are martingale difference sequences with −∞ < t < ∞ and E[αj(t)2|Ft−1] = 1

and there exists a positive constant Cq2 s.t.

sup

−∞<t<∞

E[αj(t)4q2|Ft−1] ≤ Cq2

with the spectral density function of wtj, denoted fj is square integrable,

1≤j≤pmax

X

k=−∞

[E(wtjwt+k,j)]2 = max

1≤j≤p

1 2π

Z π

−π

fj2(λ)dλ = O(1).

Then, by (2.10) in Findley and Wei (1993), the first condition in (C2) holds.

(C3) ||β e

||1 < ∞.

This assumption is the weak sparsity condition on the uncontaminated re- gression coefficients.

(C4) ||U e

||1 = ||(Σx+ Ση)−1Σηβ e

||1 < ∞.

This assumption and (C4) assure the weak sparsity condition on the regres- sion coefficients contaminated by measurement errors.

Remark. There are many ways to achieve (C4), for example, if the values of measurement errors are restricted by the number of regressors, say

1≤i≤pmaxση2

i = O(1p), then (C4) holds, since ||β e

?||1 = ||β e

− (Σx+ Ση)−1Σηβ e

||1

||β e

||1(1 +√ p

1≤i≤pmax σ2ηi λminx)+ min

1≤i≤pσηi2 ). Another important example is the case that (xt1, xt2..., xtp) has special covariance structure, for example, uncorrelated structure. But, in general, (C4) does not hold without further conditions.

(C5) n

p

q12 → ∞ as n → ∞.

(12)

The following theorem gives the rate of convergence, which holds uniformly over 1 ≤ m ≤ Kn, for the empirical prediction error of OGA. The uni- form convergence rate varies with the prescribed order of moments q1 in (C1). When the order of moments q1 is smaller, the uniform convergence rate becomes larger due to the weaker moment assumptions. Define R(J ) = E(w

e

1(J )w e

1(J )T) and γ e

i(J ) = E(w1iw e

1(J )).

Theorem 1. Assume (C1)-(C5). Suppose Kn = O(q n

p

q12 ), and min

1≤#(J )≤Kn

λmin(R(J )) > δ, max

1≤#(J )≤Kn,i /∈J||R−1(J )γ e

i(J )||1 < C? < ∞, (3.1) for some δ, C? > 0. Then

1≤m≤Kmaxn

n−1Pn t=1ym(w

e

t)−β

e

?Tw

e

t)2 m−1+mn−1p

2 q1



= Op(1).

Proof.

1 n

Pn

t=1(ˆym(w e

t) − β e

?Tw e

t)2

= n1(Y (w) − HJˆmY )T(Y (w) − HJˆmY )

= n1YT(w)(I − HJˆm)Y (w) + n1(Y − Y (w))THJˆm(Y − Y (w)), where Y (w) = (w

e

T 1β

e

?, w e

T 2β

e

?, ..., w e

T nβ

e

?)T, Y = (y1, y2, ..., yn)T, and HJ is a projection matrix project vectors into the linear space spanned by (wi, i ∈ J ), where wi = (w1i, w2i, ..., wni)T. Let

µJ,i= YT(w)(I−HJ)wi

n12||wi|| , ˆµJ,i= YT(I−HJ)wi

n12||wi|| ,

where || · || = || · ||2 denotes the L2-norm in this paper. Consider two events An(k) =

(

max

(J,i):#(J )≤k−1,i /∈J|ˆµJ,i− µJ,i| ≤ s(

r

p

2 q1

n ) )

,

Bn(k) = (

0≤i≤k−1min max

1≤j≤pJˆi,j| > ˜ξ0s(

r

p

q12

n ) )

,

(13)

where s is a positive constant independent of n and k, ˜ξ0 = 2/(1 − ξ0), for 0 < ξ0 < 1.

On An(m) ∩ Bn(m), for 1 ≤ q ≤ m,

Jˆ

q−1jq| ≥ −|ˆµJˆ

q−1jq − µJˆ

q−1jq| + |ˆµJˆ

q−1jq|

≥ −2s r

p

2 q1

n + max

1≤j≤pJˆ

q−1,j|

≥ ξ0 max

1≤j≤pJˆq−1,j|.

This is the generalization of noiseless OGA in the Appendix B in Ing and Lai (2011). So, by Lemma B1 in Ing and Lai (2011), (C3), and (C4),

1

nYT(w)(I − HJˆm)Y (w) = Op( 1

1 + mξ02). (3.2) On Bnc(m), by (C3), (C4) and Lemma 1 in Appendix,

1

nYT(w)(I − HJˆm)Y (w) ≤ min

1≤i≤m−1 1

nYT(w)(I − HJˆi)Y (w)

≤ max

1≤j≤p||β e

?||1||wn1/2j||ξ˜0s r

p

q12

n

= Op(1

m). (3.3)

It remains to prove that ∀ > 0, ∃s > 0 s.t.

P (Acn(m)) ≤ , (3.4)

and the proof is shown in the Appendix.

So, by (3.2)-(3.4), we have 1

nYT(w)(I − HJˆm)Y (w) = Op(1

m). (3.5)

On the other hand,

(14)

1

n(Y − Y (w))THJˆm(Y − Y (w))

= n1ξ?THJˆmξ?

≤ || ˆR−1( ˆJm)||m max

1≤i≤p(n1Pn

t=1ξt?wti)2

= Op(mpq12

n ), (3.6)

where ξ? = (ξ1?, ξ2?, ..., ξn?)T. Theorem 1 follows form (3.5) and (3.6).

4 Sure Screening Property and Model Selec- tion Consistency

In the first part of this section, we prove the sure screening property of OGA under a strong sparsity condition:

(C6) ∃Ln satisfies Ln → 0 and q n

p

q12 L2n → ∞ as n → ∞ s.t. for any βj 6= 0, |βj| ≥ (

max

1≤i≤pσηi2||β

e

||1

λminx)+ min

1≤i≤pσ2ηi) + Ln.

Theorem 2. Assume (C1)-(C6), (3.1) and Kn= O(q n

p

q12 ). Then lim

n→∞P (N ⊆ JˆKn) = 1, where N = {1 ≤ j ≤ p : βj 6= 0} denote the set of relevant input variables.

Proof. Let m0 = baL−2n c = o(Kn), for some positive constant a. Consider a event

A?n(k) = (

max

(J,i):#(J )≤k−1,i /∈J|ˆµJ,i− µJ,i| ≤ sL2n )

,

for some positive constant s independent of n and k. By (3.4), we have

∀s > 0, lim

n→∞P (A?nc(Kn)) = 0, which implies lim

n→∞P (A?nc(m0)) = 0. So, by similar arguments in the proof of Theorem 1, lim

n→∞P (Fn) = 0, where

(15)

Fn = {n1YT(w)(I − HJˆm0)Y (w) > Cm−10 },

for some C > 0. By (C3), (C6), it follows that #(N ) = O(1), yielding

#(N ∪ ˆJm0) = o(Kn). So, on {N ∩ ˆJmc0 6= ∅}, when n is large,

1

nYT(w)(I − HJˆm0)Y (w)

= n1β?TN ∩ ˆJc m0wT

N ∩ ˆJm0c (I − HJˆm0)wN ∩ ˆJc m0β?

N ∩ ˆJm0c

≥ (min

j∈N βj?2) min

1≤#(J )≤Kn

λmin( ˆR(J ))

≥ bL2n, for some b > 0, where wN ∩ ˆJc

m0 = (wi, i ∈ N ∩ ˆJmc0), β?

N ∩ ˆJm0c = (βi?, i ∈ N ∩ ˆJmc0)T. The last inequality above follows from Lemma 3, (C6) and (3.1).

By choosing a in m0 = baL−2n c large enough, we have bL2n > Cm−10 , and the proof of Theorem 2 is complete.

To choose the smallest number of iterations that include all relevant variables, we propose a high-dimensional information criterion (HDIC). De- fine ˆσJ2 = n−1Pn

t=1(yt − ˆyt;J)2, where ˆyt;J denotes the fitted value of yt when Y = (y1, y2, ..., yn)T is projected into the linear space spanned by wj, j ∈ J 6= ∅, setting ˆyt;J = 0 if J = ∅. Let

HDIC(J ) = n log ˆσJ2 + #(J )wnpq12 , kˆn= arg min

1≤k≤Kn

HDIC( ˆJk),

wn→ ∞, wnpq12 = o(nLn4), (4.1) k˜n= min{k : 1 ≤ k ≤ Kn, N ⊆ ˆJk}(min ∅ = Kn).

Note that ˆkn is the number of OGA iterations we choose according to HDIC, and ˜kn is the minimal number of iterations that includes all relevant regres- sors along an OGA path.

(16)

To achieve consistency of model selection under (C6), the strong sparsity condition, we need to assume the contaminated regression coefficients con- verges to the uncontaminated regression coefficients in an appropriate rate, which means the measurement errors must converges to 0 in probability with some rate:

(C7) ||U e

||1 = O(

r

p

q12

n ), note that max

1≤i≤pσ2ηi = O(1p r

p

q12

n ) assures (C7). If the regressors are uncorre- lated, then max

1≤i≤pσ2ηi = O(

r

p

q12

n ) assures (C7), which is weaker than general conditions.

In addition, we assume a weak dependency on the square of regression errors:

(C8) max

1≤t≤nE(ξt4) = O(1) and E(ξt2ξt+h2 ) − σ4ξ = o(1) as h → ∞, where σξ2 = E(ξt2), ∀t = 1, 2, ..., n.

This assumption is used to derive weak law of large numbers of ξt2. The fol- lowing theorem proves that ˆk approaches ˜k when n grows in probability sense.

Theorem 3. With the same notation and assumptions as in Theorem 2, suppose (3.1), (C7) and (C8) holds, Kn = O(q n

p

q12 ). Then lim

n→∞P (ˆkn = k˜n) = 1.

Proof. For notational simplicity, dropping the subscript n in ˜knand ˆkn. Let Dn= {N ⊆ ˆJm0} = {˜k ≤ m0}, by Theorem 2, lim

n→∞P (Dn) = 1. On {ˆk < ˜k}, by definition of ˆk, it follows that

exp(HDIC( ˆJˆk)/n) ≤ exp(HDIC( ˆJ˜k)/n),

(17)

so, ˆ σJ2ˆ

˜k−1

− ˆσJ2ˆ

k˜

≤ ˆσJ2ˆ

ˆk

− ˆσJ2ˆ

k˜

≤ ˆσ2Jˆ

˜k

{exp(n−1wnkp˜ q12 ) − exp(n−1wnkpˆ q12 )}. (4.2) Note that

n−1{Pn

t=1(yt− ˆyt; ˆJ

˜k−1)2−Pn

t=1(yt− ˆyt; ˆJ

˜k)2}

= n−1ˆ?

j˜k

wˆj

k˜ +P

l /∈ ˆJ˜kβl?wl+ ξ?)T(HJˆ

k˜ − HJˆ

k−1˜ )(βˆ?

j˜k

wˆj

˜k +P

l /∈ ˆJ˜kβl?wl+ ξ?)

=

ˆ?

k

wTˆ

k

(I−Hˆk−1

)wˆk

+wTˆ

k

(I−Hˆk−1

?}2 nwTˆ

k

(I−Hˆk−1

)wˆk

+n−1(P

l /∈ ˆJk˜βl?wl)T(HJˆk˜ − HJˆk−1˜ )(P

l /∈ ˆJ˜kβl?wl) +2n−1(P

l /∈ ˆJ˜kβl?wl)T(HJˆ˜k − HJˆ˜k−1)(βˆj?

˜k

wˆj˜k+ ξ?).

By (4.2),

βˆ?2

jk˜

n+ 2βˆ?

j˜k

n+ ˆA−1nn2 + ˆDn+ 2 ˆEn

≤ λn−1wnpq12 m0( ˆCn+ σξ2?) on {ˆk < ˜k }\

Dn, (4.3)

for some λ > 0, where Aˆn = n−1wTˆ

j˜k

(I − HJˆk−1˜ )wˆj˜k

n = n−1wˆT

j˜k

(I − HJˆ˜k−1?n= ˆσ2ˆ

J˜k

− σξ2?

n= n−1(P

l /∈ ˆJk˜βl?wl)T(HJˆk˜ − HJˆk−1˜ )(P

l /∈ ˆJ˜kβl?wl) Eˆn = n−1(P

l /∈ ˆJ˜kβl?wl)T(HJˆ˜k− HJˆ˜k−1)(βˆj?

˜k

wˆjk˜ + ξ?).

In the Appendix, it is shown that ∀θ > 0, P ( ˆAn< vn

2 , Dn)+P (| ˆBn| ≥ θLn, Dn)+P (| ˆCn| ≥ θ, Dn)+P (| ˆEn| ≥ θL2n, Dn) = o(1), (4.4) where vn= min1≤#(J )≤m0λmin(R(J )). From (4.3), (4.4), (C6), lim

n→∞P (Dn) = 1, ˆDn, Aˆ−1nn2 ≥ 0, and θ is arbitrary, it follows that P (ˆk < ˜k) = o(1).

On {ˆk > ˜k}, by definition of ˆk, it follows that

(18)

ˆ σ2ˆ

Jˆk

exp(n−1wnkpˆ q12) ≤ ˆσ2ˆ

J˜k

exp(n−1wn˜kpq12 ), so, it can be derived that

ξ?T(HJˆˆk − HJˆ˜k? +(P

l /∈ˆj˜kβl?wl)T(HJˆˆk− HJˆ˜k)(P

l /∈ˆjk˜βl?wl) +2(P

l /∈ˆj˜kβl?wl)T(HJˆˆk− HJˆ˜k?

≥ {ξ?T(I − HJˆ

˜k? +(P

l /∈ˆj˜kβl?wl)T(I − HJˆ

˜k)(P

l /∈ˆj˜kβl?wl) +2(P

l /∈ˆj˜kβl?wl)T(I − HJˆ˜k?}

×(1 − exp(−n−1wn(ˆk − ˜k)pq12 )). (4.5) Let Fˆk,˜k denote the n × (ˆk − ˜k) matrix whose column vectors are wj, j ∈ ˆJˆk− ˆJk˜. Since

ξ?T(HJˆˆk − HJˆ˜k?

= ξ?T(I − HJˆ˜k)Fk,˜ˆk{FˆT

k,˜k(I − HJˆk˜)Fˆk,˜k}−1FˆT

k,˜k(I − HJˆ˜k?

≤ || ˆR−1( ˆJKn)||||n12FˆT

k,˜k(I − HJˆ

˜k?||2

≤ || ˆR−1( ˆJKn)||(2||n12FˆT

k,˜kξ?||2+ 2||n12FˆT

k,˜kHJˆ

k˜ξ?||2)

≤ 2(ˆk − ˜k)(ˆan+ ˆbn), where

ˆ

an= || ˆR−1( ˆJKn)|| max

1≤i≤p(n12 Pn

t=1wtiξt?)2,

(4.6) ˆbn= || ˆR−1( ˆJKn)|| max

1≤#(J )≤˜k,i /∈J

(n12 Pn

t=1ti;Jξt?)2,

(19)

and it is shown in the Appendix that

P ((ˆk − ˜k)(ˆan+ ˆbn) ≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 )), ˆk > ˜k) +P ((P

l /∈ ˆJ˜kβl?wl)T(HJˆˆk−HJˆ˜k)(P

l /∈ ˆJ˜kβl?wl) ≥ θn(1−exp(−n−1wn(ˆk−˜k)pq12 )), ˆk >

˜k) +P (|(P

l /∈ ˆJ˜kβl?wl)T(HJˆ

ˆk− HJˆ

k˜?| ≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 )), ˆk > ˜k) +P ((P

l /∈ˆj˜kβl?wl)T(I − HJˆ

˜k)(P

l /∈ˆj˜kβl?wl) ≥ θn, ˆk > ˜k) +P (|(P

l /∈ˆj˜kβl?wl)T(I − HJˆ˜k?| ≥ θn, ˆk > ˜k)

= o(1). (4.7)

So, by (A.9), (4.5), (4.7), it follows that P (ˆk > ˜k) = o(1), and the proof of Theorem 3 is complete.

Even the true model will be included by OGA+HDIC, some redundant variables could be contained. So, we provide a trimming method to trim out redundant variables, Let

N = {ˆˆ jl: HDIC( ˆJkˆ− {ˆjl}) > HDIC( ˆJkˆ), 1 ≤ l ≤ ˆk} if ˆk > 1,

and ˆN = {ˆj1} if ˆk = 1. ˆN is the subset of ˆJˆk after trimming. The following theorem shows that OGA+HDIC+Trim will achieve the oracle property.

Theorem 4. Under the same assumption as in Theorem 3, lim

n→∞P ( ˆN = N ) = 1.

Proof. For ˆk > 1, define δl = 1 if HDIC( ˆJ˜k− {ˆjl}) >HDIC( ˆJk˜) and δl = 0 otherwise. Then

(20)

P ( ˆN 6= N ) ≤ P ( ˆN 6= N, ˆk > 1, N ⊆ ˆJˆk) + P (N * ˆJkˆ) + P ( ˆN 6= N, ˆk = 1)

≤ P (δl = 1 and βˆjl = 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJ˜k, ˜k > 1) +P (δl= 0 and βˆjl 6= 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) +P (ˆk 6= ˜k) + P (N * ˆJˆk) + P ( ˆN 6= N, ˆk = 1). (4.8) Let ˆJ˜k− {ˆjl} = Ql. On {ˆk = ˜k}, Since by similar arguments in the proof of Theorem 3, it can be derived that ∀θ > 0, 1 ≤ l ≤ ˆk,

P ((˜an+ ˜bn) ≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 ))) = o(1), (4.9) in which ˜an and ˜bn are the same as ˆan and ˆbn in (4.6) but with Kn replaced by ˜k, and ˜k replaced by ˜k − 1, and

P ((X

r /∈Ql

βr?wr)T(HJˆ˜k−HQl)(X

r /∈Ql

βr?wr) ≥ θn(1−exp(−n−1wn(ˆk−˜k)pq12 ))) = o(1), (4.10) P (|(X

r /∈Ql

βr?wr)T(HJˆ˜k − HQl?| ≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 ))) = o(1), (4.11) P (|n−1ξ?T(I − HQl?− σξ2?| ≥ θ) = o(1), (4.12)

P ((X

r /∈Ql

βr?wr)T(I − HQl)(X

r /∈Ql

βr?wr) ≥ θn) = o(1), (4.13)

P (|(X

r /∈Ql

βr?wr)T(I − HQl?| ≥ θn) = o(1). (4.14) So, by (4.9)-(4.14), it follows that

P (δl= 1 and βˆjl = 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) = o(1). (4.15) On the other hand,

P (|n−1wTˆj

l(I − HQl?| ≥ θLn, Dn) = o(1), (4.16)

(21)

P (n−1wTˆj

l(I − HQl)wˆjl ≤ vn

2 ) = o(1), (4.17)

P (|n−1(X

l /∈ ˆJk˜

βl?wl)T(HJˆk˜ − HQl)(βˆjlwˆjl+ ξ?)| ≥ θL2n) = o(1). (4.18) So, by (A.9), (4.16)-(4.18) and similar arguments in the proof of Theorem 3, it follows that

P (δl= 0 and βˆjl 6= 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) = o(1). (4.19) Finally, by (4.8), (4.15), (4.19) and Theorem 2 and 3, we have the desired conclusion.

5 Simulation Studies

In this section, we report simulation studies of the performance of OGA+

HDBIC+Trim. These simulations consider the regression model y?t =

p0

X

j=1

βj?wtj +

p

X

j=p0+1

βj?wtj+ ξ?t, t = 1, 2, ..., n, (5.1) where βp0+1, βp0+2, ..., βp = 0, p  n, ηtjare i.i.d. N (0, σ2η), ∀t = 1, 2, ..., n, j = 1, 2, ..., p, and are independent of xtj. ξt are i.i.d. N (0, σ2ξ) and are indepen- dent of xtj, ηtj. ηyt are i.i.d. N (0, ση2y) and are independent of xtj, ηtj, ξt

Examples 1 and 2 consider the case

xtj = dtj + ˜η ˜xt, (5.2) in which ˜η ≥ 0 and (dt1, dt2, ..., dtj, ˜xt)T, t = 1, 2, ..., n are i.i.d. normal with mean (1, 1, ..., 1, 0)T and covariance matrix I. We standardize the variance of xtj by replacing xtj with √xtj

1+˜η2. Since for any J ⊂ {1, 2, ..., p} and 1 ≤ i ≤ p with i /∈ J,

λmin(R(J )) = 1

1 + ˜η2 + ση2 > 0 and ||R−1(J )γi(J )||1 < 1,

(22)

(3.1) is satisfied. Moreover, Corr(wti, wtj) = 1+˜η˜2η2 increases when ˜η grows.

Example 1. Consider (5.1) with p0 = 5, (β1, β2, ..., β5) = (3, −3.5, 4, −2.8, 3.2), σξ2 = 1, ση2y = 0.01 and assume that (5.2) holds. The cases ˜η = 0, which means the regressors are uncorrelated, ση2 = 0.01, 0.5, 0.1, and (n, p) = (50, 1000), (100, 2000), (200, 4000) are considered here. We choose Kn = b5(n/pq12 )12c and allow q1 to vary between 4 and 15. We have also allowed D in Kn = bD(n/pq12 )12c to vary between 3 and 10, and the results are similar to those for D = 5. We perform 1000 simulations on each case. Define the mean squared prediction errors

MSPE = 1 1000

1000

X

l=1

(

p

X

j=1

βj?wn+1(l) − ˆy(l)n+1)2

in which x(l)n+1,1, x(l)n+1,2, ..., x(l)n+1,p are the regressors associated with yn+1(l) , the new outcome in the lth simulation run, and ˆyn+1(l) denotes the predictor of y(l)n+1. Table 1 shows that OGA+HDBIC+Trim is very sensitive to the order of moment bounds q1, it performs well with proper q1, but performs poorly with improper q1. If q1 is too small, the penalty for the number of predictor variables in HDBIC is too large, so, OGA+HDBIC tends to be underfitting;

if q1 is too large, the penalty for the number of predictor variables in HD- BIC is too small, so, OGA+HDBIC tends to be overfitting. With moderate order of moment bounds (q1 = 8, 10), in the simulations for n ≥ 100, OGA includes the 5 relevant regressors within Kn iterations for 99.9% or more of the simulations, and HDBIC+Trim identify the smallest correct model for 98% or more of the simulations.

(23)

Table1. Frequency, in 1000 simulations, of including all five relevant variables (Correct), of selecting exactly the relevant variables (E), of selecting all relevant variables and i irrelevant variables (E+i).

ση2 q1 n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE

0.01 4 50 1000 0 0 0 0 0 0 0 64.02502

100 2000 0 0 0 0 0 0 0 53.08281

200 4000 0 0 0 0 0 0 0 55.59686

6 50 1000 623 0 0 0 0 0 623 24.54740

100 2000 1000 0 0 0 0 0 1000 0.15931

200 4000 1000 0 0 0 0 0 1000 0.08096

8 50 1000 911 18 0 0 0 0 929 4.34789

100 2000 1000 0 0 0 0 0 1000 0.17550

200 4000 1000 0 0 0 0 0 1000 0.08053

10 50 1000 571 129 43 17 17 7 922 10.29011

100 2000 983 16 1 0 0 0 1000 0.17837

200 4000 999 1 0 0 0 0 1000 0.16207

15 50 1000 0 0 0 0 0 0 914 14.44628

100 2000 21 12 10 7 3 2 1000 4.70902

200 4000 677 225 75 14 5 2 1000 0.19443

0.05 5 50 1000 0 0 0 0 0 0 0 65.54043

100 2000 0 0 0 0 0 0 0 53.91476

200 4000 0 0 0 0 0 0 0 47.64495

6 50 1000 2 0 0 0 0 0 2 59.51543

100 2000 689 0 0 0 0 0 689 16.94148

200 4000 1000 0 0 0 0 0 1000 0.18862

8 50 1000 816 16 2 0 0 0 834 13.14926

100 2000 1000 0 0 0 0 0 1000 0.39365

200 4000 1000 0 0 0 0 0 1000 0.17408

10 50 1000 522 118 36 21 8 14 861 13.67555

100 2000 983 16 1 0 0 0 1000 0.44005

200 4000 998 2 0 0 0 0 1000 0.17630

15 50 1000 0 0 0 0 0 0 854 26.64408

100 2000 11 17 12 10 3 0 1000 10.66257

200 4000 683 218 75 19 1 2 1000 0.43310

數據

表    目    錄
Table 2 shows that OGA+HDBIC+Trim agrees with the asymptotic theory of Theorem 4. In the cases of n = 50, p = 1000, OGA can include all relevant variables at least 90% of the time if C ≤ 1 (σ η 2 ≤ 0.02), furthermore, with proper order of moment bounds (q

參考文獻

相關文件

Once we introduce time dummy into our models, all approaches show that the common theft and murder rate are higher with greater income inequality, which is also consistent with

Our model system is written in quasi-conservative form with spatially varying fluxes in generalized coordinates Our grid system is a time-varying grid. Extension of the model to

In this section, we consider a solution of the Ricci flow starting from a compact manifold of dimension n 12 with positive isotropic curvature.. Our goal is to establish an analogue

Other advantages of our ProjPSO algorithm over current methods are (1) our experience is that the time required to generate the optimal design is gen- erally a lot faster than many

Reading Task 6: Genre Structure and Language Features. • Now let’s look at how language features (e.g. sentence patterns) are connected to the structure

• ‘ content teachers need to support support the learning of those parts of language knowledge that students are missing and that may be preventing them mastering the

◆ Understand the time evolutions of the matrix model to reveal the time evolution of string/gravity. ◆ Study the GGE and consider the application to string and

In the past researches, all kinds of the clustering algorithms are proposed for dealing with high dimensional data in large data sets.. Nevertheless, almost all of