高維度時間序列並帶有測量誤差模型之模型選擇

(1)

國立臺灣大學理學院應用數學科學研究所碩士論文

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

高維度時間序列並帶有測量誤差模型之模型選擇

Model Selection for High-Dimensional Time Series Models with Measurement Errors

黃學涵

Hsueh-Han Huang

指導教授：銀慶剛博士

Advisor: Ching-Kang Ing, Ph.D.

(2)

摘要

我們使用一個叫做正交化貪婪演算法的快速逐步迴歸對高維度時間序列並帶有測量誤差的模型做模型選擇。在一個弱稀疏的條件下，我們推導出了正交化貪婪演算法預測誤差的收斂速度。在一個強稀疏的條件下，發展出一套擁有一致性的選模準則。

關鍵詞:高維度、測量誤差、正交化貪婪演算法、稀疏性、時間序列

(3)

Abstract

We use a fast stepwise regression method, called orthogonal greedy algorithm (OGA) to select variables for high-dimensional time series model with measurement errors. Under a weak sparsity condition, we derive a convergence rate of OGA, which is expressed in terms of the number of iterations, the sample size and the order of the moment imposed on the error process. Under a strong sparsity condition, we develop a consistent model selection procedure using OGA and a

high-dimensional information criterion.

Keywords: High-dimensional, measurement error, OGA, sparsity, time series.

(4)

中文摘要……… i

英文摘要………. ii

1.Introduction……….. 1

2.OGA and Noiseless OGA……….. 3

3.Uniform Convergence Rate of Empirical Prediction Error………... 4

4.Sure Screening Property and Model Selection Consistency……….. 9

5 . S i m u l a t i o n S t u d i e s … … … . . 1 6 參考文獻……….…… 23

附錄………..………. .24

(5)

表目錄

Table1……… 18

Table2……… 21

Table3……….…. 22

(6)

1 Introduction

Consider the simple linear regression without intercept y = β

e

Tx e

+ ξ, where x

e

= (x₁, x₂, ..., x_p)^T is a p-dimensional random vector satisfying E(x e

) = 0

e

= (0, 0, ..., 0)^T and E(x e x e

T) = Σ_x. β e

= (β₁, β₂, ..., β_p)^T is a p-dimensional constant vector. E(ξ) = 0, E(ξ²) = σ_ξ² > 0, and x

e

and ξ are independent.

Assume that w e

= x e

+η e

is observed instead of x e

, where w e

= (w₁, w₂, ..., w_p)^T, η

e

= (η₁, η₂, ..., η_p)^T is a vector of measurement errors, and y^? = y + η_y is observed instead of y, η_y is a measurement error, where E(η

e ) = 0

e , E(η

e η e

T) = Σ_η and η

e

is independent of (x e

, ξ), E(η_y) = 0, E(η_y²) = σ²_η_y and η_y is independent of (x

e , η

e

, ξ). To make complicated things simple, we assume that Σ_η = diag(σ²_η

1, σ_η²

2, ..., σ²_η

p) is a diagonal matrix. Note that since η_y can be absorb into ξ, we still denote y as y^? for simplicity and view ξ as the random errors after absorbing η_y.

If we regress y on w e

, then it follows that y = β

e

?Tw e

+ ξ^?, where β

e

? = β

e

− U e

and U e

= (Σ_x + Σ_η)⁻¹Σ_ηβ e

, noting that β e

? = β

e if

∀i = 1, 2, ..., p, σηi = 0.

Let (y_t, w e

T

t), t = 1, 2, ..., n, be observations, where w e

t= (w_t1, w_t2, ..., w_tp)^T. We allow (y_t, x

e

T t, η

e

T

t, ξ_t) be a stationary time series and p >> n. When p is larger than n, there are computational difficulties in estimating the regression coefficients by standard regression methods. Ing and Lai (2011) propose

(7)

the orthogonal greedy algorithm (OGA) to circumvent the computation in high dimensional inversion matrix. They derive the convergence rate of OGA and provide a consistent model selection procedure for high-dimensional time independent models. Ing and Huang (2016) generalize the results to multi- variate time series model setting and relax the moment bound assumptions from exponential moment bounds to polynomial moment bounds. However, none of them consider measurement errors in their models. Since we often face data with measurement errors that cannot be ignored in many applica- tions, recently, high-dimensional models with measurement errors has been widely studied. Loh and Wainwright (2012) propose a non-convex modification of Lasso for doing high-dimensional models with measurement errors and missing data, they also consider a time series model setting but only the cases within class of VAR(1) models with an upper restricted eigenvalue condition for sample covariance matrix. Datta and Zou (2016) propose a modification which is called Convex Conditioned Lasso (CoCoLasso) to circumvent the problem of non-convexity and the method can handle with a general class of corrupted data, but they only develop theories of the case that the true regressors are fixed design and there is no theory of model selection for a time series model setting. Belloni, Rosenbaum and Tsybakov (2014), (2016) use a Dantzig Selector type method named matrix uncertainty (MU) selector for doing high-dimensional model with measurement errors, but they do not consider a time series model setting. To our best knowledge, the existing papers that are related with high-dimensional model with measurement errors sel- dom consider a time series setting additionally, and none of them use greedy algorithm to do model selection with the previous model settings. This paper focuses on the OGA method and generalizes the results in Ing and Lai (2011) and Ing and Huang (2016) to a new dimension: high-dimensional time series

(8)

models with measurement errors under polynomial moment bound assumptions.

In this paper, We provide an upper bound of the number of iterations and derive the uniform convergence rate of empirical prediction error of OGA under a weak sparsity condition. We prove the sure screening property of OGA under a strong sparsity condition. We propose an information criterion to do model selection, together with a trimming method, the whole procedure is shown to achieve the oracle property. We also provide simulation studies to show that with proper order of moment bounds, OGA+HDBIC+Trim successfully identifies the smallest correct model with high ratios in some general model settings. Although some additional conditions are needed, the necessary conditions for OGA to do consistent model selection for models with measurement errors remain simple.

The rest of this paper is organized as follows: in Section 2, we introduce OGA and noiseless OGA. In Section 3, we derive the convergence rate of OGA. In Section 4, we prove the sure screening property of OGA, and introduce our model selection criterion along the OGA path which is called high-dimensional information criterion (HDIC). We also proposed a trimming method to exclude redundant variables and prove that OGA+HDIC+Trim achieves model selection consistency. In Section 5, we present simulation studies to illustrate the performance of OGA+HDIC+Trim.

2 OGA and Noiseless OGA

In this secition, we briefly introduce OGA and noiseless OGA that are proposed by Ing and Lai (2011).

(9)

Denote ˆy_k(w e

) as a sequence of linear approximations of the regression function y(w

e ) = β

e

?Tw e

. Initializing with ˆy₀(·) = 0, it computes the residuals U_t^(k) := y_t− ˆy_k(w

e

t), 1 ≤ t ≤ n, at the end of the kth iterations and chooses w_t,ˆ_j

k+1 on which U_t^(k) is regressed, such that ˆj_k+1= arg min

1≤j≤p n

X

t=1

(U_t^(k)− ˜β_j^(k)w_tj)²,

where ˜β_j^(k)=

Pn

t=1U_t^(k)wtj

Pn

t=1w²_tj . We update ˆ

y_k+1(w e

t) = ˆy_k(w e

t) + ˆβ_ˆ^(k)

j_k+1w^⊥_t,ˆ_j

k+1

where ˆβ_ˆ^(k)

j_k+1 =

Pn

t=1U_t^(k)w^⊥

t,ˆjk+1

Pn t=1w^⊥2

t,ˆjk+1

, w^⊥

t,ˆjk+1 is the tth component of vector w^⊥_ˆ

jk+1 = w_ˆ_j

k+1− ˆw_ˆ_j

k+1, ˆw_ˆ_j

k+1 is the projection of w_ˆ_j

k+1 into the linear space spanned by (wˆj1, wˆj2, ..., wˆjk), where w_j = (w_1j, w_2j, ..., w_nj)^T. The orthogonalization of the predictor variables allows us to use componentwise linear regression to compute OLS, thereby circumventing the difficulties with computing the inverse of high-dimensional matrix.

Noiseless OGA is similar to OGA but replaces yt by its mean y(w e

t). In the next section, we’ll use noiseless OGA to derive the convergence rate of the empirical prediction error of OGA. More details of OGA and noiseless OGA can be found in Ing and Lai (2011).

3 Uniform Convergence Rate of Empirical Pre- diction Error

In this section, we derive the convergence rate for OGA in linear regression time series models with measurement errors in which the number of

(10)

regressors is allowed to be much larger than the number of observations.

According to OGA, ˆy_m(w e

t) = w e

T t( ˆJ_m) ˆβ

e

( ˆJ_m), where ˆJ_m is the index set of the variable selected by OGA after m iterations, w

e

t(J ) = (w_ti, i ∈ J )^T and ˆβ

e

(J ) = (Pn t=1w

e

t(J )w e

T

t(J ))⁻¹Pn t=1w

e

t(J )y_t is the LSE based on model J. Let Kn denote a prescribed upper bound on the number m of OGA iterations. To provide the uniform convergence rate of the empirical norm

1 n

Pn

t=1(ˆy_m(w e

t) − β e

?Tw e

t)², 1 ≤ m ≤ K_n, we make the following assumptions below.

Assume {ξ_t}ⁿ_t=1 is a martingale difference sequence with respect to an increasing sequence of σ-fields {F_t}, {η_ti}ⁿ_t=1, i = 1, 2, ..., n are martingale difference sequences with respect to an increasing sequence of σ-fields n ˜F_to

, w_ti, i = 1, 2, ..., n, are F_t−1-measurable, x_ti, i = 1, 2, ..., n, are ˜F_t−1-measurable and there exist q₁, q₂ with q₂ > q₁ ≥ 2 s.t.

(C1) max

1≤t≤n,1≤i≤pE|x_ti|^2q¹ = O(1), sup

1≤t<∞,1≤i≤p

E[|η_ti|^2q¹| ˜F_t−1] ≤ C₁ < ∞ a.s., for some C₁ > 0, sup

1≤t<∞

E[|ξ_t|^q¹|F_t−1] ≤ C₂ < ∞ a.s., for some C₂ > 0,

(C2) max

1≤i,j≤pE|^√¹_nPn

t=1(wtiwtj− σij)|^2q² = O(1), max

1≤i,j≤pE|^√¹_nPn

t=1(x_tix_tj− σ_xij)|^2q² = O(1), where σ_ij = E(w_tiw_tj), σ_xij = E(x_tix_tj).

Remark. If w_tj has a linear representation

w_tj =

∞

X

k=−∞

a(k)α_j(t − k)

(11)

where (α_j(t), F_t) are martingale difference sequences with −∞ < t < ∞ and E[α_j(t)²|F_t−1] = 1

and there exists a positive constant C_q₂ s.t.

sup

−∞<t<∞

E[αj(t)^4q²|Ft−1] ≤ Cq2

with the spectral density function of w_tj, denoted f_j is square integrable,

1≤j≤pmax

∞

X

k=−∞

[E(w_tjw_t+k,j)]² = max

1≤j≤p

1 2π

Z π

−π

f_j²(λ)dλ = O(1).

Then, by (2.10) in Findley and Wei (1993), the first condition in (C2) holds.

(C3) ||β e

||₁ < ∞.

This assumption is the weak sparsity condition on the uncontaminated regression coefficients.

(C4) ||U e

||₁ = ||(Σ_x+ Σ_η)⁻¹Σ_ηβ e

||₁ < ∞.

This assumption and (C4) assure the weak sparsity condition on the regression coefficients contaminated by measurement errors.

Remark. There are many ways to achieve (C4), for example, if the values of measurement errors are restricted by the number of regressors, say

1≤i≤pmaxσ_η²

i = O(^√¹_p), then (C4) holds, since ||β e

?||₁ = ||β e

− (Σ_x+ Σ_η)⁻¹Σ_ηβ e

||₁ ≤

||β e

||₁(1 +√ p

1≤i≤pmax σ²_ηi λmin(Σx)+ min

1≤i≤pσ_ηi² ). Another important example is the case that (x_t1, x_t2..., x_tp) has special covariance structure, for example, uncorrelated structure. But, in general, (C4) does not hold without further conditions.

(C5) ⁿ

p

q12 → ∞ as n → ∞.

(12)

The following theorem gives the rate of convergence, which holds uniformly over 1 ≤ m ≤ K_n, for the empirical prediction error of OGA. The uniform convergence rate varies with the prescribed order of moments q₁ in (C1). When the order of moments q₁ is smaller, the uniform convergence rate becomes larger due to the weaker moment assumptions. Define R(J ) = E(w

e

1(J )w e

1(J )^T) and γ e

i(J ) = E(w1iw e

1(J )).

Theorem 1. Assume (C1)-(C5). Suppose K_n = O(q _n

p

q12 ), and min

1≤#(J )≤Kn

λ_min(R(J )) > δ, max

1≤#(J )≤Kn,i /∈J||R⁻¹(J )γ e

i(J )||₁ < C^? < ∞, (3.1) for some δ, C^? > 0. Then

1≤m≤Kmaxn

_n−1Pn t=1(ˆym(w

e

t)−β

e

?Tw

e

t)² m⁻¹+mn⁻¹p

2 q1

= O_p(1).

Proof.

1 n

Pn

t=1(ˆy_m(w e

t) − β e

?Tw e

t)²

= _n¹(Y (w) − HJˆmY )^T(Y (w) − HJˆmY )

= _n¹Y^T(w)(I − HJˆm)Y (w) + _n¹(Y − Y (w))^THJˆm(Y − Y (w)), where Y (w) = (w

e

T 1β

e

?, w e

T 2β

e

?, ..., w e

T nβ

e

?)^T, Y = (y₁, y₂, ..., y_n)^T, and H_J is a projection matrix project vectors into the linear space spanned by (w_i, i ∈ J ), where w_i = (w_1i, w_2i, ..., w_ni)^T. Let

µJ,i= ^Y^T^(w)(I−H^J^)wⁱ

n¹²||wi|| , ˆµJ,i= ^Y^T^(I−H^J^)wⁱ

n¹²||wi|| ,

where || · || = || · ||₂ denotes the L₂-norm in this paper. Consider two events A_n(k) =

(

max

(J,i):#(J )≤k−1,i /∈J|ˆµ_J,i− µ_J,i| ≤ s(

r

p

2 q1

n ) )

,

Bn(k) = (

0≤i≤k−1min max

1≤j≤p|µJˆi,j| > ˜ξ0s(

r

p

q12

n ) )

,

(13)

where s is a positive constant independent of n and k, ˜ξ₀ = 2/(1 − ξ₀), for 0 < ξ₀ < 1.

On A_n(m) ∩ B_n(m), for 1 ≤ q ≤ m,

|µ_J_ˆ

q−1,ˆjq| ≥ −|ˆµ_J_ˆ

q−1,ˆjq − µ_J_ˆ

q−1,ˆjq| + |ˆµ_J_ˆ

q−1,ˆjq|

≥ −2s r

p

2 q1

n + max

1≤j≤p|µ_J_ˆ

q−1,j|

≥ ξ0 max

1≤j≤p|µJˆq−1,j|.

This is the generalization of noiseless OGA in the Appendix B in Ing and Lai (2011). So, by Lemma B1 in Ing and Lai (2011), (C3), and (C4),

1

nY^T(w)(I − HJˆm)Y (w) = Op( 1

1 + mξ₀²). (3.2) On B_n^c(m), by (C3), (C4) and Lemma 1 in Appendix,

1

nY^T(w)(I − HJˆm)Y (w) ≤ min

1≤i≤m−1 1

nY^T(w)(I − HJˆi)Y (w)

≤ max

1≤j≤p||β e

?||₁^||w_n1/2^j^||ξ˜₀s r

p

q12

n

= O_p(1

m). (3.3)

It remains to prove that ∀ > 0, ∃s > 0 s.t.

P (A^c_n(m)) ≤ , (3.4)

and the proof is shown in the Appendix.

So, by (3.2)-(3.4), we have 1

nY^T(w)(I − HJˆm)Y (w) = O_p(1

m). (3.5)

On the other hand,

(14)

1

n(Y − Y (w))^THJˆm(Y − Y (w))

= _n¹ξ^?THJˆmξ^?

≤ || ˆR⁻¹( ˆJ_m)||m max

1≤i≤p(_n¹Pn

t=1ξ_t^?w_ti)²

= O_p(mp^q1²

n ), (3.6)

where ξ^? = (ξ₁^?, ξ₂^?, ..., ξ_n^?)^T. Theorem 1 follows form (3.5) and (3.6).

4 Sure Screening Property and Model Selec- tion Consistency

In the first part of this section, we prove the sure screening property of OGA under a strong sparsity condition:

(C6) ∃L_n satisfies L_n → 0 and q _n

p

q12 L²_n → ∞ as n → ∞ s.t. for any β_j 6= 0, |β_j| ≥ (

max

1≤i≤pσ_ηi²||β

e

||1

λmin(Σx)+ min

1≤i≤pσ²_ηi) + L_n.

Theorem 2. Assume (C1)-(C6), (3.1) and K_n= O(q _n

p

q12 ). Then lim

n→∞P (N ⊆ Jˆ_K_n) = 1, where N = {1 ≤ j ≤ p : β_j 6= 0} denote the set of relevant input variables.

Proof. Let m0 = baL⁻²_n c = o(Kn), for some positive constant a. Consider a event

A^?_n(k) = (

max

(J,i):#(J )≤k−1,i /∈J|ˆµ_J,i− µ_J,i| ≤ sL²_n )

,

for some positive constant s independent of n and k. By (3.4), we have

∀s > 0, lim

n→∞P (A^?_n^c(K_n)) = 0, which implies lim

n→∞P (A^?_n^c(m₀)) = 0. So, by similar arguments in the proof of Theorem 1, lim

n→∞P (F_n) = 0, where

(15)

F_n = {_n¹Y^T(w)(I − HJˆ_m0)Y (w) > Cm⁻¹₀ },

for some C > 0. By (C3), (C6), it follows that #(N ) = O(1), yielding

#(N ∪ ˆJm0) = o(Kn). So, on {N ∩ ˆJ_m^c₀ 6= ∅}, when n is large,

1

nY^T(w)(I − HJˆ_m0)Y (w)

= _n¹β^?T_{N ∩ ˆ}_Jc m0w^T

N ∩ ˆJ_m0^c (I − HJˆ_m0)w_{N ∩ ˆ}_Jc m0β^?

N ∩ ˆJ_m0^c

≥ (min

j∈N β_j^?²) min

1≤#(J )≤Kn

λ_min( ˆR(J ))

≥ bL²_n, for some b > 0, where w_{N ∩ ˆ}_Jc

m0 = (w_i, i ∈ N ∩ ˆJ_m^c₀), β^?

N ∩ ˆJ_m0^c = (β_i^?, i ∈ N ∩ ˆJ_m^c₀)^T. The last inequality above follows from Lemma 3, (C6) and (3.1).

By choosing a in m0 = baL⁻²_n c large enough, we have bL²_n > Cm⁻¹₀ , and the proof of Theorem 2 is complete.

To choose the smallest number of iterations that include all relevant variables, we propose a high-dimensional information criterion (HDIC). De- fine ˆσ_J² = n⁻¹Pn

t=1(y_t − ˆy_t;J)², where ˆy_t;J denotes the fitted value of y_t when Y = (y₁, y₂, ..., y_n)^T is projected into the linear space spanned by w_j, j ∈ J 6= ∅, setting ˆy_t;J = 0 if J = ∅. Let

HDIC(J ) = n log ˆσ_J² + #(J )w_np^q1² , kˆ_n= arg min

1≤k≤Kn

HDIC( ˆJ_k),

w_n→ ∞, w_np^q1² = o(nL_n⁴), (4.1) k˜_n= min{k : 1 ≤ k ≤ K_n, N ⊆ ˆJ_k}(min ∅ = K_n).

Note that ˆk_n is the number of OGA iterations we choose according to HDIC, and ˜k_n is the minimal number of iterations that includes all relevant regressors along an OGA path.

(16)

To achieve consistency of model selection under (C6), the strong sparsity condition, we need to assume the contaminated regression coefficients converges to the uncontaminated regression coefficients in an appropriate rate, which means the measurement errors must converges to 0 in probability with some rate:

(C7) ||U e

||₁ = O(

r

p

q12

n ), note that max

1≤i≤pσ²_η_i = O(^√¹_p r

p

q12

n ) assures (C7). If the regressors are uncorrelated, then max

1≤i≤pσ²_η_i = O(

r

p

q12

n ) assures (C7), which is weaker than general conditions.

In addition, we assume a weak dependency on the square of regression errors:

(C8) max

1≤t≤nE(ξ_t⁴) = O(1) and E(ξ_t²ξ_t+h² ) − σ⁴_ξ = o(1) as h → ∞, where σ_ξ² = E(ξ_t²), ∀t = 1, 2, ..., n.

This assumption is used to derive weak law of large numbers of ξ_t². The following theorem proves that ˆk approaches ˜k when n grows in probability sense.

Theorem 3. With the same notation and assumptions as in Theorem 2, suppose (3.1), (C7) and (C8) holds, K_n = O(q _n

p

q12 ). Then lim

n→∞P (ˆk_n = k˜n) = 1.

Proof. For notational simplicity, dropping the subscript n in ˜k_nand ˆk_n. Let D_n= {N ⊆ ˆJ_m₀} = {˜k ≤ m₀}, by Theorem 2, lim

n→∞P (D_n) = 1. On {ˆk < ˜k}, by definition of ˆk, it follows that

exp(HDIC( ˆJˆk)/n) ≤ exp(HDIC( ˆJ˜k)/n),

(17)

so, ˆ σ_J²_ˆ

˜k−1

− ˆσ_J²_ˆ

k˜

≤ ˆσ_J²_ˆ

ˆk

− ˆσ_J²_ˆ

k˜

≤ ˆσ²_J_ˆ

˜k

{exp(n⁻¹w_nkp˜ ^q1² ) − exp(n⁻¹w_nkpˆ ^q1² )}. (4.2) Note that

n⁻¹{Pn

t=1(y_t− ˆy_{t; ˆ}_J

˜k−1)²−Pn

t=1(y_t− ˆy_{t; ˆ}_J

˜k)²}

= n⁻¹(β_ˆ^?

j˜k

w_ˆ_j

k˜ +P

l /∈ ˆJ˜kβ_l^?w_l+ ξ^?)^T(H_J_ˆ

k˜ − H_J_ˆ

k−1˜ )(β_ˆ^?

j˜k

w_ˆ_j

˜k +P

l /∈ ˆJ˜kβ_l^?w_l+ ξ^?)

=

{β_ˆ^?

j˜k

w^T_ˆ

j˜k

(I−HJ˜ˆk−1

)wˆj˜k

+w^T_ˆ

j˜k

(I−HJ˜ˆk−1

)ξ^?}² nw^T_ˆ

j˜k

(I−HJ˜ˆk−1

)wˆj˜k

+n⁻¹(P

l /∈ ˆJk˜β_l^?w_l)^T(HJˆk˜ − HJˆk−1˜ )(P

l /∈ ˆJ˜kβ_l^?w_l) +2n⁻¹(P

l /∈ ˆJ˜kβ_l^?wl)^T(HJˆ˜k − HJˆ˜k−1)(β_ˆ_j^?

˜k

wˆj˜k+ ξ^?).

By (4.2),

β_ˆ^?2

jk˜

Aˆ_n+ 2β_ˆ^?

j˜k

Bˆ_n+ ˆA⁻¹_n Bˆ_n² + ˆD_n+ 2 ˆE_n

≤ λn⁻¹w_np^q1² m₀( ˆC_n+ σ_ξ²?) on {ˆk < ˜k }\

D_n, (4.3)

for some λ > 0, where Aˆ_n = n⁻¹w^T_ˆ

j˜k

(I − HJˆk−1˜ )wˆj˜k

Bˆ_n = n⁻¹w_ˆ^T

j˜k

(I − HJˆ˜k−1)ξ^? Cˆ_n= ˆσ²_ˆ

J˜k

− σ_ξ²?

Dˆ_n= n⁻¹(P

l /∈ ˆJk˜β_l^?w_l)^T(HJˆk˜ − HJˆk−1˜ )(P

l /∈ ˆJ˜kβ_l^?w_l) Eˆ_n = n⁻¹(P

l /∈ ˆJ˜kβ_l^?w_l)^T(HJˆ˜k− HJˆ˜k−1)(β_ˆ_j^?

˜k

wˆjk˜ + ξ^?).

In the Appendix, it is shown that ∀θ > 0, P ( ˆA_n< v_n

2 , D_n)+P (| ˆB_n| ≥ θL_n, D_n)+P (| ˆC_n| ≥ θ, D_n)+P (| ˆE_n| ≥ θL²_n, D_n) = o(1), (4.4) where v_n= min_{1≤#(J )≤m}₀λ_min(R(J )). From (4.3), (4.4), (C6), lim

n→∞P (D_n) = 1, ˆD_n, Aˆ⁻¹_n Bˆ_n² ≥ 0, and θ is arbitrary, it follows that P (ˆk < ˜k) = o(1).

On {ˆk > ˜k}, by definition of ˆk, it follows that

(18)

ˆ σ²_ˆ

Jˆk

exp(n⁻¹w_nkpˆ ^q1²) ≤ ˆσ²_ˆ

J˜k

exp(n⁻¹w_n˜kp^q1² ), so, it can be derived that

ξ^?T(HJˆˆk − HJˆ˜k)ξ^? +(P

l /∈ˆj˜kβ_l^?w_l)^T(HJˆˆk− HJˆ˜k)(P

l /∈ˆjk˜β_l^?w_l) +2(P

l /∈ˆj˜kβ_l^?wl)^T(HJˆˆk− HJˆ˜k)ξ^?

≥ {ξ^?T(I − H_J_ˆ

˜k)ξ^? +(P

l /∈ˆj˜kβ_l^?w_l)^T(I − H_J_ˆ

˜k)(P

l /∈ˆj˜kβ_l^?w_l) +2(P

l /∈ˆj˜kβ_l^?w_l)^T(I − HJˆ˜k)ξ^?}

×(1 − exp(−n⁻¹w_n(ˆk − ˜k)p^q1² )). (4.5) Let Fˆk,˜k denote the n × (ˆk − ˜k) matrix whose column vectors are w_j, j ∈ ˆJˆk− ˆJk˜. Since

ξ^?T(HJˆˆk − HJˆ˜k)ξ^?

= ξ^?T(I − HJˆ˜k)Fk,˜ˆk{F_ˆ^T

k,˜k(I − HJˆk˜)Fˆk,˜k}⁻¹F_ˆ^T

k,˜k(I − HJˆ˜k)ξ^?

≤ || ˆR⁻¹( ˆJ_K_n)||||n⁻¹²F_ˆ^T

k,˜k(I − H_J_ˆ

˜k)ξ^?||²

≤ || ˆR⁻¹( ˆJ_K_n)||(2||n⁻¹²F_ˆ^T

k,˜kξ^?||²+ 2||n⁻¹²F_ˆ^T

k,˜kH_J_ˆ

k˜ξ^?||²)

≤ 2(ˆk − ˜k)(ˆa_n+ ˆb_n), where

ˆ

a_n= || ˆR⁻¹( ˆJ_K_n)|| max

1≤i≤p(n⁻¹² Pn

t=1w_tiξ_t^?)²,

(4.6) ˆb_n= || ˆR⁻¹( ˆJ_K_n)|| max

1≤#(J )≤˜k,i /∈J

(n⁻¹² Pn

t=1wˆ_ti;Jξ_t^?)²,

(19)

and it is shown in the Appendix that

P ((ˆk − ˜k)(ˆa_n+ ˆb_n) ≥ θn(1 − exp(−n⁻¹w_n(ˆk − ˜k)p^q1² )), ˆk > ˜k) +P ((P

l /∈ ˆJ˜kβ_l^?wl)^T(HJˆˆk−HJˆ˜k)(P

l /∈ ˆJ˜kβ_l^?wl) ≥ θn(1−exp(−n⁻¹wn(ˆk−˜k)p^q1² )), ˆk >

˜k) +P (|(P

l /∈ ˆJ˜kβ_l^?w_l)^T(H_J_ˆ

ˆk− H_J_ˆ

k˜)ξ^?| ≥ θn(1 − exp(−n⁻¹w_n(ˆk − ˜k)p^q1² )), ˆk > ˜k) +P ((P

l /∈ˆj˜kβ_l^?w_l)^T(I − H_J_ˆ

˜k)(P

l /∈ˆj˜kβ_l^?w_l) ≥ θn, ˆk > ˜k) +P (|(P

l /∈ˆj_˜_kβ_l^?w_l)^T(I − HJˆ˜k)ξ^?| ≥ θn, ˆk > ˜k)

= o(1). (4.7)

So, by (A.9), (4.5), (4.7), it follows that P (ˆk > ˜k) = o(1), and the proof of Theorem 3 is complete.

Even the true model will be included by OGA+HDIC, some redundant variables could be contained. So, we provide a trimming method to trim out redundant variables, Let

N = {ˆˆ j_l: HDIC( ˆJkˆ− {ˆj_l}) > HDIC( ˆJkˆ), 1 ≤ l ≤ ˆk} if ˆk > 1,

and ˆN = {ˆj1} if ˆk = 1. ˆN is the subset of ˆJˆk after trimming. The following theorem shows that OGA+HDIC+Trim will achieve the oracle property.

Theorem 4. Under the same assumption as in Theorem 3, lim

n→∞P ( ˆN = N ) = 1.

Proof. For ˆk > 1, define δ_l = 1 if HDIC( ˆJ˜k− {ˆj_l}) >HDIC( ˆJk˜) and δ_l = 0 otherwise. Then

(20)

P ( ˆN 6= N ) ≤ P ( ˆN 6= N, ˆk > 1, N ⊆ ˆJˆk) + P (N * ˆJkˆ) + P ( ˆN 6= N, ˆk = 1)

≤ P (δ_l = 1 and βˆjl = 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJ˜k, ˜k > 1) +P (δ_l= 0 and βˆjl 6= 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) +P (ˆk 6= ˜k) + P (N * ˆJˆk) + P ( ˆN 6= N, ˆk = 1). (4.8) Let ˆJ˜k− {ˆj_l} = Q_l. On {ˆk = ˜k}, Since by similar arguments in the proof of Theorem 3, it can be derived that ∀θ > 0, 1 ≤ l ≤ ˆk,

P ((ãn+ ˜bn) ≥ θn(1 − exp(−n⁻¹wn(ˆk − ˜k)p^q1² ))) = o(1), (4.9) in which ã_n and ˜b_n are the same as â_n and ˆb_n in (4.6) but with K_n replaced by ˜k, and ˜k replaced by ˜k − 1, and

P ((X

r /∈Q_l

β_r^?w_r)^T(HJˆ˜k−H_Q_l)(X

r /∈Q_l

β_r^?w_r) ≥ θn(1−exp(−n⁻¹w_n(ˆk−˜k)p^q1² ))) = o(1), (4.10) P (|(X

r /∈Q_l

β_r^?w_r)^T(HJˆ˜k − H_Q_l)ξ^?| ≥ θn(1 − exp(−n⁻¹w_n(ˆk − ˜k)p^q1² ))) = o(1), (4.11) P (|n⁻¹ξ^?T(I − HQ_l)ξ^?− σ_ξ²?| ≥ θ) = o(1), (4.12)

P ((X

r /∈Q_l

β_r^?w_r)^T(I − H_Q_l)(X

r /∈Q_l

β_r^?w_r) ≥ θn) = o(1), (4.13)

P (|(X

r /∈Q_l

β_r^?w_r)^T(I − H_Q_l)ξ^?| ≥ θn) = o(1). (4.14) So, by (4.9)-(4.14), it follows that

P (δl= 1 and βˆjl = 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) = o(1). (4.15) On the other hand,

P (|n⁻¹w^T_ˆ_j

l(I − H_Q_l)ξ^?| ≥ θL_n, D_n) = o(1), (4.16)

(21)

P (n⁻¹w^T_ˆ_j

l(I − H_Q_l)wˆjl ≤ v_n

2 ) = o(1), (4.17)

P (|n⁻¹(X

l /∈ ˆJk˜

β_l^?w_l)^T(HJˆk˜ − H_Q_l)(βˆjlwˆjl+ ξ^?)| ≥ θL²_n) = o(1). (4.18) So, by (A.9), (4.16)-(4.18) and similar arguments in the proof of Theorem 3, it follows that

P (δ_l= 0 and βˆjl 6= 0 for some 1 ≤ l ≤ ˜k, N ⊆ ˆJk˜, ˜k > 1) = o(1). (4.19) Finally, by (4.8), (4.15), (4.19) and Theorem 2 and 3, we have the desired conclusion.

5 Simulation Studies

In this section, we report simulation studies of the performance of OGA+

HDBIC+Trim. These simulations consider the regression model y^?_t =

p⁰

X

j=1

β_j^?w_tj +

p

X

j=p⁰+1

β_j^?w_tj+ ξ^?_t, t = 1, 2, ..., n, (5.1) where β_p⁰₊₁, β_p⁰₊₂, ..., β_p = 0, p n, η_tjare i.i.d. N (0, σ²_η), ∀t = 1, 2, ..., n, j = 1, 2, ..., p, and are independent of x_tj. ξ_t are i.i.d. N (0, σ²_ξ) and are independent of x_tj, η_tj. η_yt are i.i.d. N (0, σ_η²_y) and are independent of x_tj, η_tj, ξ_t

Examples 1 and 2 consider the case

x_tj = d_tj + ˜η ˜x_t, (5.2) in which ˜η ≥ 0 and (d_t1, d_t2, ..., d_tj, ˜x_t)^T, t = 1, 2, ..., n are i.i.d. normal with mean (1, 1, ..., 1, 0)^T and covariance matrix I. We standardize the variance of xtj by replacing xtj with √^x^tj

1+˜η². Since for any J ⊂ {1, 2, ..., p} and 1 ≤ i ≤ p with i /∈ J,

λ_min(R(J )) = 1

1 + ˜η² + σ_η² > 0 and ||R⁻¹(J )γ_i(J )||₁ < 1,

(22)

(3.1) is satisfied. Moreover, Corr(w_ti, w_tj) = _1+˜^η^˜²_η2 increases when ˜η grows.

Example 1. Consider (5.1) with p⁰ = 5, (β₁, β₂, ..., β₅) = (3, −3.5, 4, −2.8, 3.2), σ_ξ² = 1, σ_η²_y = 0.01 and assume that (5.2) holds. The cases ˜η = 0, which means the regressors are uncorrelated, σ_η² = 0.01, 0.5, 0.1, and (n, p) = (50, 1000), (100, 2000), (200, 4000) are considered here. We choose Kn = b5(n/p^q1² )¹²c and allow q1 to vary between 4 and 15. We have also allowed D in K_n = bD(n/p^q1² )¹²c to vary between 3 and 10, and the results are similar to those for D = 5. We perform 1000 simulations on each case. Define the mean squared prediction errors

MSPE = 1 1000

1000

X

l=1

(

p

X

j=1

β_j^?w_n+1^(l) − ˆy^(l)_n+1)²

in which x^(l)_n+1,1, x^(l)_n+1,2, ..., x^(l)_n+1,p are the regressors associated with y_n+1^(l) , the new outcome in the lth simulation run, and ˆy_n+1^(l) denotes the predictor of y^(l)_n+1. Table 1 shows that OGA+HDBIC+Trim is very sensitive to the order of moment bounds q₁, it performs well with proper q₁, but performs poorly with improper q₁. If q₁ is too small, the penalty for the number of predictor variables in HDBIC is too large, so, OGA+HDBIC tends to be underfitting;

if q₁ is too large, the penalty for the number of predictor variables in HD- BIC is too small, so, OGA+HDBIC tends to be overfitting. With moderate order of moment bounds (q1 = 8, 10), in the simulations for n ≥ 100, OGA includes the 5 relevant regressors within K_n iterations for 99.9% or more of the simulations, and HDBIC+Trim identify the smallest correct model for 98% or more of the simulations.

(23)

Table1. Frequency, in 1000 simulations, of including all five relevant variables (Correct), of selecting exactly the relevant variables (E), of selecting all relevant variables and i irrelevant variables (E+i).

σ_η² q1 n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE

0.01 4 50 1000 0 0 0 0 0 0 0 64.02502

100 2000 0 0 0 0 0 0 0 53.08281

200 4000 0 0 0 0 0 0 0 55.59686

6 50 1000 623 0 0 0 0 0 623 24.54740

100 2000 1000 0 0 0 0 0 1000 0.15931

200 4000 1000 0 0 0 0 0 1000 0.08096

8 50 1000 911 18 0 0 0 0 929 4.34789

100 2000 1000 0 0 0 0 0 1000 0.17550

200 4000 1000 0 0 0 0 0 1000 0.08053

10 50 1000 571 129 43 17 17 7 922 10.29011

100 2000 983 16 1 0 0 0 1000 0.17837

200 4000 999 1 0 0 0 0 1000 0.16207

15 50 1000 0 0 0 0 0 0 914 14.44628

100 2000 21 12 10 7 3 2 1000 4.70902

200 4000 677 225 75 14 5 2 1000 0.19443

0.05 5 50 1000 0 0 0 0 0 0 0 65.54043

100 2000 0 0 0 0 0 0 0 53.91476

200 4000 0 0 0 0 0 0 0 47.64495

6 50 1000 2 0 0 0 0 0 2 59.51543

100 2000 689 0 0 0 0 0 689 16.94148

200 4000 1000 0 0 0 0 0 1000 0.18862

8 50 1000 816 16 2 0 0 0 834 13.14926

100 2000 1000 0 0 0 0 0 1000 0.39365

200 4000 1000 0 0 0 0 0 1000 0.17408

10 50 1000 522 118 36 21 8 14 861 13.67555

100 2000 983 16 1 0 0 0 1000 0.44005

200 4000 998 2 0 0 0 0 1000 0.17630

15 50 1000 0 0 0 0 0 0 854 26.64408

100 2000 11 17 12 10 3 0 1000 10.66257

200 4000 683 218 75 19 1 2 1000 0.43310

高維度時間序列並帶有測量誤差模型之模型選擇

國立臺灣大學理學院應用數學科學研究所 碩士論文

Institute of Applied Mathematical Sciences College of Science

National Taiwan University Master Thesis

高維度時間序列並帶有測量誤差模型之模型選擇

Model Selection for High-Dimensional Time Series Models with Measurement Errors

黃學涵

Hsueh-Han Huang

指導教授：銀慶剛 博士

Advisor: Ching-Kang Ing, Ph.D.

摘要

Abstract

目 錄

中文摘要……… i

英文摘要………. ii

1.Introduction……….. 1

2.OGA and Noiseless OGA……….. 3

3.Uniform Convergence Rate of Empirical Prediction Error………... 4

4.Sure Screening Property and Model Selection Consistency……….. 9

5 . S i m u l a t i o n S t u d i e s … … … . . 1 6 參考文獻……….…… 23

附錄………..………. .24

表 目 錄

Table1……… 18

Table2……… 21

Table3……….…. 22

1 Introduction

2 OGA and Noiseless OGA

3 Uniform Convergence Rate of Empirical Pre- diction Error

4 Sure Screening Property and Model Selec- tion Consistency

5 Simulation Studies

國立臺灣大學理學院應用數學科學研究所碩士論文

指導教授：銀慶剛博士

目錄

表目錄