In this section, we report simulation studies of the performance of OGA+
HDBIC+Trim. These simulations consider the regression model y?t =
p0
X
j=1
βj?wtj +
p
X
j=p0+1
βj?wtj+ ξ?t, t = 1, 2, ..., n, (5.1) where βp0+1, βp0+2, ..., βp = 0, p n, ηtjare i.i.d. N (0, σ2η), ∀t = 1, 2, ..., n, j = 1, 2, ..., p, and are independent of xtj. ξt are i.i.d. N (0, σ2ξ) and are indepen-dent of xtj, ηtj. ηyt are i.i.d. N (0, ση2y) and are independent of xtj, ηtj, ξt
Examples 1 and 2 consider the case
xtj = dtj + ˜η ˜xt, (5.2) in which ˜η ≥ 0 and (dt1, dt2, ..., dtj, ˜xt)T, t = 1, 2, ..., n are i.i.d. normal with mean (1, 1, ..., 1, 0)T and covariance matrix I. We standardize the variance of xtj by replacing xtj with √xtj
1+˜η2. Since for any J ⊂ {1, 2, ..., p} and 1 ≤ i ≤ p with i /∈ J,
λmin(R(J )) = 1
1 + ˜η2 + ση2 > 0 and ||R−1(J )γi(J )||1 < 1,
(3.1) is satisfied. Moreover, Corr(wti, wtj) = 1+˜η˜2η2 increases when ˜η grows.
Example 1. Consider (5.1) with p0 = 5, (β1, β2, ..., β5) = (3, −3.5, 4, −2.8, 3.2), σξ2 = 1, ση2y = 0.01 and assume that (5.2) holds. The cases ˜η = 0, which means the regressors are uncorrelated, ση2 = 0.01, 0.5, 0.1, and (n, p) = (50, 1000), (100, 2000), (200, 4000) are considered here. We choose Kn = b5(n/pq12 )12c and allow q1 to vary between 4 and 15. We have also allowed D in Kn = bD(n/pq12 )12c to vary between 3 and 10, and the results are similar to those for D = 5. We perform 1000 simulations on each case. Define the mean squared prediction errors
MSPE = 1 1000
1000
X
l=1
(
p
X
j=1
βj?wn+1(l) − ˆy(l)n+1)2
in which x(l)n+1,1, x(l)n+1,2, ..., x(l)n+1,p are the regressors associated with yn+1(l) , the new outcome in the lth simulation run, and ˆyn+1(l) denotes the predictor of y(l)n+1. Table 1 shows that OGA+HDBIC+Trim is very sensitive to the order of moment bounds q1, it performs well with proper q1, but performs poorly with improper q1. If q1 is too small, the penalty for the number of predictor variables in HDBIC is too large, so, OGA+HDBIC tends to be underfitting;
if q1 is too large, the penalty for the number of predictor variables in HD-BIC is too small, so, OGA+HDHD-BIC tends to be overfitting. With moderate order of moment bounds (q1 = 8, 10), in the simulations for n ≥ 100, OGA includes the 5 relevant regressors within Kn iterations for 99.9% or more of the simulations, and HDBIC+Trim identify the smallest correct model for 98% or more of the simulations.
Table1. Frequency, in 1000 simulations, of including all five relevant variables (Correct), of selecting exactly the relevant variables (E), of selecting all relevant variables and i irrelevant variables (E+i).
ση2 q1 n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE
0.01 4 50 1000 0 0 0 0 0 0 0 64.02502
100 2000 0 0 0 0 0 0 0 53.08281
200 4000 0 0 0 0 0 0 0 55.59686
6 50 1000 623 0 0 0 0 0 623 24.54740
100 2000 1000 0 0 0 0 0 1000 0.15931
200 4000 1000 0 0 0 0 0 1000 0.08096
8 50 1000 911 18 0 0 0 0 929 4.34789
100 2000 1000 0 0 0 0 0 1000 0.17550
200 4000 1000 0 0 0 0 0 1000 0.08053
10 50 1000 571 129 43 17 17 7 922 10.29011
100 2000 983 16 1 0 0 0 1000 0.17837
200 4000 999 1 0 0 0 0 1000 0.16207
15 50 1000 0 0 0 0 0 0 914 14.44628
100 2000 21 12 10 7 3 2 1000 4.70902
200 4000 677 225 75 14 5 2 1000 0.19443
0.05 5 50 1000 0 0 0 0 0 0 0 65.54043
100 2000 0 0 0 0 0 0 0 53.91476
200 4000 0 0 0 0 0 0 0 47.64495
6 50 1000 2 0 0 0 0 0 2 59.51543
100 2000 689 0 0 0 0 0 689 16.94148
200 4000 1000 0 0 0 0 0 1000 0.18862
8 50 1000 816 16 2 0 0 0 834 13.14926
100 2000 1000 0 0 0 0 0 1000 0.39365
200 4000 1000 0 0 0 0 0 1000 0.17408
10 50 1000 522 118 36 21 8 14 861 13.67555
100 2000 983 16 1 0 0 0 1000 0.44005
200 4000 998 2 0 0 0 0 1000 0.17630
15 50 1000 0 0 0 0 0 0 854 26.64408
100 2000 11 17 12 10 3 0 1000 10.66257
200 4000 683 218 75 19 1 2 1000 0.43310
ση2 q1 n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE
0.1 5.5 50 1000 0 0 0 0 0 0 0 59.78768
100 2000 0 0 0 0 0 0 0 50.27432
200 4000 28 0 0 0 0 0 28 45.76851
6 50 1000 0 0 0 0 0 0 0 59.57889
100 2000 20 0 0 0 0 0 20 46.85445
200 4000 973 0 0 0 0 0 973 1.57907
8 50 1000 507 9 0 0 0 0 516 28.21477
100 2000 999 0 0 0 0 0 999 0.65213
200 4000 1000 0 0 0 0 0 1000 0.31026
10 50 1000 493 94 36 12 8 9 744 22.35131
100 2000 987 13 0 0 0 0 1000 0.63598
200 4000 999 1 0 0 0 0 1000 0.29356
15 50 1000 0 0 0 0 0 0 773 43.94780
100 2000 16 13 8 4 5 0 1000 16.68930
200 4000 684 222 66 20 3 2 1000 0.83435
Example 2. The settings of this example are the same with Example 1, but we allow ση2 to have a rate of convergence such that
||U e
||1 ≤ C(
s pq12
n ), (5.3)
for some C varies between 0.01 and 45. Two cases are considered here:
Case 1: Consider ˜η = 0, which means the regressors are uncorrelated, and let σ2η = C
r
p
q12
n . In this case, the inequality in (5.3) becomes an equality.
Table 2 shows that OGA+HDBIC+Trim agrees with the asymptotic theory of Theorem 4. In the cases of n = 50, p = 1000, OGA can include all relevant variables at least 90% of the time if C ≤ 1 (ση2 ≤ 0.02), furthermore, with proper order of moment bounds (q1 = 8), OGA+HDBIC+Trim can identify the smallest correct model over 88% of the time. In the cases of n ≥ 100, OGA always include all relevant variables when C ≤ 5 (σ2η ≤ 0.085), further-more, when q1 = 8, 10, HDBIC+Trim identifies the smallest correct model at least 98% of the time. In the case of n = 200, p = 4000, q1 = 12, even if ση2 = 0.625, which is 62.5% of the variance of the real input variables,
OGA+HDBIC+Trim could still identify the smallest correct model for 91.5%
of the time. Since the penalty term of each number of predictor variables in HDBIC is log(n)pq12 , when n is small, a small q1 is appropriate to prevent from being overfitting; When n is large, a larger q1 can be tolerated without being seriously overfitting.
Case 2: Consider ˜η = 2, which means the regressors are highly correlated (80%), and let σ2η = C√1p
r
p
q12
n , which implies (5.3). Table 3 shows that in the cases of n = 50, p = 1000, the performance of OGA is getting worse with the ratio of including all relevant variables decreases to about 50 ∼ 60% of the time when C = 0.01 (ση2 is about 0.0001) due to the highly correlatedness of the regressors. However, when n ≥ 100, q1 = 10, 12, C ≤ 5 (ση2 ≤ 0.024), OGA can include all relevant variables for 98% or more of the time, and HDBIC+Trim identifies the smallest correct model for 80% or more of the time. In the case of n = 200, p = 4000, q1 = 10, C = 35 (ση2 is about 0.09), HDBIC+Trim can identify the smallest correct model for 85% of the time.
Table2. Case 1 in Example 2, with η e
= 0 and σ2η = C r
p
2 q1
n . The other notations are the same in Table 1.
q1 C n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE σ2η
8 0.01 50 1000 913 13 0 0 0 0 926 6.28107 0.00020
100 2000 999 1 0 0 0 0 1000 0.11205 0.00016
200 4000 1000 0 0 0 0 0 1000 0.05414 0.00012
1 50 1000 887 13 0 0 0 0 900 7.44817 0.02075
100 2000 1000 0 0 0 0 0 1000 0.18092 0.01592
200 4000 1000 0 0 0 0 0 1000 0.08328 0.01223
5 50 1000 459 8 1 0 0 0 468 31.75064 0.11312
100 2000 999 1 0 0 0 0 1000 0.53822 0.08503
200 4000 1000 0 0 0 0 0 1000 0.20888 0.06431
20 50 1000 0 0 0 0 0 0 0 45.31566 0.68492
100 2000 6 0 0 0 0 0 6 34.84350 0.45657
200 4000 963 0 0 0 0 0 963 1.03303 0.31875
10 0.01 50 1000 569 125 46 24 20 16 920 9.10107 0.00017
100 2000 984 16 0 0 0 0 1000 0.10932 0.00013
200 4000 1000 0 0 0 0 0 1000 0.05647 0.00010
1 50 1000 559 130 59 26 11 9 911 8.49444 0.01740
100 2000 980 19 1 0 0 0 1000 0.19531 0.01313
200 4000 997 3 0 0 0 0 1000 0.07708 0.00992
5 50 1000 493 101 35 18 10 8 763 18.06490 0.09350
100 2000 983 16 1 0 0 0 1000 0.50227 0.06929
200 4000 999 1 0 0 0 0 1000 0.19478 0.05165
35 50 1000 0 0 0 0 0 0 0 42.26501 1.49096
100 2000 13 2 0 0 0 0 15 28.75040 0.83021
200 4000 901 0 0 0 0 0 901 1.74922 0.52387
12 0.01 50 1000 6 2 1 0 1 1 932 14.44258 0.00015
100 2000 789 151 37 16 4 1 1000 0.21749 0.00011
200 4000 970 28 2 0 0 0 1000 0.05420 0.00009
1 50 1000 0 4 1 1 0 0 908 17.25881 0.01548
100 2000 781 170 35 5 4 1 1000 0.39322 0.01155
200 4000 973 25 1 1 0 0 1000 0.08424 0.00863
5 50 1000 3 3 1 0 2 8 777 41.15342 0.08249
100 2000 798 147 39 7 3 2 1000 0.81430 0.06055
200 4000 999 1 0 0 0 0 1000 0.19478 0.05165
45 50 1000 0 0 0 0 0 0 0 119.84890 2.18342
100 2000 14 3 1 0 0 0 18 26.23490 1.05687
200 4000 915 25 2 0 0 0 942 1.53381 0.62584
Table3. Case 2 in Example 2, with η e
= 2 and σ2η = C√1p r
p
2 q1
n . The other notations are the same in Table 1.
q1 C n p E E+1 E+2 E+3 E+4 E+5 Correct MSPE ση2
8 0.01 50 1000 507 3 3 0 0 0 513 4.75287 0.00011
100 2000 1000 0 0 0 0 0 1000 0.04985 0.00006
200 4000 1000 0 0 0 0 0 1000 0.02725 0.00003
1 50 1000 214 3 0 0 0 0 217 7.69615 0.01061
100 2000 994 0 0 0 0 0 994 0.09892 0.00578
200 4000 1000 0 0 0 0 0 1000 0.03679 0.00315
5 50 1000 2 0 0 0 0 0 2 11.83424 0.05303
100 2000 711 0 0 0 0 0 711 0.45313 0.02891
200 4000 1000 0 0 0 0 0 1000 0.08263 0.01576
15 50 1000 0 0 0 0 0 0 0 15.98425 0.15908
100 2000 2 0 0 0 0 0 2 9.95848 0.08674
200 4000 971 0 0 0 0 0 971 0.48686 0.04729
10 0.01 50 1000 466 77 21 21 3 4 610 3.65043 0.00010
100 2000 990 10 0 0 0 0 1000 0.05153 0.00005
200 4000 1000 0 0 0 0 0 1000 0.02699 0.00003
1 50 1000 343 43 19 6 2 3 429 5.37950 0.00892
100 2000 982 17 0 0 0 0 999 0.07926 0.00478
200 4000 997 3 0 0 0 0 1000 0.02730 0.00256
5 50 1000 52 8 2 1 0 0 67 9.36974 0.04462
100 2000 958 21 1 0 0 0 980 0.33916 0.02391
200 4000 1000 1 0 0 0 0 1000 0.07646 0.01281
35 50 1000 0 0 0 0 0 0 0 20.92983 0.31231
100 2000 1 0 0 0 0 0 1 13.52782 0.16736
200 4000 851 8 0 0 0 0 859 1.35893 0.08969
12 0.01 50 1000 64 20 13 7 10 14 573 5.72614 0.00008
100 2000 834 122 32 6 5 0 999 0.09410 0.00004
200 4000 980 19 1 0 0 0 1000 0.02973 0.00002
1 50 1000 59 17 14 9 8 1 466 7.44604 0.00795
100 2000 855 105 28 7 4 0 1000 0.10421 0.00421
200 4000 978 22 0 0 0 0 1000 0.03700 0.00223
5 50 1000 12 10 2 2 3 0 97 12.88040 0.03976
100 2000 803 137 35 10 4 0 993 0.31848 0.02106
200 4000 969 30 1 0 0 0 1000 0.07507 0.01116
40 50 1000 0 0 0 0 0 0 0 26.24179 0.31811
100 2000 7 7 0 0 0 0 14 11.19198 0.16851
200 4000 816 144 2 0 0 0 962 1.10595 0.08927
References
Abhirup Datta and Hui Zou. (2016). CoCoLasso for High-dimensional Error-in-variables Regression. https://arxiv.org/abs/1510.07123.
Alexandre Belloni, Mathieu Rosenbaum and Alexandre B. Tsybakov. (2014). An {`1, `2,
`∞}-Regularization Approach to High-Dimensional Errors-in-variables Models. https:
//arxiv.org/abs/1412.7216.
Alexandre Belloni, Mathieu Rosenbaum and Alexandre B. Tsybakov. (2016). Linear and Conic Programming Estimators in High-Dimensional Errors-in-variables Models. https:
//arxiv.org/abs/1408.0241.
Ching-Kang Ing and Tze Leung Lai. (2011). A stepwise regression method and consistent model selection for high-dimensional sparse linear models. Statist. Sinica, 1473-1513.
Ching-Kang Ing and Kunling Huang. (2016). Model Selection for High-Dimensional Mul-tivariate Time Dependent Models (Unpublished master’s thesis). National Taiwan University, Taipei City.
C. Z. Wei. (1987). Adaptive prediction by least squares predictors in stochastic regression models with applications to time series. Ann. Statist. 15(4):1667-1682.
David F. Findley and Ching-Zong Wei. (1993). Moment bounds for deriving time series CLTs and model selection procedures. Statist. Sinica, 453-480.
Po-Ling Loh and Martin J. Wainwright. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. Ann. Statist. 40(3):1637 -1664.
Temlyakov, V. N. (2000). Weak greedy algorithms. Adv. Comput. Math. 12, 213-227.
Appendix
Before we prove (3.4), three lemmas are needed first:
Lemma 1. Assume (C2), max
1≤i,j≤p
Lemma 2. Assume (C1)-(C4), max
1≤i≤p it follows from (C2) that
1≤i,j≤pmax
Similarly,
Proof of Lemma 2. Note that E( max
where the third inequality comes from Lemma 2 in Wei (1987), K is a positive constant depends only on q1 and C2 in (C1); the last inequality comes from Jensen’s inequality. From (C1), it follows that
E1 = O(
by (C1) and similar arguments in the derivation of (A.4), it follows that E( max
The last equality comes from Lemma 1, (A.5) , (C3) and (C4). By (A.4)-(A.7) and Markov’s inequality, the proof of Lemma 2 is complete.
Proof of Lemma 3. For (A.1), note that max Lemma 1, we have the desired conclusion. By (A7) and (A8) in Ing and Lai (2011) and (A.1), we have (A.2). Furthermore, since max
1≤#(J )≤Kn
so, we have (A.3), and the tools for proving (3.4) are ready.
Proof of (3.4). Since by Lemma 1, n12||wi||−→ σp ii ≥ min
where wti;J⊥ = wti− γiT(J )R−1(J )wt(J ). Since and the proof of (3.4) is complete.
Proof of (4.4). By the proof of (3.4), it follows that P (| ˆBn| ≥ θLn, Dn) ≤ P ( max
the equality comes from (C1), (C7), (C8), (A.8), Lemma 1-3. Note that P ( ˆAn< v2n, Dn) ≤ P (λmin( ˆR( ˆJ˜k)) < v2n, Dn)
≤ P (λmin( ˆR( ˆJm0)) < v2n)
≤ P ( max
1≤#(J )≤m0
|| ˆR(J ) − R(J )|| > δ2)
= o(1), (A.10)
the equality comes from Lemma 3, and P (| ˆEn| ≥ θL2n, Dn)
≤ P (||U e
||1 max
1≤j≤p|βj?|( max
1≤i,j≤p 1 n|Pn
t=1wtiwtj− σij| + max
1≤i,j≤p|σij|) ≥ θ2L2n) +P (||U
e
||12(S2,n+ S3,n) ≥ θ2L2n)
= o(1). (A.11)
where S2,n, S3,n are the same as those in the proof of (3.4), the equality comes from (C1), (C3), (C7), Lemma 1 and the proof of (3.4). By (A.8)-(A.11), the proof of (4.4) is complete.
Proof of (4.7). Since ∃ a constant ˜λ > 0 and ζn→ ∞ s.t.
n(1 − exp(−n−1wn(ˆk − ˜k)pq12 ))
ˆk − ˜k ≥ ˜λ min{(npq12 )12, wnpq12 }
= ζnpq12 , (A.12)
it follows from (A.12) that
P ((ˆk − ˜k)(ˆan+ ˆbn) ≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 )), ˆk > ˜k)
≤ P (|| ˆR−1( ˆJKn)|| max
1≤i≤p(n−12 Pn
t=1wtiξ?t)2 ≥ θ2ζnpq12 ) +P (|| ˆR−1( ˆJKn)||||n(S2,n+ S3,n)|| ≥ θ2ζnpq12 )
= o(1), (A.13)
the equality comes from Lemma 2, 3 and proof of (3.4). Note that P ((P
l /∈ ˆJ˜kβl?wl)T(HJˆ
ˆk− HJˆ
˜k)(P
l /∈ ˆJk˜βl?wl)
≥ θn(1 − exp(−n−1wn(ˆk − ˜k)pq12 )), ˆk > ˜k)
≤ P (||U e
||21( max
1≤i,j≤p 1 n|Pn
t=1wtiwtj− σij| + max
1≤i,j≤p|σij|)
≥ θ min{
r
p
q12
n , n−1wnpq12 }(ˆk − ˜k), ˆk > ˜k)
≤ P (||U e
||21( max
1≤i,j≤p 1 n|Pn
t=1wtiwtj− σij| + max
1≤i,j≤p|σij|) ≥ θ min{
r
p
q12
n , L4n})
= o(1), (A.14)
and P (|(P
l /∈ ˆJ˜kβl?wl)T(HJˆˆk− HJˆ˜k)ξ e
?| ≥ θn(1 −exp(−n−1wn(ˆk − ˜k)pq12 )), ˆk > ˜k)
≤ P (||U e
||12(S2,n+ S3,n) ≥ θ min{
r
p
q12
n , L4n})
= o(1). (A.15)
and similar to (A.14),(A.15), it follows that
P ((X
l /∈ ˆJ˜k
βl?wl)T(I − HJˆk˜)(X
l /∈ ˆJ˜k
βl?wl) ≥ θn, ˆk > ˜k) = o(1), (A.16)
P (|(X
l /∈ ˆJ˜k
βl?wl)T(I − HJˆ˜k)ξ e
?| ≥ θn, ˆk > ˜k) = o(1). (A.17)
So, by (A.13)-(A.17), the proof of (4.7) is complete.