Asymptotic Frameworks - 空間統計模型選取之大樣本理論

There are two asymptotic frameworks in geostatistics having different assumptions on the domain D. One is called the fixed domain asymptotic framework, where data are sampled more and more densely in a bounded fixed region D. The other is called the increasing domain asymptotic framework with |D| → ∞ as n → ∞, which is often considered in time series analysis. The fixed domain asymptotic framework is somewhat unique in geostatistics, which tends to have some unusual asymptotic behavior due to limited information available in a bounded fixed region.

Asymptotic properties under the increasing domain asymptotic framework are more standard. Suppose that we observe data Z according to (2.2), where µ(s) = x(s)⁰β is known, but var(Z) depends on some unknown parameter vector θ. Then Mardia and Marshall (1984) show under some regularity conditions that

θˆ_{M L} ∼ N(θ₀, I⁻¹(θ₀)), (2.10) where θ₀ is the true parameter vector and I(θ₀) is the Fisher information. However, (2.10) is generally not satisfied under the fixed domain asymptotic framework, and in fact some parameters of θ can not be consistently estimated. For example, suppose that η(·) is generated from a Mat´ern covariance function of (2.5) with ν known but σ_η² and κ_η unknown. Zhang (2004) shows that the ML estimates of σ²_η and κ_η are inconsistent under the fixed domain asymptotic framework. That is,

n→∞lim P¡

|ˆσ_η²− σ²_η,0| > ε} > 0,

and

n→∞lim P¡

|ˆκ_η − κ_η,0| > ε} > 0.

for any ε > 0, where σ_η,0² and κη,0 are the corresponding true parameters. However, as shown in the following proposition, some function of σ_η² and κ_η can be consistently estimated.

Proposition 1 (Zhang, 2004) Consider an increasing sequence of finite subsets Dn of R^d, for d = 1, 2, 3, such that ∪^∞_n=1D_n is bounded and infinite. Suppose that the data Z are observed on D = D_n according to (2.2) with β = 0 and σ_²² = 0 known, where η(·) is a Gaussian process with a Mat´ern covariance function of (2.5) and ν > 0 is known.

Assume that σ_η,0² > 0 and κ_η,0 > 0 are the true parameters corresponding to σ² and κ_η. If κ_η is fixed at some constant κ₁ > 0, and ˆσ²_η is the ML estimate of σ². Then

σ²_ηκ^2ν₁ −→ σ^p _η,0² κ^2ν_η,0, as n → ∞.

Also, Ying (1991) shows the similar results for exponential covariance function which is a special case of Mat´ern class for ν = 0.5.

Proposition 2 (Ying, 1991) Suppose that the data Z are observed on D = [0, 1] accord-ing to (2.2) with β = 0 and σ²_² = 0 known, where η(·) is a zero-mean Gaussian process with an exponential covariance function of (2.3). Let Θ be the parameter space of (σ_η², κη)⁰. Assume that either Θ = [a, b] × (0, ∞) or Θ = (0, ∞) × [a, b], where 0 < a ≤ b < ∞, and the true parameter vector (σ²_η,0, κ_η,0)⁰ ∈ Θ.

(i) Let ˆσ²_η and ˆκ_η be the ML estimates of σ_η² and κ_η. Then

√n(ˆσ_η²κˆ_η − σ²_η,0κ_η,0)−→ N(0, 2(σ^d _η,0² κ_η,0)²), as n → ∞.

(ii) Suppose that κ_η is fixed at some constant κ₁ > 0 and ˆσ₁² is the corresponding ML estimate of σ_η². Then

√n µ

σ²₁− σ_η,0² κ_η,0 κ₁

−d

→ N Ã

0, 2

µσ²_η,0κ_η,0 κ₁

¶2!

, as n → ∞.

(iii) Suppose that σ_η² is fixed at some constant σ₁² and ˆκ1is the corresponding ML estimate of κ_η. Then

√n µ

κ₁−σ_η,0² κ_η,0 σ₁²

−d

→ N µ

0, 2

µσ²_η,0κ_η,0 σ₁²

¶2¶

, as n → ∞.

The parameters σ_η²κ^2ν_η in Proposition 1 and σ_η²κ_η in Proposition 2 are called microer-godic parameters (Matheron 1971, 1989; Stein 1999), which basically imply that both parameters can be recovered with probability 1 from observations in a bounded fixed region. These parameters have also been shown to play an important role in spatial prediction by Stein (1999). Specifically, consider the spectral density function of K(h), h ∈ R^d:

f (ω) = 1 (2π)^d

R^d

exp(−iω⁰h)K(h)dh; ω ∈ R^d.

Stein shows that under the fixed domain asymptotic framework, f (ω) contributes to mean square prediction error mainly for large |ω|, whose behavior is governed by some microergodic parameters. He also provides some specific examples for exponential and Mat´ern covariance functions.

For σ²_² > 0 in (2.2), Chen et al. (2000) provides the following results regarding the ML estimates of σ²_η, κη and σ_²².

Proposition 3 (Chen et al. 2000) Suppose that the data Z are observed regularly on D = [0, 1] according to (2.2) with β = 0 known, where η(·) is a zero-mean Gaussian process with an exponential covariance function of (2.3). Assume that (ση, κη, σ²_²)⁰ ∈ Θ, where Θ ⊂ (0, ∞)³ is a compact set, and the true parameter vector (σ_²,0² , σ²_η,0, κ_η,0)⁰ ∈ Θ.

(i) Let ˆσ²_², ˆσ_η² and ˆκ_η be the ML estimates of σ_²², σ_η² and κ_η. Then, as n → ∞, µ n^1/4(ˆσ_η²κˆ_η − σ²_η,0κ_η,0)

n^1/2(ˆσ²_² − σ_²,0² )

−d

→ N µµ 0

¶ ,

µ 4√

2σ_²,0(σ_η,0² κ_η,0)^3/2 0 0 2σ_²,0⁴

¶¶

(ii) Suppose that κ_η is known and ˆσ_²² and ˆσ²_η are the corresponding ML estimates of σ_²² and σ²_η. Then, as n → ∞,

µ n^1/4(ˆσ²_η− σ_η,0² ) n^1/2(ˆσ_²²− σ_²,0² )

−d

→ N µµ 0

¶ ,

µ 4√

2σ_²,0σ_η,0³ κ^−1/2_η,0 0 0 2σ_²,0⁴

¶¶

In Chapter 5, we shall provide the convergence rates for the ML estimates of σ_²², σ²_η and κη under more general spatial domains with D = [0, n^δ] and δ ∈ [0, 1). In addition, those convergence rates will be given under geostatistical regression models of (2.2) based on not only the true model, but also underfitted and overfitted models.

Chapter 3 Variable Selection

Consider the geostatistical regression model of (2.2). Suppose that we observe spatial data, {x(si), Z(si)}; si ∈ D and i = 1, . . . , n. This model reduces to a usual regression model when η(·) = 0. Similar to linear regression, a large model with many insignificant variables tends to produce a large variance, resulting in low predictive power. On the other hand, a small model that ignores some important variable may produce large bias. To achieve good compromise between bias and variance, it is essential to identify significant variables.

Clearly, variable selection is essential not only in regression but also in geostatistical regression.

We consider selecting a subset of {1, . . . , p} corresponding to p explanatory variables.

Let A ⊂ 2^{1,...,p} be the set of all candidate models, and let α ∈ A denotes a candidate model. Note that intercept is always included in our models, and α = ∅ corresponds to the intercept only model.

Let X(α) be an n × p(α) sub-matrix of X containing the columns corresponding to α, and let β(α) be the sub-vector of β corresponding to X(α). A model α is said to be correct if µ(s) can be written as P

j∈αβ_jx_j(s), for s ∈ D. Let A^c ⊂ A be the set of all correct models and let α^c= arg min

α∈A

|α| be the correct model having the smallest number of variables. Then A^c = {α ∈ A : α^c⊂ α}.

The geostatistical regression model corresponding to α ∈ A can be written in a matrix form as:

Z = X(α)β(α) + η + ², (3.1)

where η ≡ (η(s1), . . . , η(sn))⁰ ∼ N(0, Ση) and ² ∼ N(0, σ_²²I). Hence the mean and the variance of Z under model α ∈ A are µ(α) = X(α)β(α) and

Σ(θ) = Σ_η + σ_²²I, (3.2)

where θ is the covariance parameter vector associate with var(Z).

3.1 Loss Functions

We consider two loss functions: the Kullback-Leibler (KL) loss function and the squared error loss function. First, for model α given in (3.1), the KL loss function is given by:

L^KL(α; θ) = Z

Y ∈Rⁿ

f (Y ; µ, Σ(θ₀)) log f (Y ; µ, Σ(θ₀)) f (Y ; ˆµ(α; θ), Σ(θ))dY

= 1

2log det(Σ(θ)) −1

2log det(Σ(θ₀)) + 1

2tr(Σ(θ₀)Σ⁻¹(θ)) − n 2 +1

2(µ − ˆµ(α; θ))⁰Σ⁻¹(θ)(µ − ˆµ(α; θ)), (3.3) where µ = E(Z) is the true mean vector and θ₀ is the true covariance parameter vector,

µ(α; θ) = X(α) ˆˆ β(α; θ), (3.4)

β(α; θ) = (X(α)ˆ ⁰Σ⁻¹(θ)X(α))⁻¹X(α)⁰Σ⁻¹(θ)Z,

and recall that f (·; µ, Σ) is the Gaussian density function defined in (2.8). Now, let M (α; θ) = X(α)(X(α)⁰Σ⁻¹(θ)X(α))⁻¹X(α)⁰Σ⁻¹(θ), (3.5)

A(α; θ) = I − M (α; θ). (3.6)

Note that when θ = θ₀, L^KL(α; θ) in (3.3) reduces to a simpler form:

L^KL(α) ≡ L^KL(α; θ₀) = 1

2(µ − ˆµ(α))⁰Σ⁻¹(µ − ˆµ(α)), (3.7) where ˆµ(α; θ0) and Σ(θ0) are written as ˆµ(α) and Σ to simplify their notations. We can rewrite (3.7) as

L^KL(α) = 1

2µ⁰A(α)⁰Σ⁻¹A(α)µ + 1

2(η + ²)⁰M (α)⁰Σ⁻¹M (α)(η + ²), (3.8) where A(α; θ₀) and M (α; θ₀) are also simplified as A(α) and M (α). Clearly, the first term µ⁰A(α)⁰Σ⁻¹A(α)µ on the righthand side of the equality in (3.8) vanishes when α ∈ A^c. Thus we have the following lemma.

Lemma 1 Consider a class of models given by (3.1). Let L^KL(α) be the KL loss for model α defined in (3.7). Then

E(L^KL(α)) = 1

2µ⁰A(α)⁰Σ⁻¹A(α)µ + p(α)

2 ; α ∈ A, (3.9)

where A(α) is defined in (3.6). In particular, E(L^KL(α)) = p(α)/2, for α ∈ A^c.

Lemma 2 Consider a class of models given by (3.1). Let L^KL(α) be the KL loss for model α defined in (3.7). Then

n→∞lim P¡

α^c= arg min

α∈A^c

L^KL(α)¢

= 1, (3.10)

and

α^c= arg min

α∈A^c

E(L^KL(α)). (3.11)

In addition, if α^c is fixed, and

n→∞lim inf

α∈A\A^cµ⁰A(α)⁰Σ⁻¹A(α)µ = ∞, (3.12)

where A(α) is defined in (3.6), then

n→∞lim P¡

α^c= arg min

α∈A

L^KL(α)¢

= 1. (3.13)

In general, (3.12) is satisfied under the increasing domain asymptotic framework. How-ever, under the fixed domain asymptotic framework, it may or may not be satisfied; see Theorem 9 in Section 5.2 and Theorem 12 in Section 5.3, for which (3.12) holds and Theorems 5 and 6 in Section 5.1 for which (3.12) fails. In fact, as shown in Theorems 5 and 6, the smallest true model α^c does not have the smallest KL loss under the fixed domain asymptotic framework. In other words, (3.13) is not always satisfied.

The other loss function we consider in this thesis is the squared error loss commonly used in geostatistics particularly for prediction purpose:

L(α) = k ˆS(α) − Sk², (3.14)

where ˆS(α) is a generic predictor of S based on model α ∈ A. Throughout the thesis, we consider the universal kriging predictor of S in (2.7) unless indicated otherwise. For θ = θ₀, the universal kriging predictor based on model α can be written as:

S(α) = H(α)Z,ˆ (3.15)

where

H(α) ≡ M (α) + Σ_ηΣ⁻¹A(α), (3.16) with M (α) and A(α) defined in (3.5) and (3.6), respectively. Then the corresponding risk can be decomposed into the following:

E(L(α)) = EkS − E(S|Z) − ˆS(α) + E(S|Z)k²

= Ek ˆS(α) − E(S|Z)k²− 2E(( ˆS(α) − E(S|Z))⁰(S − E(S|Z))) + EkS − (S|Z)k²

= Ek ˆS(α) − E(S|Z)k²+ EkS − E(S|Z)k²,

which is lower bounded by EkS − E(S|Z)k², independent of α ∈ A. The following lemma provides some more details regarding decomposition of E(L(α)), which is useful in establishing some asymptotic properties concerning the squared error loss.

Lemma 3 Consider a class of models given by (3.1). Let ˆS(α) be the UK predictor of S given by (3.15) and L(α) be the corresponding squared error loss defined in (3.14). Then

E(L(α)) = Ek ˆS(α) − E(S|Z)k²+ EkS − E(S|Z)k²

= R1(α) + R2(α) + σ_²²tr(ΣηΣ⁻¹), (3.17) where Ek ˆS(α) − E(S|Z)k² = R1(α) + R2(α),

R₁(α) = σ_²⁴µ⁰A(α)⁰Σ⁻²A(α)µ,

R2(α) = σ_²⁴tr(Σ⁻¹M (α)), (3.18) where M (α) = M (α; θ0) is defined in (3.5).

Note that the term R₁(α) corresponds to the model misspecification error, which is smaller for a larger model α, and in particular, R₁(α) = 0 for α ∈ A^c. The term R₂(α) corresponds to the estimation error, which generally increases with p(α) and is bounded by σ_²²p(α), since

σ_²²tr(Σ⁻¹M (α)) = σ_²²tr¡

Σ⁻¹X(α)(X(α)⁰Σ⁻¹X(α))⁻¹X(α)⁰Σ⁻¹¢

= tr¡¡

X(α)⁰Σ⁻¹X(α)¢₋₁

X(α)⁰¡

σ²_²Σ⁻²¢

X(α)¢

≤ tr¡¡

X(α)⁰Σ⁻¹X(α)¢₋₁

X(α)⁰Σ⁻¹X(α)¢

= tr(I_p(α)) = p(α).

In addition, the term EkS−E(S|Z)k² = σ_²²tr(ΣηΣ⁻¹) in (3.17) corresponds to the optimal mean squared prediction error, which provides a lower bound for E(L(α)).

In general, lim

n→∞σ_²²tr(Σ_ηΣ⁻¹)±

R₂(α) = ∞, for α ∈ A^c. It follows from (3.17) that

n→∞lim E(L(α))±

E(L(α^c)) = 1, for α ∈ A^c. In contrast, from (3.9),

n→∞lim E(L^KL(α))±

E(L^KL(α^c)) > 1, for α ∈ A^c\ {α^c}.

Therefore, it would be preferable to select α^c among α ∈ A^c under the KL loss.

在文檔中空間統計模型選取之大樣本理論 (頁 18-24)