• 沒有找到結果。

二群多變量時間數列間之簡化因果關係

N/A
N/A
Protected

Academic year: 2021

Share "二群多變量時間數列間之簡化因果關係"

Copied!
19
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 期末報告

二群多變量時間數列間之簡化因果關係

計 畫 類 別 : 個別型 計 畫 編 號 : NSC 101-2118-M-004-002- 執 行 期 間 : 101 年 08 月 01 日至 102 年 09 月 30 日 執 行 單 位 : 國立政治大學統計學系 計 畫 主 持 人 : 洪英超 計畫參與人員: 碩士班研究生-兼任助理人員:陳韋成 碩士班研究生-兼任助理人員:陳寬旻 碩士班研究生-兼任助理人員:張欣惠 碩士班研究生-兼任助理人員:林建佑 公 開 資 訊 : 本計畫涉及專利或其他智慧財產權,1 年後可公開查詢 中 華 民 國 102 年 11 月 22 日

(2)

中 文 摘 要 : 本計畫將探討'驗證二群多變量時間數列因果關係'中之變 數選擇問題。其想法主要是利用所謂的'向量自我迴歸模 型'(Vector Autoregression Model) 將二群多變量時間數 列中的'重要變數'萃取出來,並藉此建構一簡化(且新) 的'因果關係。當向量自我迴歸模型的參數已知時,我們將 証明此一變數選擇問題的解可以完全的表達出來。 當向量自 我迴歸模型的參數未知時,我們將介紹一個統計的假設檢定 程序來估計(或近似) 此一變數選擇問題的解。 中文關鍵詞: 因果關係, 向量自我迴歸過程, 顯著變數, 預測

英 文 摘 要 : In this project we investigate a variable selection problem in the validation of causal relationship between two groups of multivariate time series data. By utilizing the Vector Autoregression (VAR) model, we introduce how to extract 'significant variables' in both groups of time series data so that

a 'trimmed causal relationship' can be presented based on precedence and predictability. When the parameters of the VAR model are known, we show that explicit conditions for solving this variable

selection problem can be obtained; when the parameters are unknown, a statistical hypothesis testing procedure is used to approximate the solution.

英文關鍵詞: Granger causality, Vector Autoregressive Process, significant variables, forecasting

(3)

□期中進度報

行政院國家科學委員會補助專題研究

計畫

■期末報告

(計畫名稱)

二群多變量時間數列間之簡化因果關係

計畫類別:■個別型計畫 □整合型計畫

計畫編號:NSC 101-2118-M-004-002-

執行期間: 101年 8月 1 日至 102年 9月 30日

執行機構及系所:國立政治大學統計系

計畫主持人:洪英超

共同主持人:

計畫參與人員:陳韋成 陳寬旻 張欣惠 林建佑

本計畫除繳交成果報告外,另含下列出國報告,共 _0_ 份:

□移地研究心得報告

□出席國際學術會議心得報告

□國際合作研究計畫國外研究報告

處理方式:除列管計畫及下列情形者外,得立即公開查詢

□涉及專利或其他智慧財產權,□一年□二年後可公開查詢

中 華 民 國 102年 11月

附件一

(4)

1

報告內容

(一)

前言及文獻探討

Comput Stat DOI 10.1007/s00180-012-0351-z ORIGINAL PAPER

Extracting informative variables in the validation

of two-group causal relationship

Ying-Chao Hung · Neng-Fang Tseng

Received: 14 December 2011 / Accepted: 13 July 2012 © Springer-Verlag 2012

Abstract The validation of causal relationship between two groups of multivariate

time series data often requires the precedence knowledge of all variables. How-ever, in practice one finds that some variables may be negligible in describing the underlying causal structure. In this article we provide an explicit definition of “non-informative variables” in a two-group causal relationship and introduce various auto-matic computer-search algorithms that can be utilized to extract informative variables based on a hypothesis testing procedure. The result allows us to represent a simpli-fied causal relationship by using minimum possible information on two groups of variables.

Keywords Causal relationship · Vector autoregression model · Informative

variables· Modified Wald test · Automatic computer-search algorithm

1 Introduction

Over the years the causality system described by the multivariate time series process has been one of the most flexible and popular statistical techniques to measure the dynamic relationships between groups of variables in the areas of economics, finance, medicine, science, and engineering. The primary study of causal relationships can

Y.-C. Hung

Department of Statistics, National Chengchi University, NO. 64, Sec. 2, ZhiNan Rd., Wenshan District, Taipei 11605, Taiwan

e-mail: [email protected] N.-F. Tseng (

B

)

Department of Mathematical Statistics and Actuarial Science, Aletheia University, 32 Chen-Li Street, Tamsui, Taipei 25103, Taiwan

e-mail: [email protected]

123

Author's personal copy

2 Ying-Chao Hung, Neng-Fang Tseng

medicine, science, and engineering. The primary study of causal relationships can date back to the work by Granger (1969), wherein the Vector Autoregression (VAR) model, which is a generalization of the univariate AR models, was used to iden-tify the “causality” between two groups of time series data based on precedence and predictability. Afterward, there exists a fairly rich literature on its extensive studies. Some remarkable works are: Granger (1980) proposed a statistical hypothesis testing procedure to validate the bivariate causal relationship; Osborn (1984) discussed about “Unidirectional Granger Causality” based on the ARMA model and probed it into a statistical hypothesis testing procedures; Geweke (1982, 1984) considered measures of linear dependence and feedback between multiple time series data and provided a comprehensive literature survey of Granger-causality; Boudjellaba et al. (1992) tested causality between two vectors in multivariate autoregressive moving average models; Granger and Lin (1995) talked about the measure of causality by using the spectral decomposition based on the Vector Error Correction Model (VECM); Mosconi and Giannine (1992) investigated the Granger causality based on a non-stationary VAR model; Roebroech et al. (2005) used the Granger causality mapping (GCM) to ex-plore directed influences between neuronal populations in fMRI data; Hacker and Hatemi-J (2006) developed a method that is not sensitive to deviations from the as-sumption that the error term is normally distributed; Fujota et al. (2007) proposed an improved VAR model (called DVAR) to estimate time-varying gene regulatory networks based on gene expression profiles obtained from microarray experiments; Haufe et al. (2010) estimated causal interactions in multivariate time series using the VAR model; just to name a few.

The study of causal relationships usually include all variable information in the analysis. However, in many practical situations one finds that some variables are not particularly informative and can mislead the interpretation of the underlying causal structure. The work by Hsiao (1982) was closely related to such a concept. He intro-duced three different types of causal relationships (called direct, indirect, and spuri-ous causality) by reducing the information set in a three-variate time series model. However, when the number of variables becomes large, it is a much harder task to characterize all the causal patterns due to model complexity. To overcome this prob-lem, some graphical techniques have been successfully developed to identify and visualize the causal relationships between the components of multivariate time series data. The readers can refer to the works by Koster (1996, 1999), Lauritzen (1996, 2000), Pearl (1995, 2000), Whittaker (1990), and Arnold et al. (2007) for this type of approaches.

The goal of this study is to extract informative variables in the validation of causal relationship between two groups of multivariate time series data. These ex-tracted variables are important and useful in the sense that it allows us to forecast the future quantity of explicit variables by utilizing the minimum data information. The remainder of this paper is organized as follows. In Sect. 2, we introduce some background knowledge required for defining and identifying informative variables in the validation of two-group causal relationship. In Sect. 3, we introduce how to ex-tract all informative variables by utilizing a hypothesis testing procedure (called the modified Wald test) and various automatic computer-search algorithms. In Sect. 4,

2 Ying-Chao Hung, Neng-Fang Tseng

medicine, science, and engineering. The primary study of causal relationships can date back to the work by Granger (1969), wherein the Vector Autoregression (VAR) model, which is a generalization of the univariate AR models, was used to iden-tify the “causality” between two groups of time series data based on precedence and predictability. Afterward, there exists a fairly rich literature on its extensive studies. Some remarkable works are: Granger (1980) proposed a statistical hypothesis testing procedure to validate the bivariate causal relationship; Osborn (1984) discussed about “Unidirectional Granger Causality” based on the ARMA model and probed it into a statistical hypothesis testing procedures; Geweke (1982, 1984) considered measures of linear dependence and feedback between multiple time series data and provided a comprehensive literature survey of Granger-causality; Boudjellaba et al. (1992) tested causality between two vectors in multivariate autoregressive moving average models; Granger and Lin (1995) talked about the measure of causality by using the spectral decomposition based on the Vector Error Correction Model (VECM); Mosconi and Giannine (1992) investigated the Granger causality based on a non-stationary VAR model; Roebroech et al. (2005) used the Granger causality mapping (GCM) to ex-plore directed influences between neuronal populations in fMRI data; Hacker and Hatemi-J (2006) developed a method that is not sensitive to deviations from the as-sumption that the error term is normally distributed; Fujota et al. (2007) proposed an improved VAR model (called DVAR) to estimate time-varying gene regulatory networks based on gene expression profiles obtained from microarray experiments; Haufe et al. (2010) estimated causal interactions in multivariate time series using the VAR model; just to name a few.

The study of causal relationships usually include all variable information in the analysis. However, in many practical situations one finds that some variables are not particularly informative and can mislead the interpretation of the underlying causal structure. The work by Hsiao (1982) was closely related to such a concept. He intro-duced three different types of causal relationships (called direct, indirect, and spuri-ous causality) by reducing the information set in a three-variate time series model. However, when the number of variables becomes large, it is a much harder task to characterize all the causal patterns due to model complexity. To overcome this prob-lem, some graphical techniques have been successfully developed to identify and visualize the causal relationships between the components of multivariate time series data. The readers can refer to the works by Koster (1996, 1999), Lauritzen (1996, 2000), Pearl (1995, 2000), Whittaker (1990), and Arnold et al. (2007) for this type of approaches.

The goal of this study is to extract informative variables in the validation of causal relationship between two groups of multivariate time series data. These ex-tracted variables are important and useful in the sense that it allows us to forecast the future quantity of explicit variables by utilizing the minimum data information. The remainder of this paper is organized as follows. In Sect. 2, we introduce some background knowledge required for defining and identifying informative variables in the validation of two-group causal relationship. In Sect. 3, we introduce how to ex-tract all informative variables by utilizing a hypothesis testing procedure (called the modified Wald test) and various automatic computer-search algorithms. In Sect. 4,

(5)

2

data. The readers can refer to the works by Koster (1996, 1999), Lauritzen (1996, 2000), Pearl (1995, 2000), Whittaker (1990), and Arnold et al. (2007) for this type of approaches.

(二) 研究目的

The goal of this study is to extract informative variables in the validation of causal relationship between two groups of multivariate time series data. These ex-tracted variables are important and useful in the sense that it allows us to forecast the future quantity of explicit variables by utilizing the minimum data information. The remainder of this paper is organized as follows. In Sect. 2, we introduce some

(三) 研究方法

The notion of causality in multivariate time series data is often discussed by the sta-tionary pth-order Vector Autoregression model (denoted by VAR(p)):

Wt = b + p

j=1

AjWt−j+ at, t = 1, . . . , T, (1)

where b is a (K × 1) constant vector, Wt = (W1,t, W2,t, . . . , WK,t)� is a (K × 1)

random vector, Aj is a (K × K) coefficient matrix for all j = 1, . . . , p, and at is

a (K × 1) error (or noise) vector satisfying that (i) E(at) = 0; (ii) the covariance

matrix E(ata�t)is positive definite (thus non-singular); and (iii) E(ata�t−k) = 0 for

any non-zero k. Dividing all the variables of interest into two groups Xt and Yt, we

see that Wt can be further represented as

Wt = � Xt Yt � = � b1 b2 � + p � j=1 � AXX,j AXY,j AY X,j AY Y,j � � Xt−j Yt−j � + � aX,t aY,t � , t = 1, . . . , T, (2) where Xt = (X1,t, . . . , Xn,t)and Yt = (Y1,t, . . . , Ym,t) are (n × 1) and (m × 1)

random vectors, b1and b2are (n × 1) and (m × 1) constant vectors, AXX,j, AXY,j,

AY X,j, and AY Y,j are sub-matrices of Aj with orders (n × n), (n × m), (m × n),

and (m × m), respectively, aX,t and aY,t are (n × 1) and (m × 1) error vectors. The

primary goal of the so-called “Granger causality” is to examine whether or not the time series Yt is useful in forecasting the time series Xt.

Given any point in time , let us consider the two information sets Given any point in time t, let us consider the two information sets

ΩXY = {X1,t, . . . , Xn,t, . . . , X1,1, . . . , Xn,1, Y1,t, . . . , Ym,t, . . . , Y1,1, . . . , Ym,1}

and

(6)

3

Extracting Informative Variables in the Validation of Two-group Causal Relationship 3

the computer-search algorithms are illustrated on a real example. Some concluding remarks are given in Sect. 5.

2 Background Knowledge

The notion of causality in multivariate time series data is often discussed by the sta-tionary pth-order Vector Autoregression model (denoted by VAR(p)):

Wt = b + p

j=1

AjWt−j + at, t = 1, . . . , T, (1)

where b is a (K × 1) constant vector, Wt = (W1,t, W2,t, . . . , WK,t)� is a (K × 1)

random vector, Aj is a (K × K) coefficient matrix for all j = 1, . . . , p, and at is

a (K × 1) error (or noise) vector satisfying that (i) E(at) = 0; (ii) the covariance

matrix E(ata�t)is positive definite (thus non-singular); and (iii) E(ata�t−k) = 0 for

any non-zero k. Dividing all the variables of interest into two groups Xt and Yt, we

see that Wt can be further represented as

Wt = � Xt Yt � = � b1 b2 � + p � j=1 � AXX,j AXY,j AY X,j AY Y,j � � Xt−j Yt−j � + � aX,t aY,t � , t = 1, . . . , T, (2) where Xt = (X1,t, . . . , Xn,t) and Yt = (Y1,t, . . . , Ym,t) are (n × 1) and (m × 1)

random vectors, b1and b2are (n × 1) and (m × 1) constant vectors, AXX,j, AXY,j,

AY X,j, and AY Y,j are sub-matrices of Aj with orders (n × n), (n × m), (m × n),

and (m × m), respectively, aX,t and aY,tare (n × 1) and (m × 1) error vectors. The

primary goal of the so-called “Granger causality” is to examine whether or not the time series Yt is useful in forecasting the time series Xt.

Given any point in time t, let us consider the two information sets

ΩXY ={X1,t, . . . , Xn,t, . . . , X1,1, . . . , Xn,1, Y1,t, . . . , Ym,t, . . . , Y1,1, . . . , Ym,1}

and

ΩX ={X1,t, . . . , Xn,t, . . . , X1,1, . . . , Xn,1}.

For any given future time (t + h), we denote the best linear predictor of Xt+hbased

on the information sets ΩXY and ΩX by

ˆ

Xt(h|ΩXY) = ( ˆX1,t(h|ΩXY), . . . , ˆXn,t(h|ΩXY))

and

ˆ

Xt(h|ΩX) = ( ˆX1,t(h|ΩX), . . . , ˆXn,t(h|ΩX),

respectively. The two-group causality (also known as generalization of Granger causal-ity) is defined as follows.

4 Ying-Chao Hung, Neng-Fang Tseng

Definition 1 (Two-group Causality up to Horizon c)

Given any positive integer c, if ˆXt(h|ΩX) �= ˆXt(h|ΩXY) for some h ≤ c, then we

say that Ytcauses Xtup to horizon c and denote it by Y →

(c)X. On the other hand, if

ˆ

Xt(h|ΩX) = ˆXt(h|ΩXY) for all h ≤ c, then we say that Yt does not cause Xtup to

horizon c and denote it by Y �

(c)X.

The following proposition is useful for identifying the causality/non-causality be-tween Xt and Yt.

Proposition 1 Based on the model in Eq. (1)-(2), for any positive integer c we have that Y �

(c)X if and only if AXY,j = 0n×mfor all j = 1, . . . , p.

Proof Since we know that Y �

(c)Xis equivalent to Y �(∞) X(see Dufour and Renault

(1998), Proposition 2.3) and Y �

(∞) X if and only if AXY,j = 0n×m for all j =

1, . . . , p(see L¨utkepohl (2005), Corollary 2.2.1), the result is simply obtained. Proposition 1 indicates that the two-group causality based on the VAR model can be determined by examining the coefficient matrix AXY,j. We next review some

properties that are necessary for establishing the procedure of extracting informative variables in the later section.

As a result of Definition 1, if Y →

(c) X then there exists at least one pair (i, h) ∈

{1, . . . , n} × {1, . . . , c} such that E�Xˆi,t(h|ΩXY)− Xi,t+h �2 < E�Xˆi,t(h|ΩX)− Xi,t+h �2 ,

where ˆXi,t(h|ΩXY)and ˆXi,t(h|ΩX)are the i-th element of ˆXt(h|ΩXY)and ˆXt(h|ΩX),

respectively. Now we introduce how to calculate ˆXt(h|ΩXY). Based on Eq. (1), for

any given time lag h > 0 we have that Wt+h = h−1 k=0 A(k)1 (b + at+h−k) + p � j=1 A(h)j Wt+1−j, (3) where A(k)

j is a matrix obtained from the recursive formula

A(k)j = �

Aj k = 1

A(kj+1−1)+ A(k1 −1)Aj k = 2, 3, . . . , h,

(4) and j = 1, . . . , p. Consider the following partition of matrix A(h)

j : A(h)j = � A(h)XX,j A(h)XY,j A(h)Y X,j A(h)Y Y,j � , where A(h) XX,j and A (h)

XY,j are two sub-matrices with orders (n × n) and (n × m),

respectively. Denote the identity matrix of order n by In, it is clear that Xt+h =

4 Ying-Chao Hung, Neng-Fang Tseng

Definition 1 (Two-group Causality up to Horizon c)

Given any positive integer c, if ˆXt(h|ΩX) �= ˆXt(h|ΩXY)for some h ≤ c, then we

say that Ytcauses Xtup to horizon c and denote it by Y →

(c)X. On the other hand, if

ˆ

Xt(h|ΩX) = ˆXt(h|ΩXY)for all h ≤ c, then we say that Yt does not cause Xtup to

horizon c and denote it by Y �

(c)X.

The following proposition is useful for identifying the causality/non-causality be-tween Xtand Yt.

Proposition 1 Based on the model in Eq. (1)-(2), for any positive integer c we have that Y �

(c)Xif and only if AXY,j = 0n×m for all j = 1, . . . , p.

Proof Since we know that Y �

(c)Xis equivalent to Y �(∞)X(see Dufour and Renault

(1998), Proposition 2.3) and Y �

(∞) X if and only if AXY,j = 0n×m for all j =

1, . . . , p(see L¨utkepohl (2005), Corollary 2.2.1), the result is simply obtained. Proposition 1 indicates that the two-group causality based on the VAR model can be determined by examining the coefficient matrix AXY,j. We next review some

properties that are necessary for establishing the procedure of extracting informative variables in the later section.

As a result of Definition 1, if Y →

(c) X then there exists at least one pair (i, h) ∈

{1, . . . , n} × {1, . . . , c} such that E�Xˆi,t(h|ΩXY)− Xi,t+h �2 < E�Xˆi,t(h|ΩX)− Xi,t+h �2 ,

where ˆXi,t(h|ΩXY)and ˆXi,t(h|ΩX)are the i-th element of ˆXt(h|ΩXY)and ˆXt(h|ΩX),

respectively. Now we introduce how to calculate ˆXt(h|ΩXY). Based on Eq. (1), for

any given time lag h > 0 we have that Wt+h = h−1 � k=0 A(k)1 (b + at+h−k) + p � j=1 A(h)j Wt+1−j, (3) where A(k)

j is a matrix obtained from the recursive formula

A(k)j = �

Aj k = 1

A(kj+1−1)+ A(k1 −1)Aj k = 2, 3, . . . , h,

(4) and j = 1, . . . , p. Consider the following partition of matrix A(h)

j : A(h)j = � A(h)XX,j A(h)XY,j A(h)Y X,j A(h)Y Y,j � , where A(h) XX,j and A (h)

XY,j are two sub-matrices with orders (n × n) and (n × m),

(7)

4

4 Ying-Chao Hung, Neng-Fang Tseng

Definition 1 (Two-group Causality up to Horizon c)

Given any positive integer c, if ˆXt(h|ΩX) �= ˆXt(h|ΩXY)for some h ≤ c, then we

say that Ytcauses Xtup to horizon c and denote it by Y →

(c) X. On the other hand, if

ˆ

Xt(h|ΩX) = ˆXt(h|ΩXY)for all h ≤ c, then we say that Yt does not cause Xtup to

horizon c and denote it by Y �

(c) X.

The following proposition is useful for identifying the causality/non-causality be-tween Xtand Yt.

Proposition 1 Based on the model in Eq. (1)-(2), for any positive integer c we have that Y �

(c) Xif and only if AXY,j = 0n×m for all j = 1, . . . , p.

Proof Since we know that Y �

(c)Xis equivalent to Y �(∞)X(see Dufour and Renault

(1998), Proposition 2.3) and Y �

(∞) X if and only if AXY,j = 0n×m for all j =

1, . . . , p(see L¨utkepohl (2005), Corollary 2.2.1), the result is simply obtained. Proposition 1 indicates that the two-group causality based on the VAR model can be determined by examining the coefficient matrix AXY,j. We next review some

properties that are necessary for establishing the procedure of extracting informative variables in the later section.

As a result of Definition 1, if Y →

(c) X then there exists at least one pair (i, h) ∈

{1, . . . , n} × {1, . . . , c} such that E�Xˆi,t(h|ΩXY)− Xi,t+h �2 < E�Xˆi,t(h|ΩX)− Xi,t+h �2 ,

where ˆXi,t(h|ΩXY)and ˆXi,t(h|ΩX)are the i-th element of ˆXt(h|ΩXY)and ˆXt(h|ΩX),

respectively. Now we introduce how to calculate ˆXt(h|ΩXY). Based on Eq. (1), for

any given time lag h > 0 we have that Wt+h = h−1 k=0 A(k)1 (b + at+h−k) + p � j=1 A(h)j Wt+1−j, (3)

where A(k)j is a matrix obtained from the recursive formula

A(k)j = �

Aj k = 1

A(kj+1−1) + A(k1 −1)Aj k = 2, 3, . . . , h,

(4) and j = 1, . . . , p. Consider the following partition of matrix A(h)

j : A(h)j = � A(h)XX,j A(h)XY,j A(h)Y X,j A(h)Y Y,j � , where A(h) XX,j and A (h)

XY,j are two sub-matrices with orders (n × n) and (n × m),

respectively. Denote the identity matrix of order n by In, it is clear that Xt+h =

Extracting Informative Variables in the Validation of Two-group Causal Relationship 5

(In, 0n×m)Wt+h. Based on the notations introduced above, the best linear predictor

(in matrix form) is given by ˆ Xt(h|ΩXY) = b1,h+ p � j=1 (A(h)XX,jXt+1−j + A(h)XY,jYt+1−j), (5) where b1,h = (In, 0n×m)�hk=0−1A (k)

1 b. Eq. (5) shows that the best linear predictor

ˆ

Xt(h|ΩXY) relates to Yt merely through the coefficient matrix A(h)XY,j. This will

serve as an important benchmark for the remaining of this study.

Note that Definition 1 focuses on the causal relationship between the two random vectors Xt and Yt. In particular, it explicitly defines whether or not Yt can improve

the forecasting of Xt+h. However, by the preceding arguments we learn that if Y → (c)

X, then it is guaranteed that adding all variables in Yt into the information set will

improve the forecasting of “some” variables in Xt- but not definitely all. On the other

hand, the forecasting of Xt+h may be improved by utilizing merely the information

of “some” variables in Yt - but not necessarily all. Therefore, our goal here is to

provide a statistical procedure to extract those “informative variables” in both Xtand

Yt. To do this, we first introduce the definition of “non-informative variables” in both

Xtand Yt.

Definition 2 (Non-informative Variables in Xt and Yt)

Consider the VAR(p) model described in Eq. (1)-(2) and assume that Y →

(c) X for

some given integer c > 0.

(a) The variable Yi,tin Yt = (Y1,t, . . . , Ym,t)� is non-informative if

ˆ

Xt(h|ΩXY) = ˆXt(h|ΩXY−i)for all h ≤ c, (6)

where ΩXY−i = ΩXY \ {Yi,t, . . . , Yi,1} refers to the reduced information set with

the i-th variable in Ytbeing excluded.

(b) The variable Xi,tin Xt = (X1,t, . . . , Xn,t)� is non-informative if

ˆ

Xi,t(h|ΩXY) = ˆXi,t(h|ΩX)for all h ≤ c. (7)

The result of Definition 2 directly implies that, if the prediction of Xt+h based

on ΩXY is the same as that based on the reduced information set ΩXY−i, then the

variable Yi,tcan be excluded from analysis (since it is non-informative in predicting

Xt). Analogously, if the prediction of Xi,t+hbased on ΩXY is the same as that based

on ΩX, then the variable Xi,t can be excluded from analysis. The following two

theorems provide useful guidelines for finding the non-informative (or informative) variables in both Xt and Yt.

Theorem 1 (Identification of Non-informative Variables in Yt)

Consider the matrix A(h)XY,j given in Eq. (5) and its column partition

(8)

5

Extracting Informative Variables in the Validation of Two-group Causal Relationship 5

(In, 0n×m)Wt+h. Based on the notations introduced above, the best linear predictor

(in matrix form) is given by ˆ Xt(h|ΩXY) = b1,h+ p � j=1 (A(h)XX,jXt+1−j + A(h)XY,jYt+1−j), (5)

where b1,h = (In, 0n×m)�hk=0−1A(k)1 b. Eq. (5) shows that the best linear predictor

ˆ

Xt(h|ΩXY) relates to Yt merely through the coefficient matrix A(h)XY,j. This will

serve as an important benchmark for the remaining of this study.

Note that Definition 1 focuses on the causal relationship between the two random vectors Xt and Yt. In particular, it explicitly defines whether or not Yt can improve

the forecasting of Xt+h. However, by the preceding arguments we learn that if Y → (c)

X, then it is guaranteed that adding all variables in Yt into the information set will

improve the forecasting of “some” variables in Xt- but not definitely all. On the other

hand, the forecasting of Xt+h may be improved by utilizing merely the information

of “some” variables in Yt - but not necessarily all. Therefore, our goal here is to

provide a statistical procedure to extract those “informative variables” in both Xtand

Yt. To do this, we first introduce the definition of “non-informative variables” in both

Xtand Yt.

Definition 2 (Non-informative Variables in Xtand Yt)

Consider the VAR(p) model described in Eq. (1)-(2) and assume that Y →

(c) X for

some given integer c > 0.

(a) The variable Yi,tin Yt = (Y1,t, . . . , Ym,t)� is non-informative if

ˆ

Xt(h|ΩXY) = ˆXt(h|ΩXY−i)for all h ≤ c, (6)

where ΩXY−i = ΩXY \ {Yi,t, . . . , Yi,1} refers to the reduced information set with

the i-th variable in Ytbeing excluded.

(b) The variable Xi,tin Xt = (X1,t, . . . , Xn,t)� is non-informative if

ˆ

Xi,t(h|ΩXY) = ˆXi,t(h|ΩX)for all h ≤ c. (7)

The result of Definition 2 directly implies that, if the prediction of Xt+h based

on ΩXY is the same as that based on the reduced information set ΩXY−i, then the

variable Yi,t can be excluded from analysis (since it is non-informative in predicting

Xt). Analogously, if the prediction of Xi,t+hbased on ΩXY is the same as that based

on ΩX, then the variable Xi,t can be excluded from analysis. The following two

theorems provide useful guidelines for finding the non-informative (or informative) variables in both Xtand Yt.

Theorem 1 (Identification of Non-informative Variables in Yt)

Consider the matrix A(h)

XY,j given in Eq. (5) and its column partition

A(h)XY,j = (A(h)XY,j(:, 1), A(h)XY,j(:, 2), . . . , A(h)XY,j(:, m)), (8)

6 Ying-Chao Hung, Neng-Fang Tseng

where A(h)

XY,j(:, i)refers to the i-th column of A (h)

XY,j. Then for any given i ∈ {1, . . . , m},

Yi,t is non-informative if and only if A(h)XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof Although the proof is quite similar to the one shown by Dufour and Renault (1998), we sketch it here for the sake of completeness. Based on Eq. (5), ˆXt(h|ΩXY)

can be further represented as ˆ Xt(h|ΩXY) = b1,h+ p � j=1 A(h)XX,jXt+1−j + m � l=1 p � j=1 A(h)XY,j(:, l)Yl,t+1−j = b1,h+ p � j=1 A(h)XX,jXt+1−j + p � j=1

A(h)XY,j(:, i)Yi,t+1−j

+ m � l�=i p � j=1 A(h)XY,j(:, l)Yl,t+1−j,

where the last equality is obtained by dividing the information set ΩY into {Yi,t} and

ΩY−i. Thus, by treating ΩY−i as the set of “auxiliary variables”, we can conclude

that A(h)

XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} × {1, . . . , p} is the necessary and

sufficient condition for ˆXt(h|ΩXY) = ˆXt(h|ΩXY−i) (the result of Theorem 3.1 by

Dufour and Renault (1998)). The result then follows.

Theorem 2 (Identification of Non-informative Variables in Xt)

Consider the matrix A(h)XY,j given in Eq. (5) and its row partition

A(h)XY,j =       A(h)XY,j(1, :) A(h)XY,j(2, :) .. . A(h)XY,j(n, :)      , (9) where A(h)

XY,j(i, :)refers to the i-th row of A (h)

XY,j. Then for any given i ∈ {1, . . . , n},

Xi,t is non-informative if and only if A(h)XY,j(i, :) = 0for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof The proof is quite similar to that of Theorem 1. By extending the formula given in Eq. (5), we have that

ˆ Xi,t(h|ΩXY) = b1,h(i) + p � j=1 A(h)XX,j(i, :)Xt+1−j + p � j=1

A(h)XY,j(i, :)Yt+1−j

= b1,h(i) + p

j=1

A(h)XX,j(i, i)Xi,t+1−j + n � l�=i p � j=1 A(h)XX,j(i, l)Xl,t+1−j + p � j=1

A(h)XY,j(i, :)Yt+1−j

6 Ying-Chao Hung, Neng-Fang Tseng

where A(h)XY,j(:, i)refers to the i-th column of A(h)XY,j. Then for any given i ∈ {1, . . . , m}, Yi,t is non-informative if and only if A(h)XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof Although the proof is quite similar to the one shown by Dufour and Renault (1998), we sketch it here for the sake of completeness. Based on Eq. (5), ˆXt(h|ΩXY)

can be further represented as ˆ Xt(h|ΩXY) = b1,h+ p � j=1 A(h)XX,jXt+1−j + m � l=1 p � j=1 A(h)XY,j(:, l)Yl,t+1−j = b1,h+ p � j=1 A(h)XX,jXt+1−j + p � j=1

A(h)XY,j(:, i)Yi,t+1−j

+ m � l�=i p � j=1 A(h)XY,j(:, l)Yl,t+1−j,

where the last equality is obtained by dividing the information set ΩY into {Yi,t} and

ΩY−i. Thus, by treating ΩY−i as the set of “auxiliary variables”, we can conclude

that A(h)

XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} × {1, . . . , p} is the necessary and

sufficient condition for ˆXt(h|ΩXY) = ˆXt(h|ΩXY−i) (the result of Theorem 3.1 by

Dufour and Renault (1998)). The result then follows.

Theorem 2 (Identification of Non-informative Variables in Xt)

Consider the matrix A(h)

XY,j given in Eq. (5) and its row partition

A(h)XY,j =       A(h)XY,j(1, :) A(h)XY,j(2, :) .. . A(h)XY,j(n, :)      , (9) where A(h)

XY,j(i, :)refers to the i-th row of A (h)

XY,j. Then for any given i ∈ {1, . . . , n},

Xi,t is non-informative if and only if A(h)XY,j(i, :) = 0for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof The proof is quite similar to that of Theorem 1. By extending the formula given in Eq. (5), we have that

ˆ Xi,t(h|ΩXY) = b1,h(i) + p � j=1 A(h)XX,j(i, :)Xt+1−j + p � j=1

A(h)XY,j(i, :)Yt+1−j

= b1,h(i) + p

j=1

A(h)XX,j(i, i)Xi,t+1−j + n � l�=i p � j=1 A(h)XX,j(i, l)Xl,t+1−j + p � j=1

(9)

6

6 Ying-Chao Hung, Neng-Fang Tseng

where A(h)

XY,j(:, i)refers to the i-th column of A (h)

XY,j. Then for any given i ∈ {1, . . . , m},

Yi,t is non-informative if and only if A(h)XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof Although the proof is quite similar to the one shown by Dufour and Renault (1998), we sketch it here for the sake of completeness. Based on Eq. (5), ˆXt(h|ΩXY)

can be further represented as ˆ Xt(h|ΩXY) = b1,h+ p � j=1 A(h)XX,jXt+1−j + m � l=1 p � j=1 A(h)XY,j(:, l)Yl,t+1−j = b1,h+ p � j=1 A(h)XX,jXt+1−j + p � j=1

A(h)XY,j(:, i)Yi,t+1−j

+ m � l�=i p � j=1 A(h)XY,j(:, l)Yl,t+1−j,

where the last equality is obtained by dividing the information set ΩY into {Yi,t} and

ΩY−i. Thus, by treating ΩY−i as the set of “auxiliary variables”, we can conclude

that A(h)XY,j(:, i) = 0 for all (h, j) ∈ {1, . . . , c} × {1, . . . , p} is the necessary and sufficient condition for ˆXt(h|ΩXY) = ˆXt(h|ΩXY−i) (the result of Theorem 3.1 by

Dufour and Renault (1998)). The result then follows.

Theorem 2 (Identification of Non-informative Variables in Xt)

Consider the matrix A(h)

XY,j given in Eq. (5) and its row partition

A(h)XY,j =       A(h)XY,j(1, :) A(h)XY,j(2, :) .. . A(h)XY,j(n, :)      , (9)

where A(h)XY,j(i, :)refers to the i-th row of A(h)XY,j. Then for any given i ∈ {1, . . . , n}, Xi,t is non-informative if and only if A(h)XY,j(i, :) = 0 for all (h, j) ∈ {1, . . . , c} ×

{1, . . . , p}.

Proof The proof is quite similar to that of Theorem 1. By extending the formula given in Eq. (5), we have that

ˆ Xi,t(h|ΩXY) = b1,h(i) + p � j=1 A(h)XX,j(i, :)Xt+1−j + p � j=1

A(h)XY,j(i, :)Yt+1−j

= b1,h(i) + p

j=1

A(h)XX,j(i, i)Xi,t+1−j + n � l�=i p � j=1 A(h)XX,j(i, l)Xl,t+1−j + p � j=1

A(h)XY,j(i, :)Yt+1−j

Extracting Informative Variables in the Validation of Two-group Causal Relationship 7

where b1,h(i)refers to the i-th element of vector b1,h, and the last equality is obtained

by dividing the information set ΩX into {Xi,t} and ΩX−i. Thus, by treating ΩX−i as

the set of “auxiliary variables”, we can conclude that A(h)XY,j(i, :) = 0for all (h, j) ∈ {1, . . . , c} × {1, . . . , p} is the necessary and sufficient condition for ˆXi,t(h|ΩXY) =

ˆ

Xi,t(h|ΩX) (the result of Theorem 3.1 by Dufour and Renault (1998)). The result

then follows.

3 Extracting Informative Variables

Theorem 1 and Theorem 2 state that the informative variables for two-group causal-ity can be explicitly identified by examining the row and column vectors of the co-efficient matrix A(h)

XY,j. However, in practice the parameters in A (h)

XY,j are usually

unknown and need to be estimated. Therefore, to extract all informative variables one can resort to a study analogous to “model selection”. When the number of vari-ables is large, some commonly used algorithms are: stepwise, forward selection, and backward elimination. These algorithms involve a multi-stage procedure of variable selection and/or elimination that are executed based on the so-called modified Wald test proposed by L¨utkepohl and Burda (1997). Before we proceed, let us first look at the following simple example that illustrates how a desired modified Wald test is performed.

3.1 The Modified Wald Test

Let us consider the following three-variate VAR(1) process with  XY1,t1,t Y2,t   =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2    XY1,t1,t−1−1 Y2,t−1   + at.

Given c = 2, suppose we would like to test whether or not Y1,t is an informative

variable in the causal relation Y →

(2)X, by Definition 2 we can test the null hypothesis

H0 : ˆXt(h|ΩXY) = ˆXt(h|ΩXY−1) for h = 1, 2. (10)

Based on the result of Theorem 1, if A(1)XY,1(:, 1) and A(2)XY,1(:, 1) are both close to zero, then the null hypothesis is not rejected and Y1,t is characterized as a

non-informative variable; otherwise it is characterized as an non-informative variable. To perform a general test for any given c and VAR(p) model, we consider the matrix A(h) = (A(h)

1 , . . . , A (h)

p ) for h = 1, . . . , c (recall that A(h)j are matrices defined in

Eq. (4)) and the column vector α =    vec(A(1)) ... vec(A(c))    ,

(四) 結果與討論

Extracting Informative Variables in the Validation of Two-group Causal Relationship 7

where b1,h(i)refers to the i-th element of vector b1,h, and the last equality is obtained

by dividing the information set ΩX into {Xi,t} and ΩX−i. Thus, by treating ΩX−i as

the set of “auxiliary variables”, we can conclude that A(h)

XY,j(i, :) = 0for all (h, j) ∈

{1, . . . , c} × {1, . . . , p} is the necessary and sufficient condition for ˆXi,t(h|ΩXY) = ˆ

Xi,t(h|ΩX) (the result of Theorem 3.1 by Dufour and Renault (1998)). The result then follows.

3 Extracting Informative Variables

Theorem 1 and Theorem 2 state that the informative variables for two-group causal-ity can be explicitly identified by examining the row and column vectors of the co-efficient matrix A(h)

XY,j. However, in practice the parameters in A (h)

XY,j are usually

unknown and need to be estimated. Therefore, to extract all informative variables one can resort to a study analogous to “model selection”. When the number of vari-ables is large, some commonly used algorithms are: stepwise, forward selection, and backward elimination. These algorithms involve a multi-stage procedure of variable selection and/or elimination that are executed based on the so-called modified Wald test proposed by L¨utkepohl and Burda (1997). Before we proceed, let us first look at the following simple example that illustrates how a desired modified Wald test is performed.

3.1 The Modified Wald Test

Let us consider the following three-variate VAR(1) process with  XY1,t1,t Y2,t   =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2    XY1,t−11,t−1 Y2,t−1   + at.

Given c = 2, suppose we would like to test whether or not Y1,t is an informative

variable in the causal relation Y →

(2)X, by Definition 2 we can test the null hypothesis

H0 : ˆXt(h|ΩXY) = ˆXt(h|ΩXY−1) for h = 1, 2. (10)

Based on the result of Theorem 1, if A(1)

XY,1(:, 1) and A (2)

XY,1(:, 1) are both close

to zero, then the null hypothesis is not rejected and Y1,t is characterized as a

non-informative variable; otherwise it is characterized as an non-informative variable. To perform a general test for any given c and VAR(p) model, we consider the matrix A(h) = (A(h)1 , . . . , A(h)p )for h = 1, . . . , c (recall that A(h)j are matrices defined in

Eq. (4)) and the column vector α =    vec(A(1)) ... vec(A(c))    ,

(10)

7

Extracting Informative Variables in the Validation of Two-group Causal Relationship 7

where b1,h(i)refers to the i-th element of vector b1,h, and the last equality is obtained

by dividing the information set ΩX into {Xi,t} and ΩX−i. Thus, by treating ΩX−i as

the set of “auxiliary variables”, we can conclude that A(h)

XY,j(i, :) = 0for all (h, j) ∈

{1, . . . , c} × {1, . . . , p} is the necessary and sufficient condition for ˆXi,t(h|ΩXY) =

ˆ

Xi,t(h|ΩX) (the result of Theorem 3.1 by Dufour and Renault (1998)). The result

then follows.

3 Extracting Informative Variables

Theorem 1 and Theorem 2 state that the informative variables for two-group causal-ity can be explicitly identified by examining the row and column vectors of the co-efficient matrix A(h)

XY,j. However, in practice the parameters in A (h)

XY,j are usually

unknown and need to be estimated. Therefore, to extract all informative variables one can resort to a study analogous to “model selection”. When the number of vari-ables is large, some commonly used algorithms are: stepwise, forward selection, and backward elimination. These algorithms involve a multi-stage procedure of variable selection and/or elimination that are executed based on the so-called modified Wald test proposed by L¨utkepohl and Burda (1997). Before we proceed, let us first look at the following simple example that illustrates how a desired modified Wald test is performed.

3.1 The Modified Wald Test

Let us consider the following three-variate VAR(1) process with  XY1,t1,t Y2,t   =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2    XY1,t1,t−1−1 Y2,t−1   + at.

Given c = 2, suppose we would like to test whether or not Y1,t is an informative

variable in the causal relation Y →

(2)X, by Definition 2 we can test the null hypothesis

H0 : ˆXt(h|ΩXY) = ˆXt(h|ΩXY−1) for h = 1, 2. (10)

Based on the result of Theorem 1, if A(1)

XY,1(:, 1) and A (2)

XY,1(:, 1) are both close

to zero, then the null hypothesis is not rejected and Y1,t is characterized as a

non-informative variable; otherwise it is characterized as an non-informative variable. To perform a general test for any given c and VAR(p) model, we consider the matrix A(h) = (A(h)1 , . . . , A(h)p ) for h = 1, . . . , c (recall that A(h)j are matrices defined in

Eq. (4)) and the column vector α =    vec(A(1)) ... vec(A(c))    ,

8 Ying-Chao Hung, Neng-Fang Tseng

where “vec” is the column stacking operator that creates a column vector from the matrix by stacking its column vectors below one another. Thus, at each stage the hypotheses for testing whether or not a particular variable Yi,t(or Xi,t) is informative

can be written as the form �

H0 : (Ic⊗ R)α = 0

Ha : (Ic⊗ R)α �= 0

(11) where ⊗ is the Kronecker product so that

Ic⊗ R =      1· R 0 · R · · · 0 · R 0· R 1 · R · · · 0 · R ... ... ... ... 0· R 0 · R · · · 1 · R      cr×cpK2 ,

and R is a “designated” (r×pK2)matrix that corresponds to the null hypothesis (here

r =# of variables in Xtor Ytconsidered in the null hypothesis and K = m + n). To

illustrate, the preceding example for the null hypothesis in Eq. (9) gives that

A(h) =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2   h for h = 1, 2, and α = �vec(A(1) ) vec(A(2)) � .

To test if Y1,t is informative, we can simply choose

R =�0 0 0 1 0 0 0 0 0� so as to satisfy that (I2⊗ R)α = � R 0 0 R � �vec(A(1) ) vec(A(2)) � = � A(1)XY,1(:, 1) A(2)XY,1(:, 1) � .

As a result, testing the null hypothesis that

(I2⊗ R)α = 0 is then equivalent to testing

� A(1)XY,1(:, 1) A(2)XY,1(:, 1) � = � 0 0 � .

Analogously, to test if Y2,tis informative, we can simply choose

R = �0 0 0 0 0 0 1 0 0�; while to test if X1,t is informative, we can simply choose

R = � 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 � .

8 Ying-Chao Hung, Neng-Fang Tseng

where “vec” is the column stacking operator that creates a column vector from the matrix by stacking its column vectors below one another. Thus, at each stage the hypotheses for testing whether or not a particular variable Yi,t(or Xi,t) is informative

can be written as the form �

H0 : (Ic⊗ R)α = 0

Ha : (Ic⊗ R)α �= 0

(11) where ⊗ is the Kronecker product so that

Ic⊗ R =      1· R 0 · R · · · 0 · R 0· R 1 · R · · · 0 · R ... ... ... ... 0· R 0 · R · · · 1 · R      cr×cpK2 ,

and R is a “designated” (r×pK2)matrix that corresponds to the null hypothesis (here

r =# of variables in Xtor Yt considered in the null hypothesis and K = m + n). To

illustrate, the preceding example for the null hypothesis in Eq. (9) gives that

A(h) =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2   h for h = 1, 2, and α = � vec(A(1)) vec(A(2)) � .

To test if Y1,tis informative, we can simply choose

R = �0 0 0 1 0 0 0 0 0� so as to satisfy that (I2⊗ R)α = � R 0 0 R � �vec(A(1) ) vec(A(2)) � = � A(1)XY,1(:, 1) A(2)XY,1(:, 1) � .

As a result, testing the null hypothesis that

(I2⊗ R)α = 0 is then equivalent to testing

� A(1)XY,1(:, 1) A(2)XY,1(:, 1) � = � 0 0 � .

Analogously, to test if Y2,t is informative, we can simply choose

R =�0 0 0 0 0 0 1 0 0�; while to test if X1,t is informative, we can simply choose

R = � 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 � .

(11)

8

8 Ying-Chao Hung, Neng-Fang Tseng

where “vec” is the column stacking operator that creates a column vector from the matrix by stacking its column vectors below one another. Thus, at each stage the hypotheses for testing whether or not a particular variable Yi,t(or Xi,t) is informative

can be written as the form �

H0 : (Ic⊗ R)α = 0 Ha : (Ic⊗ R)α �= 0

(11) where ⊗ is the Kronecker product so that

Ic⊗ R =      1· R 0 · R · · · 0 · R 0· R 1 · R · · · 0 · R ... ... ... ... 0· R 0 · R · · · 1 · R      cr×cpK2 ,

and R is a “designated” (r×pK2)matrix that corresponds to the null hypothesis (here

r =# of variables in Xt or Ytconsidered in the null hypothesis and K = m + n). To illustrate, the preceding example for the null hypothesis in Eq. (9) gives that

A(h) =  AAXY11XX11 AAXY11YY11 AAXY11YY22 AY2X1 AY2Y1 AY2Y2   h for h = 1, 2, and α = �vec(A(1) ) vec(A(2)) � .

To test if Y1,t is informative, we can simply choose

R =�0 0 0 1 0 0 0 0 0� so as to satisfy that (I2⊗ R)α = � R 0 0 R � � vec(A(1)) vec(A(2)) � = � A(1)XY,1(:, 1) A(2)XY,1(:, 1) � .

As a result, testing the null hypothesis that

(I2⊗ R)α = 0 is then equivalent to testing

� A(1)XY,1(:, 1) A(2)XY,1(:, 1) � = � 0 0 � .

Analogously, to test if Y2,t is informative, we can simply choose

R = �0 0 0 0 0 0 1 0 0�; while to test if X1,t is informative, we can simply choose

R = � 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 � .

Extracting Informative Variables in the Validation of Two-group Causal Relationship 9

Note that once the hypothesis in Eq. (11) is specified, the modified Wald test statistic is given by λmodW ald = T � (Ic⊗ R) ˆα + ˆ w √ T �� (Ic⊗ R) ˆΣαˆ(Ic⊗ R)� + λ ˆΣwˆ �−1 · � (Ic⊗ R) ˆα + ˆ w √ T � , (12)

where T is the number of data observations, ˆα is the least square estimator of α, ˆw is a random vector (independent of ˆα) drawn from a multivariate normal distribution M N (0, λ ˆΣwˆ), λ is usually a small positive value (relative to T ) chosen by the user,

ˆ Σwˆ = �0 0 0 Ic−1⊗ diag(R ˆΣβˆR�) � ,

where ˆβ is the least square estimator of β = vec(A(1)), ˆΣβˆ/T is the estimated

covariance matrix of ˆβ, ˆ Σαˆ =      I �1 i=0B1−i⊗ JBiJ� ... �c−1 i=0 Bc−1−i ⊗ JBiJ�     Σˆβˆ      I �1 i=0B1−i ⊗ JBiJ� ... �c−1 i=0 Bc−1−i⊗ JBiJ�      � , where B = � A(1) IK(p−1) 0K(p−1)×K � and J =�IK, 0K×K(p−1)�.

It has been shown (L¨utkepohl and Burda, 1997) that

λmodW ald −→ χd 2(rc) under the null hypothesis in Eq. (11).

Therefore, given a significance level α , the null hypothesis in Eq. (11) is rejected when λmod

W ald > χ21−α(rc).

Remark 1 If c = 1, the random vector ˆw has a degenerate distribution localized at zero (i.e., ˆw = 0 and ˆΣwˆ = 0 almost surely). In this case the modified Wald test

reduces to the standard Wald test.

3.2 Automatic Computer-search Algorithms

Note that if the identification of every informative variable relies merely on one mod-ified Wald test (i.e., with all remaining variables included in analysis), the search procedure can lead to the dropping of “true” informative variables (the issue is sim-ilar to that in regression analysis the full model is considered while variables are selected merely based on the t statistics, see Kutner et al. (2008)). On the other hand,

(12)

9

Extracting Informative Variables in the Validation of Two-group Causal Relationship 9

Note that once the hypothesis in Eq. (11) is specified, the modified Wald test statistic is given by λmodW ald = T � (Ic⊗ R) ˆα + ˆ w √ T �� (Ic⊗ R) ˆΣαˆ(Ic⊗ R)� + λ ˆΣwˆ �−1 · � (Ic⊗ R) ˆα + ˆ w √ T � , (12)

where T is the number of data observations, ˆαis the least square estimator of α, ˆw is a random vector (independent of ˆα) drawn from a multivariate normal distribution M N (0, λ ˆΣwˆ), λ is usually a small positive value (relative to T ) chosen by the user,

ˆ Σwˆ = �0 0 0 Ic−1⊗ diag(R ˆΣβˆR�) � ,

where ˆβ is the least square estimator of β = vec(A(1)), ˆΣβˆ/T is the estimated

covariance matrix of ˆβ, ˆ Σαˆ =      I �1 i=0B1−i ⊗ JBiJ� ... �c−1 i=0 Bc−1−i⊗ JBiJ�     Σˆβˆ      I �1 i=0B1−i ⊗ JBiJ� ... �c−1 i=0 Bc−1−i⊗ JBiJ�      � , where B = � A(1) IK(p−1) 0K(p−1)×K � and J =�IK, 0K×K(p−1) � .

It has been shown (L¨utkepohl and Burda, 1997) that

λmodW ald −→ χd 2(rc) under the null hypothesis in Eq. (11).

Therefore, given a significance level α , the null hypothesis in Eq. (11) is rejected when λmod

W ald > χ21−α(rc).

Remark 1 If c = 1, the random vector ˆw has a degenerate distribution localized at zero (i.e., ˆw = 0 and ˆΣwˆ = 0 almost surely). In this case the modified Wald test

reduces to the standard Wald test.

3.2 Automatic Computer-search Algorithms

Note that if the identification of every informative variable relies merely on one mod-ified Wald test (i.e., with all remaining variables included in analysis), the search procedure can lead to the dropping of “true” informative variables (the issue is sim-ilar to that in regression analysis the full model is considered while variables are selected merely based on the t statistics, see Kutner et al. (2008)). On the other hand,

Extracting Informative Variables in the Validation of Two-group Causal Relationship 9

Note that once the hypothesis in Eq. (11) is specified, the modified Wald test statistic is given by λmodW ald = T � (Ic⊗ R) ˆα + ˆ w √ T ��� (Ic ⊗ R) ˆΣαˆ(Ic⊗ R)� + λ ˆΣwˆ �−1 · � (Ic⊗ R) ˆα + ˆ w √ T � , (12)

where T is the number of data observations, ˆαis the least square estimator of α, ˆw is a random vector (independent of ˆα) drawn from a multivariate normal distribution M N (0, λ ˆΣwˆ), λ is usually a small positive value (relative to T ) chosen by the user,

ˆ Σwˆ = � 0 0 0 Ic−1⊗ diag(R ˆΣβˆR�) � ,

where ˆβ is the least square estimator of β = vec(A(1)), ˆΣβˆ/T is the estimated

covariance matrix of ˆβ, ˆ Σαˆ =      I �1 i=0B1−i ⊗ JBiJ� ... �c−1 i=0 Bc−1−i⊗ JBiJ�     Σˆβˆ      I �1 i=0B1−i ⊗ JBiJ� ... �c−1 i=0Bc−1−i⊗ JBiJ�      � , where B = � A(1) IK(p−1) 0K(p−1)×K � and J = �IK, 0K×K(p−1) � .

It has been shown (L¨utkepohl and Burda, 1997) that

λmodW ald −→ χd 2(rc) under the null hypothesis in Eq. (11).

Therefore, given a significance level α , the null hypothesis in Eq. (11) is rejected when λmod

W ald > χ21−α(rc).

Remark 1 If c = 1, the random vector ˆw has a degenerate distribution localized at zero (i.e., ˆw = 0 and ˆΣwˆ = 0 almost surely). In this case the modified Wald test

reduces to the standard Wald test.

3.2 Automatic Computer-search Algorithms

Note that if the identification of every informative variable relies merely on one mod-ified Wald test (i.e., with all remaining variables included in analysis), the search procedure can lead to the dropping of “true” informative variables (the issue is sim-ilar to that in regression analysis the full model is considered while variables are selected merely based on the t statistics, see Kutner et al. (2008)). On the other hand,

10 Ying-Chao Hung, Neng-Fang Tseng

if the modified Wald test is conducted by considering all possible subsets of vari-ables, the search procedure can be computationally expensive (especially when the number of variables is large). To overcome these problems, we introduce some auto-matic computer-search algorithms that have been widely used in regression analysis, such as the forward selection, backward elimination, and stepwise. We introduce the ideas of these algorithms in the following.

Forward Selection: At each stage the algorithm includes one “most informative” variable based on a predetermined level of the corresponding p-value, but omitting the test whether an included variable should be removed. The algorithm terminates if no further variables can be included.

Backward Elimination: This is the opposite of forward selection. The algorithm starts with the model containing all variables and at each stage remove one “most non-informative” variable based on a predetermined level of the corresponding p-value. The algorithm terminates if no further variables can be removed.

Forward Stepwise: At each stage the algorithm first includes one “most informa-tive” variable based on a predetermined level of the corresponding p-value (called α-to-enter) and, if there are any of the other variables in the model, remove one “most non-informative” variable based on another predetermined level of the corresponding p-value (called α-to-remove). The algorithm terminates if no further variables can ei-ther be included or removed.

It should be mentioned here that, the algorithms introduced above all result in approximations to the “best set” of informative variables. In addition, there is no guarantee that the search results of different algorithms will be the same. We sum-marize the steps of implementing these algorithms as follows, in which we start with extracting the informative variables in Yt and the informative variables in Xt

after-wards.

Step 1: Select one algorithm and denote the corresponding initial set of informative variables in Ytand Xtby Y (0) and X(0), respectively. Set the initial stage k = 0.

Step 2: Set the stage k = k + 1. Update the set of informative variables in Yt based

on X(0) and the associated modified Wald tests, denote the updated set by Y (k). Step 3: Repeat Step 2 until the stopping criterion is satisfied. Denote the resulting

estimated set of informative variables in Yt by ˜Yt. Set the stage k = 0.

Step 4: Set the stage k = k + 1. Update the set of informative variables in Xt based

on ˜Yt and the associated modified Wald tests, denote the updated set by X(k).

Step 5: Repeat Step 4 until the stopping criterion is satisfied. Denote the resulting estimated set of informative variables in Xt by ˜Xt.

Step 6: Extract the obtained two sets ˜Yt and ˜Xt.

10 Ying-Chao Hung, Neng-Fang Tseng

if the modified Wald test is conducted by considering all possible subsets of vari-ables, the search procedure can be computationally expensive (especially when the number of variables is large). To overcome these problems, we introduce some auto-matic computer-search algorithms that have been widely used in regression analysis, such as the forward selection, backward elimination, and stepwise. We introduce the ideas of these algorithms in the following.

Forward Selection: At each stage the algorithm includes one “most informative” variable based on a predetermined level of the corresponding p-value, but omitting the test whether an included variable should be removed. The algorithm terminates if no further variables can be included.

Backward Elimination: This is the opposite of forward selection. The algorithm starts with the model containing all variables and at each stage remove one “most non-informative” variable based on a predetermined level of the corresponding p-value. The algorithm terminates if no further variables can be removed.

Forward Stepwise: At each stage the algorithm first includes one “most informa-tive” variable based on a predetermined level of the corresponding p-value (called α-to-enter) and, if there are any of the other variables in the model, remove one “most non-informative” variable based on another predetermined level of the corresponding p-value (called α-to-remove). The algorithm terminates if no further variables can ei-ther be included or removed.

It should be mentioned here that, the algorithms introduced above all result in approximations to the “best set” of informative variables. In addition, there is no guarantee that the search results of different algorithms will be the same. We sum-marize the steps of implementing these algorithms as follows, in which we start with extracting the informative variables in Yt and the informative variables in Xt

after-wards.

Step 1: Select one algorithm and denote the corresponding initial set of informative variables in Ytand Xtby Y (0) and X(0), respectively. Set the initial stage k = 0.

Step 2: Set the stage k = k + 1. Update the set of informative variables in Yt based

on X(0) and the associated modified Wald tests, denote the updated set by Y (k). Step 3: Repeat Step 2 until the stopping criterion is satisfied. Denote the resulting

estimated set of informative variables in Yt by ˜Yt. Set the stage k = 0.

Step 4: Set the stage k = k + 1. Update the set of informative variables in Xt based

on ˜Ytand the associated modified Wald tests, denote the updated set by X(k).

Step 5: Repeat Step 4 until the stopping criterion is satisfied. Denote the resulting estimated set of informative variables in Xt by ˜Xt.

Step 6: Extract the obtained two sets ˜Yt and ˜Xt.

10 Ying-Chao Hung, Neng-Fang Tseng

if the modified Wald test is conducted by considering all possible subsets of vari-ables, the search procedure can be computationally expensive (especially when the number of variables is large). To overcome these problems, we introduce some auto-matic computer-search algorithms that have been widely used in regression analysis, such as the forward selection, backward elimination, and stepwise. We introduce the ideas of these algorithms in the following.

Forward Selection: At each stage the algorithm includes one “most informative” variable based on a predetermined level of the corresponding p-value, but omitting the test whether an included variable should be removed. The algorithm terminates if no further variables can be included.

Backward Elimination: This is the opposite of forward selection. The algorithm starts with the model containing all variables and at each stage remove one “most non-informative” variable based on a predetermined level of the corresponding p-value. The algorithm terminates if no further variables can be removed.

Forward Stepwise: At each stage the algorithm first includes one “most informa-tive” variable based on a predetermined level of the corresponding p-value (called α-to-enter) and, if there are any of the other variables in the model, remove one “most non-informative” variable based on another predetermined level of the corresponding p-value (called α-to-remove). The algorithm terminates if no further variables can ei-ther be included or removed.

It should be mentioned here that, the algorithms introduced above all result in approximations to the “best set” of informative variables. In addition, there is no guarantee that the search results of different algorithms will be the same. We sum-marize the steps of implementing these algorithms as follows, in which we start with extracting the informative variables in Yt and the informative variables in Xt

after-wards.

Step 1: Select one algorithm and denote the corresponding initial set of informative variables in Ytand Xtby Y (0) and X(0), respectively. Set the initial stage k = 0.

Step 2: Set the stage k = k + 1. Update the set of informative variables in Yt based

on X(0) and the associated modified Wald tests, denote the updated set by Y (k). Step 3: Repeat Step 2 until the stopping criterion is satisfied. Denote the resulting

estimated set of informative variables in Yt by ˜Yt. Set the stage k = 0.

Step 4: Set the stage k = k + 1. Update the set of informative variables in Xt based

on ˜Ytand the associated modified Wald tests, denote the updated set by X(k).

Step 5: Repeat Step 4 until the stopping criterion is satisfied. Denote the resulting estimated set of informative variables in Xt by ˜Xt.

(13)

10

10 Ying-Chao Hung, Neng-Fang Tseng

if the modified Wald test is conducted by considering all possible subsets of vari-ables, the search procedure can be computationally expensive (especially when the number of variables is large). To overcome these problems, we introduce some auto-matic computer-search algorithms that have been widely used in regression analysis, such as the forward selection, backward elimination, and stepwise. We introduce the ideas of these algorithms in the following.

Forward Selection: At each stage the algorithm includes one “most informative” variable based on a predetermined level of the corresponding p-value, but omitting the test whether an included variable should be removed. The algorithm terminates if no further variables can be included.

Backward Elimination: This is the opposite of forward selection. The algorithm starts with the model containing all variables and at each stage remove one “most non-informative” variable based on a predetermined level of the corresponding p-value. The algorithm terminates if no further variables can be removed.

Forward Stepwise: At each stage the algorithm first includes one “most informa-tive” variable based on a predetermined level of the corresponding p-value (called α-to-enter) and, if there are any of the other variables in the model, remove one “most non-informative” variable based on another predetermined level of the corresponding p-value (called α-to-remove). The algorithm terminates if no further variables can ei-ther be included or removed.

It should be mentioned here that, the algorithms introduced above all result in approximations to the “best set” of informative variables. In addition, there is no guarantee that the search results of different algorithms will be the same. We sum-marize the steps of implementing these algorithms as follows, in which we start with extracting the informative variables in Yt and the informative variables in Xt

after-wards.

Step 1: Select one algorithm and denote the corresponding initial set of informative variables in Ytand Xtby Y (0) and X(0), respectively. Set the initial stage k = 0.

Step 2: Set the stage k = k + 1. Update the set of informative variables in Yt based

on X(0) and the associated modified Wald tests, denote the updated set by Y (k). Step 3: Repeat Step 2 until the stopping criterion is satisfied. Denote the resulting

estimated set of informative variables in Yt by ˜Yt. Set the stage k = 0.

Step 4: Set the stage k = k + 1. Update the set of informative variables in Xt based

on ˜Yt and the associated modified Wald tests, denote the updated set by X(k).

Step 5: Repeat Step 4 until the stopping criterion is satisfied. Denote the resulting estimated set of informative variables in Xt by ˜Xt.

Step 6: Extract the obtained two sets ˜Yt and ˜Xt.

問題與建議:

(1) The numerical results show that the algorithms considered in this study have fairly high accuracy in forecasting the future quantity of selected variables.

(2) It should be noted thatthe algorithms introduced above all result in approximations to the “best set” of informative variables.

(3) It should be mentioned that the algorithms used in this study may fail (i.e.,

!

˜

Y t = ˜ X t =" , the readers can refer to Makridakis et al. (1983) for an example that the forward stepwise algorithm fails). To overcome this problem, we can consider the following two strategies: (i) increase the values of !-to-enter and/or !-to-remove; (ii) utilize other search algorithms or perform an exhaustive search of all possible subsets of variables. If none of the strategies work, it is possible that there is no causal relationship between the two groups of variables.

(4) One may suspect that which information sets

!

˜

X t and

!

˜

Y t obtained from the algorithms are the “best”. This answer is, in fact, subject to cases. One possible solution is to investigate the accuracy of forecasting in some selected variables.

(5) It is noted that for small samples, the size of the modified Wald test can be sensitive to the choice of " in Eq. (12). To avoid this problem, for this particular data set we suggest the following rule of thumb for choosing " at each stage of the modified Wald test:

Let N be the number of variables included in the VAR model. Choose " = 9 ! N and " = 35 ! 5N when extracting the informative variables in Yt and Xt, respectively.

(5) We are currently investigating the computational cost when the number of variables becomes large, and how to extract informative variables based on a suitable multiple hypothesis testing procedure for non-stationary processes. This is quite important since we can provide the level of accuracy for the two obtained sets of informative variables by controlling the type I error (or family-wise error rate) of the tests.

(五) 計畫成果自評

此計畫執行順利,結果已達預期目標且令人滿意。 參加此計畫的工作人員已獲得以下之訓練:

(14)

11 1. 人員已熟悉時間序列因果關係的相關問題。 2. 人員已熟悉VAR模型的操作、參數估計與預測方法。實際的電腦模擬讓工 作人員熟悉C++, Matlab, SAS, R等軟體,也增加了工作人員在程式設計方面 的競爭力。 本計畫執行進度優良,初步成果已經發表在:

Hung, Y.C.; Tseng, N.F. (2013). Extracting Informative Variables in the Validation of

Two-group Causal Relationship. Computational Statistics, 28(3), pp. 1151-1167.

此外,另有一篇論文撰寫中,預計近日投稿到 “The Electronic Journal of Statistics” :

Tseng, N.F.; Hung, Y.C.; Balakrishnan, N. A trimmed Causal Relationship Between Two Groups of Time Series, preprint.

(六) 參考文獻

16 Ying-Chao Hung, Neng-Fang Tseng

of variables. If none of the strategies work, it is possible that there is no causal re-lationship between the two groups of variables. Second, one may suspect that which information sets ˜Yt and ˜Xt obtained from the algorithms are the “best”. This

an-swer is, in fact, subject to cases. One possible solution is to consider the accuracy of forecasting in some selected variables. To illustrate, let us recall the numerical re-sults in Table 9. Suppose now we are more interested in forecasting the growth rates of “Exports of Goods” (X1,t) and “Imports of Goods” (X2,t), then the informative

variables extracted by the forward selection algorithm should work better (since they result in the smallest values of MSEP and MAEP). Finally, we are currently investi-gating the computational cost when the number of variables becomes large, and how to extract informative variables based on a suitable hypothesis testing procedure for non-stationary processes.

References

1. Arnold A, Liu Y, Abe N (2007) Temporal causal modeling with graphical Granger methods. Proceed-ings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 66-75.

2. Berberian SK (1961) Introduction to Hilbert space. Oxford University Press, New York.

3. Boudjellaba H, Dufour JN, Roy R (1992) Testing causality between two vectors in multivariate autore-gressive moving average models. J. Amer. Statist. Assoc. 87: 1082-1090.

4. Dufour JM, Renault E (1998) Short-run and long-run causality in time series theory. Econometrica 66: 1099-1125.

5. Fujita A, Sato JR, Garay-Malpartida HM, Morettin PA, Sogayar MC, Ferreira CE (2007) Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. Bioinformatics 23: 1623-1630.

6. Geweke J (1982) Measurement of linear dependence and feedback between multiple time series. J. Amer. Statist. Assoc. 77: 304-313.

7. Geweke J (1984) Inference and causality in economic time series, in: Z. Griliches and M.M. Intriligator, eds., Handbook of Econometrics, Vol. 2 (North-Holland, Amsterdam), pp 1101-1144.

8. Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37: 424-438.

9. Granger CWJ (1980) Testing for causality: a personal viewpoint. J. Econ. Dyn. Control 2: 329-352. 10. Granger CWJ, Lin JL (1995) Causality in the long run. Econometric Theory 11: 530-536.

11. Hacker RS, Hatemi-J A (2006) Tests for causality between integrated variables using asymptotic and bootstrap distributions: theory and application. Appl. Econ. 38: 1489-1500.

12. Haufe S, M¨uller K-R, Nolte G, Kr¨amer N (2010) Sparse causal discovery in multivariate time series. NIPS 2008 Workshop on Causality, JMLR Workshop and Conference Proceedings 6: 97-106.

13. Hocking RR (1976) The analysis and selection of variables in linear regression. Biometrics 32: 1-49. 14. Hsiao C (1982) Autoregressive modeling and causal ordering of econometric variables. J. Econ. Dyn.

Control 4: 243-259.

15. Koster JTA (1996) Markov properties of nonrecursive causal models. Ann. Statist. 24: 2148-2177. 16. Koster JTA (1999) On the validity of the Markov interpretation of path diagrams of Gaussian structural

equations systems with correlated errors. Scand. J. Statist. 26: 413-431.

17. Kutner MH, Nachtsheim CJ, Neter J (2008) Applied linear regression models, 4th edn. McGraw Hill, New York.

18. Lauritzen SL (1996) Graphical models. Oxford University Press, Oxford.

19. Lauritzen SL (2000) Causal inference from graphical models, in: E. Barndorff-Nielsen, D.R. Cox, and C. Kl¨uppelberg, eds., Complex Stochastic Systems, CRC Press, London.

20. L¨utkepohl H (2005) New introduction to multiple time series analysis, 1st edn. 2nd printing, Springer, Berlin, Heidelberg, New York.

21. L¨utkepohl H, Burda MM (1997) Modified Wald tests under nonregular conditions. J. Econometrics 78: 315-332.

參考文獻

相關文件

• pbrt is based on radiative transfer: study of the transfer of radiant energy based on radiometric principles and operates at the geometric optics level (light interacts

○ Propose a method to check the connectivity of the graph based on the Warshall algorithm and introduce an improved approach, then apply it to increase the accuracy of the

The natural structure for two vari- ables is often a rectangular array with columns corresponding to the categories of one vari- able and rows to categories of the second

 develop a better understanding of the design and the features of the English Language curriculum with an emphasis on the senior secondary level;..  gain an insight into the

• Introduction of language arts elements into the junior forms in preparation for LA electives.. Curriculum design for

The A-Level Biology Curriculum aims to provide learning experiences through which students will acquire or develop the necessary biological knowledge and

(a) A special school for children with hearing impairment may appoint 1 additional non-graduate resource teacher in its primary section to provide remedial teaching support to

In section 4, based on the cases of circular cone eigenvalue optimization problems, we study the corresponding properties of the solutions for p-order cone eigenvalue