Conclusion and outlook - 希爾柏特黃轉換於非穩定時間序列之分析：用電量與黃金價格

In the end, we summarize the analytical process and results, and conclude the direction of the development and application in the future.

2. Methodology

2.1. Empirical mode decomposition

2.1.1. Introduction to empirical mode decomposition

Empirical mode decomposition (EMD) is a general non-linear and non-stationary data processing method which developed by [Huang et al., 1998]. Since this method is empirical, intuitive and adaptive, it is appropriate to analyze non-linear and non-stationary data. By sifting process of EMD, the irregular time series can be decomposed into an independent set of nearly periodic oscillatory modes which called

“intrinsic mode functions (IMFs)”. Dissimilar to the non-linear and non-stationary data, the IMFs are based on the local characteristic time scale and vibrate in more regular modes. Due to the nearly periodic properties of the IMFs, they perhaps have some particular physical meanings; for instance, if the period of an IMF is about one day or 24 hours, it can be recognized as the daily component of the original data;

likewise, an IMF with the period approximates to one week implies that the IMF is the weekly component of the original data.

As these results, non-linear and non-stationary time series can be decomposed into a set of IMFs by EMD, and the IMFs are easier to analyze their own physical meanings than the original data.

2.1.2. Intrinsic mode functions and sifting process

The empirical mode decomposition (EMD) proposed by [Huang et al., 1998] is a data-analysis method which is useful to deal with non-linear and non-stationary time series. It assumes every time series can be decomposed into a set of intrinsic mode functions (IMFs). The IMFs are based on the local characteristic scale by itself, and they have to satisfy the following conditions:

(1) The IMFs have the same numbers of extrema (including maxima and minima) and zero-crossings, or differ at most by one;

(2) At any point, the mean value of the upper envelope (defined by local maxima) and the lower envelope (defined by local minima) is zero; it means the IMFs are symmetric with respect to local zero mean.

We can use the sifting process to extract the IMFs from the original data by the following steps:

(1) Identify all the local maxima and minima of the time series x(t)

(2) Connect all the local maxima and minima by cubic spline interpolation to generate its upper envelope e_max(t) and lower envelope e_min(t)

(3) Calculate the point-by-point averages m(t) from the upper and lower envelopes:

(4) Calculate the difference between the time series x(t) and the mean value m1(t), and get h1(t) as :

h₁(t) = x(t) - m₁(t)

(5) Check the properties of h₁(t):

If h1(t) doesn’t satisfy the conditions of IMF, replace x(t) with h1(t) and repeat the steps from (1) to (4) until h_k(t) satisfies the stopping criterion:

A typically value for SD is set between 0.2 and 0.3.

On the contrary, if h₁(t) satisfies the conditions of IMF, it’s set to be the first IMF and denotes h1(t) as the first component c1(t), and then we separate the IMF c1(t) from the original data x(t) to get the residue r₁(t):

x(t) - c1(t) = r1(t) when the component c_n(t) or the residue r_n(t) becomes so small that it is less than the predetermined value of substantial consequence, or when the residue rn(t) becomes a monotonic function since no more IMFs can be extracted. [Huang et al. 1998]

At the end of the sifting process, the original time series can be expressed as

Where n is the number of IMFs, r_n(t) is the final residue as the trend of x(t), and ci(t) represents IMFs as the independent component of x(t). The IMFs are listed from high frequency to low frequency by sifting processes, so we can regard the EMD as a filter to separate high frequency components to low frequency components.

Fig. 3 The processes of sifting process:

(a) Original data x(t);

(b) m1(t) (shown in the solid line) computed by m1(t) = (emax(t)+ emin(t))/2, where emax(t) and e_min(t) are the upper and lower envelopes (both shown in the dotted line);

(d) get h2(t) by repeating sifting process again, but it is still not an IMF;

(e) the 1^st IMF has been extracted after repeating 9 sifting process.

Source: [Norden E. Huang, 1998]

Fig. 4 There are 7 IMFs (list from 2^nd to 8^rd lines) and one residue (list in the 9^rd line) have been

decomposed from the original time series (list in the 1^st line) by the sifting process.

Source: [Norden E. Huang, 1998]

2.1.3. Ensemble Empirical Mode Decomposition

EMD has been proved to be a useful data-analysis method by extracting IMFs scales; in other words, the mode-mixing IMF includes two or more oscillatory modes which the superfluous one are similar to other IMFs. The mode-mixing problem is often caused by the intermittence signals in the original data, and a typical example to illustrate it is shown in fig. 5 and fig. 6.

On account of the mode-mixing problem in EMD, sometimes the IMF can not represent its own meanings clearly; thus, EEMD has been proposed by [Wu and Huang, 2004] for solving the mode-mixing problem by adding white noise. We know that each observed data are mixed with true time series and noise; since the noise contains wider frequency domain, the ensemble mean of the observed data is close to the true time series. For instant, the white noise contains whole nature frequency and its ensemble mean of the residue is close to zero (shown in fig. 7 and fig. 8).

Accordingly, it is reasonable to add white noise to the original data and we can still extract the true signal by computing the ensemble mean; as these reasons, the additional white noise does not obviously affect the ensemble mean between the upper and lower envelopes in the sifting process.

The procedure of EEMD is shown as the follows:

(1) Adding a time-series white noise to the original data

(2) Decomposing IMF from the data with additional white noise

(3) Repeat the above two steps iteratively, and add different white noise for each time. Finally we obtain the ensemble means of corresponding IMFs of the decompositions.

However, there is a well-established statistical rule proved by [Wu and Huang, 2004] to control the parameters of the additional white noise:

Where N is the ensemble number, ε is the amplitude of the additional white noise, and εn is the final standard deviation of error. The standard deviation of error is defined as the difference between the input signal and the corresponding IMFs. The results compared between EMD and EEMD are shown in fig.5, fig.6 and fig.9. In this study, the ensemble number N is set to be 100 and the εis set to be 0.1.

As these results, the procedure of additional white noise can successfully fill the original data with uniform frequency domain and eliminate most of the mode-mixing problems in IMFs. Therefore, EEMD is a substantial improvement over the original EMD.

Fig. 5 The original data is based on sine wave and contains the intermittence signals with higher

frequency and the first sifting process. Source: [Wu and Huang, 2004]

Fig. 6 The IMFs and the residue extracted from the data (see fig. 5), and C1 obviously includes two oscillatory modes which called mode-mixing problem. Source: [Wu and Huang, 2004]

Fig. 7 The original data of uniform white noise recorded in 1000 seconds.

[Source: Data Analysis Center, NCU]

Fig. 8 The IMFs and the residue decomposed from the uniform white noise with SD=1, and the mean value of residue is about zero. [Source: Data Analysis Center, NCU]

Fig. 9 The IMFs and the residue decomposed from the data (see fig.5) by EEMD, which the ensemble

number is 50 and ε is 0.1. The mode mixing in fig.5 has been separated into C3 and C5.

Source: [Wu and Huang, 2004]

2.2. Statistical Measures

We use the following statistical measures to analyze the IMFs and residues:

mean period, Pearson product moment correlation coefficient, Kendall tau rank correlation coefficient, variance, variance percentage, power percentage and LRCV.

The definitions of these statistical measures are illustrated as follows.

2.2.1. Mean period

The mean period is defined by “inverse of mean frequency”, and the mean frequency is the average of the “instantaneous frequency”. The instantaneous frequency is computed by Hilbert-Huang transformation (HHT), and the processes for computing instantaneous frequency by HHT are shown as the follows:

We can extract the IMFs from the original data by EMD or EEMD. Since the IMFs are symmetric to the local zero mean, it’s available to use Hilbert transformation to compute the time-series imaginary part Y(t):

As the X(t) and Y(t) are the corresponding real part and imaginary part, they are allowed to form the time-series complex conjugate pairs in the polar coordinate:

Where a(t) is the time-series amplitudes in the polar coordinate, its definition is:

And (t) is the time-series phase in the polar coordinate, its definition is:

Since X(t) and Y(t) are both instantaneous values, (t) can indicate the changing value of the phase between two continue points. The instantaneous frequency can be computed by (t) on the time interval differential as follows:

As the instantaneous frequency is defined by differential, it is useful to compute

the frequencies or periods of the non-linear and non-stationary time series. We can compute the mean period of the IMFs by instantaneous frequency, and analyze the meanings or properties of them. In addition to analyzing the properties of the IMFs, the mean period is also an important measure to divide the IMFs into several groups.

On the other hand, it is meaningless to compute the mean period of the residue because it is a monotonic function. Due to this reason, we only compute the mean period of the IMFs in this study.

2.2.2. Pearson product moment correlation coefficient

Pearson product moment correlation coefficient is a statistical measure which is often used to compare the degree of linear dependence between two variables. The definition of Pearson product moment correlation coefficient between two variables X and Y ( ) is shown as the follows: variable X, and E indicates the expectation.

The cov(X, Y) is the covariance of the variables X and Y which divided by the product of their standard deviations obtained as:

The Pearson correlation coefficient always ranges from −1 to 1. The larger the value of Pearson correlation is, the higher the linear correlation between two variables is. The different sorts of relationship between two variables and their corresponding

values of the correlation coefficient are shown in fig. 10. A value of 1 shows that all the data of X and Y are either keeping increasing or decreasing in the same interval, and there must be a linear equation that describes the relationship between X and Y perfectly. A value of −1 shows that all the data of X and Y are varying in the opposite direction, and the relationship between X and Y is strongly positive correlation. A value of zero means there is no linear correlation between X and Y.

The extent of the correlation in different ranges of Pearson product moment correlation is shown in table.1.

Table. 1 The extent of linear correlation represented by different ranges of Pearson product moment correlation coefficient. Source: [Linear statistical models, Philip B. Ender, UCLA]

Correlation Negative Positive

None −0.09 to 0.0 0.0 to 0.09

Small −0.3 to −0.1 0.1 to 0.3

Medium −0.5 to −0.3 0.3 to 0.5

Large −1.0 to −0.5 0.5 to 1.0

Fig. 10 Different sets of the points and their corresponding values of Pearson product moment correlation

coefficients (r). Source: [Linear statistical models, Philip B. Ender, UCLA]

2.2.3. Kendall tau rank correlation coefficient

In statistics, Kendall tau rank correlation coefficient is used to measure the association between two variables. Different from Pearson product moment correlation coefficient, Kendall tau rank correlation coefficient is a measure of rank correlation; hence, it is used to compare the ordering of the concordant rank between two variables.

The Kendall tau rank correlation coefficient is defined as:

Where n is the number of the data points. The concordant and discordant pairs are defined as follows: for two variables X and Y, if the corresponding values between them are in a concordant ranking, such as both X_i > X_j and Y_i > Y_j,or both X_i < X_j and Yi < Yj, the two variables are said to be concordant pairs. On the contrary, if the corresponding value between them are in a discordant rank, such as X_i > X_j but Y_i <

Yj,or Xi < Xj but Yi > Yj, the two variables are said to be discordant pairs.

Kendall tau rank correlation coefficient must range from -1 to +1, and the values of +1, -1 and zero implies different ordering of ranking. A value of +1 means the ranking between two variables is perfectly agreement, and the number of discordant pairs is zero. A value of -1 means the ranking between two variables is perfectly disagreement, and the number of concordant pairs equals to zero. If the value approximates to zero, it means the two variables are independent to each other.

2.2.4. Variance

Since the variance percentage and power percentage are both defined by variance, we introduce three statistical measures together in this part. We first introduce variance. Followed by variance percentage and finally power percentage.

The definition of variance is the square mean of the difference between the data and ensemble average, which is shown as follows:

Where X is the value of the data, μ is the ensemble average, and E indicates the except value.

In probability theory and statistics, variance is used to determine the degree of dispersing for a set of points. The degree of dispersing means how far the data locate

from the ensemble average, so variance is a useful measure to show the data aggregate or disperse from the ensemble average. Typical examples that explain difference distributing data and their corresponding variance are shown in fig.11.

Fig. 11 The 6 distributions have the same mean (=1) and differ in their variance whose value is indicated next to each curve. Note that the greater the variance, the greater the asymmetry of the curve.

Source: [Computerized Information Series, FAO]

2.2.5. Power percentage and variance percentage

The definitions of power percentage and variance percentage are shown as follows:

Owing to the definitions, power percentage and variance percentage can show the degree of weighting for each IMF or residue with different perspectives; power percentage is based on the original data and variance percentage is based on the summation of each component. Nevertheless, the summed up variance of all IMFs and residue is not totally equivalent to the variance of the original data due to the round-off errors, additional white noise, and cubic spline end conditions in EEMD [Peel et al., 2005].

Since variance is used to determine how far the data points locate from the local mean, both power percentage and variance percentage can show the degree of weighting as the percentage measures for each component. The larger the two percentage measures are, the more important the component is. Thus, these two measures are useful to determine which component is worth analyzing.

2.2.6 LRCV

The measure “LRCV” is the abbreviation of “logarithm of the ratio of consecutive financial variable values” [Huang et al., 2003b]. LRCV is often used to analyze financial time-series data, its definition is shown as follows:

Where S_n is the value of the original data at the nth time step, and logarithm is the natural logarithm. The value of LRCV represents variability between two continue time steps, and higher values of LRCV correspond to sharper fluctuations in the original data.

This measure is only used to compare the volatility of the original monthly gold prices with the high frequency term in section 6.

3. Data and analysis

The hourly electricity consumption, hourly temperature and LME monthly gold prices can be decomposed into a set of IMFs with different time scales and the residues by EEMD. Since there are lots of components decomposed from the data, we must to sort out the significant components which are more meaningful to analyze.

The statistical measures “power percentage” and “variance percentage” are used to determine the significant components, and the correlation coefficients “Pearson product moment correlation coefficient” and “Kendall tau rank correlation coefficient”

are used to compare the correlation between the significant components with the original data.

Among all the significant components, the IMFs can be indicated their own specific meanings by the statistical measure “mean period”, and the residues exhibit deterministic long-run properties as the trends of the original data [Huang et al., 1998].

As these reasons, we can find the characteristics of the original data by analyzing the IMFs and the residues.

In this section, we introduce the data source for a start, and then analyze the meanings of the significant components for each data part by part, and finally summarized the results of the analysis.

3.1. Data

3.1.1. Hourly electricity consumption in NCCU

The hourly electricity consumption is supplied by [Office of General Affairs, NCCU]. The recording period of the data starts from January 28^th, 2008 to February 3^rd, 2010. According to the different sampling frequency, the data are divided into two types: hourly electricity consumption and daily electricity consumption. Both the two types of data record the electricity consumption from five major electricity meters in NCCU, which named GCB2, GCB3, GCB5, GCB6, and GCB10. The five electricity meters record the electricity consumption for different buildings as follows:

GCB2-Information Building, GCB3-Research Building, GCB5-College of Commerce Building, GCB6-CiSian Building, GCB10-General Building of Colleges.

Since the electricity consumption of Information Building, College of Commerce Building and General Building of Colleges are higher than others, and the hourly data reveals more information and details than the daily data, we choose the hourly data of GCB2, GCB5 and GCB10 to analyze the electricity consumption in NCCU. The duration of the analytical data we select starts from March 1^st, 2008 to June 30^th, 2008 because the electricity consumption of 2^nd semester is always much higher than 1^st semester in NCCU.

3.1.2. Hourly temperature in Taipei

The data “hourly temperature in Taipei” is picked from the hourly atmospheric data recorded by the weather station “Taipei (6920)”, which the original data are

downloaded from [Data Bank for Atmospheric Research (DBAR), NTU]. The analytical period of the hourly temperature we select is as same as the hourly electricity consumption. The hourly atmospheric data contain 152 rows, and each row corresponds to diverse atmospheric measures. According to the illustration of hourly atmospheric data provided by DBAR, the measure from 29^th row to 33^throw called

“dry bulb temperature”, which shows the average hourly temperature in Taipei, and it is the analytical data we need in this study.

Due to the same sampling frequency and recording period for the two sorts of data, it is available to compare the correlation between the electricity consumption and temperature in NCCU during the same interval of time.

3.1.3. Monthly gold price

The gold price data are downloaded from the website named [KITCO (www.kitco.com)], and the original data is recorded by London Metal Exchange (LME) with the unit of US dollar per oz ($/ounce). The original monthly gold prices and the corresponding significant events are shown in fig. 11.

There are many types of gold price data recorded by LME, such as hourly data, monthly data, and yearly data. In this study, we choose the monthly gold prices to analyze for the following reasons. Owing to the monthly data equivalent to approximately 24 years, it can perform more information and factors in the long run as it shows longer time variance than hourly data and extracts more details than yearly data (1833 to 1999). Thus, monthly data can best present the details in long run (fig.

12).

Fig. 12 LME monthly gold prices and significant events recorded from Jan. 1968 to Nov. 2010.

3.2. Hourly temperature in Taipei

The IMFs and residue which are extracted from the hourly temperature are shown in fig.13 and fig.14. Also, the statistical measures are shown in table. 2. Among all the components of the hourly temperature, the most significant one is the residue because its weighting percentage and correlation coefficients are both the largest. The power percentage of residue is about 46% and variance percentage is about 60%. The Pearson and Kendall correlation coefficient for each is about 0.702 and 0.513. In addition to the residue, another significant component is IMF4, which is with the mean period around 24 hours. The power percentage and variance percentage of IMF4 are 13.5% and 17.38%, and both of them are the largest above all the IMFs. On the other hand, the Pearson and Kendall correlation coefficient of IMF4 for each is 0.432

1968-Jan 1974-Mar 1980-May 1986-Jul 1992-Sep 1998-Nov 2005-Jan

US dollar per oz

and 0.255; the Pearson correlation coefficient is the highest, but the Kendall

在文檔中希爾柏特黃轉換於非穩定時間序列之分析：用電量與黃金價格 - 政大學術集成 (頁 14-92)