Evaluations on real data - Performance evaluation of the CMP algorithm

Chapter 6 Performance Evaluation

6.2 Performance evaluation of the CMP algorithm

6.2.2 Evaluations on real data

In this sub-section, we compare the CMP, and the modified Apriori and BIDE algorithms using two real datasets: weather and stock. The weather dataset is retrieved from the Data Bank for Atmospheric Research (DBAR) and the period of the dataset is from January 2003 to December 2007 with a total of 1825 transactions [16]. Each transaction in the database includes two sequences: temperature and relative humidity in a day. These measurements are taken at the weather station in Taipei. Both sequences are recorded hourly. In the experiment, we aim to examine the influence of temperature and relative humidity on cloud formation. Hence, we select dates with average cloudiness greater than six on a decimal scale where a value of six refers to mostly cloudy [7]. As a result, 1348 days are selected. Therefore, the database contains 1348 transactions, and the length of each transaction is 24.

Next, we transform each sequence in a transaction into a symbolic sequence in the weather dataset. The first sequence in a transaction represents temperature and it includes five different symbols L₁, ML₁, M₁, MH₁, and H₁from low to high, where the symbols L₁, ML₁, M₁, MH₁, and H₁ represent low, medium low, medium, medium high, and high temperature, respectively. The second sequence in a transaction represents relative humidity, and also uses five different symbols L₂, ML₂, M₂, MH₂, and H₂ from low to high humidity.

The stock dataset is collected from the Taiwan Stock Exchange Corporation (TSEC) [57] from January 2000 to December 2007. Seven stock indexes are chosen, including Nasdaq Composite Index (IXIC), Dow Jones Industrial average (DJI), Tokyo Nikkei 225 Price Index (NK-225), Korea Composite Stock Price Index (KOSPI), Hong Kong Hang Seng Index (HSI), Singapore Straits Times Index (STI), and Taiwan Stock Index (TAIEX). Each of these stock market movements is represented by one of the sequences in a transaction. Each sequence in a transaction represents the movements of the stock market in a month. In this experiment, we aim to examine if Asia stock market is likely

to be affected by U.S. stock market. That is, we doubt whether there exists a pattern that contains all seven stock indexes (multiple sequences) moving in the same direction.

Therefore, the database contains 96 transactions (96 months), where each transaction includes seven sequences and the length of each transaction is about 20. Then, each sequence is transformed into a symbolic sequence. The movements of a stock market are classified into three levels: rising, falling, and constant. That is, each sequence is formed by three symbols: R_i (rising), F_i (falling), and C_i (constant), where i is the sequence ID in a transaction, 1 < i < 7.

(a) (b) Fig. 6.7. Runtime versus minimum support: (a) weather and (b) stock.

The performance of the three algorithms using the real datasets is quite similar to that found using the synthetic data. Fig. 6.7 illustrates the runtime versus the minimum support threshold for the weather and stock datasets. In the modified BIDE algorithm, a new transaction is formed by appending the second sequence to the first sequence. The length of a transaction is 48 in the weather dataset and about 140 in the stock dataset.

Since the modified BIDE algorithm cannot efficiently mine frequent closed patterns in long transactions, it runs slower than the CMP and the modified Apriori algorithms. The CMP algorithm outperforms the modified Apriori algorithm because the latter generates

1 10 100 1000 10000 100000

4 6 8 10 12 14 16 18 20

Runtime (s)

Minimum support (%) Apriori CMP BIDE

1 10 100 1000 10000 100000

10 12 14 16 18 20

Runtime (s)

Minimum support (%) Apriori CMP BIDE

a large number of candidates, especially when the minimum support threshold is low. In addition, the CMP algorithm adopts closure checking and pruning strategies to accelerate the mining process and avoid generating many unnecessary candidates.

Therefore, it outperforms the modified Apriori and BIDE algorithms.

(a) (b) Fig. 6.8. Runtime versus maximum gap: (a) weather and (b) stock.

Fig. 6.8 presents the runtime versus the maximum number of gaps for the weather and stock datasets. As the maximum number of gaps increases, more candidates or patterns are generated and therefore the runtime of all three algorithms increases. Since the runtime of the CMP algorithm increases slowly as the maximum number of gaps increases, the CMP algorithm outperforms the modified Apriori and BIDE algorithms.

In Fig. 6.9, we demonstrate how the SAX representation may affect the performance on the real datasets. We apply the CMP algorithm to the weather and stock datasets with a different number of symbols, where the minimum support threshold is 10% in the weather dataset and 30% in the stock dataset. In both real datasets, the number of closed patterns decreases as the number of symbols increases. When the number of symbols increases, the runtime decreases for the weather dataset; however, it increases for the stock dataset.

1 10 100 1000 10000

0 1 2 3 4 5

Runtime (s)

Max gap

Apriori CMP BIDE

1 10 100 1000

0 1 2 3 4 5

Runtime (s)

Max gap

Apriori CMP BIDE

(a) (b) Fig. 6.9. Runtime and number of closed patterns versus number of symbols: (a) weather and

(b) stock.

We perceive that the trend of runtime obtained for the weather dataset is different from what we have observed from the synthetic data in which the runtime of the CMP algorithm increases as the number of symbols increases. In the weather dataset, both temperature and relative humidity are recorded hourly and the hourly fluctuations of these variables are small for most days. Thus, many values in a sequence are transformed into the same symbol in the SAX representation. As more equivalent symbols are contained in the dataset, more closed patterns but fewer frequent 1-patterns may be mined. In other words, discretizing by three symbols or more symbols does not make a big difference on the number of frequent 1-patterns generated. Thus, the runtime of the CMP algorithm depends on the number of closed patterns mined from the dataset.

With smaller number of symbols, more closed patterns can be mined, and hence the CMP algorithm requires more time in the mining process.

In comparison with the weather data, the movements of stock market indexes are collected on a daily basis and have larger fluctuations. Hence, fewer closed patterns but more frequent 1-patterns are generated in the stock dataset. As more frequent 1-patterns are generated, the number of combinations of forming candidate patterns from a frequent pattern increases. This leads to more computational effort to check whether

0 5000 10000 15000 20000 25000

0 10 20 30 40 50 60

3 4 5 6 7

Number of patterns

Runtime (s)

Number of symbols

Runtime (s) Number of patterns

0 500 1000 1500 2000 2500

0 1 2 3 4 5 6 7

3 4 5 6 7

Number of patterns

Runtime (s)

Number of symbols

Runtime (s) Number of patterns

these candidate patterns are frequent and closed. Notice that the runtime in the case of three symbols is slightly greater than that in the cases of four and five symbols. In the case of three symbols, the effort on generating closed patterns dominates the total runtime. As the number of symbols increases, fewer closed patterns may be mined and the effort that dominates total runtime would shift to forming candidate patterns and checking if the candidates formed are frequent and closed. However, in the cases of four and five symbols, the effort on forming candidate patterns and checking is not yet apparent since the number of symbols only increases a little and the effort on generating closed patterns decreases because fewer closed patterns can be found. Hence, the runtime in the cases of four and five symbols may be less than in the case of three symbols. As the effort on forming candidate patterns becomes apparent in the case of six and seven symbols, the performance would be worse in comparison to the case of three symbols. Therefore, as the number of symbols increases, the runtime decreases for the weather dataset; however, it increases for the stock dataset.

Mining the weather dataset, some interesting patterns are found, such as

⎭⎬

MH . The patterns show that when

the temperature and the relative humidity are medium high or greater, it is likely to be day with high cloudiness.

In Figs. 6.10 - 6.12, we project the patterns back to the original data. Fig. 6.10 shows a continuous period of high (or medium high) temperature and high (or medium high) relative humidity from the 1^st to the 10^th hours, and demonstrates that warm and moist air produces clouds during this period. However, when the relative humidity starts to drop in the 11^th hour and remains at a lower level afterward, the cloudiness also drops dramatically despite the temperature remaining at the high level.

Fig. 6.10. Example I: projecting the pattern back to the raw time-series sequence.

Fig. 6.11. Example II: projecting the pattern back to the raw time-series sequence.

In Fig. 6.11, the temperature remains at the medium high level and the relative humidity remains at the high level, and hence high cloudiness can be observed through the day. In Fig. 6.12, it is mostly cloudy in the early morning when the temperature and relative humidity are high (or medium high) through the first eight hours. However, the relative humidity decreases to a lower level from late morning to the afternoon while the temperature remains at a high level. Hence, a trend in decreasing cloudiness may be

Relative humidity

Temperature

Hours

Temperature Cloudiness Relative humidity

Cloudiness

Relative humidity

Temperature

Hours

Temperature Cloudiness Relative humidity

Cloudiness

H₂

MH₁

observed during this period. Nevertheless, from the 20^th to the 24^th hour, the cloudiness starts to increase as the relative humidity starts to increase and the temperature remains at a medium high level.

Fig. 6.12. Example III: projecting the pattern back to the raw time-series sequence.

In summary, the CMP algorithm is more efficient and scalable than the modified Apriori and BIDE algorithms. It requires much less execution time, especially when the minimum support threshold is low. This is because the CMP algorithm employs the projected database, closure checking and pruning strategies to mine closed patterns to avoid generating many unnecessary candidates. Therefore, the experimental results show that the CMP algorithm outperforms the modified Apriori and BIDE algorithms.

在文檔中時間序列資料庫之封閉性樣式探勘 (頁 93-99)