• 沒有找到結果。

Performance evaluation of the CFP algorithm

Chapter 6 Performance Evaluation

6.3 Performance evaluation of the CFP algorithm

To evaluate the performance of the CFP algorithm, we compare it with the modified Apriori algorithm. For the comparison reason, we adapt the Apriori algorithm [53] to mine closed flexible patterns in a time-series database. That is, we slightly modify the join step of the Apriori algorithm. Instead of combining two frequent k-patterns as described in [53], we join these patterns with all possible combinations of

2 4 6 8 10

65 70 75 80 85 90 95

20 22 24 26 28 30 32 34 36

-5 -3 -1 1 3 5 7 9 11 13 15 17 19 21 23 25

Relative humidity

Temperature

Hours

Temperature Cloudiness Relative humidity

H2

Cloudiness MH2

M2 ML2

M2

ML2 M2

MH2 H2

MH1

H1

MH1

86

gap intervals to generate candidate (k+1)-patterns. Then, we scan the database once to count the support of each candidate (k+1)-pattern and check if it is frequent. The steps described above are repeated until no more frequent patterns can be found. To compare with the modified Apriori algorithm, we derive a complete set of frequent patterns from the closed flexible patterns found in the CFP algorithm.

6.3.1 Evaluations on synthetic data

In Fig. 6.13, we investigate the effect of varying minimum support thresholds from 2% to 12% on the runtime of the CFP and the modified Apriori algorithms. The result shows that as the minimum support threshold decreases, the runtime of both algorithms increases sharply. This is because more patterns can be mined with a smaller minimum support threshold. However, compared with the modified Apriori algorithm, the CFP algorithm has a significant performance improvement, especially when the minimum support threshold is low. This is due to the benefits of using projected databases so that the CFP algorithm can localize the pattern extension in a small number of projected databases. Moreover, using two pruning strategies and the closure checking scheme, we can eliminate many unnecessary patterns which result in a major reduction in execution time. In contrast, the modified Apriori algorithm bears more computational costs because it generates a large number of candidates and needs multiple scans of the database. Therefore, the CFP algorithm runs about 5-19 times faster than the modified Apriori algorithm.

Generally speaking, increasing the number of transactions may lead to a larger number of frequent patterns being generated. As shown in Fig. 6.14, the runtime of both algorithms increases linearly as the number of transactions increases.

87

Fig. 6.13. Runtime versus minimum support (CFP).

Fig. 6.14. Runtime versus number of transactions (CFP).

Fig. 6.15. Runtime versus transaction length (CFP).

Fig. 6.16. Runtime versus maximum gap (CFP).

Fig. 6.15 illustrates the runtime versus the transaction length for both algorithms, where the transaction length varies from 6 to 10. As the transaction length increases, the runtime of both algorithms increases. The reason is that longer flexible patterns may be discovered from longer transactions and thus require more effort on the join steps and counting supports. Nevertheless, the CFP algorithm outperforms the modified Apriori algorithm in all cases.

In Fig. 6.16, we evaluate the performance on the CFP and the modified Apriori

0 2000 4000 6000 8000 10000 12000 14000

2 4 6 8 10 12

Runtime (s)

Minimum support (%)

Apriori CFP

0 500 1000 1500 2000 2500 3000

10 20 30 40 50 60 70 80 90 100

Runtime (s)

Number of transactions (K)

Apriori CFP

0 200 400 600 800 1000 1200 1400 1600 1800

6 7 8 9 10

Runtime (s)

Transaction length

Apriori CFP

0 200 400 600 800 1000 1200 1400

0 1 2 3 4 5

Runtime (s)

Max gap

Apriori CFP

88

algorithms with different maximum gap thresholds. We observe that when the maximum gap threshold increases, the runtime of both algorithms increases. As the maximum gap threshold increases, the number of possible combinations of gap intervals between two successive items in a pattern also increases. This results in more frequent patterns at each level, and hence more execution time is required. However, the modified Apriori algorithm spends a lot of time on generating a large number of candidates and scanning the database to count their supports. In contrast, the CFP algorithm employs the concept of projected database and saves a lot of time on support counting. Therefore, the CFP algorithm is more efficient and scalable than the modified Apriori algorithm.

Fig. 6.17. Runtime versus number of symbols (CFP).

We further examine the effect of the number of distinct symbols on the performance of both algorithms. As shown in Fig. 6.17, both curves slope downward as the number of distinct symbols increases from 3 to 7. This is because larger number of distinct symbols may reduce the chances of forming frequent patterns. Nevertheless, the CFP algorithm results in better performance than the modified Apriori algorithm.

In summary, since the CFP algorithm employs the projected database, two pruning strategies, and the closure checking scheme to mine closed flexible patterns in a DFS manner, it is more efficient and scalable than the modified Apriori algorithm in all cases,

0 200 400 600 800 1000 1200

3 4 5 6 7

Runtime (s)

Number of symbols

Apriori CFP

89

especially when the minimum support threshold is low or the average length of transactions is large. The experimental results show that the CFP algorithm outperforms the modified Apriori algorithm by an order of magnitude.

6.3.2 Evaluations on real data

To validate the CFP algorithm on the real dataset, we perform experiments on the house price index (HPI) data of the United States. The real dataset is retrieved from Standard & Poor’s official website [55] and it covers the time period from July 2005 to December 2007. This period could be characterized as a turning point in U.S. economy, generally attributed to the strong housing demand in 2005 and the subprime mortgage crisis in 2007.

The real data consists of home price indices of 20 metropolitan regions in the United States, including Boston, New York, Washington, Charlotte, Atlanta, Tampa, Miami, Detroit, Cleveland, Chicago, Minneapolis, Dallas, Denver, Phoenix, Las Vegas, Seattle, Portland, San Francisco, Los Angeles, and San Diego. Moreover, it is recorded on a monthly basis. Hence, we have 20 transactions in the database and each of which contains 30 elements. That is, there are 30 months over the period from July 2005 to December 2007. Each element in the transaction is transformed into one of the distinct symbols HP, LP, FT, LN, and HN in the transformation phase where these symbols denote high positive, low positive, flat, low negative, and high negative rate of change in home-price appreciation, respectively.

Fig. 6.18 shows the runtime versus the minimum support threshold which varies from 20% to 70%. The runtime of the modified Apriori algorithm grows sharply as the minimum support threshold decreases since it requires a major effort to generate a huge number of candidates and scan databases multiple times. In contrast, the runtime of the CFP algorithm increases moderately even when the minimum support threshold is low.

90

Fig. 6.18. Runtime versus minimum support for HPI data.

Fig. 6.19. Runtime versus maximum gap for HPI data.

Fig. 6.19 depicts the result of mining real data with respect to different values of maximum gaps, where the minimum support threshold is set to 90%. The runtime of both algorithms rises slowly with low maximum gap thresholds because there are only a limited number of patterns. However, as the value of maximum gaps increases, the performance differences between the modified Apriori algorithm and the CFP algorithm become noticeable.

Mining the real dataset, we could find some interesting patterns and project these patterns back to the original data in order to show the trend in a certain time period. For instance, one of the patterns is

{

LN LN HN HN HN

}

as shown in Fig. 6.20. The pattern implies a continuing downward trend in the rate of house price in the wake of the subprime mortgage crisis in 2007. We obtain this pattern on West Coast cities like Las Vegas, Phoenix, Los Angeles, and San Diego and this denotes that these cities all have the same movement in house prices as the subprime mortgage crisis breaks out.

The movement demonstrates that the home price starts to falter in spring 2007 and continue to have a higher negative rating in summer and winter 2007. Moreover, Las Vegas has the pattern begins in February 2007 but it is at a lag of three months for Phoenix. Los Angeles and San Diego have the pattern appeared at the same time in June

0.1 1 10 100 1000 10000

20 30 40 50 60 70

Runtime (s)

Minimum support (%)

Apriori CFP

0.01 0.1 1 10 100

0 1 2 3 4 5

Runtime (s)

Max gap

Apriori CFP

91

2007 that is a month later than Phoenix. In other words, we observe an earlier fall in house prices affected by the subprime mortgage crisis in Las Vegas and Phoenix than other West Coast cities.

Fig. 6.20. HPI rate of change for West Coast cities.

Another pattern

{

FT LP LPLP LP FT LN LN LN LN

}

is found on South cities Atlanta and Dallas as shown in Fig. 6.21. The same trend can be seen in the rate of house price movement between both cities. It indicates the small rise in house price appreciation during the period from March 2006 to August 2006 before the subprime mortgage crisis hits. However, it depicts several consecutive monthly falls in the rate of house price starting in September 2006.

The other pattern

{

FT FT LN

[ ]

0,1 HN

[ ]

0,1 HN HN

}

is discovered on Midwest cities Minneapolis, Chicago, and Cleveland as shown in Fig. 6.22. We plot the pattern separately for these three cities in order to provide a clear view on the part of the flexible intervals. We observe that all three cities have the same movement {FT FT LN}

from June to August and then within 0-1 gap interval, the HPI rate of change would

West

-3.00 -2.50 -2.00 -1.50 -1.00 -0.50 0.00

Feb. 2007 Mar Apr May Jun Jul Aug Sep Oct. 2007

Month

HPI rate of change

Las Vegas Phoenix Los Angeles San Diego

LN LN LN

HN HN LN

LN

HN

HN

HN

HN LN

LN

HN

HN HN

LN

HN HN

HN

92

drop to high negative. Furthermore, the second gap interval [0, 1] between HN and HN indicates that within 0-1 gap interval, there is less fluctuation in house price movement and the change remains in a high negative rate.

Fig. 6.21. HPI rate of change for South cities.

Fig. 6.22. HPI rate of change for Midwest cities.

In summary, the performance of the CFP and the modified Apriori algorithms on the real dataset is quite similar to that on the synthetic datasets. Since the CFP algorithm takes the advantage of two pruning strategies and the closure checking scheme to reduce the search space and thus speed up the algorithm, it outperforms the modified Apriori algorithm.

South

-0.80 -0.60 -0.40 -0.20 0.00 0.20 0.40 0.60 0.80 1.00 1.20

Mar. 2006 Apr May Jun Jul Aug Sep Oct Nov Dec. 2006

Month

HPI rate of change

Atlanta Dallas

-3.00 -2.00 -1.00 0.00 1.00

Jun. 2007 Jul Aug Sep Oct Nov Dec. 2007

Minneapolis

-3.00 -2.00 -1.00 0.00 1.00

Jun. 2007 Jul Aug Sep Oct Nov Dec. 2007

Chicago

-3.00 -2.00 -1.00 0.00 1.00

Jun. 2007 Jul Aug Sep Oct Nov Dec. 2007

Cleveland

FT FT LN

HN HN HN

FT FT LN

HN HN HN FT FT LN HN HN

HN

Midwest

FT LP

LP LP

LP FT

LN

LN

LN LN FT

LP

LP

LP LP

FT

LN LN LN

LN

93