Algorithm GIAMS-MED - The Proposed Algorithms

Chapter 4 The Proposed Algorithms

4.2 Algorithm GIAMS-MED

As we have mentioned, the only difference of the GIAMS-MED algorithm from GIAMS-IND is in the second process for generating indirect associations. Therefore, in this section, we will only describe the design of this process, named procedure IndirectAssociationGen-Med. The basic concept of our design is to facilitate the property of mediator, i.e., the support threshold for a qualified mediator, to reduce the number of frequent itemsets

candidate mediators.

In the following, we first show that for an itemset X to be a qualified mediator, the support of X should be no less than a threshold σm, and σm = 2σf − σs.

Theorem 1. The support of a mediator M should be no less than σm = 2σf − σs,i.e., sup(M) ≥ σm.

Proof: First, let us consider any three sets, A, B, and C. According to the set theory, we

have the following equation (see Figure 4-13):

C ⊇ (A ∩ C) ∪ (B ∩ C) (3)

Figure 4-13. Visualization of the concept revealed in (3).

Then from (1), we can derive according to the inclusion-exclusion principle

|C| ≥ |(A ∩ C) ∪ (B ∩ C)|

= |A ∩ C| + |B ∩ C| − |A ∩ B ∩ C|

≥ |A ∩ C| + |B ∩ C| − |A ∩ B| (4)

Now, let us consider a qualified indirect association <{x, y} | M>. (see Figure 4-14)

Figure 4-14. The relations between x, y and M.

Let TXM, TYM, and TM be the set of transactions containing {x}∪ M, {y}∪ M, and M, respectively. Then according to (2) we have

|TM| ≥ | TXM | + | TYM | − | TXM ∩ TYM | (5)

That is,

sup(M) ≥ sup({x}∪ M) + sup({y}∪ M) − sup({x, y}∪ M )

From Definition 1, we know that the minimum support of itemsets {x}∪M and {y}∪M are both larger than σf and the maximum support of indirect itempair {x, y} is less than σs. Note that sup({x, y}∪M) ≤ sup({x, y}). We thus can conclude that if M is a mediator, then the support of M should be no less than 2σf − σs.

The main deficiency of IndirectAssociationGen is that many candidate indirect associations are generated. We observe most of the supports of the mediators are smaller than σm. In other words, these candidates can not become indirect association rules. In order to reduce the number of candidate indirect associations, GIAMS-MED uses σm as another threshold to make sure every candidate indirect association generated with its support larger than σm. That helps reduce the cost of unnecessary calculation.

M

x y

σs

σf

≥ ≥σf

Procedure Name: IndirectAssociationGen-Med Input: d, σs, σf, σd, η^,Card-Stree.

Output: Set of Indirect associations IA.

Steps:

1. Let σm = 2σf − σs;

2. Let L₁ be the set of all 1-itemsets in Card-Stree with support no less than σf; 3. Let M₁ be the set of all 1-itemsets in Card-Stree with support no less than σm;

//generate mediators of length 1 4. k = 1; C2 = join(L1, L1)

5. foreach itemset X∈ C2 and Y.count < σs ×η do //generate IIS 6. insert X into IIS; //X is a candidate itempair

7. IA = ∅;

8. while (M_k ≠ ∅) do //generate indirect association rules 9. foreach {a, b} ∈ IIS do

10. foreach M ∈M_k do

11. if dep({a}, M) ≥ σd and dep({b}, M) ≥σd then 12. IA = IA∪{〈 a, b| M〉};

13. endfor

14. C_k+1 = join(Mk, M_k) //generate next level candidate mediators 15. foreach itemset X ∈ Ck+1 and X.count ≥ σm×η^do

16. insert X into M_k+1; 17. k = k + 1;

18. endwhile

Figure 4-15. Description of procedure IndirectAssociationGen-Med in algorithm GIAMS-MED.

The description of procedure IndirectAssociationGen-Med is shown in Figure 4-15.

Figure 4-16 and Figure 4-17 illustrate the process of IndirectAssociationGen-Med procedure.

First, it generates frequent all one-items, then forms candidate 2-itemsets from those frequent(σf ) one-items. Any 2-itemset with its count less than indirect itempair threshold (σs) is inserted into the candidate IIS.

After generates all frequent one-items, the procedure also from (k-1)-mediators find candidate mediator(large than σm), then employs a level-wise generation of all possible

mediators.

Figure 4-16. An example for generatory mediators and indirect item pairs.

Finally, after generate IIS and mediator then we can form a candidate indirect association rule. And we output the rule if it satisfies the dependence condition (σd) and mediator support threshold (σf) by joining an IIS and a mediator. For example, as Figure 4-17 shows, an IIS-pair {A, B} and mediator D can form a candidate indirect association rule <A, B |D>. If both {A, D} and {B, D} have supports larger than mediator support threshold (σf) and dependences, then we get the indirect association A and B via D.

L1 A B C

E F G D

IIS R AB

MediatorArray

G D

MediatorArray DG

Figure 4-17. An illustration of generating indirect association rules from mediators and IIS.

ItemA Mediator

A A A

G D ItemB B B

B DG

M₁

G D

M₂ DG IIS AB

Chapter 5

Theoretical Analyses

In this chapter, we will analyze some properties of the proposed two algorithms. First, we will prove that the support error of any frequent itemset discovered by the proposed two algorithms is less than the user specified error threshold ε. Then we will provide a theoretical performance comparison between GIAMS-MED and GIAMS-IND in indirect association generation.

5.1 Support Error Bound Analysis

In this section, we will show that the pruning technique used in procedure Decay&Pruning always guarantees a bounded error within the user specified threshold.

To facilitate the discussion, we introduce some new notation. Let the true support of an itemset X, called Tsup(X), be the fraction of transactions so far containing X, and the estimated support of an itemset X, called Esup(X). We will show that the difference between Esup(X) and Tsup(X) is smaller than the support error threshold ε^.

Figure 5-1. The description of maximal possible error and pruning threshold.

Theorem 2 Tsup(X)−Esup(X)≤ε.

Proof: Consider a generated frequent itemset X. Let sbid be the identifier of the starting

block of the current window, xbid be the smallest identifier of the block that itemset X appears and remains in the current Card-Stree, and cbid be the current block. Since the count information of X within blocks from xbid to cbid is maintained, the maximum counting error should be equal to the part dropping in blocks sbid to xbid−1. This concept is illustrated in Figure 5-1. Let ηxbid-1 denote the decayed accumulated amount of transactions in processing block xbid−1. This value will continue decaying in processing blocks xbid to cbid. Then the difference between the estimated count of itemset X, Ecount(X), and true count of itemset X, Tcount(X), is smaller than ε × ηxbid-1 × d(cbid−xbid+1). Thus, we have

Dividing (6) by the current decayed transaction size ηcbid we obtain

Maximal possible error

… …

cbid

Since the minimum and maximum values of xbid are 1 and cbid, respectively, it follows that the difference between Esup(X) and Tsup(X) is smaller than the error threshold ε^when we prune itemset where count is less than ε×ηxbid.

5.2 Performance Comparison

In this section we will compare the performance of GIAMS-IND and GIAMS-MED from a theoretical viewpoint. Since both algorithms differ only in the second process for indirection association generation, we will focus on this process. It suffices to show how many candidate mediators can be pruned by IndirectAssociationGen-Med as compared with IndirectAssociationGen.

Assume that the number of frequent 2-itemsets is n. Then procedure IndirectAssociationGen requires C₂ⁿ=n(n−1)/2 set joins. Recall that in IndirectAssociationGen-Med we divide the frequent 2-itemsets into two parts according to σm. One contains those with support greater than σf but less than σm, assuming the number of itemsets is n1. The other consists of those whose support is greater than σm, assuming its size is n₂. So, n = n₁ + n₂.

The cost of procedure IndirectionAssociationGen-Med is

Therefore, the difference between IndirectionAssociationGen and IndirectionAssociationGen-Med will be

That is, the performance improvement of IndirectionAssociationGen-Med over IndirectionAssociationGen depends on the gap between σf and σm, i.e., the larger the gap, the greater the number n1.

Chapter 6 Experimental Results

To evaluate the performance and effectiveness of the proposed algorithms, GIAMS-IND and GIAMS-MED, we conducted comprehensive experiments on synthetic dataset as well as real datasets, considering all three commonly used window models, including landmark, time-fading, and sliding window models. The evaluation was inspected from three aspects, execution time, memory usage, and pattern accuracy.

All experiments were done on an AMD X3-425 (2.7 GHz) PC with 3GB of main memory, running the Windows XP operating system. All algorithms were implemented in Visual C++ 2008.

6.1 Evaluation on Synthetic Data

The synthetic dataset T5.I5.N0.1K.D1000K was used in our experiments, which was generated using the program in [2]. In each of the following sections, we degenerate the generic data stream model to three common data stream models, landmark window model, time fading window model and sliding window model.

6.1.1 Landmark window model

Mediator support threshold: We first examine the effect of varying mediator support thresholds. In this experiment, the mediator support condition σf was set from 0.01 to 0.018.

The other parameter settings are shown as follows.

ω s d σs σd proportionally to the transaction size. The memory usage exhibits similar phenomenon: the lower σf is, the more memory is consumed.

Figure 6-1. Execution time and memory usage for running process 1, with varying transaction sizes and σσσσfs.

Stride: In this experiment, we examine the effect of varying strides (block size). The stride value was set from 10000 to 80000. The other parameter settings are shown as follows.

ω d σs σf σd

∞ 1 0.01 0.01 0.1 0.001

The results are shown in Figure 6-2. We observe two noticeable phenomena. First, the execution time is decreasing as the stride increases. This is because larger stride encourages the possibility of analogical transactions. That is, more transactions can be merged together and it reduces the cost of subset generation. Second, longer stride also is helpful for reducing the memory usage, because smaller stride makes the pruning threshold stricter. Finally, as time goes by, the execution time and memory usage are also increasing.

T5.I5.N0.1K.D1000K

Figure 6-2. Execution time and memory usage for process 1, with varying transaction sizes and strides.

Accuracy: Since our algorithms introduce the pruning technique to reduce the memory usage, so error may occur to the maintained frequent itemsets and discovered indirect associations.

First, we check the difference between the true support and estimated support, which is measured by the following formula called ASE (Average Support Error):

F value was set from 10000 to 80000 and the mediator support condition σf from 0.001 to 0.018.

The other parameters are shown as follows.

ω d σs σd

∞ 1 0.01 0.1 0.001

The results are depicted in Figure 6-3. All ASEs are zero, indeed less than the user specified error ε=0.001. This asserts our derivation in Theorem 2.

T5.I5.N0.1K.D1000K

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Figure 6-3. Average support error of generated frequent itemsets.

The accuracy of discovered indirect association rules was measured by inspecting how many rules are missed, i.e., recall, which is defined as follows:

true IA_true denotes the set of true indirect associations. Figure 6-4 shows the results. All recalls are 100%.

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Figure 6-4. Recall of discovered indirect associations with different strides (Block Size) and σσσσfs.

Performance of process 2 for rule generation: In this experiment, we compare the performance of the two algorithms in implementing the process for rule generation. The parameters are shown as follows.

ω s d σs σd ε

∞ 10000 1 0.01 0.1 0.001

The results are presented in Figure 6-5. First, let us look at the memory usage. There is no significant difference; GIAMS-IND and GIAMS-MED consume approximately the same amount. Next we examine the execution time. Clearly, GIAMS-MED is much faster than GIAMS-IND. The reason is that as shown in Figure 6-6, the number of candidate rules generated by GIAMS-IND is much more than that by GIAMS-MED.

T5.I5.N0.1KD1000K

0 0.5 1 1.5 2 2.5 3

0.01 0.012 0.014 0.016 0.018

Meditaor support threshold

Time(sec)

0 5 10 15 20 25

Memory(MB)

Mem. GIAMS-IND Mem. GIAMS-MED

Time GIAMS-IND

Figure 6-5. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σσσσfs.

T5.I5.D1000K

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Number of candidate indirect assocation rules

GIAMS-IND GIAMS-MED

Figure 6-6. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σσσσfs.

6.1.2 Time-fading window model

Decay rate: We first compare the performance of our algorithms executed in this model with varying decay rates. The parameters were set as follows:

ω s d σs σf σd ε

∞ 1 0.7-0.9 0.01 0.012 0.1 0.001

As the experimental results shown in Figure 6-7, the memory usage is increasing when the decay rate is getting larger. This is because the support count of an itemset will be declined more slowly and makes that item stay longer in the search tree. The situation is controversy for execution times. When the decay rate is smaller, an itemset becomes outdated more quickly. That means more itemsets will be added into and deleted from the search tree in faster recession duration.

Figure 6-7. Execution time and memory usage for running process 1, with varying transaction sizes and decay rates

Mediator support threshold: In this experiment, we changed σf (mediator support threshold) from 0.01 to 0.018. The other parameters were set as follows.

ω s d σs σd ε

∞ 1 0.9 0.01 0.1 0.001

As the results shown in Figure 6-8, the execution time and memory usage increase proportionally to transaction sizes, but the effect of varying minimum supports is not significant. Compared with other data stream models, the time-fading window model spends more time in transaction insertion. The reason is that the time-fading window model process one transaction at a time.

Figure 6-8. Execution time and memory usage for process 1, with varying transaction sizes and σσσσfs

Accuracy: In this experiment we compared the ratio of average support error with respect to different decay rates. The parameters were set as follows.

ω s d σs σf σd

∞ 1 0.9-0.7 0.01 0.01-0.018 0.1 0.001

As shown in Figure 6-9, higher decay rates lead to larger errors. The mediator support threshold also affects the error but the influence is far less than decay rate. We also observed the recalls of discovered indirect associations. All of the results as shown in Figure 6-10 are almost 100% because we keep all 2-itemsets, so most of the indirect association rules can be discovered successfully.

T5.I5.N0.1K.D1000K

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0.01 0.012 0.014 0.016 0.018

Mediator support threshold ASE(x10 -7 )

d=0.9 d=0.8

d=0.7

Figure 6-9. Average support error of generated frequent itemsets.

T5.I5.N0.1K.D1000K

0 10 20 30 40 50 60 70 80 90 100

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Recall(%)

d=0.9 d=0.8 d=0.7

Figure 6-10. Recall of discovered indirect associations with different decay rates and σσσσfs.

Performance of rule generate operator: We varied the σf from 0.01 to 0.18. The other parameters are shown as follows:

ω s d σs σd

∞ 1 0.9 0.01 0.1 0.001

As shown in Figure 6-11, GIAMS-MED is much faster than GIAMS-IND. The reason is that as shown in Figure 6-12 the number of candidate rules generated by GIAMS-MED is less than GIAMS-IND. Both methods consume similar amount of memory because the main memory used is for maintaining the Card-Stree.

T5.I5.N0.1KD1000K

0.1 1 10 100

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Time(sec)

9.6 9.8 10 10.2 10.4 10.6

Memory(MB)

Mem. GIAMS-IND Mem. GIAMS-MEDTime GIAMS-IND

Figure 6-11. The execute time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σσσσfs

T5.I5.D1000K

0 200000 400000 600000 800000 1000000 1200000

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Number of candidate rules

GIAMS-IND GIAMS-MED

Figure 6-12. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σσσσfs....

6.1.3 Sliding window model

Stride: We first evaluated the effect of varying strides and observed the difference

between both algorithms. The parameters are shown as follows.

ω s d σs σf σd ε

80000 10000-80000 1 0.01 0.012 0.1 0

From Figure 6-13 we can see that bigger strides result in less execution time because less number of blocks has to be processed. A peculiar phenomenon is when the transaction is 160K, an unordinary peak occurs. We guess it is because the length of transaction in that case is longer than the other cases. The difference in memory usage with respect to varying s is not significant, since the memory usage is more dependent on the window size.

T5.I5.N0.1K.D1000K

Figure 6-13. Execution time and memory usage for running process 1, with varying transaction sizes and strides.

Window size: The effect of window size is evaluated in this experiment, which value

was changed from 10000 to 80000. The other parameters are shown as follows.

s d σs σf σd ε

10000 1 0.01 0.012 0.1 0

Intuitively, the more information we would like to observe, the larger memory is required. So as shown in Figure 6-14, the large windows size would lead to large memory usage. However, the effect of varying window size on the execution time does not exhibit obvious regulation. In general, the larger the window is, the more execution time is required.

However, the case for s=10000 does not conform to this trend because the pruning scheme can not be applied when s=ω.

T5.I5.N0.1K.D1000K

Figure 6-14. Execution time and memory usage for running process 1, with varying transaction sizes and window sizes.

Performance of rule generation: We varied σf from 0.01 to 0.018. The other parameters are shown as follows.

ω s d σs σd ε

80000 10000 1 0.01 0.1 0.001

As shown in Figure 6-15, GIAMS-MED performs better than GIAMS-IND in all of the cases. And its curve is analogous to that in Figure 6-16. This once again shows that most of the execution time is spent on performing joining and inspection of candidate rules. In addition, the execution times of both algorithms are decreasing as σf increases. This is because when σf is lager, less number of frequent itemsets could be generated, and so less number of candidate rules will be discovered.

T5.I5.N0.1KD1000K

0 0.05 0.1 0.15 0.2 0.25

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Time(sec)

0 2 4 6 8 10 12

Memory(MB)

Mem. GIAMS-IND Mem. GIAMS-MEDTime GIAMS-IND

Figure 6-15. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σσσσfs.

T5.I5.D1000K

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

0.01 0.012 0.014 0.016 0.018

Number of candidate MSS-pairs

GIAMS-IND GIAMS-MED

Figure 6-16. The number of candidate rules generated by GIAMS-IND and GIAM-MED with varying σσσσfs.

6.2 Evaluation on Real Data

In this section, we present the experimental results on the real dataset constructed from the web log of news pages in msn.com for the entire day of September, 28, 1999. More detailed description of this dataset can be found in [4].

6.2.1 Landmark window model

Mediator support threshold: We increased σf from 0.01 to 0.018 and the other parameters are shown as follows:

ω s d σs σd ε

∞ 10000 1 0.01 0.1 0.001

The results are depicted in Figure 6-17. Since the average length of transactions in this real data set is shorter than that in the synthetic data, the insertion time is shorter than synthetic data. Similar to the experimental results for synthetic data, the larger mediator support thresholds favor faster execution time and lower memory usage, though the memory usage is smaller.

msnbc

Figure 6-17. Execution time and memory usage for running process 1, with varying transaction sizes and σσσσfs.

Stride: We varied the stride from 10000 to 80000. The other parameters are shown as

follows:

ω d σs σf σd ε

80000 1 0.01 0.012 0.1 0.001

The results are shown in Figure 6-18. The stride is a critical factor to the effectiveness of the pruning phase, as revealed in (7). A large stride would make the itemset more easily be pruned. So the execution time and memory usage would be larger when the stride is smaller.

msnbc

Figure 6-18. Execution time and memory usage for process 1, with varying transaction sizes and strides (Block Size).

Accuracy: The parameters in this experienced are shown as follows.

ω s d σs σf σd ε

∞ 10000-80000 1 0.01 0.01-0.018 0.1 0.001

The results in Figure 6-19 show that ASEs with respect to all cases are zero. This is because in this dataset, the transactions are almost less than five; our model would keep most of the itemsets and never prune them. The recall ratio in Figure 6-20 are also 100% for all cases, showing that our algorithms can exhibit the same good results in real data in this experiment.

msnbc

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Figure 6-19. Average support error of generated frequent itemsets.

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Figure 6-20. Recalls of discovered indirect associations with different strides and σσσσfs.

Performance of rule generation: Then we come to observe the performance for the rule generation procedure. The parameters in this experiment are shown as follows.

ω s d σs σf σd ε

∞ 10000 1 0.01 0.01-0.018 0.1 0.001

As shown in Figure 6-21, GIAMS-MED is faster than GIAMS-IND and the gap between GIAMS-IND and GIAMS-MED is smaller than that observed in synthetic data because the difference in the number of candidate associations, as shown in Figure 6-22, is smaller.

msnbc

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

0.01 0.012 0.014 0.016 0.018

Mediator support threshold

Time(sec)

0 1 2 3 4 5 6 7

Memory(MB)

Mem. GIAMS-IND Mem. GIAMS-MEDTime GIAMS-IND

Figure 6-21. The execution time and memory usage comparison for algorithms GIAMS-IND and GIAMS-MED, running process 2 with varying σσσσfs.

msnbc

0 200 400 600 800 1000 1200 1400

0.01 0.012 0.014 0.016 0.018

mediator support threshold Number of candidate candidate rules

GIAMS-IND GIAMS-MED

Figure 6-22. The number of candidate rules generated by GIAMS-IND and GIAMS-MED with varying σσσσfs.

6.2.2 Time-fading window model

Decay rate: The parameters in this experiment are shown as follows.

ω s d σs σf σd ε

∞ 1 0.7-0.9 0.01 0.012 0.1 0.001

It can be seen from the experimental results in Figure 6-23 that the execution time decreasing as the decay rate is increasing. The performance is slower than other models since the cost of dividing all transactions into subsets is rather high. The memory usage in general is decreasing as the decay rate is decreasing.

在文檔中適用於串流資料中探勘間接關聯規則的通用型架構及演算方法 (頁 32-0)