Weight of Customer-Sequence - Incremental SPAM (IncSPAM): Mining Sequential Patterns

Chapter 4 Incremental SPAM (IncSPAM): Mining Sequential Patterns

4.8 Weight of Customer-Sequence

Fig 4-14. The lexicographic sequence tree after the sixth transaction comes in

4.8 Weight of Customer-Sequence

‧‧‧

Data Streams

System starts Current transaction

These transactions are the latest transactions of a customer but the customer has no recent transactions in the data stream

Fig 4-15. The transactions of a customer with no recent records in a data stream

In IncSPAM algorithm, each customer maintains a sliding window to keep the latest N transactions and the system mines sequential patterns from all customer-sequences. But some customers may have no transactions in recent time in the data stream. These customer-sequences with out-of-date transactions would result in a false positive problem in

our algorithm. The supports of some patterns generated by the system are overly counted.

Figure 4-15 shows an example of these out-of-date transactions in a data stream. The customer-sequence with these out-of-date transactions is less important than other customer-sequences.

A concept of weight can be used to judge the importance of customers. Each customer-sequence c has its own weight wc, 0 ≤ wc ≤ 1. Each weight wc is decayed if the incoming transaction does not belong to c. When a transaction of c comes, the weight wc is set to one. A decay function is used to compute the weights of customer-sequences when a new transaction is coming in:

c d

w = 1×

d, a decay-rate defined by users, can decide how fast a customer-sequence is decayed. p is a decay-period of a customer-sequence which is the number of transactions between the

incoming transaction and the latest transaction of c. p can be written as below:

p = (incoming transaction TID – the latest transaction TID of c)

In our proposed algorithm, the concept of the decay-rate d is adopted from [26]. d is defined as:

) 1 ,

1 , 1

( ¹

( > ≥ ≤ <

=b⁻ b h b⁻ d

d ^h

Decay-base b: the amount of weight reduction per decay-unit

Decay-base-life h: the number of decay-units that makes the current weight be 1/b

Figure 4-16 shows an example of calculating the weights of customers. Assume the incoming transaction is the transaction with TID = 7 and the decay rate d = 0.9. The latest transaction of each customer is pointed by an arrow. Let us take customer #1 as an example.

The latest transaction of customer #1 is the transaction with TID = 6. So the weight of

(b, c, d)

Fig 4-16. An example of calculating the weights of customers

In IncSPAM, we do not need to calculate decay-period when a new transaction comes in.

The weight of the customer that the incoming transaction belongs to is set to one. The others decay only by a decay-rate d. Figure 4-17 shows an example when a new transactions with TID = 8 comes. The weight of customer #2 is set to 1 and the others decay by 0.9.

Fig 4-17. When a new transaction with TID = 8 comes in

Now the support of a sequence ρ is not just the number of non-zero positions in the ρ-idx.

The support of ρ is counted by the summation of the weights of the customer-sequences which ρ is in. We take the same example in section 4.7. The CBASWs and the lexicographic sequence tree in Figure 4-11 becomes Figure 4-18. We assume the decay-rate is 0.9. In Figure 4-18, we can find that the support of the tree node <(b)> is 1.9 not 2.

∅

Fig 4-18. The lexicographic sequence tree when the third transaction comes in (with the concept of customer weight)

Updating support for an existing node can be easier than counting support of a new candidate. Whenever a new transaction comes in, only one customer-sequence is affected.

Except the affected customer-sequence, the other customer-sequences’ weights just decay by a decay-rate. We do not have to sum up all the weights one by one. The cases of updating support can be listed below:

‧ (Case 1) The incoming transaction belongs to a new customer: For a sequence ρ, the original support decays by a decay-rate. Then we check if ρ exists in the new customer-sequence. If ρ does exist, the decayed support increments one. If not, the decayed supports adds zero. Figure 4-19 shows an example.

(a, b, d)

<(c)>-idx = [0]; <(c)>’s support = 0

<(b)>-idx = [1]; <(b)>’s support = 1

<(a)>-idx = [1]; <(a)>’s support = 1 Decay rate d = 0.9

<(d)>-idx = [1]; <(d)>’s support = 1

After transaction #2 comes

Fig 4-19. An example of support updating in IncSPAM (Case 1)

‧ (Case 2) The incoming transaction belongs to an existing customer: For a sequence ρ, the previous position value of this customer in the ρ-idx has to be checked. Assume the modified customer-sequence is c. If previous ρ-idx[c] is zero, the original support decays by a decay-rate and increments one or zero by the existence of ρ in c. If previous ρ-idx[c] is not zero, the original support subtracts the previous weight of customer c and then decays by a decay-rate. Finally the support increments one or zero by the same consideration. Figure 4-20 shows an example.

Decay rate d = 0.9

Fig 4-20. An example of support updating in Incremental SPAM (Case 2)

The weights of the customer-sequences do not change the entire process of maintaining a lexicographic sequence tree. Only when counting support in each tree node IncSPAM needs to consider the concept of weight. The Function UpdateSupport in Figure 4-8 and Figure 4-9 is changed to Figure 4-21.

UpdateSupport (c, n)

1: if customer-sequence c is new then 2: decay the support of n;

3: if the sequence of n is in c then 4: the support of n + 1;

5: else

6: the support of n + 0;

7: else

8: if ρ-idx[c] is 0 then // assume the sequence in n is ρ 9: decay the support of n;

10: if the sequence of n is in c then 11: the support of n + 1;

12: else

13: the support of n + 0;

14: else

15: the support of n – previous weight of c;

16: if the sequence of n is in c then 17: the support of n + 1;

18: else

19: the support of n + 0;

Fig 4-21. The pseudo code of function UpdateSupport

Chapter 5 Performance Measurement

5.1 Performance Measurement of New-Moment

We performed many performance measurements to compare New-Moment with Moment.

Moment program (MomentFP) was provided by author. All experiments were done on a 1.3GHz Intel Celeron PC with 512MB memory and running with Windows XP system.

New-Moment was implemented in C++ STL and compiled with Visual C++ .NET compiler.

All testing data was generated by the synthetic data generator provided by Agrawal et al in [1]. For testing the scalability of New-Moment and Moment, we use two set of different parameters, value1 and value2, to generate dataset. Parameters of testing data are listed in table 5-1. The dataset generated by value1 (T10I8D200K) contains general length of patterns and the dataset generated by value2 (T15I12D200K) contains longer patterns.

Parameter Value1 Value2 Average items per transaction (T) 10 15

Number of transactions (D) 200k 200k Number of items (N) 1000 1000 Average length of maximal pattern(I) 8 12

Table 5-1. Parameters of testing data for New-Moment

Our testing method is to execute New-Moment and Moment on the same dataset and to test their performance. The performance measurements include memory usage, loading time of the first window, and average time of window sliding. When the first window is filled by

New-Moment and Moment receive 100 continuous transactions and generate 100 consecutive sliding windows. Both New-Moment and Moment record executing time of each window.

Average time of window sliding was reported over these 100 consecutive sliding windows.

We use different minimum support, different window size, and different number of item types to test these two algorithms.

5.1.1 Different Minimum Support

In the first experiment, we discuss the memory usage and executing time of New-Moment and Moment in different minimum support. Minimum support is changed from 1% to 0.1%.

Sliding window size is fixed to 100,000 transactions. The number of item types is fixed to 1000. With different datasets (T10I8D200K and T15I12D200K), the results are listed below.

(1) T10I8D200K

Memory usage

0 50000 100000 150000 200000 250000

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.05 Minsup (%)

Memory (KB)

NewMoment MomentFP

Fig 5-1. Memory usage with different minimum support (T10I8D200K) (New-Moment and Moment)

The first measurement is about memory usage of New-Moment and Moment. Figure 5-1 shows the memory usage in Kbytes. We can observe that memory used is more than 120MB in Moment but memory used in New-Moment is just about 15MB. When the minimum

support is down to 0.05%, the memory used by New-Moment is just 50MB but memory of Moment is out of bound (more than 512MB).

There are much less tree nodes in New-CET than in CET. New-Moment only maintains bit-vectors of 1-itemsets and closed frequent itemsets in the current window. Experiment shows that New-CET is more compact than CET.

Time of Loading the First Window

0 20 40 60 80

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Minimum Support (%) Loading Time (seconds)

NewMoment MomentFP

Fig 5-2. Loading time of the first window with different minimum support (T10I8D200K) (New-Moment and Moment)

The second measurement is about the loading time of the first window. Figure 5-2 shows the result. In the first window, both New-Moment and Moment need to build a lexicographic tree. We can observe that New-Moment is a little faster than Moment. The reason is that generating candidates and counting their supports with bit-vector is more efficient than with an independent sliding window (in MomentFP, a FP-tree [5] is used).

Average Time of Window Sliding

0 0.01 0.02 0.03 0.04 0.05

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Minimum Support (%)

Window Sliding Time (seconds)

NewMoment MomentFP

Fig 5-3. Average time of window sliding with different minimum support (T10I8D200K) (New-Moment and Moment)

The third measurement is about average time of window sliding. Figure 5-3 shows the result. In the experiment New-Moment is a little slower than Moment because New-Moment do not use tid sum as another key to speed up left-check step. But we can observe that the difference is little. The sliding steps can be finished in a second for both algorithms and the difference is meaningless.

(2) T15I12D200K

The patterns in this dataset are longer than the patterns in previous dataset. We also test the memory usage, loading time of the first window, and average time of window sliding. From the measurements listed below, we can observe that the scalability of New-Moment is better than Moment.

The first measurement is about memory usage in Kbytes. Figure 5-4 shows the result. We can observe that the memory used in New-Moment is still less than the memory used in Moment. By comparing the Figure 5-1 and Figure 5-4, we can also observe that the scalability of New-Moment in memory usage is better than Moment. In the dataset T10I8D200K,

memory used in Moment is about 120MB but in the dataset T15I12D200K, memory used in Moment is about 200MB. Memory used in New-Moment is under 100MB in both datasets.

Memory usage

0 100000 200000 300000

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Minimum Support (%)

Memory (KB)

NewMoment MomentFP

Fig 5-4. Memory usage with different minimum support (T15I12D200K) (New-Moment and Moment)

The second measurement is about the loading time of the first window. Figure 5-5 shows the result. We can observe that New-Moment is still faster than Moment.

Loading Time of the First Window

0 50 100 150 200

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Minimum Support (%) Loading Time (seconds)

NewMoment MomentFP

Fig 5-5. Loading time of the first window with different minimum support (T15I12D200K) (New-Moment and Moment)

The third measurement is about the average time of window sliding. Figure 5-6 shows the result. When minimum support is less than 0.3%, the average time of window sliding in New-Moment is less than Moment.

Average Time of Window Sliding

0 0.05 0.1 0.15

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Minimum Support (%)

Window Sliding Time (seconds)

NewMoment MomentFP

Fig 5-6. Average time of window sliding with different minimum support (T15I12D200K) (New-Moment and Moment)

By testing two different dataset with different minimum support, we can observe that New-Moment has better scalability than Moment. Although New-Moment is a little slower than Moment in window sliding, both algorithms handle a transaction in one second. In complicated dataset and low minimum support, New-Moment can even outperform Moment.

New-Moment not only use less memory than Moment but also is as fast as Moment in loading the first window and window sliding.

5.1.2 Different Sliding Window Size

Sliding window size decides the length of each bit-vector. In this experiment, we want to compare New-Moment and Moment in different sliding window size. This experiment can show that using bit-vectors of items instead of independent sliding window is an efficient

strategy. In this measurement, sliding window size is changed from 10,000 transactions to 100,000 transactions. Minimum support is fixed to 0.1%. The dataset is T10I8D200K. The number of item types is fixed to 1000. We also test memory usage, loading time of the first window, and average time of window sliding.

Figure 5-7 shows the first measurement, memory usage, in Kbytes. We can observe that both New-Moment and Moment are linearly affected by sliding window size. New-Moment still outperforms Moment in memory usage. Furthermore, the memory used in Moment increases faster than New-Moment when window size becoming larger.

Memory Usage

0 50000 100000 150000 200000 250000

10 20 30 40 50 60 70 80 90 100

Window Size (K transactions)

Memory Usage (KB)

NewMoment MomentFP

Fig 5-7. Memory usage with different sliding window size (New-Moment and Moment)

Figure 5-8 shows the result of the second measurement, time of loading the first window.

Although with the increasing sliding window size each bit-vector becomes larger, New-Moment is still a little faster than Moment in loading time of the first window. The reason is that processing time of bitwise AND between bit-vectors is almost not affected by the length of bit-vector.

Loading Time of the First Window

0 20 40 60 80

10 20 30 40 50 60 70 80 90 100

Window Size (K transactions)

Loading Time (seconds)

NewMoment MomentFP

Fig 5-8. Loading time of the first window with different sliding window size (New-Moment and Moment)

Figure 5-9 shows the result of the third measurement, average time of window sliding.

Window sliding time of New-Moment and Moment is almost the same. In the experiment of different window size, we can also conclude that New-Moment outperforms Moment in memory usage and retain the same executing time.

Average Time of Window Sliding

0 0.02 0.04 0.06 0.08

10 20 30 40 50 60 70 80 90 100

Window Size (K transactions) Window Sliding Time (seconds)

NewMoment MomentFP

Fig 5-9. Average time of window sliding with different sliding window size (New-Moment and Moment)

5.1.3 Different Number of Items

New-Moment maintains bit-vectors of all items instead of independent sliding window structure. The more types of items, the more bit-vectors need to be maintained. The goal of this experiment is to show that even with a large number of items New-Moment also outperforms Moment in memory usage. The number of item types is ranged from 1000 to 10000. Minimum support is 0.1%. Sliding window size is 100000. Testing dataset is T10I8D200K. We also test memory usage, loading time of the first window, and average window sliding time.

Figure 5-10 shows the memory usage in Kbytes. Moment is out of memory (more than 512MB) when the number of items exceeds 3000. Memory usage of New-Moment and the number of items is linearly related. This result shows that New-Moment does not increase its memory usage suddenly when the number of items is large.

Memory Usage

0 100000 200000 300000 400000 500000

1 2 3 4 5 6 7 8 9 10

Number of Items (K)

Memory (KB)

NewMoment MomentFP

Fig 5-10. Memory usage with different number of items (New-Moment and Moment)

Next we test the executing time of both algorithms. Figure 5-11 shows the result of loading time of the first window. Figure 5-12 shows average time of window sliding. The results

9000, loading the first window is only executed once. Average time of window sliding is still less than 1 second. It means that New-Moment is still efficient with a large number of items.

Time of Loading the First Window

0 100 200 300 400

1 2 3 4 5 6 7 8 9 10

Number of Items (K) Loading Time (seconds)

NewMoment MomentFP

Fig 5-11. Loading time of the first window with different number of items (New-Moment and Moment)

Average Time of Window Sliding

0 0.2 0.4 0.6

1 2 3 4 5 6 7 8 9 10

Number of Items (K) Window Sliding Time (seconds)

NewMoment MomentFP

Fig 5-12. Average time of window sliding with different number of items (New-Moment and Moment)

5.2 Performance Measurement of IncSPAM

The sequence data set in transaction form is generated by IBM data generator [2]. Our program is written with C++ standard library (STL) and compiled with gcc 4.0.3 on Linux 9.0.

The testing computer has 2.16GHz CPU power and 2GB main memory. Table 5-2 shows the parameters used to generate testing data.

Parameter Value Average Number of transactions per customer (C) 30

Average Number of items per transaction (T) 2~3 Number of Different Items (N) 1000 Table 5-2. Parameters of testing data for IncSPAM

The performance measurements include memory usage and average time of window sliding.

Memory usage was tested by system tool to observe real memory variation. We run all transactions generated with parameters above and record time of handling one transaction.

Average time of window sliding is over the entire dataset. All experiments are performed with decay-rate d = 0.999. For testing the scalability of IncSPAM, we test it with different minimum support, different window size of a customer-sequence, and different number of customers.

5.2.1 Different Minimum Support

The number of sequential patterns increases with lower minimum support. We want to test the memory usage and executing time of IncSPAM with different minimum support to see its scalability. In this experiment we use an absolute minimum support S. If the number of customers that support a sequence ρ is more than S, ρ is a sequential pattern. We changed S

from 3 to 10. The total number of customer is 1000. Window size of each customer is 10 transactions. Figure 5-13 shows the memory usage in Mbytes.

Memory Usage

0 100 200 300 400

3 (0.3%) 4 (0.4%) 5 (0.5%) 6 (0.6%) 7 (0.7%) 8 (0.8%) 9 (0.9%) 10 (1%) Minimum Support (Number of Customers)

Memory Usage (MB)

T = 2 T = 2.5 T = 3

Fig 5-13. Memory usage with different minimum support (IncSPAM)

The memory usage is about 200 MB. That is reasonable in mining of sequential patterns.

We can find that with lower minimum support the memory usage of IncSPAM increases rapidly. For proving the lexicographic sequence tree of IncSPAM does not generate redundant tree nodes, we test the number of tree nodes in the lexicographic sequence tree and the memory used by IncSPAM.The experiment can prove that IncSPAM is efficient in memory.

Figure 5-14 shows the experiment for testing the relationship of maximum number of tree nodes and memory usage. From this graph we can observe that the relationship is linear. That means the memory usage grows up only because of increase of sequential patterns. IncSPAM does not produce additional structure when minimum support becomes small and is efficient in memory usage.

Maximum Number of Tree nodes v.s. Memory Usage

0 100 200 300 400

0 2000 4000 6000 8000 10000 12000

Maximum Number of Tree nodes

Memory Usage (MB)

T = 2 T = 2.5 T = 3

Fig 5-14. Relationship between maximum number of tree nodes and memory usage (IncSPAM)

Figure 5-15 shows the average window sliding time. The result shows that average sliding time is below 1 second. IncSPAM uses CBASW and the characteristics of incremental mining to speed up processing an incoming transaction. The experiment can prove the efficiency of IncSPAM.

Average Sliding Time

0 0.2 0.4 0.6 0.8 1

3 (0.3%) 4 (0.4%) 5 (0.5%) 6 (0.6%) 7 (0.7%) 8 (0.8%) 9 (0.9%) 10 (1%) Minimum Support (Number of Customers)

Sliding Time (seconds)

T = 2 T = 2.5 T = 3

Fig 5-15. Average time of window sliding with different minimum support (IncSPAM)

5.2.2 Different Sliding Window Size

Size of sliding window is to control the number of transactions maintained by each customer. In this experiment, we test the memory usage and average sliding time by different size of sliding window. The size ranges from 10 transactions to 25 transactions. Minimum support is fixed to 10 customer-sequences. Figure 5-16 shows memory usage and Figure 5-17 shows average sliding time with different window size.

Memory Usage

0 200 400 600 800

10 15 20 25

Window Size (transactions per customer)

Memory Usage (MB)

T = 2 T = 2.5 T = 3

Fig 5-16. Memory usage with different window size (IncSPAM)

Average Sliding Time

0 0.5 1 1.5 2

10 15 20 25

Window Size (transactions per customer)

Sliding Time (seconds)

T = 2 T = 2.5 T = 3

Fig 5-17. Average sliding time with different window size (IncSPAM)

When the number of transactions maintained by each customer increases, the corresponding memory usage and average sliding time also grows up. In current applications and IBM synthetic dataset, maintaining about 15 transactions for each customer is reasonable.

IncSPAM can be applied in general applications and is efficient in memory usage and handing real-time transactions.

5.2.3 Different Number of Customers

IncSPAM can dynamically add a new customer to the summary data structure. In previous experiments we fix the number of customers for observing performance conveniently. The

在文檔中使用位元向量在資料串流環境探勘封閉式頻繁項目集及循序樣式之研究 (頁 50-0)