• 沒有找到結果。

Chapter 4 Incremental SPAM (IncSPAM): Mining Sequential Patterns

4.7 The Entire Process of Incremental SPAM (IncSPAM)

Finally we introduce entire IncSPAM algorithm for the mining of sequential patterns.

Figure 4-6 shows the main function of IncSPAM.

IncSPAM (S, d, N)

1: foreach incoming transaction from the data stream do

2: find out which customer c the incoming transaction belongs to;

3: update the CBASW of this customer by the incoming transaction;

4: store all the frequent 1-sequences to F;

5: MaintainTree(c, F);

Fig 4-6. Main function of Incremental SPAM

The CBASW of each customer is modified from line 1 to line 4. After the modification of CBASWs is finished, function MaintainTree is called. Function MaintainTree maintains sequential patterns dynamically in a lexicographic sequence tree. There are some cases about incremental mining of sequential patterns. Assume that a new transaction ω comes in. ω belongs to customer c. The lexicographical tree T is updated to T′:

A pattern which is frequent in T is still frequent in T′: We only needs to update its ρ-idx and support

A pattern which is not in T appears in T′: A new pattern is generated because of the incoming transaction. By the Apriori [1] property, since prefix of the new pattern must also be frequent, we only need to generate candidates from the leaf nodes of T. There are two ways to reduce the number of candidates to be generated: (1) We only consider the items in the incoming transaction to append on the leaf nodes because the new patterns must contain these items in the end. (2) The incoming transaction only belongs to a specific customer c so the generated candidates must begin with the items in the customer-sequence of c. Figure 4-7 shows an example after sliding the CBASW of customer #3. The incoming transaction is TID = 7.

(b, c, d)

Sliding Window of Each Customer

The appended items are items b, c, and d.

The sub-trees of items a, b, c, and d need to generate candidates; others don’t.

Fig 4-7. Reducing the generated candidates

A pattern which is in T does not exist in T′: The pattern becomes infrequent because of window sliding. We directly delete the node and its sub-tree.

MaintainTree (c, F)

1: foreach tree node n who’s representing item i is in F do 2: if i exists in the customer-sequence of c then 3: Generate(c, n);

4: else //i does not exist in c 5: Update(c, n);

Fig 4-8. The pseudo code of function MaintainTree

Figure 4-8 shows the pseudo code of MaintainTree. Function Generate, as shown in Figure 4-9, uses S-step and I-step to generate all possible children with the principles mentioned above for each tree node. If the child does not exist in the lexicographical tree, Generate creates a new tree node for this child. If the child is in the lexicographical tree, Generate only updates the index set and support of this child. Function Update, as shown in Figure 4-10, is simpler than Generate. Update does not need to generate children. Update only checks each tree node to update its index set and support. The process of updating the index set and the support is in Function UpdateSupport.

Generate (c, n)

1: foreach existing child n′ of n do 2: UpdateSupport(c, n′);

3: if the support of n′ < S then 4: eliminate n′ and its sub-tree;

5: generate candidates of n by S-step and I-step;

6: foreach generated candidate x of n do 7: count the support of x;

8: if the support of x ≥ S then 9: x is a child of n;

10: foreach child n′ of n do 11: Generate(c, n′);

Fig 4-9. The pseudo code of function Generate

Update (c, n)

Fig 4-10. The pseudo code of function Update

We use the previous example to show the process of IncSPAM. Assume three transactions have been received by IncSPAM. Figure 4-11 shows the CBASWs and the lexicographic sequence tree. We mark the sequential patterns with squares. Each tree node maintains an index set to record its support. In Figure 4-11, only 1-sequence <(b)> is frequent so the tree does not have longer sequential patterns.

Fig 4-11. The lexicographic sequence tree when the third transaction comes in

When the fourth transaction (a, b, c) comes in, CBASW of customer 2 has been modified and 1-sequences <(a)> and <(c)> become new sequential patterns. By the extension methods, S-step and I-step, longer candidates have been generated. IncSPAM checks the support of each candidate using index set and keeps sequential patterns in the lexicographic sequence

Fig 4-12. The lexicographic sequence tree after the fourth transaction comes in

When the fifth transaction (a, b) comes in, IncSPAM updates the CBASWs and the index sets in the lexicographic sequence tree. Then IncSPAM needs to generate new candidates to find if there are new sequential patterns. Figure 4-13 shows the lexicographic sequence tree and CBASWs after the fifth transaction comes. In the figure the tree nodes linked by the dotted arrows means the candidates IncSPAM needs to check. The fifth transaction belongs to customer 3 so only the sub-trees of items that exist in the customer-sequence 3 need to generate candidates. In Figure 4-13 we can know that the sub-trees of items a and b need to generate new candidates. Then we find that the new candidates <(a)(a)>, <(a)(b)>, and <(b, c)(b)> are not frequent. IncSPAM does not keep these tree nodes.

Fig 4-13. The lexicographic sequence tree after the fifth transaction comes in

Figure 4-14 shows the result after the sixth transaction comes in. IncSPAM finds that the existing tree node <(b)(b)> becomes infrequent. In this case IncSPAM directly deletes the tree node <(b)(b)> and its sub-tree <(b)(b, c)>.

Fig 4-14. The lexicographic sequence tree after the sixth transaction comes in