• 沒有找到結果。

Appending the Incoming Transaction in window sliding

Chapter 3 New-Moment: Mining Closed Frequent Itemsets

3.2 Our Proposed Algorithm: New-Moment Algorithm

3.2.6 Appending the Incoming Transaction in window sliding

Appending the incoming transaction is our second step of window sliding. The most right bits of all the bit-vectors of items are set corresponding to the items contained in the incoming transaction. After modification of bit-vectors of items, New-Moment begins to modify New-CET. Only the sub-trees of the items in the inserted transaction need to be checked.

The method of traverse the New-CET for adding a new transaction, called function Append, is the same as function Build. A little difference is that the supports of existing closed frequent itemsets in the hash table need to be modified. Figure 3-11 shows the pseudo code of appending the incoming transaction after setting the most right bit in each bit-vector of 1-itemset.

Append (nI, N, S)

1: if support(nI) ≥ S · N then 2: if leftcheck(nI) = false then

3: foreach frequent sibling nK of nI do 4: generate a new child nIK for nI;

5: bitwise AND BITI and BITK to obtain BITIK; 6: foreach child nI′ of nI do

7: Append(nI′, N, S);

8: if ∄a child nI′ of nI such that

support(nI′) = support(nI) then 9: if nI is closed frequent itemset

in previous sliding window then 10: update the support of nI;

11: update nI in the hash table;

12: else

13: retain nI as a closed frequent itemset;

14: insert nI into the hash table;

Fig 3-11. Pseudo code of appending the incoming transaction in window sliding

itemsets. Figure 3-12 shows the New-CET after appending the incoming transaction. This is also the New-CET in the second sliding window.

(a): <1110> (b): <1110> (c): <0110> (d): <0000>

(a, b): 3

(a, b, c): 2 (a, b, c): 2

(a): <1111> (b): <1110> (c): <0111> (d): <0001>

(a, c): 3

Window #2

a, b, c 4

a, c, d 5

a, b, c 3

a, b 2

c, d 1

Itemsets TID

a, b, c 4

a, c, d 5

a, b, c 3

a, b 2

c, d 1

Itemsets TID

Minsup = 2 Window Size = 4

Fig 3-12. New-CET after appending the incoming transaction (Window #2)

Chapter 4

Incremental SPAM (IncSPAM): Mining Sequential Patterns

Sequential pattern mining is more complicated than mining frequent itemsets, especially in the stream environment. In previous researches, there is no general processing model for handling a data stream with a transaction unit. Incremental SPAM (IncSPAM) provides a suitable sliding window model in a data stream. It receives transactions from the data stream and uses a brand-new concept of bit-vector, Customer Bit-Vector Array with Sliding Window (CBASW), to store the information of items for each customer. Then IncSPAM uses a lexicographic sequence tree to maintain the sequential patterns in the current window. For speeding up the maintaining process, IncSPAM uses index sets to store the first occurring positions in all customer-sequences for a tree node. Whenever a new transaction comes, CBASWs and the lexicographic tree are modified incrementally. Each transaction can be analyzed in few seconds. Finally a weight function is adopted in this sliding window model.

The weight function can judge the importance of a sequence and ensure the correctness of the sliding window model.

4.1 A New Concept of Sliding Window for Sequences

Original sliding window model keeps the latest transactions in a data stream. In the mining of sequential patterns, transactions in a data stream belong to many customers. IncSPAM keeps the latest N transactions for each customer in a data stream and N is called window size.

Each customer maintains its own sliding window. Figure 4-1 shows an example of this sliding window model.

(b, c, d)

Maintain the latest N transactions in the data stream

Sequence Database

Fig 4-1. An example for the new concept of sliding window

In Figure 4-1, the mining system has received 7 transactions. Assume that the window size of each customer is 2 (each customer keeps the latest 2 transactions in a data stream). There are three transactions belonging to customer #1: transactions with TID = 1, 3, and 6. Only transactions with TID = 3 and 6 are stored in the sliding window of customer #1 (marked by the two-way arrow). In the same concept, the sliding windows of customer #2 and customer

#3 are also displayed in Figure 3-9. This example will be used through the introduction of IncSPAM.

4.2 Customer Bit-Vector Array with Sliding Window (CBASW)

IncSPAM also uses bit-vectors to store the information of the sliding window. The concept of bit-vector is almost the same as in section 3.2.1. The difference is that each customer has his own bit-vectors for all items to store the information of his sliding window. Table 4-1 shows the bit-vectors of each customer in Figure 4-1. All these bit-vectors can be collected as a unique data structure for each customer.

Customer #1 Customer #2 Customer #3

a 00 01 10

b 11 11 11

c 11 01 01

d 11 00 01

Table 4-1. Bit-vectors of all items for all customers

Definition of Customer Bit-Vector Array with Sliding Window (CBASW): For each customer-sequence c, we keep the latest N transactions. N is called window size. Each bit-vector of item i contains N bits to represent the occurrences of i in the latest N transactions.

Figure 4-2 shows an example of CBASW. Each bar with a customer id is a CBASW which contains all bit-vectors of all items for a customer.

CID = 1 0 0

Sliding Window for Each Customer

Customer Bit-Vector Array with Sliding Window

Fig 4-2. An example of CBASW

A lexicographical tree is needed in the mining process of Incremental SPAM. Although CBASWs keep all information of items (1-itemsets) for all customers, they are not efficient enough to be used to build and modify a lexicographical tree. In the next section the concept of Index Set will be introduced to speed up the process.

4.3 Index Set ρ-idx

Building and modifying lexicographical tree for mining sequential patterns is complicated than for mining frequent itemsets. The number of candidate sequence is huge. Index Set for a sequence only stores the first positions in all customer-sequences. The memory usage of these positions is less than the memory usage of bit-vectors.

Definition of Index Set: For a sequence ρ, the first occurring position in a customer-sequence c of ρ is recorded as ρ-posc. If ρ is not in c, ρ-posc = 0. The collection of these ρ-pos values in the order of customer id is called an index set ρ-idx. For convenience, ρ-posc can be represented as ρ-idx[c].

Take the CBASWs in Figure 4-2 as an example, the index sets of 1-sequences (items) are listed in Figure 4-3. Each number in the array represents the first position of a sequence ρ in each customer. By counting the number of positions which is not zero in an index set, we can obtain the support of a sequence. Take the 1-sequence <(a)> as an example. The number of non-zero positions in <(a)>-idx is 2. That means the sequence <(a)> exists in two customer-sequence (CID = 2 and CID = 3). So the support of <(a)> is 2. The support of each sequence is records after the index set.

CID = 1 0 0

Fig 4-3. An example of index sets

4.4 Maintaining Information of Items in Window Sliding with CBASW and ρ-idx

Whenever a new transaction comes from the data stream, its CID is checked to find out which CBASW needs to be modified. Each bit-vector in this CBASW is modified by the incoming transaction. As in section 3.2.2, if the number of transactions of the customer exceeds the size of window, window sliding is performed.

Window sliding process is the same in section 3.2.2. Each bit-vector left-shifts one bit to eliminate the oldest transaction and sets the most right bit by the incoming transaction. The bit-vectors of the items in the incoming transaction set their most right bits to one; the others set their most right bits to zero. Figure 3-11 shows an example of this sliding process.

In Figure 4-4, we assume that the system has received 5 transactions and only the CBASW of customer #1 is observed. The first CBASW shows the situation before the new transaction with TID = 6 coming. The second CBASW shows the result after left-shifting each bit-vector one bit. The third CBASW shows the final result after setting the most right bit by the incoming transaction.

(a, b)

Before the transaction with TID = 6 comes

After the transaction with TID = 6 comes

Left-shift each

Fig 4-4. An example of window sliding in a CBASW

ρ-idx of each item (1-sequence) is maintained according to the CBASWs. Whenever a window sliding for a CBASW of a customer c is performed, each ρ-idx[c] is decreased once.

If a ρ-idx[c] becomes zero after decreasing, the bit-vector of ρ is checked to find out the new first occurring position. If ρ does not exist in this customer-sequence anymore, the ρ-idx[c] is set to zero.

In the first CBASW of Figure 4-4, <(a)>-idx[1] is 1. After window sliding, <(a)>-idx[1] is decreased to 0. In the third CBASW of Figure 3-11, <(a)>-idx[1] is 0 because sequence <(a)>

does not exist in customer-sequence of customer #1. The support of sequence <(a)> also decreases to 2.

4.5 Building of Lexicographical Sequence Tree

Section 4.4 introduces the algorithm for maintaining information of items in a data stream.

After the items of the incoming transaction are processed, these items are used to generate all possible sequences and check their supports. SPAM algorithm [11] presents a concept of lexicographical sequence tree to archive this goal.

Assume that there is a lexicographical ordering ≤ of the items I in the data stream. If item i occurs before item j in the ordering, then we denote this by i ≤I j. This ordering can be extended to sequences by defining sa ≤ sb if sa is a subsequence of sb. If sa is not a subsequence of sb, then there is no relationship in this ordering.

The root of the tree is labeled with ∅. Recursively if n is a node in the tree, then n’s

children are all nodes n′ such that n ≤ n′ and ∀m ∈ T: n′ ≤ m ⇒ n ≤ m. Figure 4-5 shows an example of a lexicographic sequence tree. This graph is modified from [11]. Here shows a sub-tree of sequence tree for two items a and b to the fourth level.

<(a)> <(b)>

Level 1

<(a)(a)> <(a)(b)> <(a, b)> 2

3

<(a, b)(b)>

<(a, b)(a)>

<(a)(b)(b)>

<(a)(b)(a)>

<(a)(a, b)>

<(a)(a)(b)>

<(a)(a)(a)>

4

<(a)(a)(a, b)> <(a)(a, b)(a)> <(a)(a, b)(b)> <(a)(b)(a, b)> <(a, b)(a)(a)> <(a, b)(a)(b)> <(a, b)(a, b)>

Fig 4-5. A lexicographic sequence tree example

Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. A sequence-extended sequence is a sequence by adding a 1-itemset to the end of its parent’s sequence in the lexicographical tree, like

<(a)(a)(a)> and <(a)(a)(b)> generated from <(a)(a)> in Figure 4-5. An itemset-extended sequence is a sequence by adding an item to the last itemset in the parent’s sequence in the

lexicographical tree. The order of the item is greater than any item in the last itemset, like

<(a)(a, b)> generated from <(a)(a)> in Figure 4-5.

If we generate sequences by traversing the tree, then each node in the tree can generate sequence-extended children sequences and itemset-extended children sequences. We refer to the process of generating sequence-extended sequences as the sequence-extension step (S-step) and the process of generating itemset-extended sequences as the itemset-extension

step (I-step). Thus each node n in the tree has two sets: Sn, the set of candidate items that are considered for possible S-step extensions of node n and In, the set of candidate items that are considered for possible I-step extensions.

4.6 Counting Support with ρ-idx

For a sequence ρ, ρ-idx stores the first positions of ρ in all customer-sequences. In the lexicographical tree of Incremental SPAM, each node, representing ρ, uses ρ-idx to count the support of sequence ρ. We introduce the method in two different steps.

4.6.1 Counting Support in S-step

Assume there are a sequence α and an appended 1-itemset β. An S-extended sequence γ is generated by α and β. Our goal is to use α-idx and β-idx to generate γ-idx and count the support of γ.

The first step α-idx[c] and β-idx[c] for each customer c are checked. If either α-idx[c] or β-idx[c] is zero, γ-idx[c] is set to zero. That means γ can not exist in the customer-sequence of c. If there is a c that α-idx[c] and β-idx[c] are both not zero, γ may exist in this customer-sequence. The corresponding position values have to be checked. There are two cases for α-idx[c] and β-idx[c]:

(Case 1) α-idx[c] < β-idx[c]: That means α appears before β and γ does exist in the customer-sequence of c. γ-idx[c] is set to β-idx[c].

(Case 2) α-idx[c] ≥ β-idx[c]: In this case we do not have enough information for judging if γ exists. The CBASW of customer c needs to be further checked. We denote the bit-vector of item i in the CBASW of customer c as CBASWc(i).

A left-shit operation is performed on CBASWc(β) by α-idx[c] bits. If the result of shifting is a non-zero bit-vector, γ exists in the customer-sequence of c. Assume the position of the first non-zero bit in the result is h. The first position of γ, γ-idx[c], is set to α-idx[c] + h.

Otherwise γ does not exist in the customer-sequence of c and γ-idx[c] is set to zero.

Finally the support of γ can be counted by the number of non-zero positions.

4.6.2 Counting Support in I-step

Assume a sequence α, an appended item T, and an I-extended sequence γ generated by α and T. Our goal is to use α-idx and T-idx to generate γ-idx and count the support of γ.

As in S-step, α-idx[c] and T-idx[c] for each customer c are also checked. If either α-idx[c]

or T-idx[c] is zero, γ-idx[c] is set to zero. If there is a c that α-idx[c] and T-idx[c] are both not zero, the corresponding position values have to be checked. There are also two cases for α-idx[c] and T-idx[c] in I-step:

(Case 1) α-idx[c] = β-idx[c]: That means the last itemset of α and item T are in the same

(Case 2) α-idx[c] ≠ β-idx[c]: The further check in CBASW of c is performed. Assume X is the last itemset of α. A bit-vector BITX is obtained by bitwise AND CBASWc(x1), CBASWc(x2), …, CBASWc(xk), where xi is an item contained in X. Then BITX is left-shifted (α-idx[c] – 1) bits. If the resultant sequence is non-zero, γ exists in this customer-sequence. Assume the position of the first non-zero bit in the result is h. The first position of γ, γ-idx[c], is set to [(α-idx[c] – 1) + h]. Otherwise γ does not exist in the customer-sequence of c and γ-idx[c] is set to zero.

As in section 4.6.1, the support of γ can be counted by the number of non-zero positions.

4.7 The Entire Process of Incremental SPAM (IncSPAM)

Finally we introduce entire IncSPAM algorithm for the mining of sequential patterns.

Figure 4-6 shows the main function of IncSPAM.

IncSPAM (S, d, N)

1: foreach incoming transaction from the data stream do

2: find out which customer c the incoming transaction belongs to;

3: update the CBASW of this customer by the incoming transaction;

4: store all the frequent 1-sequences to F;

5: MaintainTree(c, F);

Fig 4-6. Main function of Incremental SPAM

The CBASW of each customer is modified from line 1 to line 4. After the modification of CBASWs is finished, function MaintainTree is called. Function MaintainTree maintains sequential patterns dynamically in a lexicographic sequence tree. There are some cases about incremental mining of sequential patterns. Assume that a new transaction ω comes in. ω belongs to customer c. The lexicographical tree T is updated to T′:

A pattern which is frequent in T is still frequent in T′: We only needs to update its ρ-idx and support

A pattern which is not in T appears in T′: A new pattern is generated because of the incoming transaction. By the Apriori [1] property, since prefix of the new pattern must also be frequent, we only need to generate candidates from the leaf nodes of T. There are two ways to reduce the number of candidates to be generated: (1) We only consider the items in the incoming transaction to append on the leaf nodes because the new patterns must contain these items in the end. (2) The incoming transaction only belongs to a specific customer c so the generated candidates must begin with the items in the customer-sequence of c. Figure 4-7 shows an example after sliding the CBASW of customer #3. The incoming transaction is TID = 7.

(b, c, d)

Sliding Window of Each Customer

The appended items are items b, c, and d.

The sub-trees of items a, b, c, and d need to generate candidates; others don’t.

Fig 4-7. Reducing the generated candidates

A pattern which is in T does not exist in T′: The pattern becomes infrequent because of window sliding. We directly delete the node and its sub-tree.

MaintainTree (c, F)

1: foreach tree node n who’s representing item i is in F do 2: if i exists in the customer-sequence of c then 3: Generate(c, n);

4: else //i does not exist in c 5: Update(c, n);

Fig 4-8. The pseudo code of function MaintainTree

Figure 4-8 shows the pseudo code of MaintainTree. Function Generate, as shown in Figure 4-9, uses S-step and I-step to generate all possible children with the principles mentioned above for each tree node. If the child does not exist in the lexicographical tree, Generate creates a new tree node for this child. If the child is in the lexicographical tree, Generate only updates the index set and support of this child. Function Update, as shown in Figure 4-10, is simpler than Generate. Update does not need to generate children. Update only checks each tree node to update its index set and support. The process of updating the index set and the support is in Function UpdateSupport.

Generate (c, n)

1: foreach existing child n′ of n do 2: UpdateSupport(c, n′);

3: if the support of n′ < S then 4: eliminate n′ and its sub-tree;

5: generate candidates of n by S-step and I-step;

6: foreach generated candidate x of n do 7: count the support of x;

8: if the support of x ≥ S then 9: x is a child of n;

10: foreach child n′ of n do 11: Generate(c, n′);

Fig 4-9. The pseudo code of function Generate

Update (c, n)

Fig 4-10. The pseudo code of function Update

We use the previous example to show the process of IncSPAM. Assume three transactions have been received by IncSPAM. Figure 4-11 shows the CBASWs and the lexicographic sequence tree. We mark the sequential patterns with squares. Each tree node maintains an index set to record its support. In Figure 4-11, only 1-sequence <(b)> is frequent so the tree does not have longer sequential patterns.

Fig 4-11. The lexicographic sequence tree when the third transaction comes in

When the fourth transaction (a, b, c) comes in, CBASW of customer 2 has been modified and 1-sequences <(a)> and <(c)> become new sequential patterns. By the extension methods, S-step and I-step, longer candidates have been generated. IncSPAM checks the support of each candidate using index set and keeps sequential patterns in the lexicographic sequence

Fig 4-12. The lexicographic sequence tree after the fourth transaction comes in

When the fifth transaction (a, b) comes in, IncSPAM updates the CBASWs and the index sets in the lexicographic sequence tree. Then IncSPAM needs to generate new candidates to find if there are new sequential patterns. Figure 4-13 shows the lexicographic sequence tree and CBASWs after the fifth transaction comes. In the figure the tree nodes linked by the dotted arrows means the candidates IncSPAM needs to check. The fifth transaction belongs to customer 3 so only the sub-trees of items that exist in the customer-sequence 3 need to generate candidates. In Figure 4-13 we can know that the sub-trees of items a and b need to generate new candidates. Then we find that the new candidates <(a)(a)>, <(a)(b)>, and <(b, c)(b)> are not frequent. IncSPAM does not keep these tree nodes.

Fig 4-13. The lexicographic sequence tree after the fifth transaction comes in

Figure 4-14 shows the result after the sixth transaction comes in. IncSPAM finds that the existing tree node <(b)(b)> becomes infrequent. In this case IncSPAM directly deletes the tree node <(b)(b)> and its sub-tree <(b)(b, c)>.

Fig 4-14. The lexicographic sequence tree after the sixth transaction comes in

4.8 Weight of Customer-Sequence

‧‧‧

Data Streams

System starts Current transaction

These transactions are the latest transactions of a customer but the customer has no recent transactions in the data stream

Fig 4-15. The transactions of a customer with no recent records in a data stream

In IncSPAM algorithm, each customer maintains a sliding window to keep the latest N transactions and the system mines sequential patterns from all customer-sequences. But some customers may have no transactions in recent time in the data stream. These customer-sequences with out-of-date transactions would result in a false positive problem in

our algorithm. The supports of some patterns generated by the system are overly counted.

Figure 4-15 shows an example of these out-of-date transactions in a data stream. The customer-sequence with these out-of-date transactions is less important than other customer-sequences.

A concept of weight can be used to judge the importance of customers. Each customer-sequence c has its own weight wc, 0 ≤ wc ≤ 1. Each weight wc is decayed if the incoming transaction does not belong to c. When a transaction of c comes, the weight wc is set

A concept of weight can be used to judge the importance of customers. Each customer-sequence c has its own weight wc, 0 ≤ wc ≤ 1. Each weight wc is decayed if the incoming transaction does not belong to c. When a transaction of c comes, the weight wc is set