Construction and Maintenance of Summary Data structure

Chapter 2 Online Mining of Frequent Itemsets in Data Streams

2.3 The Proposed Algorithm: DSM-FI

2.3.1 Construction and Maintenance of Summary Data structure

In this section, we describe the algorithm which constructs and maintains the in-memory summary data structure called SFI-forest (Summary Frequent Itemset forest).

Definition 2-6 A summary frequent itemset forest (SFI-forest) is a summary data structure and is defined as follows.

1. SFI-forest consists of a frequent item list (FI-list), and a set of summary frequent itemset trees (SFI-trees) of item-prefixes, denoted by item-prefix.SFI-trees.

2. Each node in the item-prefix.SFI-tree consists of four fields: item-id, item-id.esup, item-id.window-id, and item-id.node-link. The first field item-id is the item identifier

of the inserting item. The second field item-id.esup registers the number of transactions represented by a portion of the path reaching the node with the item-id.

The value of the third field item-id.window-id assigned to a new node is the window identifier of the current window. The final field item-id.node-link links up a node with the next node with the same item-id in the same SFI-tree or null if there is none.

3. Each entry in the FI-list consists of four fields: item-id, item-id.esup, item-id.window-id, and item-id.head-link. The item-id registers which item identifier

the entry represents, item-id.esup records the number of transactions containing the item carrying the item-id, the value of item-id.window-id assigned to a new entry is the window identifier of current window, and item-id.head-link points to the root node of the item-id.SFI-tree. Note that each entry with item-id in the FI-list is an item-prefix and it is also the root node of the item-id.SFI-tree.

4. Each item-prefix.SFI-tree has a specific opposite frequent item list (OFI-list) with respect to the item-prefix, denoted by item-prefix.OFI-list. The item-prefix.OFI-list is composed of four fields: item-id, item-id.esup, item-id.window-id, and item-id.head-link. The item-prefix.OFI-list operates the same as the FI-list except that

Figure 2-2 outlines the SFI-forest construction of the proposed DSM-FI algorithm. First of all, DSM-FI algorithm reads a transaction T from the current window BN. Then, DSM-FI

projects this transaction T into many sub-transactions, and inserts these sub-transactions into the SFI-forest. The details of this projection are described as follows. A transaction T with m items, such as (x1x2… xm), in the current window should be projected by inserting m item-prefix sub-transactions into the SFI-forest. In other words, the transaction T = (x₁x₂… x_m) is converted into m sub-transactions; that is, (x1x2… xm), (x2x3… xm), …, (xm-1xm), and (xm).

These m sub-transactions are called item-prefix transactions, since the first item of each sub-transaction is an item-prefix of the original transaction T. This step, called transaction projection, is denoted by TP(T) = {x1|T, x₂|T, …, x_i|T, …, x_m|T}, where x_i|T = (x_ixi+1… x_m), ∀i

= 1, 2, …, m. The projecting cost of a transaction of length m for constructing the summary data structure SFI-forest is (m²+m)/2, i.e., m + (m−1) + … + 2 + 1. Recall that the decomposing cost of a transaction with m items of BTS algorithm for constructing the summary data structure is (2^m−2). In general, the constructing cost of summary data structure of our algorithm is extremely less than that of BTS algorithm.

After performing the transaction projection of the incoming transaction T, DSM-FI algorithm inserts T into the FI-list, and then removes T from the current window in the main memory. Then, the items of these item-prefix transactions are inserted into the item-prefixes.SFI-trees as branches, and the estimated support of the corresponding item-prefixes.OFI-lists are updated. If an itemset shares a prefix of an itemset already in the SFI-tree, the new itemset will share a prefix of the branch representing that itemset. In addition, an estimated support counter is associated with each node in the tree. The counter is updated when an item-prefix transaction causes the insertion of a new branch. Figure 2-3 shows the subroutines of SFI-forest construction and maintenance.

Example 2-1. Let the Wj be a window with the landmark identifier j, and it contains six transactions: < acdf >, < abe >, < df >, < cef >, < acdef > and < cef >, where a, b, c, d, e and f are items in the data stream. The SFI-forest with respect to the first two transactions, < acdf >

and < abe >, constructed by DSM-FI algorithm is described as follows. Note that each node of the form (id: id.esup: id.wid) is composed of three fields: item-id, estimated support, and window-id. For example, (a: 2: j) indicates that, from basic window Wj to current basic window WN (1 ≤ j ≤ N), item a appeared twice.

Algorithm SFI-forest construction

Input: A data stream, DS = [B1, B2, …, BN) with landmark 1, a user-specified minimum support threshold s∈(0, 1), and a maximum support error threshold ε ∈ (0, s).

Output: A SFI-forest generated so far.

1: FI-list = {}; /*initialize the FI-list to empty.*/

2: foreach window Bj do /* j = 1, 2, …, N */

3: foreach transaction T = (x₁x₂… x_m) ∈ B_j (j = 1, 2, …, N) do

/* m ≥ 1 and j is the current window identifier */

4: foreach item x_i ∈ T do /* the maintenance of FI-list */

5: if xi ∉ FI-list then

6: create a new entry of form (x_i, 1, j, head-link) into the FI-list;

/* the entry form is (item-id, item-id.esup, window-id, head-link)*/

7: else /* the entry already exists in the FI-list*/

8: x_i.esup = x_i.esup + 1;

/* increment the estimated support of item-id x_i by one*/

9: end if 10: end for 11: call TP(T, j);

/* project the transaction with each item-prefix xi for constructing the xi.SFI-tree */

12: end for

13: call SFI-forest-pruning(SFI-forest, ε, N); /* Step 3 of DSM-FI algorithm */

14: end for

Figure 2- 2. Algorithm SFI-forest Construction

Subroutine TP /* Step 2 of DSM-FI algorithm: construct and maintain the SFI-forest */

Input: A transaction T = (x₁x₂… x_m) and the current window-id j;

Output: x_i.SFI-tree, ∀i = 1, 2, …, m;

1: foreach item xi, ∀i = 1, 2, …, m, do

2: SFI-tree-maintenance([xi|X], xi.SFI-tree, j);

/* X = x1, x2, …, xm is the original incoming transaction T */

/* [xi|X] is an item-prefix transaction with the item-prefix xi*/

3: end for

Subroutine SFI-tree-maintenance /* Step 2 of DSM-FI algorithm */

Input: An item-prefix transaction (xixi+1… xm), the current window-id j, and xi.SFI-tree, where i=1, 2, …, m;

Output: A modified xi.SFI-tree, where i=1, 2, ..., m;

1: foreach item xl do /* l = i+1, i+2, …, m */

2: if x_l ∉ x_i.OFI-list then /* x_i.OFI-list maintenance */

3: create a new entry of form (x_l, 1, j, head-link) into the x_i.OFI-list;

/* the entry form is (item-id, item-id.esup, item-id.window-id, item-id.head-link)*/

4: else /* the entry already exists in the x_i.OFI-list */

5: x_l.esup = x_l.esup + 1;

/* increment the estimated support of item-id x_l by one*/

6: end if 7: endfor

8: foreach item xi, ∀i = 1, 2, …, m, do /* xi.SFI-tree maintenance */

9: if SFI-tree has a child node with item-id y such that y.item-id = xi.item-id then 10: y.esup = y.esup +1; /*increment y’s estimated support by one*/

11: else create a new node of the form (xi, 1, j, node-link);

/* initialize the estimated support of the new node to one, and link its parent link to SFI-tree, and its node-link linked to the nodes with same item-id via the node-link structure.

12: end if 13: end for

Subroutine SFI-forest-pruning /* Step 3 of DSM-FI algorithm: prune the infrequent information from the SFI-forest */

Input: A SFI-forest, a user-specified maximum support error threshold ε, and the current window identifier N;

Output: A SFI-forest which contains the set of all significant and frequent itemsets.

1: foreach entry xi (i=1, 2, …, d) ∈ FI-list, where d =|FI-list| do Figure 2- 3. Subroutines of SFI-forest construction algorithm

a:1:j

Figure 2- 4. SFI-forest construction after processing the first transaction < acdf >

Figure 2- 5. SFI-forest construction after processing the second transaction < abe >

(a) First transaction < acdf >: First of all, DSM-FI algorithm reads the first transaction and calls the Transaction-Projection(< acdf >). Then, DSM-FI inserts four item-prefix transactions: <acdf>, <cdf>, <df>, and <f> into the FI-list, [a.SFI-tree, a.OFI-list], [c.SFI-tree, c.OFI-list], [d.SFI-tree, d.OFI-list], and [f.SFI-tree, f.OFI-list], respectively.

The result is shown in Figure 2-4. In the following steps, the head-links of each item-prefix.OFI-list are omitted for concise presentation.

(b) Second transaction <abe>: DSM-FI algorithm reads the second transaction and calls the Transaction-Projection(<abe>). Next, DSM-FI inserts three item-prefix transactions:

<abe>, <be>, and <e> into the FI-list, [a.SFI-tree, a.OFI-list], [b.SFI-tree, b.OFI-list], and [e.SFI-tree, e.OFI-list], respectively. The result is shown in Figure 2-5. After processing all the transactions of window Wj, the SFI-forest generated so far is shown in Figure 2-6.

在文檔中在串流資料中高效率頻繁樣式探勘演算法之研究 (頁 31-37)