The Closed Itemsets Maintaining (CIM) Algorithm

Chapter 2 Incremental Mining Algorithms for Association Rules

2.5 The Closed Itemsets Maintaining (CIM) Algorithm

We develop a novel incremental mining algorithm mainly consisting of CO_generation and CP_generation subroutines, called Closed Itemsets Maintaining (CIM), to efficiently find FCID+. Also, an in-memory data structure called Closed Maintenance Tree (CMT) is proposed in the CIM algorithm to facilitate the processes of CO_generation and CP_generation subroutines. The CIM algorithm first updates the itemsets in the CMT against d to obtain CO by the CO_generation subroutine.

Then, by the CP_generation subroutine, it generates candidate itemsets for the itemsets of FCID+ which have not been determined in the CO_generation subroutine.

Finally, by reprocessing these obtained candidate itemsets against D and checking their closure property, the CIM algorithm can find FCID+ from the CMT. Details of the CMT data structure, the CO_generation and CP_generation subroutines are described in Section 2.5.1 to Section 2.5.3.

The CIM algorithm(CMT, D, d, minsup) Parameters:

CMT: A closed maintenance tree;

D: An original database;

d: A set of newly inserted transactions;

minsup: A minimum support threshold.

Begin

Set FFJCISet = φ^{; /*}FFJCISet is a set used to store the itemsets of FFJCI. */

Set Cand = φ^{; /*}Cand is a set used to store candidate itemsets for FCID+. */

CO_generation subroutine(CMT, d, minsup, FFJCISet, Cand);

Set F1dD+ = φ^{; /*}^F1dD+ is a set used to store the frequent 1-itemsets in both d and D⁺. */

Set mincountD+ = minsup * (|D| + |d|);

Obtain_frequent_items(CMT, mincountD+, F1dD+);

/* Obtain F1dD+ from CMT. */

CP_generation subroutine(CMT, d, minsup, FFJCISet, Cand, F1dD+, CMT.root);

Reprocess_Cand(CMT, Cand, D); /* Reprocess obtained candidate k-itemsets (k ≥ 2) in CMT against D. */

Check_Closure_Cand(CMT, Cand); /* Check closure property for all candidates itemsets in CMT. */

Remove_NCI(CMT, mincountD+); /* Remove the closed itemsets in CMT whose support counts are less than mincountD+. */

Output_FCI(CMT); /* Output FCID+ for D⁺.*/

End.

Figure 2-3: The CIM algorithm

Theorem 2-5: The CIM algorithm can correctly obtain FCID+.

Proof: As mentioned above, an incremental mining algorithm can use two steps:

updating CO against d and reprocessing CP against D to find out FCID+ dealing with the problem of maintaining association rules. According to Theorem 2-2 and Corollary 2-1, since the CIM algorithm can maintain CO and candidate itemsets for CP in the CMT by the CO_generation and CP_generation subroutines, the CIM algorithm can correctly obtain FCID+ from the CMT.

2.5.1 The Closed Maintenance Tree (CMT)

A Closed Maintenance Tree (CMT) which is a tree structure like a prefix tree [1]

is constructed as follows. For each itemset x, a corresponding node vx is built in the CMT. Each node maintains its corresponding itemset with support count, denoted as (itemset, support count). For each pair of nodes vx and vy corresponding to itemsets x and y, there is a directed edge from vx to vy if x is a parent of y. x is said to be a parent of y if y can be obtained by adding a new item to x, and inversely, y is said to be a child of x. Therefore, an itemset has only one parent and more than one child in the constructed CMT. Note that, the itemsets in a CMT are usually maintained in lexical order, and for saving the storage space, each node maintains only the suffix of an itemset which is regarding the itemset in its parent node. There are three types of nodes in a CMT:

z Closed nodes: the nodes represent the itemsets in FCID;

z Prefix-unclosed nodes: the nodes represent the common prefixes of closed nodes;

z Infrequent nodes: the nodes represent infrequent 1-itemsets in D.

Among them, in particular, prefix-unclosed nodes are used to improve the searching performance of CMT, and infrequent nodes are used to reduce useless item combinations in the CP_generation subroutine.

Table 2-1: A transactional database

TID Items

100 A,C, D 200 B, C, E 300 A,B, C, E 400 B, E

Example 2-1: Given a transactional database as shown in Table 2-1, Figure 2-4

shows an example of CMT based on minsup = 0.5. The prefix-unclosed node (B, 3)

and the closed node (CE, 2) stand for the closed itemset (BCE, 2); (B, 3) and (E, 3) stand for the closed itemset (BE, 3). The CMT maintains only one infrequent node (D,

1).

Figure 2-4: A closed maintenance tree (CMT)

2.5.2 The CO_generation Subroutine of the CIM Algorithm

The CO_generation subroutine is responsible for processing FCID against d to find FFJCI and FNJCI, thus obtaining CO. In that, finding FNJCI is the most concerned because most itemsets in NId are irrelative and useless. In order to reduce useless item combinations of NId, the CO_generation subroutine adopts the branch-wise processing strategy to process a given CMT against d as follows. The CO_generation subroutine operates from the most left branch to the most right branch in the CMT. If a branch consists of only one item x maintained in an infrequent node vx, the CO_generation subroutine updates x’s support count against d, and keeps x in a set used to store candidate itemsets for FCID+ if x’s support count is not less than minsup*|D⁺|. Detailed usage of this candidate set will be described in Section 5.3.

Otherwise, for each of the other branches, which consists of closed nodes, the CO_generation subroutine uses the items belonging to the branch, i.e., the items of the maximal itemset in the branch, as seeds to mine the closed itemsets in d by a

(D, 1) (B, 3)

(CE, 2) (AC, 2)

root

(C, 3)

(E, 3)

: Prefix-unclosed node : Closed node

: Infrequent node

closed itemsets mining approach (such as the CHARM algorithm). Moreover, a checking mechanism is used to reduce duplicate item combinations which have been considered by a processed branch. Since the CO_generation subroutine considers only the items in a branch at a time, useless item combinations belonging to NId can be effectively reduced. The performance of CO_generation subroutine is greatly improved. After all branches have been processed, the CO_generation subroutine then updates found itemsets against CMT to obtain CO. Assume y is an itemset in the CMT, z is one of the found itemsets in d, and x = y ∩ z. The CO_generation subroutine can find FFJCI and FNJCI by updating x with support count calculated by y’s support count + z’s support count. The updated CMT thus contains the entire CO.

CO_generation subroutine(CMT, d, minsup, FFJCISet, Cand) Parameters:

CMT: The closed maintenance tree;

d: The newly inserted transactions;

minsup: The minimum support threshold;

FFJCISet: The set used to store the itemsets of FFJCI;

Cand: The set used to store candidate itemsets for FCID+. Begin

Set T = φ^; /* T is a set used to store the mining results by the branch-wise processing strategy. */

for each item ai only appears d, do /* Insert each new item ai in CMT. */

insert ai with ai.count = 0 into CMT;

for each branch bi ∈ CMT, do

if bi consists of only one infrequent item x, then

update x.count against d; /* x.count denotes x’s support count. */

if x.count ≥ minsup*|D⁺|, then insert x with x.count into Cand;

else if bi ≠ null and bi is not contained by a processed branch bj, then Closed_itemset_mining(bi, d, T); /* Execute a closed itemsets mining

algorithm and store mining results into T. */

y = CMT.get_first_CI(); /* Fetch the first closed itemset by lexical

order in CMT. */

z = T.get_first_CI(); /* Fetch the first closed itemset by lexical order in T. */

while y ≠ null and z ≠ null, do if y = z, then

y.count = y.count + z.count;

if z.count ≥ minsup*|d|, then

insert y with y.count into FFJCISet;

y = CMT.get_next_CI(y); /* Fetch the next closed itemset by lexical order in CMT. */

z = T.get_next_CI(z); /* Fetch the next closed itemset by lexical order in T. */

else if y ∩ z = y, then

y.count = y.count + z.count;

if z.count ≥ minsup*|d|, then

insert y with y.count into FFJCISet;

y = CMT.get_next_CI(y);

else if y ∩ z = z then

if z.count ≥ minsup*|d|, then

insert z with (y.count + z.count) into FFJCISet;

z.count = y.count + z.count;

insert z with z.count into CMT;

z = T.get_next_CI(z);

else if y ∩ z = x and x ≠ null then /* x ⊂ y and x ⊂ z. */

if CMT.exist(x) = false, then x.count = y.count + z.count;

insert x with x.count into CMT;

if z.count ≥ minsup*|d|, then

insert x with x.count into FFJCISet;

y = CMT.get_next_CI(y);

else if (y.count + z.count) > x.count, then x.count = y.count + z.count;

if z.count ≥ minsup*|d|, then

insert x with x.count into FFJCISet;

y = CMT.get_next_CI(y);

End.

Figure 2-5: The CO_generation subroutine

Theorem 2-6: The algorithm of CO_generation subroutine can correctly obtain

CO.

Proof: For a branch of the given CMT, by using the items of the branch as seeds to process d, the CO_generation subroutine can find the closed itemsets in d which are subsets of one of the frequent closed itemsets in the branch. After all branches have been processed, it is easily seen that these found closed itemsets in d can be used to obtain the entire FFJCI ∪ FNJCI by updating them against the frequent closed itemsets in the CMT. The updated CMT thus contains the entire FCID ∪ FFJCI ∪

FNJCI.

Table 2-2: The newly inserted transactions

TID Items

500 B,C, D 600 C, D

Figure 2-6: An example of branch-wise processing strategy in the CO_generation subroutine

Figure 2-7: An example of updating process in the CO_generation subroutine

(D, 1)

Closed itemsets mining with branch-wise processing strategy

Example 2-2: When new transactions shown in Table 2-2 have been inserted

into Table 2-1, the CO_generation subroutine first considers the most left branch of {AC} in Figure 2-4 and uses {A} and {C} as seeds to mine the closed itemsets in d.

Then, the branches with maximal itemsets {BCE}, {BE}, {C} and {D} are processed in turn. Mining results are shown in Figure 2-6, where the branches with {BE} and {C}

can be ignored because related item combinations have been processed by the branch with {BCE}. After all branches have been processed, the CO_generation subroutine then updates mining results against CMT. The updated CMT is shown in Figure 2-7, where the itemsets {B}, {C} and {BC} are belonging to FFJCI, and the itemset {D}

is a candidate itemset for FCID+.

2.5.3 The CP_generation Subroutine of the CIM Algorithm

According to Corollary 1, the CP_generation subroutine can find FId and then remove the itemsets in FId which have been covered by FFJCI as candidates for CP (i.e. {FId – cover(FFJCI, FId)}), but this indirect way may require an excessive computation cost for a large size of FId and generate many candidate itemsets irrelative to FCID+. As a result, the CP_generation subroutine adopts a more effective and efficient candidate generation dealing with candidate generation. Let F1dD+

denote the frequent 1-itemsets in both d and D⁺, and Cand1 denote the 1-itemsets which are infrequent in D but frequent in D⁺. They can be easily obtained from the updated CMT after the CO_generation subroutine. The CP_generation subroutine attempts to combine the found itemsets of FFJCI and Cand1 with ones of F1dD+, to directly generate k-itemsets (k ≥ 2) as candidates for FCID+ as follows. The CP_generation subroutine uses a depth-first and left-to-right search manner in the

CMT to generate the other candidates. When meeting an itemset x of FFJCI in the CMT, the CP_generation subroutine combines x with one of F1dD+ to form a new itemset x’. If x’ is not covered by FFJCI (i.e. x’ is not a subset of an itemset in FFJCI) and frequent in d, x’ is a new candidate itemset and a corresponding node vx’ is built in the CMT. On the other hand, when meeting an itemset y of Cand1 or of new candidates generated before, the CP_generation subroutine does a similar combination-and-test to generate a new candidate itemset y’ and build a corresponding node vy’ in the CMT. These two FFJCI–based and Cand–based candidate generations continue until no new candidate itemsets are generated.

CP_generation subroutine(CMT, d, minsup, FFJCISet, Cand, F1dD+, x) Parameters:

CMT: The closed maintenance tree;

d: The newly inserted transactions;

minsup: The minimum support threshold;

FFJCISet: The set used to store the itemsets of FFJCI;

Cand: The set used to store candidate itemsets for FCID+;

F1dD+: The set used to store frequent 1-itemsets in both d and D⁺; x: A variable.

Begin

if x = CMT.root, then for each child ci of x, do

CP_generation subroutine(CMT, d, minsup, FFJCISet, Cand, F1dD+, ci);

else if x ⊆ FFJCISet or x ⊆ Cand, then

for each zi ∈ F1dD+ and the lexical order of zi is after that of the first item of x, do x’ = combine(x, zi); /* Attempt to generate new candidate

itemsets for FCID+. */

if x’ ≠ null, then

if cover(FFJCISet, x’) ≠ null, then continue;

/* If x’ is covered by FFJCISet. */

update x’.count against d;

if x’.count ≥ minsup*|d|, then

insert x’ with x’.count into CMT and Cand;

for each child ci of x, do

CP_generation subroutine(CMT, d, minsup, FFJCISet, Cand, F1dD+, ci);

End.

Figure 2-8: The CP_generation subroutine

Theorem 2-7: The algorithm of CP_generation subroutine can correctly

generate candidate itemsets for the itemsets of FCID+ which have not been determined in the CO_generation subroutine.

Proof: It is obvious that only the itemsets of FId which are enumerated from F1dD+ are possible to be contained in FCID+. The number of itemsets of {FId – cover(FFJCI, FId)} can be further reduced regarding FCID+. Since the entire F1dD+

can be obtained by collecting the 1-itemsets covered by FFJCI and the itemsets of Cand1, the CP_generation subroutine can directly, without loss of information, generate the candidate itemsets for the itemsets of FCID+ which have not been determined in the CO_generation subroutine by combining FFJCI with F1dD+ and Cand with F1dD+, respectively. Among them, the FFJCI–based candidate generation can avoid the item combinations which have been covered by the found itemsets of

FFJCI.

Figure 2-9: An example of CP_generation subroutine

Example 2-3: Continue from Example 2-2. After the CO_generation subroutine,

FFJCI = {B, BC, C}, Cand1 = {D} and F1dD+ = {B, C, D}. As shown in Figure 2-9, the CP_generation subroutine mainly generates candidate itemsets as follows. It first combines {B} of FFJCI with one of F1dD+ to form valid candidate itemsets. This will generate the candidate itemset {BD}. Then {BC} and {C} of FFJCI are processed as well to generate the candidate itemsets {BCD} and {CD}, respectively.

2.6 The CIM Algorithm with Pre-large Concept: CIM-P Algorithm

在文檔中漸進式探勘與多維度即時探勘之研究 (頁 37-47)