Chapter 6 Multidimensional Online Mining Algorithms for Generation of
6.6 LNOM: Algorithm Design and Implementation
The NOM approach needs to calculate the appearing counts and the non-appearing upper-bound counts of the candidate itemsets derived from matched tuples. A straightforward way for finding these values is to process matched tuples one after one for each candidate itemset. Assume k is the number of matched tuples, m is the average number of itemsets in the k matched tuples, and n is the number of candidate itemsets generated from the k matched tuples. The computation cost will be O(knm) when the candidate itemsets are processed one by one. The computation cost will, however, become large along with the increase of the itemsets kept in EMPR and the candidate itemsets to be considered. In fact, in the NOM approach, many candidate itemsets with the same subsets can be processed at the same time. For example, in Tuple 4 of Example 6-6, the appearing count of the candidate itemset {C}
and the upper-bound counts of the candidate itemsets {AC}, {BC} and {ABC} can be calculated at the same time because they have the same subset {C}. On the other hand, many itemsets kept in the matched tuples are useless for calculating the counts of candidates since they are not the subsets of candidates and can be omitted. For example, in Example 6-6, the itemsets {D}, {F}, {AF} and {BF} kept in the matched tuples are not the subsets of the candidate itemsets and can be omitted. We thus try to use appropriate data structures and design efficient algorithms to improve the performance of the NOM approach.
At first, the problem of calculating the appearing and upper-bound counts of candidate itemsets in a matched tuple is conceptually modeled by a graph and converted into a directed-minimum-spanning-tree problem. The spanning-tree-count- calculating (STCC) algorithm is then proposed to find the directed minimum spanning
tree. The lattice data structure [2][41] is utilized to organize and maintain all candidate itemsets such that the candidate itemsets with the same proper subsets can be considered at the same time. Consequently, by the STCC algorithm, the proposed lattice-based NOM (LNOM) approach requires only one scan of the itemsets for each matched tuple in Phase 1.
In addition, the hashing technique is used to filter out a part of itemsets kept in the matched tuples which are useless for calculating the counts of candidate. The NOM approach first hashes the set of candidate itemsets into a given hash table as soon as they are collected. Each bucket of the hash table consists of an integer to represent how many candidate itemsets have been hashed into this bucket. When an itemset of a matched tuple is selected, the NOM approach calculates its hash value and finds its corresponding bucket. If the value stored in the target bucket is equal to 0, the itemset must be useless since it is not a candidate itemset. It can thus be directly omitted. The computational time can thus be further reduced.
6.6.1 The Proposed Lattice-based NOM (LNOM) Approach
The problem of calculating the appearing and upper-bound counts of candidate itemsets in a matched tuple t can be conceptually modeled by a graph. Let G = (V, E) be a directed graph, where V is the set of vertices representing all candidate itemsets and E is the set of directed edges representing a-proper-subset-of relationships between pairs of candidate itemsets. For each edge (u, v) ∈ E, a weight w(u, v) specifies the possible upper-bound count of the candidate itemset v estimated from the candidate itemset u. Given a new vertex r representing the pseudo starting vertex, we make a new graph G’ = (V’, E’), where V’ = V ∪ {r}, E’ = E ∪ {(r, u): u ∈ V}. For each edge (r, u), if u appears in t, the appearing count of u is assigned as the weight
w(r, u). For the case that u does not appear in t, meaning it is collected from the other matched tuple(s), then w(r, u) = 0 if there exists one item contained in u but not contained in t and w(r, u) = t.trans*s−1 otherwise, where s is the initial minimum support for deriving EMPR. The following lemmas formally show the above concepts.
Lemma 6-11: G’ is an acyclic and connected graph.
Proof: It is obvious that the a-proper-subset-of relation on a set is transitive and anti-symmetric. G’ is thus acyclic. Next, we prove G’ is a connected graph by contradiction. If G’ is not a connected graph, there exists a vertex u which is not reachable from the pseudo starting vertex r. This contradicts the definition of G’. Thus,
G’ is an acyclic and connected graph.
Lemma 6-12: Let k be the number of items contained in a candidate itemset x.
The vertex ux has 2k-1 incoming edges in G’.
Proof: If x is a candidate k-itemset, it will appear in the frequent pattern set of at least a tuple. Since x is large in that tuple, all its proper subsets except φ are also large and appear in that tuple. There are 2k-2 proper subsets for x except φ. In addition, the incoming edge (r, ux) is used to link the two vertices r and ux. The vertex ux thus has
2k-1 incoming edges in G’.
Lemma 6-13: For a matched tuple t in EMPR, if there exists one item contained
in a candidate itemset u but not contained in t, then the upper-bound count of u is 0.
Proof: According to the concept of the negative border, all single items which are not large must be put into the negative 1-itemsets. Since all the large and negative itemsets for a block of data are stored in a corresponding tuple, if there exists one item contained in a candidate itemset but not contained in the tuple, this item does not appear in the corresponding block of data. The count of the item is thus 0 in this tuple,
causing the count of each itemset containing the item is also 0. This completes the proof.
Lemma 6-14: For a matched tuple t in EMPR, if a candidate itemset u does not
appear in t, then the maximum possible upper-bound count of u is t.trans*s−1.
Proof: Since u does not appear in t, it is not a frequent itemset. The support of u in t must thus be less than the minimum support s. Therefore, the count of u in t must be less than t.trans*s. The maximum possible upper-bound count of u is thus
t.trans*s−1.
Example 6-9: For the EMPR given in Table 6-3 and the mining request in
Example 6-6, the graph model for Tuple 4 is generated as shown in Figure 6-2.
A B C
AB AC BC
ABC
t.trans*8%
t.trans*3%
r t.trans*5%-1
t.trans*8%
t.trans*8%
t.trans*3%
t.trans*6%
t.trans*2%
t.trans*2%
t.trans*2%
t.trans*2%
t.trans*8%
t.trans*6%
t.trans*6%
t.trans*6%
t.trans*5%-1 t.trans*5%-1
t.trans*5%-1 t.trans*5%-1
Figure 6-2: The graph model of candidate itemsets for Tuple 4 in Table 6-4
For each vertex other than r in G’, the smallest weight on all its incoming edges is its tight upper bound count. The count-calculation problem can thus be easily
thought of as the directed-minimum-spanning-tree problem [30], which wishes to find a rooted directed spanning tree T = (V’, S’) from G’, such that S’ is a subset of E’ and
∑
v∈S uv u w
) , (
) ,
( is a minimum. The spanning-tree-count-calculating (STCC) algorithm
shown in Figure 6-3 is thus proposed based on the above concept for efficiently finding the counts of all candidate itemsets in a tuple. The STCC algorithm first selects an itemset appearing in t and with the smallest support. It then estimates the upper-bound count of each itemset reachable from the selected one in the graph, and thus avoids recalculating the counts of these traversed vertices in the future. This requires only one scan of the itemsets in t if they have been sorted according to their supports.
The spanning-tree-count-calculating (STCC) algorithm:
INPUT: The graph of candidate itemsets G’ = (V’, E’) derived from the EMPR, and a matched tuple t in EMPR.
OUTPUT: The minimum spanning tree of candidate itemsets T = (V’, S’).
STEP 1: Set ProcessedSet = φ, where ProcessedSet is a set used to keep the vertices in G’ which have been traversed.
STEP 2: Select an itemset x appearing in t and with the smallest support t.sx. STEP 3: If x ∈ V’ (i.e., x is a candidate itemset), set Countxappearing = t.trans * t.sx,
ProcessedSet = ProcessedSet ∪ {x}, and do STEP 4; otherwise (i.e., x is not a candidate itemset), do nothing and go to STEP 5.
STEP 4: For each y reachable from x and y ∉ ProcessedSet, set Count = UBy min(t.trans * s-1, t.trans * t.sx) and ProcessedSet = ProcessedSet ∪ {y}.
STEP 5: Repeat STEPs 2 to 4 until all the itemsets appearing in t are processed.
STEP 6: If |ProcessedSet| ≠ |V’| (i.e., some candidate itemsets do not appear in the underlying dataset of t), set CountUBx = 0 for each remaining itemset x ∈ V’.
Figure 6-3: The STCC algorithm
Example 10: Continuing Example 3, the negative itemset {C} with 2% will be
first selected by the proposed STCC algorithm to calculate the appearing count of itself and the upper-bound counts of {AC}, {BC} and {ABC}. Then, the itemsets {D}
with 3%, {AB} with 3%, {B} with 6% and {A} with 8% are selected in turn. Among them, the support information of {D} is useless because it is not a candidate itemset.
Figure 6-4 shows the directed minimum spanning tree found from Figure 6-2.
A B C
AB AC BC
ABC
t.trans*8%
t.trans*3%
r t.trans*5%-1
t.trans*8%
t.trans*8%
t.trans*3%
t.trans*6%
t.trans*2%
t.trans*2%
t.trans*2%
t.trans*2%
t.trans*8%
t.trans*6%
t.trans*6%
t.trans*6%
t.trans*5%-1 t.trans*5%-1
t.trans*5%-1 t.trans*5%-1
Figure 6-4: The directed minimum spanning tree found from Figure 6-2
The STCC algorithm mentioned above can be efficiently implemented by the lattice data structure [2][41], which organizes all candidate itemsets in a systematic way. The lattice is constructed as follows. For each candidate itemset x, a corresponding vertex ux associated with a pair of values (Countxappearing,CountUBx ) is built in the lattice. For any pair of vertices ux and uy corresponding to candidate itemsets x and y, there is a directed edge from ux to uy if x is a parent of y. An itemset
x is said to be a parent of an itemset y if y can be obtained by adding an item to x, and inversely, y is said to be a child of x. Therefore, a candidate itemset may have more than one parent and more than one child in the constructed lattice.
Example 6-11: Consider the candidate itemsets illustrated in Example 6-6. The
lattice to represent the candidate itemsets is illustrated in Figure 6-5, where the vertex labeled “Null” denotes the greatest lower bound of the lattice.
A B C
Figure 6-5: The lattice to represent the candidate itemsets illustrated in Example 6-6
The lattice structure is used to efficiently find the appearing and upper-bound counts of candidate itemsets in each tuple and to accumulate these values when the tuples are processed one by one. By the connected edges in the lattice structure, the proposed lattice-based NOM approach (called LNOM) can not only restrict the number of candidate itemsets to be examined, but also easily consider candidate itemsets with the same proper subsets at the same time. The detailed LNOM algorithm will be described in Section 6.6.3.
6.6.2 Using the Hashing Technique to Reduce Computation Cost Further
Many itemsets kept in matched tuples, especially negative itemsets, may be useless for calculating the counts of candidate itemsets. For example, the itemsets {D}, {F}, {AF} and {BF} kept in the matched tuples in Example 6-6 are not the subsets of the candidate itemsets and can be omitted. Negative itemsets are formed by excluding frequent itemsets from the candidates which are generated in a level-wise way [27][85]. In other words, a negative itemset is a candidate itemset without enough support. In general, the set of candidate itemsets generated level-wisely is usually much larger than the set of frequent itemsets found, especially in the early stage of candidate generation [5][67]. The number of negative itemsets useless for calculating the counts of candidate itemsets may thus be large. In this section, we shall utilize the hashing technique [67] to filter out a part of useless itemsets to be considered in Phase 1. Take the direct hashing function as an example to explain our idea. Let x = {a1, a2, …, an} denote an itemset consisting of n items (from a1 to an), order(ai) denote the serial number of the item ai among the entire set of items, and size(HT) denote the size of a given hash table HT. A direct hashing function for n-dimensional keys can be defined as follows:
h(x) = (order(a1) * order(a2) * …* order(an)) mod size(HT).
The hashing function is order-independent; that is, it can generate the same hash value for all permutations of items in an itemset. Each bucket of the hash table consists of only an integer to represent how many candidate itemsets have been hashed into this bucket. 0 denotes that no candidate itemsets have been hashed into this bucket. When initially obtaining the set of candidate itemsets, the NOM approach calculates their hash values, finds corresponding hash buckets, and for each candidate
add one to the value of its corresponding bucket.
Example 6-12: For the candidate itemsets {A}, {B}, {C}, {AB}, {AC}, {BC}
and {ABC} obtained in Example 6-6, the LNOM approach will hash them into a given hash table HT. Without loss of generality, assume order(A) = 1, order(B) = 2 and order(C) = 3. Also assume the size of the hash table is 7. The hash values of these candidate itemsets will first be calculated. Take the itemset {AB} as an example. Its hash value is (order(A) * order(B)) mod 7, which is 2. The value in Bucket 2 is then increased by one. The other candidate itemsets are hashed in a similar way. The
resulting hash table is shown in Figure 6-6.
0 1 2 2 0 0
{A} {B} {C}
0 1 2 3 4 5
{AB} {AC} {BC}
{ABC}
HT
Bucket number
Bucket value 2
6 Itemsets
Figure 6-6: The hash table derived from the candidate itemsets illustrated in Example 6-6
After a hash table is constructed from all the candidate itemsets, it can then be used to filter out a part of useless itemsets in a tuple. Tuples are processed one by one.
When an itemset of a matched tuple is selected, the NOM approach calculates its hash value and finds its corresponding bucket. If the value stored in the target bucket is equal to 0, the itemset must be useless since it is not a candidate itemset. It can thus be directly omitted. Otherwise, the itemset may be, but not certainly, a candidate itemset.
Rescanning the candidate itemsets is then necessary to determine whether it is a candidate.
Furthermore, the corresponding value in the bucket of the itemset which has been assured to be a candidate will be decreased by one. The next itemset of the same tuple is then checked according to the modified hash table, which can thus raise the probability for a useless itemset to be filtered out. After a tuple is processed, the hash table is restored to its original state, which is then used for another tuple. This is illustrated by the following example.
Example 6-13: Continuing Example 6-12, after the hash table in Figure 6-6 has
been constructed, it can be used to filter out some useless itemsets in matched tuples.
For example, when Tuple 4 in Example 6-6 is checked, the itemset {C} with 2%
support is first selected to process since it has the smallest support value among all the itemsets appearing in the tuple. The hash value of {C} is calculated as 3 and the value in Bucket 3 is 2, not 0. The itemset {C} is thus checked against the candidate itemsets and is found to be a candidate. It is then used to calculate the counts of the candidate {C} and its superset in the lattice. In this example, the counts of the candidates {C}, {AC}, {BC} and {ABC} are then calculated. As a result, the value in Bucket 3 is decreased by 2 due to {C} and {AC}. The value in Bucket 6 is decreased to 0 as well due to {BC} and {ABC}. Bucket 6 in the modified hash table can filter out the itemsets {F} and {AF} in Tuple 4 since the value in Bucket 6 has been zero. After that , the hash table will be restored to the original one in Figure 6-6 for processing
another matched tuple.
6.6.3 The LNOM Algorithm with a Direct Hashing Function
In Phase 1, by one scan of a given EMPR, the LNOM approach first collects the itemsets in the matched tuples satisfying the query support as candidates, constructs a corresponding lattice for considering candidate itemsets with the same proper subsets
at the same time, and hashes them into a given hash table for filtering out a part of useless itemsets in matched tuples. The LNOM approach then processes matched tuples one by one, selects the itemsets in the order of ascending support values for each matched tuple, and checks whether they are useful for calculating the counts of candidates according to the values of their hash buckets. If the corresponding target bucket value is 0, the itemset is omitted. Otherwise, for each itemset x, the LNOM approach will assure whether x is a candidate by checking the set of candidate itemsets. If x is a candidate, the LNOM approach will cumulate the Countxappearing and each Count in the lattice, where y denotes an element in the proper superset of UBy x (y is a descendant of x). This procedure is then repeated until all the matched tuples have been processed. After that, the LNOM approach can generate the candidate itemsets with appearing counts and upper-bound counts corresponding to the given mining request.
Example 6-14: Consider the mining request in Example 6-6. The LNOM
approach will construct the lattice shown in Figure 6-5 and the hash table shown in Figure 6-6. It then processes the first matched tuple, and filter out (D, 2%) using the hash table. The remaining itemsets (ABC, 6%), (BC, 6%), (AC, 7%), (AB, 8%), (C, 9%), (A, 10%) and (B, 11%) are then processed in turn to update the counts of the corresponding itemsets in the lattice. After that, the LNOM approach processes the second matched tuple. Only the four itemsets (C, 2%), (AB, 3%), (B, 6%) and (A, 8%) needs to be processed after the hash-table checking. (C, 2%) is then first selected, and is used to update not only the appearing count of {C} but also the upper-bound counts of the itemsets in its proper superset ({AC}, {BC} and {ABC}). The updated lattice after processing all the matched tuples is shown in Figure 6-7.
A B C
Figure 6-7: The updated lattice after processing all matched tuples
Next, Phase 2 proceeds to prune candidates in a level-wise way. Candidate 1-itemsets are then first handled. If the upper-bound support of a candidate 1-itemset is less than the query support, it and the itemsets in its proper superset are removed from the lattice. If a candidate 1-itemset appears in all the matched tuples and its upper-bound support is larger than or equal to the query support, then it is put into the set of final frequent itemsets and removed from the lattice. This procedure is repeated level-wisely until all the candidate itemsets have been processed. After Phase 2, the remaining candidate itemsets in the lattice have enough upper-bound supports but do not appear in at least one matched tuple. The LNOM approach thus re-processes them against the underlying blocks of data for the matched tuples in which they do not appear to get their actual supports. After all the frequent itemsets are found, the association rules can then be easily generated from them. The detailed algorithm of the LNOM approach with a direct hashing function is stated in Figure 6-8.
The LNOM approach with a direct hashing function:
INPUT: An EMPR based on an initial minimum support s, and a mining request q with a set of contexts cxq, a minimum support sq (sq ≥ s) and a minimum confidence confq.
OUTPUT: A set of association rules satisfying the mining request q.
Phase 1: Generation of candidate itemsets:
STEP 1: Set C = φ and Match_Trans = 0, where C is a lattice used to maintain the set of candidate itemsets and Match_Trans is a variable used to keep the total number of transactions in the matched tuples which have been processed.
STEP 2: Initialize two equal-sized hash tables HT1 and HT2 with all the bucket values being zero.
STEP 3: For each tuple t in EMPR, do the following substeps:
STEP 3: For each tuple t in EMPR, do the following substeps: