Chapter 5 Indirect Associations Mining from Streaming Data
5.2 Algorithm MIA-LM
5.2.1 Algorithm description
The MIA-LM algorithm is described in Figure 5-2 and other procedures are described in Figure 5-3, Figure 5-4, Figure 5-5, and Figure 5-6. Four user-specified parameters must be given, including σs, σf, σd, and . The MIA-LM algorithm first divides DS into blocks that are denoted as {B1, B2, ..., Bn}. Each block Bi consists of a set of transactions represented as Bi = {t1, t2, ..., tk}, where k is set to 1/ as suggested in [21]. Moreover, the total number of transactions N seen so far in the transaction data stream DS is defined as N = |B1| + |B2| + ... + |Bl|, l n. The other four
30
phases of the proposed algorithm (Lines 4~8) are Transaction projection, Candidate IIS updating, Pruning ISFI-forest and Indirect Association Mining.
Algorithm Name: MIA-LM(DS, σs, σf, σd, )
Input: Transaction data stream DS, an itempair support threshold σs, a mediator support threshold σf, a mediator dependence threshold σd and a support error threshold .
Output: Indirect Association Patterns IA.
Step:
1: Determine the size of block by using 1/;
2: Divide DS into blocks; each block contains 1/ transactions;
3: foreach Bi do
4: TrasactionProjection(Bi, i);
5: UpdateIIS(ISFI-forest, candidateIIS-list, σs, N);
6: PruneISFI-forest(ISFI-forest, i);
7: if user query request = ture then
8: IndirectAssociationMining(ISFI-forest, candidateIIS-list, σf, σd, σs, N);
9: endfor
Figure 5-2 The MIA-LM algorithm.
Function Name: TrasactionProjection(Bi, i)
Input: Transactions in block Bi = { T1, T2, …, Th, ..., Tk} and the current block-id i.
Output: An updated ISFI-forest.
Step:
procedure TrasactionProjection(Bi, i)
1: foreach transaction t = { x1, x2, …, xm} in Bi do
Figure 5-3 The descriptions of transaction projection function.
31
Function Name: UpdateIIS (ISFI-forest, candidateIIS-list, σs, N)
Input: An ISFI-forest, an itempair support σs and the current transaction size N.
Output: An updated candidateIIS-list.
Step:
1: procedure UpdateIIS (ISFI-forest, candidateIIS-list, σs, N) 2: foreach 2-itemset X in ISFI-forest do
3: if candidateIIS-list has a entry Y && X.items = X.items then
Figure 5-4 The descriptions of UpdateIIS function.
Function Name: PruneISFI-forest (ISFI-forest, i) Input: An ISFI-forest and the current block id i.
Output: An ISFI-forest which contains the set of all frequent itemsets.
Step:
1: procedure PruneISFI-forest (ISFI-forest, i)
2: foreach itemset X in ISFI-forest excepting 1-itemsets do 3: if X.count < i - X.block_id + 1 then
4: prune X;
5: endprocedure
Figure 5-5 The descriptions of Prune ISFI-forest function.
Function Name: IndirectAssociationMining(ISFI-forest, candidateIIS-list, σs, σf, σd, N)
Input: An ISFI-forest, a candidateIIS-list, a mediator support σf, an itempair support threshold σs, a mediator support threshold σf, a mediator dependence threshold σd and the current transaction size N.
Output: Indirect Association Patterns IA.
Step:
procedure IndirectAssociationMining(ISFI-forest, candidateIIS-list, σf, σd, σs, N) 1: IA =
2: MSS-list =
3: foreach itempair {x, y} in candidateIIS-list do 4: if count of itempair {x, y} < σs*N then
Figure 5-6 The descriptions of indirect association mining function.
32
In the transaction projection phase, each transaction is loaded from the current block and projected to a set of item-suffix transactions. That is, a transaction t = {x1, x2, ..., xm} is projected to m items-suffix transactions that are t1 = {x1, x2, …, xm}, t2 = {x2, x3, …, xm}, …, tm-1 = {xm-1, xm} and tm = {xm}. These item-suffix transactions are then inserted into the corresponding IS-tree and FI-list according to the first item of each item-suffix transaction.
The main task of candidate IIS updating phase is to maintain the candidate Indirect Itempair Set (candidate IIS), which is used for deriving indirect associations in the last phase. Each 2-itemset in ISFI-forest whose count is less than σs*N is first put into candidateIIS-list. If a 2-itemset already exists in candidateIIS-list, then update its count. Due to deleting existing 2-itemsets in candidateIIS-list in this phase may lead to false positive indirect itempairs, MIA-LM postpones the deletion till the fourth phase.
Memory usage is very important in a data stream mining environment. The purpose of pruning ISFI-forest phase is to prune itemsets whose counts are less than a specific threshold
k(currentblockid - X.block_id + 1). The currentblockid and X.block_id denote the current block and the block where itemset X is inserted. Since we choose the block size k to be the inverse of support error threshold , the threshold equals to (currentblockid - X.block_id + 1). If the count of an itemset is less than this threshold, that itemset will have low probability to appear in subsequent blocks, and should be considered as infrequent itemset. In this case, the itemset is pruned to save the memory space.
In the indirect association mining phase, the proposed approach generates the indirect association patterns and response patterns to user by using ISFI-forest and candidateIIS as HI-mine did. For each itempair {x, y} in candidateIIS, it first uses ISFI-forest to generate its corresponding frequent itemsets. If the mediator dependences of frequent itemsets are larger than the mediator dependence threshold σd, they are put into mediator support set, denoted as MSS(x)
33
and MSS(y). If an itemset Z appears both in MSS(x) and MSS(y), then an indirect association {x, y}
| Z is generated.
5.2.2 An Example
Assume that we have a transaction data stream and the support error threshold is set at = 20%. The transaction data stream is divided into blocks, each of which contains five (= 1/0.2) transactions. The first two blocks are shown in Figure 5-7.
...
Block3
Time TID1 A,B,D
TID2 B,C,D TID3 C,D TID4 B,D,E TID5 B,D
Block1 Block2
TID6 A,C TID7 A,D,E TID8 B,C TID9 C,D,E TID10 A,D,E
Figure 5-7 The first two blocks of transaction data stream.
The first transaction TID1 = {A, B, D}in Block1 is projected to three item-suffix transactions, including {A, B, D}, {B, D} and {D}. These item-suffix transactions are inserted into the corresponding IS-tree and FI-list according to the first item of each item-suffix transaction. The updated ISFI-forest is shown in Figure 5-8.
34
Item ItemItem CountCount Block-idBlock-id
FI-list IS-tree
Figure 5-8 The updated ISFI-forest after inserting TID1.
In candidate IIS updating phase, after processing transactions in current block, the proposed algorithm then generates candidate indirect itempair set (candidate IIS) from the ISFI-forest. Take IS-tree of item A as an example, two 2-itemsets with their counts can be generated as <{A, B}:1>
and <{A, D}:1>. Assume the itempair support σs is set at 0.25, and since the number of transactions N processed from now on is five, the threshold of an itempair should be kept or removed is 1.25 (=0.25*5). Because the counts of these two 2-itemsets are smaller than 1.25, they are then put into candidate IIS. In the same way, other possible candidate indirect itempairs of items B, C, D and E can also be generated.
In the ISFI-forest pruning phase, the itemsets except 1-itemsets are pruned from ISFI-forest if their counts are less than (currentblockid - x.block-id + 1). Take IS-tree of item A as an example.
Two 2-itemsets with their counts and block id can be generated as {A, B}:1:1 and {A, D}:1:1.
Since currentblockid = 1, an itemset will be pruned if its count is less than 1 (= 1 – 1 + 1). In this example, itemsets {A, B} and {A, D} are pruned from the IS-tree of item A. By repeating
35
transaction projection, candidate IIS updating and ISFI-forest pruning phases, the Block2 can also be processed. The resulting candidate IIS and the ISFI-forest are shown in Figure 5-9.
A:4:1
Item ItemItem CountCount Block-idBlock-id
FI-list IS-tree
Figure 5-9 The candidate IIS and ISFI-forest after processing Block1 and Block2.
Assume that the user has a query request after processing Block2, the indirect association generating function is then used for deriving indirect associations by checking each itempair in candidate IIS with the ISFI-forest. Take itempair {B, E}:1 in candidate IIS as an example, it first finds the mediator support sets of item B and E, i.e., MSS(B) and MSS(E). Suppose σf and σd are set at 0.3 and 0.5, respectively. From the IS-tree of item B, two candidate itemsets {B, D} and {B, C}
can be generated. Since the count of itemset {B, D} is large than 3 (=N*σf = 10*0.3), and the value of IS({B, D}) is 0.632 (= 0.4/[(0.5*0.8)1/2]) which also large than 0.5. Thus, itemset {D} is put into MSS(B). In the same way, we can derive MSS(E) that contains itemset {D}. Thus, from MSS(B) and MSS(E), we obtain an indirect association B, E{D}, which is then outputted for user reference.
36