Hi-mine and Hi-Mine* - Indirect Association Mining

Chapter 3 Related Work

3.2 Indirect Association Mining

3.2.2 Hi-mine and Hi-Mine*

In [27], Wan and An proposed an more efficient approach, called HI-mine, which can reduce the costs suffered by the INDIRECT algorithm in indirect association mining. The structure used in the HI-mine algorithm, called HI-struct, is a dynamic projection of a transactions database. Table 3-1 shows an example transaction database and the initial HI-struct of TDB is shown in Figure 3-2, which maintains all indexes of the frequent-item projections with the same first item. The corresponding HI-struct projected on frequent item D is depicted in Figure 3-3. The HI-struct can avoid the cost in generating a large number of candidates and so can be more efficient in mining indirect association rules.

Table 3-1 An example transaction database TDB

IDs List of item

Figure 3-2 The initial HI-struct of TDB

Figure 3-3 The corresponding HI-struct of TDB projected on item D.

The HI-mine* algorithm [28] also proposed by Wan and An, is an enhancement of the HI-mine algorithm. HI-mine* adopts a more compact data structure call Super Compact Transaction Database (STDB), on which some optimization strategies are introduced, including only one database scanning, direct frequent item projecting, and dynamic infrequent item pruning. The STDB is a compressed transactions data that hopefully can be stored into memory.

Any transactions that differ only in the last item are combined into a new transaction in STDB that is also composed of two parts, head and body. However, different from CTDB, the body is further divided into two parts, front and back, where the front stores the prefix items of the compressed transactions except the last items along with their counts are stored in the back part.

Let I be the set of items. Given a transaction database D and three thresholds: σ_s, σ_f and σ_d, as in Definition 1, we describe below the definitions of Indirect Itempair Set and Mediator Support Set that will be used in the HI-mine* algorithm.

Definition 2

Indirect Itempair Set: The indirect itempair set (IIS) of D is defined as:

Definition 3

Mediator Support Set: The mediator support set (MSS) of an item x is defined as:

The HI-mine* algorithm consists of three phases. In the first phase, it transforms the original transaction database into an STDB through four intermediate steps, transferring the transaction database into the CT-tree, next, transferring CT-tree into the CTDB (Compact transaction Database), then transferring CTDB into another CT-tree, and finally transferring CT-tree to STDB. In the second phase, it builds an HI-struct from a STDB and then dynamically adjusts and mines the HI-struct to compute the IIS and MSSs. The HI-struct consists of a header table of STB, storing the frequent items and the links pointing to the transactions in STDB where it appears, and a dynamic changing table storing the resulting projection of STDB on that item.

In the third phase, the complete set of indirect associations is generated from IIS and MSSs.

Figure 3-4 illustrates the corresponding CT-tree, CTDB, STDB, and HI-struct projected on item of the transaction database in Table 3-1.

(a) CT-Tree of Table 3-1 TDB (b) The compact transaction database

(c) The CT-tree of CTDB

(e) HI-struct projected on item D of the TDB in Table 3-1

Figure 3-4 The corresponding CT-tree, CTDB, STDB, and HI-struct projected on item D of the TDB in Table 3-1.

Chapter 4 Indirect Associations Mining from Static Data

In the study of HI-mine* we observed some shortcomings:

1) The construction of STDB requires many data conversion steps, and the cost for data conversion is proportional to the transaction length, i.e., the longer the transaction length is, the more time the conversion will require.

2) In the stage of building STDB, each transaction needs to be re-sorted in descending order of item’s frequency to sustain compression ratio. But in the streaming data environment, because data is continuously and infinite, we cannot count item’s frequency to reorder the transaction.

3) The process of mining MSSs requires a large amount of itemset comparisons, each of which is quite time consuming. Although the comparison is inevitable, the number of comparisons can be significantly reduced.

4.1 Algorithm EMIA

4.1.1 Basic Concept

Our proposed approach, the EMIA (Efficient Mining of Indirect Association) algorithm is modified from HI-mine* algorithm. We use a data structure similar to HI-mine* used in STDB, with the intention to adopt the concept of data compression.

The data structures used in the EMIA algorithm include (shown in Figure 4-1):

 CTT: The CTT (Compact Transactions Table) is used for merging identical transactions and combining analogical transactions. It has four parts: firstly, the left part that contains the same set of items, secondly, the right part that contains the list of all different items with their count values, thirdly, the counting sum of all item’s count value which contains in right part, and last, the index as an unique value for identification.

 HT: The HT (Hash Table) is used to store the hashed value of the left part of the transaction and links are pointers, indicate where is the item related row in the CTT.

 Item-list: The Item-list is used for keeping all items and its frequency and link to the corresponding row in CTT.

 MSS-list: The MSS-list is used for storing candidate mediators.

 IIS-list: The IIS-list is used for indirect itempair sets.

Hash Table

Key Link

Compact Transactions Table Index Count Left Right

Item-list

Item Count Links

(a) (b) (c)

IIS-list

MSS-list

mediator item {itemset}

(d) (e)

Figure 4-1 The data structures used in the EMIA algorithm.

We also improve the following shortcomings of HI-mine*.

1) We do not use the item frequency to re-sort transaction but instead sorting items in alphabetical order.

2) We no longer need CT-tree, CTDB and STDB through these four stages to generate the

Frequent-Item Projection Table. Instead, we used HT to help us quickly identify whether the left part is already existed. If found, we can quickly update the data in the CTT.

3) In the building of CTT stage, we also add the pointer link of the CTT to another new structure, namely Item-list, for reducing the number of comparisons.

4.1.2 Algorithm Detail

The EMIA algorithm is described in Figure 4-2. Three user-specified parameters are σs, σf, and σ_d. The algorithm consists of three main phases: Compact transactions table construction, MSSs and IIS construction and indirect association generation. In what follows, we describe the process of each phase.

Algorithm Name: EMIA(D, σ_s, σ_f, σ_d)

Input: Transaction data D, an itempair support threshold σs, a mediator support threshold σf and a mediator dependence threshold σd

Output: Indirect Association Patterns IA.

Step:

1: N ← |D|

2: CompactTransactionsTableConstruction (D, CTT,HT,Item-list) 3: MSSsandIISConstrucation (1item-lis, σ_f, σ_d, σ_s, N)

4: IndirectAssociationGeneration( IIS-list, MSS-list) Figure 4-2 The EMIA algorithm

Procedure Name: CompactTransactionsTableConstruction (D, CTT, HT, Item-list)

Input: Transaction data D, a compact transactions table CTT, a hash table HT and an Item-list Output: updated CTT and Item-list

Step:

Figure 4-3 The descriptions of procedure TrasactionProjection

Procedure Name: MSSsandIISConstrucation (Item-lis, σ_s, σ_f, σ_d,N,MSS-list,IIS-list)

Input: Item-lis, an itempair support threshold σs, a mediator support threshold σf and a mediator dependence threshold σd , length of transactions N, mediator support set list MSS-list , indirect itempair set list IIS-list

Output: mediator support set list MSS-list and indirect itempair set list IIS-list Step:

Figure 4-4 The description of procedure MSSsandIISConstrucation

Procedure Name: FindMSSsandIIS (HS, chs,L, σf, σd, σs, N, Item-list,MSS-list,IIS-list)

Input: Item-lis, an itempair support threshold σs, a mediator support threshold σf, a mediator dependence threshold σ_d, length of transactions N, a MSS-list ,a IIS-list

Output: updated mediator support set list MSS-list and indirect itempair set list IIS-list Step:

Figure 4-5 The description of procedure FindMSSsandIIS

Procedure Name: IndirectAssociationGeneration (MSS-list, list IIS-list) Input: mediator support set list MSS-list , indirect itempair set list IIS-list Output: Indirect Association Patterns IA.

Step:

1: procedure IndirectAssociationMining (MSS-list,IIS-list) 2: foreach itempair{x,y} in IIS-list do

Figure 4-6 The description of procedure IndirectAssociationGeneration

In the compact transactions table construction phase, each transaction is split to two parts, left and right, the right corresponds to the last element of the transaction and left stores the other elements. That is, if transaction t = {x1, x2, ..., xm}, we split t into left = {x1, x2, ..., xm-1} and right

= {x_m}. Next, we search in CTT to see if any transaction with the same is left part as t has been maintained there. To speed up the searching process, we adopt the hash technique and store the index pointing to the corresponding row in CTT in the HT table, with key storing the left part of t

and link storing the index. In this way, we insert the left part and right part of t into CCT, if the searching result is negative; otherwise, we update the corresponding count field in CCT, which denotes the number of transactions having the same left part, and the count of the item stored in the right field. In the same time, each item in transaction t along with the index in CCT is also inserted into the Item-list if it is a new observed item or its count is updated, otherwise.

After each transaction in D is transformed and inserted into the CTT, we remove infrequent items from CTT and Item-list. This is because we only need those frequent items to construct MSSs and IIS, removing the infrequent items can further improve the performance.

The core phase, MSSs and IIS construction is responsible for generating candidate MSSs and IIS by using the divide-and-conquer strategy. In this phase, we process each item in Item-list, forming subsets of CTT contain any left or right fields that the item appears. Assume there are n frequent items in the Item-list = {i₁, i₂, ..., i_n} In first level that can be partitioned into n subsets from the links field of Item-list. In addition, in other level still partitioned into smaller subsets until there are no support of subset greater than σf.

For example, if we get i₁’s subset of the CTT from Item-list and continue to partitioned that as follows: (1) those containing itemset {i1 , i2}; (2) those containing itemset {i1 , i3}; (3) those containing itemset {i₁, i₄}; ….finally those containing itemset {i₁, i_n}.

In order to find IIS and MSSs that contain itemset {a1, a2} a subset S12 is created. In S12, we calculate the support of itemset {a1, a2} and check if the support is greater than or equal to σf and if the dependency of {a₁, a₂} is greater than σ_d. If so, then a₁and a₂arecandidate mediators and we add a2 to MSS(a1) and add a1 to MSS(a2). Second, we partition S12 into smaller subsets S123, S₁₂₄, ..., S_12n for finding other mediators until there is no subset with support greater than σ_f. On the other hand, if the support of {a1, a2} is less than σf and σs then itemset {a1, a2} is a candidate IIS and we add {a1, a2} to IIS-list.

In the indirect association generation phase, we use each candidate IIS from IIS-list to find mediators. For example, assume that itempair {a₁, a₂} is an IIS. Then we can find its mediators from MSS(a1 ) and MSS(a2). If MSS(a1) and MSS(a2) also contain {{am}, {am, an}}, then the two indirect associations discovered for itempair {a1, a2} are:

4.1.3 An Example

In this section, we will illustrate the EMIA algorithm using the example in Figure 4-17.

Suppose σ_s = σ_f = σ_d = 0.5, where σ_s, σ_f and σ_d are itempair support threshold, mediator support threshold and mediator dependence threshold, respectively.

IDs List of item

001 002 003 004 005 006

A, B, C A, B, C B, C B, D A, B, D, E E

Figure 4-7 An example transaction

Transaction Projection Phase

In this phase, we project each transaction to three data structures: HT, CTT and Item-list.

The first transaction is {A, B, C}. Below are the steps for projecting this transaction.

1) The transaction is split into left={A, B} and right={C} and added into CTT 2) Add key {A, B} and the pointer 1 to HT

3) Add each item in this transaction and the corresponding pointer link in CTT to Item-list.

Figure 4-8 shows the result after processing the first transaction.

Figure 4-8 The structure, HT, CTT and Item-list after processing the first transaction.

The second transaction is the same as the first one, so we only have to update the count in CTT and Item-list as shown in Figure 4-9.

Figure 4-9 The resulting structures after second transaction {A, B, C} added.

Next, we process the third transaction. Since the left part B is not found, so a new tuple is created in HT and CTT to accommodate this transaction.

The process continues till only one item is contained, i.e., the sixth transaction. All we need to do is updating Count of item E in Item-list. The resulting structure after this phase is shown in Figure 4-10.

Figure 4-10 The resulting structures after the transaction projection phase.

MSSs and IIS Construction

The first step of this phase is to delete infrequent items from Item-list and CTT. As shown in Figure 4-11, D and E are infrequent items (their support threshold is less than σf), so we delete them from Item-list and CTT.

Figure 4-11 After infrequent items are removed.

We use a divide-and-conquer strategy to project CTT into smaller tables with respect to the subsequences of the same frequent item. Consider Figure 4-11 for example. The tuples that contain the first item A in Item-list are r₁ and r₃. The projection of CTT corresponding to item A is shown in Figure 4-12(a).

There are three frequent items A, B, and C in Item-list and we want to divide projection of A further to smaller projections with respect to other frequent item to compute MSSs and IIS. In this example, we divide projection of A by item B and C. The projection of {A, B} consists of r1

and r3 (shown in Figure 4-12(b)). Since the support is 0.5 (= count/N = 3/6) which passes the minimum support threshold 0.5, and the IS measure IS(A, B) is 0.774 (= 3/[(3*5)^1/2]), passing the minimum dependence threshold 0.5 as well, we add {A} to MSS(B) and {B} to MSS(A) (shown in Figure 4-13(b)).

Figure 4-12 The projections of CTT corresponding to item A, itemset AB, ABC, and AC

Then, the algorithm applies the process for item A recursively to projection of {A, B} for determine whether itemsets belongs to MSS. Figure 4-12(c) shows the projection of {A, B, C}.

Since the support count of {A, B, C} is smaller than σ_f, we stop the recursion and back to process the projection of {A, C}. Since the support count of {A, C} is 0.33(=2/6), which is less than σf

and σs. Therefore, {A, C} is a candidate IIS and is added into IIS-list (shown in Figure 4-13(a)).

Then divide of {A} has been completed.

IIS-list

Figure 4-13 The results of MMS-list and IIS-list

Repeating the above steps with {B} and {B, C} (shown in Figure 4-14) then we can find the {B}, {C} are MSS, too. The final results after this phase are shown in Figure 4-15.

CTT’ projection of B

Figure 4-14 The subset of item B

IIS-list

Figure 4-15 The result of IIS-list and MSS-list

Indirect Association Mining Phase

The last phase of EMIA algorithm is to generate the set of mediators for each indirect itempair in IIS. For example, the set of mediators for itempair {A, C} in IIS-list is computed by intersecting MSS(A) and MSS(C), which results in {{B}}. In this way, an indirect association is discovered in the example database: < A, C | {B}>.

Chapter 5 Indirect Associations Mining from Streaming Data

To the best of our knowledge, no research work has been conducted on mining indirect associations over data streams. In this chapter, we introduce two algorithms, MIA-LM and EMIA-LM, for mining indirect associations over data streams. The MIA-LM algorithm will be detailed in section 5.2 and illustrated with examples in section 5.3. The EMIA-LM algorithm is modified from EMIA and will be explained in sections 5.3 and 5.4.

5.1 Preliminary Description

As the streaming data is infinite but there are limited memory capacity and computing power, the most critical issue for designing efficient stream mining algorithm is how to limit the amount of data. One of the most effective ways to control the amount of data is data pruning.

And user may use the error rates to control the degree of pruning.

Definition 4

A transaction data stream DS is any ordered pair (T, ), where a transaction T = tid, It is a set of items It with identifier tid and It  I, where I = {i1, i2, …, im} is a set of items, and Δ is a sequence of positive real time intervals.

In this chapter, given a transaction data stream DS and parameters, including an itempair support σs  (0, 1), a mediator support σf (σs, 1), a mediator dependence σd  (0, 1) and a support error threshold   (0, σs), we attempt to design an algorithm for mining indirect associations over a transaction data stream with an acceptable error rate .

5.2 Algorithm MIA-LM

In this section, we describe the second proposed approach, namely MIA-LM (Mining Indirect Association over a Landmark Model), a hybrid of DSM-FI and HI-mine algorithms. The data structures (shown in Figure 5-1) used in the MIA-LM algorithm include:

 ISFI-forest: An ISFI-forest consists of a dynamic FI-list (Frequent Item list) and a set of dynamic IS-trees (Item-Suffix tree).

 FI-list: The information of each item is stored in FI-list, including item-id, item-count, block-id and head-link. The item-id records the identifier of inserted item, the item-count records the number of occurrences of certain item in transactions, the block-id is the identifier of the block when this item is inserted, and the head-link points to the root node of the corresponding IS-tree.

 IS-tree: An IS-tree consists of a dynamic header_table and a prefix tree structure. The header_table of an IS-tree is composed of item-id, item-count, block-id, and head-link.

The item-id records the identifier of inserted item, the item-count is the number of occurrences of certain item in transactions, the block-id is the identifier of the block when this item is inserted, and the head-link points to the first node with the same item in the corresponding prefix tree. Each node of the prefix tree is composed of item-id, item-count, block-id, and node-link. The first three components are the same as

header_table, and the fourth component is a pointer which points to the next node with the same item.

 candidateIIS-list: If the support of a 2-itemset (itempair) is less than itempair support σs, no matter its subsets are frequent or not, it is stored in candidateIIS-list. Two attributes, item-id and item-count, of each itempair are recorded in candidateIIS-list.

Figure 5-1 The data structures ISFI-forest, FI-list, IS-tree, and candidate IIS

5.2.1 Algorithm description

The MIA-LM algorithm is described in Figure 5-2 and other procedures are described in Figure 5-3, Figure 5-4, Figure 5-5, and Figure 5-6. Four user-specified parameters must be given, including σs, σf, σd, and . The MIA-LM algorithm first divides DS into blocks that are denoted as {B₁, B₂, ..., B_n}. Each block B_i consists of a set of transactions represented as B_i= {t₁, t₂, ..., t_k}, where k is set to 1/ as suggested in [21]. Moreover, the total number of transactions N seen so far in the transaction data stream DS is defined as N = |B₁| + |B₂| + ... + |B_l|, l  n. The other four

phases of the proposed algorithm (Lines 4~8) are Transaction projection, Candidate IIS updating, Pruning ISFI-forest and Indirect Association Mining.

Algorithm Name: MIA-LM(DS, σs, σf, σd, )

Input: Transaction data stream DS, an itempair support threshold σs, a mediator support threshold σf, a mediator dependence threshold σd and a support error threshold .

Output: Indirect Association Patterns IA.

Step:

1: Determine the size of block by using 1/;

2: Divide DS into blocks; each block contains 1/ transactions;

3: foreach B_i do

4: TrasactionProjection(Bi, i);

5: UpdateIIS(ISFI-forest, candidateIIS-list, σs, N);

6: PruneISFI-forest(ISFI-forest, i);

7: if user query request = ture then

8: IndirectAssociationMining(ISFI-forest, candidateIIS-list, σf, σd, σs, N);

9: endfor

Figure 5-2 The MIA-LM algorithm.

Function Name: TrasactionProjection(Bi, i)

Input: Transactions in block B_i = { T₁, T₂, …, T_h, ..., T_k} and the current block-id i.

Output: An updated ISFI-forest.

Step:

procedure TrasactionProjection(Bi, i)

1: foreach transaction t = { x₁, x₂, …, xm} in B_i do

Figure 5-3 The descriptions of transaction projection function.

Function Name: UpdateIIS (ISFI-forest, candidateIIS-list, σs, N)

Input: An ISFI-forest, an itempair support σs and the current transaction size N.

Output: An updated candidateIIS-list.

Step:

1: procedure UpdateIIS (ISFI-forest, candidateIIS-list, σs, N) 2: foreach 2-itemset X in ISFI-forest do

3: if candidateIIS-list has a entry Y && X.items = X.items then

Figure 5-4 The descriptions of UpdateIIS function.

在文檔中在靜態與串流資料中的高效率間接關聯探勘 (頁 22-0)