Summary - Background and Related Work - 基於異質限制條件的頻繁型樣探勘方法

CHAPTER 2 Background and Related Work

2.3 Summary

In this section, we summarize the research work on constraint-based association mining.

As shown in Table 2.5, most work is devoted to aggregation constraint, and no work considered incorporating cardinality constraint.

Figure 2.12. A classification of aggregation constraints [20].

Table 2.5. A summary of research work on constraint-based association mining.

Type of Constraints

Paper Type of algorithms IC AC CC

Srikant et al, 1997 [25] Apriori-based ●

Ng et al, 1998 [18] Apriori-based ●

Bayardo et al, 2000 [5] Apriori-based ●

Grahne et al, 2000 [10] Apriori-based ●

Pei and Han, 2000 [19] FP Growth-based ●

Pei et al, 2004 [20] FP Growth-based ●

Song and Qin, 2004 [24] FP Growth-based ●

Lu et al, 2005 [17] ECLAT-based ●

J.T. Lee et al, 2006 [16] FP Growth-based ● Bonchi and Lucchese, 2007 [6] Apriori-based ●

※ IC: Item constraint; AC: Aggregation constraint; CC: Cardinality constraint.

CHAPTER 3 Mining Frequent Patterns with Heterogeneous Constraints

In this chapter, we describe our proposed MCApriori and MCFPTree algorithms for mining frequent patterns that satisfy a set of user-specified constraints that consists of up to three types of constraints, i.e., item, aggregation, and cardinality constraints.

3.1 Problem Statement

Let I = {i₁, i₂, …, i_m} be a set of items, where each item is associated with a value attribute, such as cost, profit, or price, etc. Let D be a transaction database consisting of a set of transactions, where a transaction T = 〈tid, It 〉 is a set of items I^t with identifier tid and It⊆ I. An itemset S, S ⊆ I, is contained in a transaction T if S ⊆ It. The support sup(S) of an itemset S in a transaction database D is the fraction of transactions in D containing S. Given a support threshold ξ (0 ≤ ξ ≤ 1), an itemset S is frequent provided sup(S) ≥ ξ.

A constraint C is a predicate on the powerset of I, C: → {True, False}. An itemset S satisfies C if and only if C(S) = True. The complete set of itemsets satisfying constraint C is SAT^C(I)={S S⊆I∧C(S)=True}. In our research, we consider three different kinds of constraints, including an item constraint CB, a set of aggregation constraints SC, and a cardinality constraint CL. The problem of concern is to discover the set of itemsets F satisfying all constraints and minimum support threshold, i.e.,

Example 3.1. Consider the transaction database in Table 3.1, and the profit of each item in Table 3.2. Suppose that the user-specified constraints and support threshold are as shown in Table 3.3. Then the set of frequent itemsets that satisfy the specified constraints are {pd, pdo, pdb, pdob,}.

Table 3.1. A transaction database.

TID List of Items

Table 3.2. Profit of each item in Table 3.1.

Item Profit

Table 3.3. Heterogeneous constraints settings.

Constraint Value

Aggregation avg(S) ≥ 30 and sum(S) ≥ 50

Item

(

p∧d

)

Cardinality Card(S)≤7

Support Threshold Support threshold ξ = 3

3.2 Algorithm MCApriori

The first proposed algorithm, named MCApriori (Multi-Constrained Apriori), is a Apriori-like algorithm, which employees the anti-monotonicity property to prune massive amounts of un interesting itemsets of no interest. The main differences between MCApriori and Apriori are:

(1) Instead of performing a bottom-up, level-wise traversal starting from singleton item, our algorithm exploits the item constraint to form an initial set of candidate serving as the base for further generation of new candidates;

(2) A different approach of new candidates generating and pruning is employed.

Our MCApriori algorithm consists of five main phases: (1) Item sorting phase; (2) Initial candidate generation phase; (3) Constraint checking phase; (4) Support counting phase;

and (5) New candidate generation phase.

In what follows, we will describe each phase in an individual subsection.

3.2.1 Item Sorting Phase

One of the design issues for MCApriori is the exploition of aggregation constraint to

prune uninteresting candidates. And the consideration is intuitionally on the anti-monotone constraints due to the paradigm of Apriori framework. The adoption for plain anti-monotone aggregation constraints is straightforward. The real problem is on the convertible anti-monotone. We have to order the set of items according to their values to convert such a type of aggregation constraints to anti-monotone, and to determine the direction of ordering, in ascending or descending.

For the above reasons, the first step in this phase is to select the order for sorting the items. If most of aggregation constraints are convertible with respect to ascending order, we choose the ascending order; otherwise, we choose the descending order.

Example 3.2. Continuing with Example 3.1, let us consider the aggregation constraint in Table 3.3, i.e., avg(S)≥30 and sum(S)≥50. Since only the first constraint is convertible anti-monotone w.r.t. descending order, we sort the items in descending order of their profit values, as shown in Table 3.4. The sorted list is prepared for later use in aggregation constraint checking and new candidate generation.

Table 3.4. Descending order list L.

Item Profit

3.2.2 Initial Candidate Generation Phase

The initial candidate generation phase is the most critical step in our MCApriori algorithm. MCApriori exploits the item constraint directly to construct an initial set of candidate itemsets, with the intention to lessen the overhead in generating lots of intermediate candidates.

We consider two different cases: (1) The item constraint CB does not contain any negative item, and (2) CB contains some negative items.

(1) Case 1: C

B constraint contains no negative item

In this case, we exploit the disjuncts of item constraint CB to generate an initial set of candidate itemset. Figure 3.1 describes the procedure for decomposing the item constraint into initial candidate itemsets.

Input: Item Constraint CB;

Figure 3.1. Generation of initial candidates from item constraint containing no negative item.

For example, if CB = (a∧b)∨(c∧d), then = {ab, cd}.

(2) Case 2: C

_B contains some negative items

This case is far more complicated than the first case. The intuition is to avoid generating any candidate containing all positive items and part of the negative items in a disjunct Di in CB composed of negative items while not refrain the construction of cross-disjunct candidates that contain proper subsets of negative items in Di and all positive items of another Dj.

In light of this, we first exploit the disjuncts of item constraint CB to generate an initial set of candidate itemsets that is composed of only positive items and move all negative items to a negative set . Then we generate the powerset of , i.e., . Finally, we perform a cross union between and , and prune those new initial candidate itemsets that contradict item constraint CB. Figure 3.2 shows the detailed steps for generating the set of the initial itemsets which can satisfy the item constraint CB.

Example 3.3. Consider an item constraint CB = (a ∧ b) ∨ (b ∧ ∼d) ∨ (c ∧ ∼p ∧ ∼r).

Below we describe the process for executing the algorithm in Figure 3.2 on this example.

Steps 1-8: Decompose the item constraint CB to get the = {ab, b, c} and = {d, p, r}.

Steps 10 : Find out the powerset of . = {d, p, r, dp, dr, pr, dpr}.

Steps 12-18: Perform a cross union between and , and prune those new initial candidate itemsets not satisfy CB, resulting in ’.

Step 19: = {ab, b, c, abp, abr, abpr, bp, br, bpr, cd}.

3.2.3 Constraint Checking Phase

In this phase, the generated candidate itemsets are inspected against all aggregation constraints, which are classified into two groups, anti-monotone and those not. According to

the definition of anti-monotone property, we conclude the following rules for candidates

Figure 3.2. Generation of the initial candidates from item constraint containing negative items.

(1) If a candidate violates any one of convertible anti-monotone constraints, then this

candidate can be pruned because all of the supersets will violate the constraint.

(2) If a candidate satisfies all convertible anti-monotone constraints but violates any one of those not convertible anti-monotone, then this candidate is marked as unsatisfied pattern, and is kept for the generation of new candidates.

(3) If a candidate satisfies all aggregation constraints, then this candidate remains in the candidate set, ready for undergoing the next phase of support counting.

The algorithm description for this phase is described in Figure 3.3.

Input: Candidate itemsets , and set of aggregation constraints CS Output: Updated candidate itemsets .

Steps:

1. Let CS = CAM ∪CAM; /* CAM = covertible anti-monotone

CAM= non convertible anti-monotone. */

2. for each candidate itemset ∈ do

Figure 3.3. Algorithm description of the constraint checking phase.

3.2.4 Support Counting Phase

We adopt the vertical counting technique proposed by Zaki [28] to accomplish this task.

The tid-lists (bitmap vector) for each item that exists in at least one candidate pattern are created. Then the support of a candidate itemset can be readily obtained by performing

bitmap intersection of the tid-lists forming that itemsets. All candidate itemsets with their supports larger than the threshold and their lengths satisfying the cardinality constraint are then reported as qualified frequent itemsets.

3.2.5 New Candidates Generation and Checking Phase

The new candidate generation method of our MCApriori algorithm is better explained using the framework of prefix set-enumeration tree [23]. Consider the example prefix set-enumeration tree in Figure 3.4. The rule for enumerating all prefix-preserving superset of any node in the tree is to perform a tail extension [4] recursively by appending a suffix with order higher than the largest of the itemset. For example, starting from {AB}, we can enumerate {ABC} and {ABD} as its children (prefix-preserving superset) in the enumeration tree.

Our idea is as follows. Given a candidate k-itemset, with items ordered by the ordering we choose in Phase 1, we can perform a tail extension through the sorted item list L to generate all of its prefix-preserving supersets of length k+1. Then we check each superset against all aggregation constraints using the same process described in Phase 3.

In this way, once an itemset violates a convertible anti-monotone constraint, we stop further generating its next-level prefix-preserving supersets, leading to an efficient pruning of its branches in the enumeration tree.

Figure 3.4. A complete prefix set-enumeration tree over four items.

The above idea can be improved further if we adopt an important property stated in the following lemma.

Lemma 3.1 Consider a convertible anti-monotone aggregation constraint Cag with respect to an order R over a set of items I. Let χ be a frequent itemset satisfying Cag and a₁, a₂, ..., a_m be a set of frequent items listed in the order of R. If itemset χ∪{ai}(1 ≤ i <

m) violates C_ag, then for any k, i < k ≤ m, itemset χ∪{ak} also violates C^ag. □

In short, Lemma 3.1 states that if an itemset χ∪{ai} violates a convertible anti-monotone constraint, so does any combination χ∪{aj} for any j>i. So instead of

during the course of tail extension, i.e., for each generated χ∪{ai}. This idea is illustrated in Figure 3.4. The detailed description of the above process is shown in Figure 3.5.

Unfortunately, the proposed procedure in Figure 3.5 will miss some candidate itemsets that may finally become satisfied frequent itemsets. The situation occurs because our approach only considers the prefix-preserving supersets of a candidate, thus may miss its non-prefix-preserving supersets.

Input: Set of candidate itemsets , set of aggregation constraint CS and sorted list L ;

Figure 3.5. The procedure for new candidate generation.

∈

For example, consider a candidate itemset {BC} in the enumeration tree in Figure 3.4, and assume one of its supersets, say {BCA}, is satisfied frequent. Since A is ordered before B and C, we will never generate {BCA} by performing the tail extension.

To remedy this problem, we distinguish the execution of this phase into two different cases: the case for first time execution, and the case for all follow-up executions, i.e., all executions after the first time.

The process for the follow-up case is just the same as that described in Figure 3.5.

So we will focus on the first-time case. The idea is that rather than employing the tail extension, we perform a full-extension between the initial set of candidate itemset and the sorted item list L. That is, a candidate χ is extended by appending an item in L not belonging to χ. For example, if χ= {bc} and L= {a, b, c, d}, then the result is {abc, bcd}.

3.2.6 An Example

Consider Example 3.1. We describe the process for executing MCApriori on this example.

Phase 1: Choose the value descending order and sort each items accordingly.

Phase 2: Exploit item constraint CB to obtain the set of initial candidates = {p, d}.

Phase 3: Check = {p, d}; find can satisfy all of aggregation constraints.

Phase 4: = {p, d} can satisfy the support threshold.

Phase 5: We perform the first time execution between = {p, d} and L = {p, o, b, d, c, e, a, r}, obtaining new candidates, {pdo, pdb}, which satisfy all of the aggregation constraints. Since {pdc} does not satisfy the convertible

further generating {pde, pda, pdr}.

Repeating Phases 4 and 5 we finally obtain the complete set of satisfied frequent itemsets {pd, pdo, pdb, pdob}.

Figure 3.6 shows the complete process for generating all frequent patterns.

Figure 3.6. An illustration of running the MCapriori algorithm on Example 3.1.

3.3 Algorithm MCFPTree

In this section, we introduce another proposed algorithm, named MCFPTree (Multi-Constrained Frequent Pattern Tree mining), which exploits the item constraint to construct the FP-Tree and conditional FP-Tree structures to discover satisfied frequent patterns. Compared with MCApriori the MCFPTree algorithm is a relatively efficient method for mining constrained frequent patterns because MCFPTree does not have to generate

candidate itemsets except the first phase for initial candidate generation.

The MCFPTree algorithm also consists of five main phases: (1) Initial candidate construction phase; (2) Support counting and database reduction phase; (3) FP-Tree construction phase; (4) Frequent pattern generation; and (5) Constraint checking phase. In what follows, we will first detail each phase of MCFPTree, and then give an example to illustrate our MCFPTree algorithm.

3.3.1 Detail Description of MCFPTree

(1) Initial candidate construction phase

The initial candidate construction phase is the same as that of MCApriori, which has been introduced in section 3.2.2.

(2) Support counting and database reduction phase

In this phase, we scan the database to count the support of each item, during which we also reduce the transaction database according to the following rule: A transaction is pruned if it does not contain any initial candidate itemset. Finally, we prune all infrequent items.

(3) FP-Tree construction phase

The task of this phase is to construct the FP-tree by scanning the transaction database.

The FP-tree structure and the steps for building it follow those used in FP-Growth [12].

(4) Frequent pattern generation phase

In this phase, we traverse the FP-Tree to find out all frequent itemsets with support greater than or equal to ξ. Again, the approach used in FP-Growth is adopted. That is, we

construct the conditional pattern base of each frequent 1-itemset, then construct its conditional FP-tree, and perform mining recursively on that tree to generate all of the frequent itemsets.

(5) Constraint checking phase

In this phase, we check each of the frequent itemsets against the item constraint, aggregation constraints, and the cardinality constraint to generate the set of satisfied frequent itemsets.

3.3.2 An Example

Consider Example 3.1 again. Below we illustrate the process for executing MCFPTree on this example.

Phase 1: Exploit the item constraint CB to obtain initial candidates {pd}.

Phase 2: Scan the transaction database to find all frequent 1-itemsets and perform transaction reduction when appropriately. The resulting database is shown in Table 3.5.

Phase 3: Scan the reduced database to construct the FP-tree; the resulting FP-tree is shown in Figure 3.7.

Phase 4: This phase constructs the conditional FP-tree of each frequent 1-itemset to generate all frequent patterns, which is depicted in Figure 3.8.

Phase 5: Finally, all frequent itemsets are checked against all constraints. The resulting set of satisfied frequent patterns is shown in Table 3.6.

Table 3.5. The reduced transaction database.

TID List of Items 10 b, d, o, p 30 b, d, o, p,

40 b, d, p

50 b, d, o, p

Figure 3.7. The FP-Tree generated from the reduced database in Table 3.5.

Figure 3.8. The conditional-base FP-Tree.

Table 3.6. The frequent itemsets that satisfy all sets of constraints.

Itemset Support sum avg Card

pd 4 60 30 2 pdb 4 90 30 3 pdo 3 100 33.3 3 pdob 3 130 32.5 3

CHAPTER 4 Performance Evaluations

In this chapter, we evaluate the performance of the proposed algorithms for mining frequent patterns with multiple constraints, MCApriori and MCFPTree. A real dataset, Accidents, [9] and two synthetic datasets, T10I4D100K and T40I10D100K, generated by IBM generator [2], are used in this evaluation.

All experiment were performed on a two 1.8GHz Intel Xeon CPUs ASUS server with 4GB main memory and 450 GB hard disk running on Windows server 2003. For comparison with our algorithms, we also implemented two methods: the Apriori+ algorithm and the FP Growth+ algorithm. Here, Apriori+ refers to the approach of applying the Apriori algorithm, while FP Growth+ refers to the approach of applying the FP-Growth algorithm, followed by a post processing of the frequent patterns.

In Section 4.1, we will compare the performance of MCApriori against the Apriori+

algorithm, while Section 4.2 shows the performance of MCFPTree against the FP-Growth+

algorithm. All comparisons were investigated from three aspects, including support threshold, size of item constraints, and the number of aggregation constraints.

4.1 Performance of MCApriori

4.1.1 Performance on Synthetic Datasets

The effectiveness and efficiency of the proposed algorithms were first evaluated over the

1000. Detailed parameter settings for these two IBM synthetic datasets are shown in Table 4.1.

Table 4.1. Parameter setting for synthetic database generation.

Parameters T10I4D100K T40I10D100K

|D| Number of original transactions 100K 100K

|T| Average size of transactions 10 40

In what follows, we describe the result individually corresponding to four aspects of concern: the support threshold, size of item constraints, number of aggregation constraints, and scalability.

(1) Support threshold

In this experiment various support thresholds are considered, ranging from 0.1% to 10%.

The other constraints consist of two aggregation constraints, one item constraint composed of five disjuncts, and a cardinality constraint. Table 4.2 shows the detailed settings.

The results in Figure 4.1 and Figure 4.2 demonstrate that our MCApriori algorithm is significantly faster than Apriori+ for datasets T10I4D100K and T40I10D100K, especially for relative large average sizes of transactions and frequent itemsets.

Table 4.2. Constraint settings for experiments regarding the effect of varying support threshold.

Constraint Name Value

Aggregation avg(S) ≥500 and sum(S) ≥3000

Item (900∩851)∪(701∩800)∪(800)∪(700∩9)∪(750∩9∩3)

Cardinality Card(S) ≤7

Figure 4.1. Performance of MCApriori against Apriori+ on T10I4D100K.

Figure 4.2. Performance of MCApriori against Apriori+ on T40I10D100K.

(2) Size of item constraints

In this experiment, we consider the effect of the size of item constraint, which is measured by the number of disjuncts, ranging from 1 to 6. The other settings of aggregation constraints, cardinality constraint, and support threshold are shown in Table 4.3.

Table 4.3. Constraint settings for experiments regarding the effect of item constraint size.

Constraint Name Value

Aggregation avg(S) ≥ 500 and sum(S) ≥ 3000

Support Threshold ξ = 0.002 for T10I4D100K ξ = 0.075 for T40I10D100K

Cardinality Card(S) ≤ 7

From the results in Figure 4.3 and Figure 4.4 we observe that

(1) The effect of item constraint size becomes negligible for larger support thresholds (2) The factor of size of item constraint size almost has no effect on the performance of

Apriori+ because the cost spent on post processing frequent itemsets is negligible as compared to that on executing Apriori.

Figure 4.3. Performance of MCApriori against Apriori+ with varying # of disjuncts on T10I4D100K.

Figure 4.4. Performance of MCApriori against Apriori+ with varying # of disjuncts on T40I10D100K.

(3) Number of aggregation constraints

In this experiment, we consider the effect of varying number of aggregation constraints, which is set from 1 to 6. The other constraint settings, including item constraint composed of three disjuncts, the support thresholds for datasets T10I4D100K and T40I10D100K, and a cardinality constraint, are detailed in Table 4.4.

Table 4.4. Constraint setting for experiments regarding the effect of carrying number of aggregation constraints.

Constraint Name Value

Item Constraint (701)∪(765)∪(36) Support Threshold ξ = 0.002 for T10I4D100K

ξ = 0.075 for T40I10D100K Cardinality Card(S) ≤ 7

The results in Figure 4.5 and Figure 4.6 demonstrates that

(1) The factor of size of items almost has no effect on the performance of Apriori+.

(2) Similar to that observed in the effect of item constraint size, varying number of aggregation constraints nearly has no effect on the performance of Apriori+.

Figure 4.5. Performance of MCApriori against Apriori+ with varying # of aggregations on T10I4D100K.

Figure 4.6. Performance of MCApriori against Apriori+ with varying # of aggregations on T40I10D100K.

(4) Scalability study

In this experiment, we evaluate the scalability of the proposed MCApriori algorithm from two aspects, the number of items and database size. The synthetic dataset T10I4D100K is used in this evaluation, and we use four support thresholds, ranging from 0.25% to 1%. First, we examine the scability of MCApriori with respect to the number of different items, ranging from 0.25K to 1.25K; the other constraint settings are shown in Table 4.5.

Table 4.5. Constraint setting for experiments regarding the effect of number of items.

Constraint Name Value

Item Constraint (221∩222)∪(241∩195)∪(199∩239) Aggregation Constraint avg(S) ≥ 215 and sum(S) ≥ 600

Cardinality Card(S) ≤ 7

The results in Figure 4.7 demonstrates that

在文檔中基於異質限制條件的頻繁型樣探勘方法 (頁 30-0)