An Improved Data Mining Approach Using Predictive Itemsets

(1)

An improved data mining approach using predictive itemsets

Tzung-Pei Hong

a,*

, Chyan-Yuan Horng

b

, Chih-Hung Wu

c

, Shyue-Liang Wang

d

a_{Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC} b_{Institute of Information Engineering, I-Shou University, Kaohsiung 840, Taiwan, ROC}

c_{Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC} d_{Department of Information Management, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC}

Abstract

In this paper, we present a mining algorithm to improve the efficiency of finding large itemsets. Based on the concept of prediction proposed in the (n, p) algorithm, our method considers the data dependency in the given transactions to predict promising and non-prom-ising candidate itemsets. Our method estimates for each level a different support threshold that is derived from a data dependency param-eter and dparam-etermines whether an item should be included in a promising candidate itemset directly. In this way, we maintain the efficiency of finding large itemsets by reducing the number of scanning the input dataset and the number candidate items. Experimental results show our method has a better efficiency than the apriori and the (n, p) algorithms when the minimum support value is small.

Keywords: Data mining; Association rule; Predictive itemset; Data dependency; Predicting minimum support

1. Introduction

Years of effort in data mining have produced a variety of efficient techniques (Chen, Han, & Yu, 1996). Depending on the types of datasets processed, mining approaches may be classified as working on transaction datasets, tem-poral datasets, relational datasets, or multimedia datasets, among others. On the other hand, depending on the classes of knowledge derived, mining approaches may be classified as finding association rules, classification rules, clustering rules, or sequential patterns (Agrawal & Srikant, 1995), etc. Among these techniques, finding association rules from transaction datasets is usually an essential task (Agrawal, Imielinksi, & Swami, 1993b; Agrawal & Srikant, 1994; Agrawal, Srikant, & Vu, 1997; Ezeife, 2002; Han & Fu, 1995; Mannila, Toivonen, & Verkamo, 1994; Park, Chen, & Yu, 1997; Srikant & Agrawal, 1995, 1996; Wojciechow-ski & Zakrzewicz, 2002).

Many algorithms for mining association rules from transactions are proposed, most of which are executed in level-wise processes. That is, itemsets containing single items are processed first, then itemsets with two items are processed. The process was repeated, continuously adding one more item each time, until some criteria are met. The famous apriori mining algorithm was proposed byAgrawal et al. (1993a, 1993b). The apriori iterates two phases, the phase of candidate generation and the phase of verification. Possible large itemsets are produced in the first phase and verified in the second phase by scanning the input dataset. Since itemsets are processed level by level and datasets had to be scanned in each level, the verification phase thus dominates the performance.Han, Pei, and Yin (2000)then proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets. The FP-tree was used to compress a database into a tree structure which stored only large items. Several other algorithms based on the FP-tree structure have also been proposed. For example, Qiu, Lan, and Xie (2004)proposed the QFP-growth mining approach to mine association rules.Zaiane and Mohammed (2003) pro-posed the COFI-tree structure to replace the conditional

*

Corresponding author.

E-mail addresses: tphong@nuk.edu.tw (T.-P. Hong), hcy@ms4.url. com.tw (C.-Y. Horng), johnw@nuk.edu.tw (C.-H. Wu), slwang@nuk. edu.tw(S.-L. Wang).

www.elsevier.com/locate/eswa Expert Systems with Applications 36 (2009) 72–80

Expert Systems with Applications

(2)

FP-tree. Ezeife (2002) constructed a generalized FP-tree, which stored all the large and non-large items, for incre-mental mining without rescanning databases. Koh and Shieh (2004) adjusted FP-trees also based on two support thresholds, but with a more complex adjusting procedure and spending more computation time than the one pro-posed in this paper.

Denwattana and Getta (2001) proposed an algorithm (referred to as the (n, p) algorithm) to reduce the numbers of scanning input datasets for finding large itemsets. The (n, p) algorithm also iterates two phases, the phase of pre-diction and the phase of verification. Unlike the apriori, the (n, p) algorithm predicts large itemsets for p-levels in the first phase and verifies all these p-level itemsets in the second phase. A heuristic estimation method is presented to predict the possibly large itemsets. If the prediction was valid, then the approach is efficient in finding the actu-ally large itemsets.

In this paper, we propose a mining algorithm to improve the efficiency of finding large itemsets. Our approach is based on the concept of prediction presented in the (n, p) algorithm and considers the data dependency among trans-actions. As the (n, p) algorithm does, our method iterates the same two phases but uses a new estimation method to predict promising and non-promising candidate itemsets flexibly. The estimation mechanism computes for each level a different support threshold derived from a data depen-dency parameter and determines whether an item should be included in a promising candidate itemset directly by the support values of items. Since we reduce the number of candidate itemsets to be verified by the new estimation mechanism and the number of scanning of the input data-set by the concept of prediction of the (n, p) algorithm, the performance of finding large itemsets can be improved.

The rest of this paper is organized as follows. Section2 presents the related works of ﬁnding large itemsets. The apriori algorithm and the (n, p) algorithm are reviewed. In Section3, we describe our motivation and the theoreti-cal foundation of our method. Detailed description of our algorithm is given in Section4together with a simple exam-ple. Experimental results and the comparison on the per-formance of the apriori, the (n, p) algorithm, and our method are shown in Section 5. Conclusions are given in Section6.

2. Related works

One application of data mining is to induce association rules from transaction data, such that the presence of cer-tain items in a transaction will imply the presence of cercer-tain other items. Below we brieﬂy reviewed the apriori and the (n, p) algorithm.

2.1. The apriori algorithm

Agrawal and Srikant (1994)andAgrawal et al. (1993b, 1997) propose a famous mining algorithm, the apriori,

based on the concept of large itemsets to find association rules in transaction data. The apriori iterates two phases, the phase of candidate generation and the phase of verifica-tion, at each level. At the ith level, i > 1, itemsets consisting of i-items are processed. In candidate generation, all possi-ble large i-itemsets are produced by combining the unre-peated elements of (i 1)-itemsets. In the verification phase, the input dataset is scanned and if the number of an i-itemset appearing in the transactions is larger than a pre-defined threshold (called the minimum support, or min-sup), the itemset is considered as large. After that, these two phases iterate for (i + 1)-level until all large itemsets are found.

Suppose that we have a dataset containing six transac-tions, as shown inTable 1, which has two features, transac-tion identiﬁcatransac-tion (ID) and transactransac-tion descriptransac-tion (Items). There are eight items in the dataset. Assume that min-sup = 30%. The transaction dataset is ﬁrst scanned to count the candidate 1-itemsets. Since the counts of the items a(4), b(5), c(6), d(3), and e(4) are larger than 6 * 30% = 1.8, they are thus put into the set of large 1-item-sets. Candidate 2-itemsets are then formed from these large 1-itemsets by taking any two items in the large 1-itemsets and counting if their occurrences are large than or equal to 1.8. Therefore, ab, ac, ae, bc, bd, be, cd, ce, and de then form the set of large 2-itemsets. In a similar way, abc, abe, ace, bce, bcd, bde, and cde form the set of large 3-itemsets, abce and bcde form the set of large 4-itemsets.

2.2. Denwattana and Getta’s approach

In Denwattana and Getta (2001), the (n, p) algorithm tries to reduce the number of scanning datasets and improves the efficiency of finding large itemsets by ‘‘guess-ing’’ what items should be considered. The approach parti-tions candidate itemsets into two parts: positive candidate itemsets and negative candidate itemsets, where the former contains itemsets guessed to be large and the later contains itemsets guessed to be small. Initially, the (n, p) algorithm scans the dataset to find large itemsets of 1-items. Two parameters, called n-element transaction threshold tt and

frequency threshold tf, are used to judge whether an item

could compose a positive candidate itemset. According to the n-element transaction threshold tt, only the transactions

with item numbers (lengths) less than or equal to ttare

con-sidered. The frequency of each item appearing in

transac-Table 1

A sample dataset of transactions

ID Items 1 abc 2 bcdef 3 abceg 4 acd 5 bcde 6 abce

(3)

tions with j-items, j 6 tt, is computed. If the appearing

fre-quency of an item is larger than or equal to tf, the item

could be used to compose a positive candidate itemset. Then, two phases, the predication process and the verifica-tion process, are iterated. In the predicverifica-tion process, positive candidate 2-itemsets Cþ₂, each of which has its two items satisfying the above criteria, is formed. The remaining can-didate 2-itemsets not in Cþ₂ form C₂. Cþ₃ and C₃ are then formed from only Cþ₂ in a similar way. The positive candi-date 2-itemsets which are subsets of the itemsets in Cþ₃ are then removed from Cþ₂. The same process was repeated until p-levels of positive and negative candidate itemsets are formed. After that, the verification process check whether the itemsets in the positive candidate itemsets are actually large and the itemsets in the negative candidate itemsets are actually small by scanning the dataset once. The itemsets incorrectly guessed are then expanded and processed by scanning the dataset again.

Consider the dataset shown in Table 1 again. Suppose that minsup = 30% and the n-element transaction thresh-old tt= 5 and (n, p) = (5, 3). The numbers of occurrences

of each item in transactions with diﬀerent lengths are shown inTable 2.

Assume the frequency threshold tf= 80%. The

occur-rence number of each item in transactions with 3-items must be larger than or equal to 1.6 (2 * 80%) for this item to be considered as positive. Thus, items a and c are posi-tive. Similarly, b, c, and e are positive for transactions with 4-items, and b, c, and e are positive for transactions with 5-items. The items which can be used to compose positive candidate itemsets are thus the union {a, b, c, e} of the above three sets. We obtain Cþ2 ¼ fbc; be; ab; ce; ac; aeg,

and C₂ ¼ fbd; bf ; bg; cd; cf ; cg; de; ef ; eg; ad; af ; ag; df ; dg; fgg. Then, Cþ3 and C

3 are formed from C þ 2 as

Cþ₃ ¼ fbce; abc; abe; aceg, and C3 ¼ /. C þ

4 and C

4 are

formed from Cþ₃ as Cþ₄ ¼ fabceg, and C4 ¼ /. The

ele-ments in Cþ₂ which are subsets of those in Cþ₃ are removed. Thus, we obtain Cþ₂ ¼ /. Similarly, the itemsets in Cþ3

which are subsets of those in Cþ₄ are removed. Therefore, Cþ₃ ¼ /.

In the veriﬁcation process, the dataset is scanned to check all the itemsets in Cþ_j and C_j, j = 2, 3, 4. After the scan, the set of large itemsets in positive candidate sets is {abce}. The set of small itemsets in negative candidate sets

is {bf, bg, cf, cg, ef, eg, ad, af, ag, df, dg, fg}. The itemsets de, cd, and bd in C2 are incorrectly predicted. The incorrectly

predicted itemsets are further processed. The subsets of incorrectly predicted itemsets in positive candidate sets are ﬁrst generated. These subsets are then pruned using the large itemsets already known. Since Cþ₄ is correctly pre-dicted, the result is /. The supersets of incorrectly predicted itemsets in negative candidate sets are also generated as {bde, cd, cde, bcde}. The dataset is then scanned again to check these itemsets. Therefore, the itemset {bcd, bde, cde, bcde} is large in the second scan. All the large itemsets are then collected as

L2¼ fbd; cd; de; ae; ac; ce; ab; be; bcg;

L3¼ fcde; bde; bcd; ace; abe; abc; bceg;

L4¼ fbcde; abceg:

The same process goes on for ﬁnding L5–L7. C5is ﬁrst

generated from L4 as {abcde} and checked to be large in

the same manner. After all the large itemsets are found, the association rules can then be derived from them. 3. Motivation

3.1. Observation

The apriori algorithm is straightforward and simple but needs to repeatedly scan the dataset while finding the large itemsets. There have been many approaches proposed to reduce the number of scanning datasets. The (n, p) algo-rithm is one of such approaches. It scans the input dataset once in the beginning and twice in each verification pro-cess. Since p-levels of transactions are processed in an iter-ation, the number of scanning datasets is less than that in the apriori. In the above example, the apriori scans the dataset 10 times; while the (n, p) algorithm scans five times. In Denwattana and Getta (2001), it is claimed that the number of scanning the dataset can be reduced with proper parameters n and p.

Unfortunately, in some cases, the overall performance of the (n, p) algorithm is defected because of the verification process. In Cþ_i and C_i, i P 2, all itemsets with lower sup-ports are rejected by Cþ_i and are collected into C_i. When generating Cþ_iþ1 and C_iþ1, all elements in Cþ_i and C_i are taken into account, respectively. If there are a large number of itemsets with supports lower than the pre-defined min-sup and are included in C_i, the (n, p) algorithm also has to compute all combinations of such itemsets and deletes them all in the verification process. This affects the overall performance of the (n, p) algorithm. On the other hand, in using the apriori, only the itemsets with supports higher than minsup are considered. For example, the (n, p) algo-rithm guesses that 200 out of 1000 2-itemsets are large and included in Cþ₂ and 800 items are in C₂; 70% elements in Cþ₂ and 99% elements in C₂ are not large enough. It takes about (800· 99%)2(k 1)

computations in generating C_k, k > 2, 3, . . . , 2 + p, and removes most elements in C_k in

Table 2

The number of occurrences of each item in transactions of diﬀerent lengths Item Occurrences of an item in transactions

Length = 3 Length = 4 Length = 5

a 2 1 1 b 1 2 2 c 2 2 2 d 1 1 1 e 0 2 2 f 0 0 1 g 0 0 1 Number of transactions 2 2 2

(4)

the verification phase. In the apriori, if 200· 120% 2-item-sets are actually large (the (n, p) algorithm may make a wrong guess), only (200· 120%)2computations are needed in generating Cþ₃ and much less than (200· 120%)2are left after verifying their supports by scanning the dataset. Though the number of scanning the dataset in the (n, p) algorithm is less than that in the apriori, the overall perfor-mance of the (n, p) algorithm may be not good enough. Experimental results are available in Section5. Therefore, the performance of the (n, p) algorithm can be further improved. One direction of improvement is to employ bet-ter data structures, so that the generation of Cþ_iþ1 and C_iþ1 can be done in a more efficient manner, such asBykowski and Rigotti (2001), Bastide, Taouil, Pasquier, Stumme, and Lakhal (2000), Koh and Shieh (2004), Pei, Han, and Mao (2000) and Zaki and Hsiao (2002). The other direction is filtering out the itemsets with lower supports and reduce the number of elements to be included in Cþ

i and C i . This

paper adopts the later approach from a point of view of probability.

3.2. Theoretical foundation

Usually, items must have greater support values for being covered in large itemsets with more items from a probabilistic viewpoint. For example, if minsup is 30%, then an item with its support just greater than 30% is a large 1-itemset. Both items in a 2-itemset must, however, have their supports a little larger than 0.3 for this 2-itemset to be large with a high probability. At one extreme, if total dependency relations exist in the transactions, an appear-ance of one item will certainly imply the appearappear-ance of another. In this case, the support thresholds for an item to appear in large itemsets with different items are the same. At the other extreme, if the items are totally indepen-dent, then the support thresholds for an item to appear in large itemsets with different items should be set at different values. In this case, the support threshold for an item to appear in a large itemset with r-items can be easily derived below.

Since all the r-items in a large r-itemset must be 1-large itemsets, all the supports of the r-items must be larger than or equal to the pre-deﬁned minsup a. Since the items are assumed totally independent, the support of the r-itemset is s1· s2· · sr, where siis the actual support of the ith

item in the itemset. If this r-itemset is large, its support must be larger than or equal to a. Thus

s1 s2 srP a:

If the predictive support threshold for an item to appear in a large r-itemset is ar, then

s1 s2 srP ar ar arP a:

Thus ar

rP a and arP a1=r:

Therefore, if the items are totally independent, the support threshold of an item should be expected to be a1/rfor being included in a large r-itemset. Since the transactions are sel-dom totally dependent or totally independent, a data dependency parameter w, ranging between 0 and 1, is then used to calculate the predictive support threshold of an item for appearing in a large r-itemset as wa + (1 w)a1/r. A larger w value represents a stronger item relationship existing in transactions. w = 1 means total dependency for transaction items and w = 0 means total independency. The proposed approach thus uses diﬀerent predictive sup-port thresholds for each item to be included in promising itemsets with diﬀerent numbers of items.

4. Our method 4.1. Our algorithm

The proposed mining algorithm aims at efficiently find-ing any p-levels of large itemsets by scannfind-ing a dataset twice except for the first level. The support of each item from the first dataset scan is directly judged to predict whether an item will appear in an itemset. The proposed method uses a higher predicting minsup for each item to be included in a promising itemset of more items. Itemsets with different numbers of items then have different predict-ing minsups for an item. A predictpredict-ing minsups is calculated as a weighted average of the possible minsups for totally dependent data and for totally independent data. A data dependency parameter, ranging between 0 and 1, is used as the weight.

A mining process similar to that proposed in Denwat-tana and Getta (2001) can then be adopted to find the p-levels of large itemsets. A dataset is first scanned to get the support of each item. If the support of an item is larger than or equal to the pre-defined minsup value, it is put in large 1-itemsets L1. After large 1-itemsets have been found,

any more p-levels of large itemsets can be obtained by scan-ning a dataset twice. Candidate 2-itemsets (C2) are formed

by combination of the items in large 1-itemsets. In the meantime, the predicting minsup for an item to be included in large 2-itemsets is estimated according to the given data dependency parameter. If the support of an item in L1is

smaller than the predicting minsup, any candidate 2-item-set including this item will not be large with a high possibil-ity. On the contrary, any candidate 2-itemset with supports of all items larger than or equal to the predicting minsup will have a high possibility of being large. The candidate 2-itemset can then be partitioned into two parts: Cþ₂ and C₂, according to whether the supports of all the items in an itemset are larger than or equal to the predicting minsup.

After promising candidate 2-itemsets (Cþ₂Þ are gener-ated, candidate 3-itemsets (C3) are formed by combining

them. Similar to the process for ﬁnding candidate 2-item-sets, a new predicting minsup for each item to be included

(5)

in promising 3-itemsets is calculated. Candidate 3-itemsets can then be partitioned into two parts: Cþ3 and C3 by

com-paring the supports of the items included in C3 with the

predicting minsup. The same procedure is repeated until the p-levels of itemsets are processed. Therefore, no dataset scan except for the ﬁrst level has been done until now. A dataset is then scanned to get the actually large itemsets in the promising candidate itemsets and in the non-promis-ing candidate itemsets. The itemsets incorrectly predicted in the p-levels are further processed by scanning the dataset once more. Therefore, a total of two dataset scans are needed in getting the p-level large itemsets. After that, another processing phase of p-levels is done. This process-ing is repeated until no large itemsets are found in a phase. Let cibe the number of occurrence of each item ai

appear-ing in the input dataset of transactions and the support (ai)

of each item aibe ci/n. The details of our algorithm are

pre-sented inFig. 1.

4.2. An example

In this section, the dataset inTable 1is used to describe our method. Assume that minsup a = 30% and w = 0.5 in this example. The following processes are performed:

STEP 1: The support of each item is compared with the minsup a. Since the supports of {a}, {b}, {c}, {d}, and {e} are larger than or equal to 0.30, they are put in L1.

STEP 2: r = 1. STEP 3: r0_{= 1.}

STEP 4: P1is the same as L1, which is {a, b, c, d, e}.

STEP 5: The candidate set C2 is formed from L1 as

C2= {ae, be, ce, de, ad, bd, cd, ac, bc, ab}.

STEP 6: r = r + 1 = 2.

STEP 7: The predicting minsup value for each item a0= 0.5· 0.30 + (1 0.5) · 0.301/2= 0.424.

INPUT: A set of n transactions with m items, the minsup α, a dependency parameter w, and a level

number p.

OUTPUT: Large itmesets.

STEP 1: Check whether the support αi of each item ai is larger than or equal to α. If αi≥α, put ai in

the set of large 1-itemsets L1.

STEP 2: Set r=1, where r is the number of items in itemsets currently being processed.

STEP 3: Set r’=1, where r’ is the number of items at an end of the last iteration.

STEP 4: Set Pr = Lr, where Pr is the items predicted to be included in r-itemsets.

STEP 5: Generate the candidate set Cr+1from Lr in a way similar to that in the apriori.

STEP 6: Set r=r+1.

STEP 7: Check whether the support αi of each item ai in Pr-1is larger than or equal to the predicting

minsupα’=wα+(1-w)α1/r to be included in Pr. If αi≥α’, put ai in Pr.

STEP 8: Form the promising candidate itemsets C_r+ by choosing from Crthe itemsets with each

item existing in Pr.

STEP 9: Set the non-promising candidate itemsets Cr−=Cr−Cr+.

STEP 10: Set r=r+1.

STEP 11: Generate the candidate set Crfrom Cr+−1 in the way as the apriori does.

STEP 12: Check whether the support αi of each item ai in Pr-1is larger than or equal to the predicting

minsupα’=wα+(1-w)α1/r. If αi≥α’, put ai in Pr.

STEP 13: Set the non-promising candidate itemsets C_r−=C_r−C_r+.

STEP 14: Remove the itemsets in Cr+₋₁, which are subsets of any itemset in C .r+

STEP 15: Repeat STEP 10 to STEP 14 until r = r’ + p.

STEP 16: Scan the dataset to check whether the promising candidate itemsets, Cr+_'₊₁ to Cr+ are

actually large and whether the non-promising candidate itemsets, C_r−_'₊₁ to C_r− are actually small. Put the actually large itemsets in the corresponding sets Lr’+1 to Lr.

STEP 17: Find all the proper subsets with r’+1 to i items for each itemset which is not large in C , r’_i+

+1≤i ≤r; keep the proper subsets which are not among existing large itemsets; donate them

as NC+.

STEP 18: Find all the proper supersets with i to r items for each itemset which is large in C , r’ +1_i− ≤

i≤ r; the supersets must also have all their sub-itemsets of r’ items existing in Lr’ and cannot

include any sub-itemset in the non-large itemsets in C and _i+ C checked in STEP 16;_i−

donate them as NC-.

STEP 19: Scan the dataset to check whether the itemsets in NC+_{and NC}-_{are large; add the large}

itemsets to the corresponding sets Lr’+1 to Lr.

STEP 20: If Lr is not null, r’= r’ + p and go to STEP 4 for another iteration; otherwise do STEP 21.

STEP 21: Add the non-redundant subsets of large itemsets to the corresponding sets L2 to Lr.

(6)

The support of each item in P1 is then

compared with 0.424. Since the supports of {a}, {b}, {c}, {d}, and {e} are larger than 0.424, P2is {a, b, c, d, e}.

STEP 8: The itemsets in C2with each item existing in P2

are chosen to form promising candidate itemsets Cþ₂. Thus Cþ₂¼ fae;be;ce;de;ad;bd;cd;ac;bc;abg. STEP 9: The non-promising candidate itemsets C₂ is

found as C₂ ¼ C2 Cþ2 ¼ /.

STEP 10: r = r + 1 = 3.

STEP 11: The candidate set C3 is formed from Cþ2 as

C3= {abe, ace, ade, bce, bde, cde, abd, acd,

bcd, abc}.

STEP 12: The predicting minsup a0_{= 0.5}_{· 0.30 + (1}

0.5)· 0.301/3

= 0.485. The support of each item in P2 is then compared with 0.485. Since the

supports of {a}, {b}, {c}, {d} and {e} are larger than 0.485, P3is {a, b, c, d, e}. Thus, we have

Cþ₃ ¼ C3.

STEP 13: The non-promising candidate itemsets C₃ is found as C₃ ¼ C3 Cþ3 ¼ /.

STEP 14: Since all itemsets in Cþ₂ are subsets of itemsets in Cþ₃, they are removed from Cþ₂=/.

STEP 15: Since r (=3) < r0_{+ p (=4), STEP 10–STEP 14}

are repeated. r = 3 + 1 = 4. The candidate set C4is then formed from Cþ3 as C4= {abce, abde,

acde, bcde, abcd}. The predicting minsup value a0_{= 0.5}_{· 0.30 + (1 0.5) · 0.30}1/4

= 0.520. The support of each item in P3is then compared

with 0.520. Since the supports of {a}, {b}, {c}, and {e} are larger than 0.520, P4is {a, b, c, e}.

Cþ₄ is thus formed as Cþ4 ¼ fabceg. The

non-promising candidate itemsets C₄ is found as C₄ ¼ C4 Cþ4 ¼ fabde; acde; bcde; abcdg. The

itemsets in Cþ₃ which are subsets of itemsets in Cþ₄ are removed from Cþ₃. Thus, we obtain Cþ₃ ¼ fade; bde; cde; abd; acd; bcdg.

STEP 16: The dataset is scanned to check whether the promising candidate itemsets of Cþ₂ to Cþ₄ are actually large and whether the non-promising candidate itemsets of C₂ to C₄ are actually small. The itemsets ad in Cþ₂, ade, abd, acd in Cþ₃, and bcde in C₄ are incorrectly predicted. By deleting ad, ade, abd, and acd from Cþ₂ and Cþ₃, the rest of elements in Cþ₂, Cþ₃, and Cþ₄ are put into L2, L3, and L4, respectively. Also,

bcde is put into L4.

STEP 17: The proper subsets of the itemsets incorrectly predicted in Cþ_i are generated. Since {ad, a-de, abd, acd} is incorrectly predicted in this example, its proper subsets not in existing large itemsets is {ad}. Thus NC+= {ad}.

STEP 18: The proper supersets of the itemsets incorrectly predicted in C_i are generated. Since only {bcde} is incorrectly predicted in this example, its proper supersets with 3-items and not in existing large itemsets is /. Thus NC= /.

STEP 19: The dataset is scanned to ﬁnd the large itemsets from NC+and NC. Since {ad} is not large by verifying the dataset, the large itemsets L2to L4

are then found as L2= {ab, bc, ac, cd, bd, de, ce,

be, ae}, L3= {abc, bcd, cde, bde, bce, ace, abe},

and L4= {bcde, abce}.

STEP 20: Since L4 is not null, the next iteration is

exe-cuted. STEP 4–STEP 19 are then repeated for L5 to L7. Then, we have C5= {abcde} and

L5= L6= L7= /.

STEP 21: The non-redundant subsets of the found large itemsets are added to the corresponding sets L2to L4. The ﬁnal large itemsets L2to L4are

then found as follows: L2¼ fab; bc; ac; cd; bd; de; ce; be; aeg;

L3¼ fabc; bcd; cde; bde; bce; ace; abeg;

L4¼ fbcde; abceg:

5. Experiments and results

In order to demonstrate the performance of our method, several experiments are performed. All the three methods, the apriori, the (n, p) algorithm, and our method, are imple-mented in Java using the same data structure and represen-tation. The experiments are performed on a personal computer with a Pentium-IV 2.0 GHz CPU and Win-dowsTM_{. We collect diﬀerent types of datasets using the data}

generator provided by IBM. The following parameters indicate the contents of the dataset:

• nt: the number of transactions in dataset

• lt: average transaction length in the dataset

• ni: the number of items in the dataset

• np: the number of patterns found in the dataset

• lp: average length of pattern

• cp: correlation between consecutive patterns

• f: average conﬁdence in a rule • vc: variation in the conﬁdence

For convenience, we use a 9-tuple (D, nt, lt, ni, np,

lp, cp, f, vc) to describe the parameters used in generating a

dataset D. 5.1. Experiment I

First of all, we test if our algorithm performs better than the apriori does. The dataset D1 is generated with the

parameters (D1, nt, 10, 25, 1000, 4, 0.15, 0.25, 0.1). We

per-form the two programs with diﬀerent sizes, nt= 10 K,

20 K, 50 K, 100 K, of datasets and compare the perfor-mance by speedup which is computed as (tp to/to), where

tp is the running time of the apriori and to is the running

time of ours. In this experiment, the minsup is set at 15% and diﬀerent combinations of the data depen-dency parameter w and transaction threshold t are tested.

(7)

The experimental result is shown in Fig. 2. It seems that our method is scalable and provides a better computational performance than the apriori with a lower w.

5.2. Experiment II

Next, the dataset D1is used to test the performance of

our method with diﬀerent values of minsup. We set nt= 10,000. The speedup is illustrated in Fig. 3. Like the

(n, p) algorithm, our algorithm has a better performance when minsup is low. This is because that the apriori exactly ﬁlters out itemsets with supports lower than minsup in each stage and reduces the number of elements to be considered in generating candidate itemsets. In the (n, p) algorithm and ours, we have to guess what itemsets should be considered in the prediction phase and then to verify the predicted ones. The generation of proper subsets and supersets after the veriﬁcation phase takes a considerable time.

5.3. Experiment III

According to experiments I and II, we ﬁnd that the per-formance of our method is sensitive with respect to the con-tents of the datasets, i.e, the relationships among data in

the datasets. We generated ﬁve diﬀerent datasets, D2, D3,

D4, D5, D6by the changing cpand f and remaining the rest

parameters the same as that in D1. The parameters (cp, f) in

D2, D3, D4, D5, and D6are given as (0.05, 0.75), (0.05, 0.25),

(0.15, 0.75), (0.25, 0.75), and (0.25, 0.25), respectively.Fig. 4 presents the results with 10,000 data in each dataset.

The data dependency parameter w seems to have an eﬀect on the performance of the proposed algorithm, but not on the ﬁnal large itemsets. A larger w value represents a stronger item relationship existing in transaction data-sets. If the relationships of data items in transactions have been known to be very strong, w may be set at a value close to 1. If the relationships of data items in transactions have been known to be independent, w may be set at a value close to 0. If the relationships of data items in transactions are unknown, w may be set at 0.5.

Consider the dataset of Table 1 again. Here are two extreme cases with w = 0 and w = 1, respectively. Assume w = 0 and minsup = 30%, we obtain L1= {a, b, c, d, e} and

Cþ₂ ¼ fae; be; ce; ac; bc; abg; Cþ₃ ¼ Cþ

4 ¼ /;

C₂ ¼ fcd; bd; ad; deg; C₃ ¼ fbce; ace; abe; abcg; C₄ ¼ /:

The itemsets that are incorrectly predicted are cd, bd, de, bce, ace, abe, and abc. L3 and L4 are generated from the

supersets of NC= {cd, bd, de} and NC= {bce, ace, abe, abc}, respectively. If w = 1, we obtain L1= {a, b, c, d,

e} and Cþ₂ ¼ Cþ

3 ¼ /;

Cþ₄ ¼ fabce; abde; acde; bcde; abcdg; C₂ ¼ C

3 ¼ C4 ¼ /:

The itemsets that are incorrectly predicted are abde, acde, and abcd. L2and L3are generated from the subsets

of NC+= {abde, acde, abcd}. In both cases, many compu-tations are spent in combining the items from NC+(NC).

(w,t)=(0.5,3) (w,t)=(0.25,3) (w,t)=(0.25,5) (w,t)=(0.5,5) 10K 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 20K 50K 100K Speedup DataSize

Fig. 2. Speedup in experimental result I with minsup = 15%.

-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 15% 20% 40% 60% 80% Speedup (w,t)=(0.5,3) (w,t)=(0.25,3) minsup

Fig. 3. Speedup in experimental result II with diﬀerent minsup.

-2.0 0.0 2.0 4.0 6.0 8.0 0.1 0.25 0.5 0.75 speedup D1 D2 D3 D4 D5 D6 w

(8)

Generally, w = 0.5 has a better performance than w = 0 or w = 1.

5.4. Experiment IV

Finally, we test if the level of transactions to be consid-ered in an iteration eﬀects the performance. The dataset D2

is used to test. It seems that the best performance appears when three levels are considered in an iteration (seeFig. 5). Also, it is found that the data in D2are of low dependency

(cp= 0.05). This is because that the generation of Cþiþ1from

Cþ_i is of O(m2) computational complexity, where m is the number of elements in Cþ_i . The more levels of transactions are considered in an iteration the larger n may be obtained. 6. Discussions and conclusions

From the experimental results, different values of data dependency will cause the same large itemsets, but different predictive effects. When w = 1, the non-promising date sets are predicted very well, but the promising candi-date sets are predicted badly; and vice versa for w = 0. By default, we set w = 0.5. If the data dependency relation-ships in transactions can be well utilized, our method can improve the overall performance of finding large itemsets. In our experiments, both the (n, p) algorithm and ours suffer from the inefficiency of generating Cþ_iþ1 from Cþ_i. When there are many items in the dataset, e.g., 25 items in D1 D6, and more levels of transactions to be

consid-ered, more computation is needed in both algorithms. However, our method provides a more accurate approach for predicting itemsets and obtains a better performance than the (n, p) algorithm, especially when p > 2.

In this paper, we have presented a mining algorithm that combines the advantages of the apriori and the (n, p) algo-rithm in ﬁnding large itemsets. As the (n, p) algoalgo-rithm does, our algorithm reduces the number of scanning datasets for ﬁnding p levels of large itemsets. A new parameter that considers data dependency is included in our method for

early ﬁltering out the itemsets that are possibly of lower supports and thus improves the computational eﬃciency.

We also conclude that the three algorithms can compete with each other and gain the best performance on diﬀerent types of datasets. There need more studies on how to tune the parameters, such as n, p, and transaction threshold in the (n, p) algorithm and w, t in ours, before the mining task is performed.

Acknowledgements

The authors would also like to thank Mr. Tsung-Te in the Department of Information Management, Shu-Te Uni-versity, Taiwan, for his help in conducting the experiments. References

Agrawal, R., Imielinksi, T., & Swami, A. (1993a). Dataset mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925.

Agrawal, R., Imielinksi, T., & Swami, A. (1993b). Mining association rules between sets of items in large dataset. In The ACM SIGMOD conference (pp. 207–216). Washington DC, USA.

Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In The international conference on very large data bases (pp. 487– 499).

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In The 11th IEEE international conference on data engineering (pp. 3–14). Agrawal, R., Srikant, R., & Vu, Q. (1997). Mining association rules with

item constraints. In The third international conference on knowledge discovery in datasets and data mining (pp. 67–73). Newport Beach, California.

Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. ACM SIGKDD Explorations, 2(2), 66–75.

Bykowski, A., & Rigotti, C. (2001). A condensed representation to ﬁnd frequent patterns. In The 12th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. Santa Barbara, Califor-nia, USA.

Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview from a dataset perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866–883.

Denwattana, N., & Getta, J. R. (2001). A parameterised algorithm for mining association rules. In The 12th Australasian Dataset conference (pp. 45–51).

Ezeife, C. I. (2002). Mining Incremental association rules with generalized FP-tree. In The 15th conference of the Canadian society for computa-tional studies of intelligence on advances in artiﬁcial intelligence (pp. 147–160).

Han, J., & Fu, Y. (1995). Discovery of multiple-level association rules from large dataset. In The 21st international conference on very large data bases (pp. 420–431). Zurich, Switzerland.

Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In The 2000 ACM SIGMOD international conference on management of data (pp. 1–12).

IBM, The Intelligent Information Systems Research (Quest) Group,

http://www.almaden.ibm.com/software/quest/Resources/datasets/ syndata.html.

Koh, J. L., & Shieh, S. F. (2004). An eﬃcient approach for maintaining association rules based on adjusting FP-tree structures. In The 9th international conference on database systems for advanced applications (pp. 417–424).

Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Eﬃcient algorithm for discovering association rules. The AAAI Workshop on Knowledge Discovery in Datasets (pp. 181–192). -2.0 0.0 2.0 4.0 6.0 8.0 10.0 2 3 4 5 6 7 speedup w=0.25 w=0.5 w=0.75 t

(9)

Park, J. S., Chen, M. S., & Yu, P. S. (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Trans-actions on Knowledge and Data Engineering, 9(5), 812–825.

Pei, J., Han, J., & Mao, R. (2000). CLOSET: an eﬃcient algorithm for mining frequent closed itemsets. The 2000 ACM SIGMOD DMKD‘00. Dallas, TX, USA.

Qiu, Y., Lan, Y. J., & Xie, Q. S. (2004). An improved algorithm of mining from FP-tree. In The third international conference on machine learning and cybernetics (pp. 26–29).

Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. In The 21st international conference on very large data bases (pp. 407–419), Zurich, Switzerland.

Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In The 1996 ACM SIGMOD international conference on management of data (pp. 1–12). Montreal, Canada. Wojciechowski, M., & Zakrzewicz, M. (2002). Dataset ﬁltering techniques

in constraint-based frequent pattern mining. Pattern detection and discovery. London, UK.

Zaiane, O. R., & Mohammed, E. H. (2003). COFI-tree mining: A new approach to pattern growth with reduced candidacy generation. In The IEEE international conference on data mining.

Zaki, M. J., Hsiao, & C. J. (2002). CHARM: an eﬃcient algorithm for closed itemset mining. In The second SIAM international conference on data mining. Arlington.