• 沒有找到結果。

An Improved Data Mining Approach Using Predictive Itemsets

N/A
N/A
Protected

Academic year: 2021

Share "An Improved Data Mining Approach Using Predictive Itemsets"

Copied!
9
0
0

加載中.... (立即查看全文)

全文

(1)

An improved data mining approach using predictive itemsets

Tzung-Pei Hong

a,*

, Chyan-Yuan Horng

b

, Chih-Hung Wu

c

, Shyue-Liang Wang

d

aDepartment of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC bInstitute of Information Engineering, I-Shou University, Kaohsiung 840, Taiwan, ROC

cDepartment of Electrical Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC dDepartment of Information Management, National University of Kaohsiung, Kaohsiung 811, Taiwan, ROC

Abstract

In this paper, we present a mining algorithm to improve the efficiency of finding large itemsets. Based on the concept of prediction proposed in the (n, p) algorithm, our method considers the data dependency in the given transactions to predict promising and non-prom-ising candidate itemsets. Our method estimates for each level a different support threshold that is derived from a data dependency param-eter and dparam-etermines whether an item should be included in a promising candidate itemset directly. In this way, we maintain the efficiency of finding large itemsets by reducing the number of scanning the input dataset and the number candidate items. Experimental results show our method has a better efficiency than the apriori and the (n, p) algorithms when the minimum support value is small.

 2007 Elsevier Ltd. All rights reserved.

Keywords: Data mining; Association rule; Predictive itemset; Data dependency; Predicting minimum support

1. Introduction

Years of effort in data mining have produced a variety of efficient techniques (Chen, Han, & Yu, 1996). Depending on the types of datasets processed, mining approaches may be classified as working on transaction datasets, tem-poral datasets, relational datasets, or multimedia datasets, among others. On the other hand, depending on the classes of knowledge derived, mining approaches may be classified as finding association rules, classification rules, clustering rules, or sequential patterns (Agrawal & Srikant, 1995), etc. Among these techniques, finding association rules from transaction datasets is usually an essential task (Agrawal, Imielinksi, & Swami, 1993b; Agrawal & Srikant, 1994; Agrawal, Srikant, & Vu, 1997; Ezeife, 2002; Han & Fu, 1995; Mannila, Toivonen, & Verkamo, 1994; Park, Chen, & Yu, 1997; Srikant & Agrawal, 1995, 1996; Wojciechow-ski & Zakrzewicz, 2002).

Many algorithms for mining association rules from transactions are proposed, most of which are executed in level-wise processes. That is, itemsets containing single items are processed first, then itemsets with two items are processed. The process was repeated, continuously adding one more item each time, until some criteria are met. The famous apriori mining algorithm was proposed byAgrawal et al. (1993a, 1993b). The apriori iterates two phases, the phase of candidate generation and the phase of verification. Possible large itemsets are produced in the first phase and verified in the second phase by scanning the input dataset. Since itemsets are processed level by level and datasets had to be scanned in each level, the verification phase thus dominates the performance.Han, Pei, and Yin (2000)then proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets. The FP-tree was used to compress a database into a tree structure which stored only large items. Several other algorithms based on the FP-tree structure have also been proposed. For example, Qiu, Lan, and Xie (2004)proposed the QFP-growth mining approach to mine association rules.Zaiane and Mohammed (2003) pro-posed the COFI-tree structure to replace the conditional

0957-4174/$ - see front matter  2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.09.009

*

Corresponding author.

E-mail addresses: tphong@nuk.edu.tw (T.-P. Hong), hcy@ms4.url. com.tw (C.-Y. Horng), johnw@nuk.edu.tw (C.-H. Wu), slwang@nuk. edu.tw(S.-L. Wang).

www.elsevier.com/locate/eswa Expert Systems with Applications 36 (2009) 72–80

Expert Systems with Applications

(2)

FP-tree. Ezeife (2002) constructed a generalized FP-tree, which stored all the large and non-large items, for incre-mental mining without rescanning databases. Koh and Shieh (2004) adjusted FP-trees also based on two support thresholds, but with a more complex adjusting procedure and spending more computation time than the one pro-posed in this paper.

Denwattana and Getta (2001) proposed an algorithm (referred to as the (n, p) algorithm) to reduce the numbers of scanning input datasets for finding large itemsets. The (n, p) algorithm also iterates two phases, the phase of pre-diction and the phase of verification. Unlike the apriori, the (n, p) algorithm predicts large itemsets for p-levels in the first phase and verifies all these p-level itemsets in the second phase. A heuristic estimation method is presented to predict the possibly large itemsets. If the prediction was valid, then the approach is efficient in finding the actu-ally large itemsets.

In this paper, we propose a mining algorithm to improve the efficiency of finding large itemsets. Our approach is based on the concept of prediction presented in the (n, p) algorithm and considers the data dependency among trans-actions. As the (n, p) algorithm does, our method iterates the same two phases but uses a new estimation method to predict promising and non-promising candidate itemsets flexibly. The estimation mechanism computes for each level a different support threshold derived from a data depen-dency parameter and determines whether an item should be included in a promising candidate itemset directly by the support values of items. Since we reduce the number of candidate itemsets to be verified by the new estimation mechanism and the number of scanning of the input data-set by the concept of prediction of the (n, p) algorithm, the performance of finding large itemsets can be improved.

The rest of this paper is organized as follows. Section2 presents the related works of finding large itemsets. The apriori algorithm and the (n, p) algorithm are reviewed. In Section3, we describe our motivation and the theoreti-cal foundation of our method. Detailed description of our algorithm is given in Section4together with a simple exam-ple. Experimental results and the comparison on the per-formance of the apriori, the (n, p) algorithm, and our method are shown in Section 5. Conclusions are given in Section6.

2. Related works

One application of data mining is to induce association rules from transaction data, such that the presence of cer-tain items in a transaction will imply the presence of cercer-tain other items. Below we briefly reviewed the apriori and the (n, p) algorithm.

2.1. The apriori algorithm

Agrawal and Srikant (1994)andAgrawal et al. (1993b, 1997) propose a famous mining algorithm, the apriori,

based on the concept of large itemsets to find association rules in transaction data. The apriori iterates two phases, the phase of candidate generation and the phase of verifica-tion, at each level. At the ith level, i > 1, itemsets consisting of i-items are processed. In candidate generation, all possi-ble large i-itemsets are produced by combining the unre-peated elements of (i 1)-itemsets. In the verification phase, the input dataset is scanned and if the number of an i-itemset appearing in the transactions is larger than a pre-defined threshold (called the minimum support, or min-sup), the itemset is considered as large. After that, these two phases iterate for (i + 1)-level until all large itemsets are found.

Suppose that we have a dataset containing six transac-tions, as shown inTable 1, which has two features, transac-tion identificatransac-tion (ID) and transactransac-tion descriptransac-tion (Items). There are eight items in the dataset. Assume that min-sup = 30%. The transaction dataset is first scanned to count the candidate 1-itemsets. Since the counts of the items a(4), b(5), c(6), d(3), and e(4) are larger than 6 * 30% = 1.8, they are thus put into the set of large 1-item-sets. Candidate 2-itemsets are then formed from these large 1-itemsets by taking any two items in the large 1-itemsets and counting if their occurrences are large than or equal to 1.8. Therefore, ab, ac, ae, bc, bd, be, cd, ce, and de then form the set of large 2-itemsets. In a similar way, abc, abe, ace, bce, bcd, bde, and cde form the set of large 3-itemsets, abce and bcde form the set of large 4-itemsets.

2.2. Denwattana and Getta’s approach

In Denwattana and Getta (2001), the (n, p) algorithm tries to reduce the number of scanning datasets and improves the efficiency of finding large itemsets by ‘‘guess-ing’’ what items should be considered. The approach parti-tions candidate itemsets into two parts: positive candidate itemsets and negative candidate itemsets, where the former contains itemsets guessed to be large and the later contains itemsets guessed to be small. Initially, the (n, p) algorithm scans the dataset to find large itemsets of 1-items. Two parameters, called n-element transaction threshold tt and

frequency threshold tf, are used to judge whether an item

could compose a positive candidate itemset. According to the n-element transaction threshold tt, only the transactions

with item numbers (lengths) less than or equal to ttare

con-sidered. The frequency of each item appearing in

transac-Table 1

A sample dataset of transactions

ID Items 1 abc 2 bcdef 3 abceg 4 acd 5 bcde 6 abce

(3)

tions with j-items, j 6 tt, is computed. If the appearing

fre-quency of an item is larger than or equal to tf, the item

could be used to compose a positive candidate itemset. Then, two phases, the predication process and the verifica-tion process, are iterated. In the predicverifica-tion process, positive candidate 2-itemsets Cþ2, each of which has its two items satisfying the above criteria, is formed. The remaining can-didate 2-itemsets not in Cþ2 form C2. Cþ3 and C3 are then formed from only Cþ2 in a similar way. The positive candi-date 2-itemsets which are subsets of the itemsets in Cþ3 are then removed from Cþ2. The same process was repeated until p-levels of positive and negative candidate itemsets are formed. After that, the verification process check whether the itemsets in the positive candidate itemsets are actually large and the itemsets in the negative candidate itemsets are actually small by scanning the dataset once. The itemsets incorrectly guessed are then expanded and processed by scanning the dataset again.

Consider the dataset shown in Table 1 again. Suppose that minsup = 30% and the n-element transaction thresh-old tt= 5 and (n, p) = (5, 3). The numbers of occurrences

of each item in transactions with different lengths are shown inTable 2.

Assume the frequency threshold tf= 80%. The

occur-rence number of each item in transactions with 3-items must be larger than or equal to 1.6 (2 * 80%) for this item to be considered as positive. Thus, items a and c are posi-tive. Similarly, b, c, and e are positive for transactions with 4-items, and b, c, and e are positive for transactions with 5-items. The items which can be used to compose positive candidate itemsets are thus the union {a, b, c, e} of the above three sets. We obtain Cþ2 ¼ fbc; be; ab; ce; ac; aeg,

and C2 ¼ fbd; bf ; bg; cd; cf ; cg; de; ef ; eg; ad; af ; ag; df ; dg; fgg. Then, Cþ3 and C



3 are formed from C þ 2 as

3 ¼ fbce; abc; abe; aceg, and C3 ¼ /. C þ

4 and C

 4 are

formed from Cþ3 as Cþ4 ¼ fabceg, and C4 ¼ /. The

ele-ments in Cþ2 which are subsets of those in Cþ3 are removed. Thus, we obtain Cþ2 ¼ /. Similarly, the itemsets in Cþ3

which are subsets of those in Cþ4 are removed. Therefore, Cþ3 ¼ /.

In the verification process, the dataset is scanned to check all the itemsets in Cþj and Cj, j = 2, 3, 4. After the scan, the set of large itemsets in positive candidate sets is {abce}. The set of small itemsets in negative candidate sets

is {bf, bg, cf, cg, ef, eg, ad, af, ag, df, dg, fg}. The itemsets de, cd, and bd in C2 are incorrectly predicted. The incorrectly

predicted itemsets are further processed. The subsets of incorrectly predicted itemsets in positive candidate sets are first generated. These subsets are then pruned using the large itemsets already known. Since Cþ4 is correctly pre-dicted, the result is /. The supersets of incorrectly predicted itemsets in negative candidate sets are also generated as {bde, cd, cde, bcde}. The dataset is then scanned again to check these itemsets. Therefore, the itemset {bcd, bde, cde, bcde} is large in the second scan. All the large itemsets are then collected as

L2¼ fbd; cd; de; ae; ac; ce; ab; be; bcg;

L3¼ fcde; bde; bcd; ace; abe; abc; bceg;

L4¼ fbcde; abceg:

The same process goes on for finding L5–L7. C5is first

generated from L4 as {abcde} and checked to be large in

the same manner. After all the large itemsets are found, the association rules can then be derived from them. 3. Motivation

3.1. Observation

The apriori algorithm is straightforward and simple but needs to repeatedly scan the dataset while finding the large itemsets. There have been many approaches proposed to reduce the number of scanning datasets. The (n, p) algo-rithm is one of such approaches. It scans the input dataset once in the beginning and twice in each verification pro-cess. Since p-levels of transactions are processed in an iter-ation, the number of scanning datasets is less than that in the apriori. In the above example, the apriori scans the dataset 10 times; while the (n, p) algorithm scans five times. In Denwattana and Getta (2001), it is claimed that the number of scanning the dataset can be reduced with proper parameters n and p.

Unfortunately, in some cases, the overall performance of the (n, p) algorithm is defected because of the verification process. In Cþi and Ci, i P 2, all itemsets with lower sup-ports are rejected by Cþi and are collected into Ci. When generating Cþiþ1 and Ciþ1, all elements in Cþi and Ci are taken into account, respectively. If there are a large number of itemsets with supports lower than the pre-defined min-sup and are included in Ci, the (n, p) algorithm also has to compute all combinations of such itemsets and deletes them all in the verification process. This affects the overall performance of the (n, p) algorithm. On the other hand, in using the apriori, only the itemsets with supports higher than minsup are considered. For example, the (n, p) algo-rithm guesses that 200 out of 1000 2-itemsets are large and included in Cþ2 and 800 items are in C2; 70% elements in Cþ2 and 99% elements in C2 are not large enough. It takes about (800· 99%)2(k 1)

computations in generating Ck, k > 2, 3, . . . , 2 + p, and removes most elements in Ck in

Table 2

The number of occurrences of each item in transactions of different lengths Item Occurrences of an item in transactions

Length = 3 Length = 4 Length = 5

a 2 1 1 b 1 2 2 c 2 2 2 d 1 1 1 e 0 2 2 f 0 0 1 g 0 0 1 Number of transactions 2 2 2

(4)

the verification phase. In the apriori, if 200· 120% 2-item-sets are actually large (the (n, p) algorithm may make a wrong guess), only (200· 120%)2computations are needed in generating Cþ3 and much less than (200· 120%)2are left after verifying their supports by scanning the dataset. Though the number of scanning the dataset in the (n, p) algorithm is less than that in the apriori, the overall perfor-mance of the (n, p) algorithm may be not good enough. Experimental results are available in Section5. Therefore, the performance of the (n, p) algorithm can be further improved. One direction of improvement is to employ bet-ter data structures, so that the generation of Cþiþ1 and Ciþ1 can be done in a more efficient manner, such asBykowski and Rigotti (2001), Bastide, Taouil, Pasquier, Stumme, and Lakhal (2000), Koh and Shieh (2004), Pei, Han, and Mao (2000) and Zaki and Hsiao (2002). The other direction is filtering out the itemsets with lower supports and reduce the number of elements to be included in Cþ

i and C  i . This

paper adopts the later approach from a point of view of probability.

3.2. Theoretical foundation

Usually, items must have greater support values for being covered in large itemsets with more items from a probabilistic viewpoint. For example, if minsup is 30%, then an item with its support just greater than 30% is a large 1-itemset. Both items in a 2-itemset must, however, have their supports a little larger than 0.3 for this 2-itemset to be large with a high probability. At one extreme, if total dependency relations exist in the transactions, an appear-ance of one item will certainly imply the appearappear-ance of another. In this case, the support thresholds for an item to appear in large itemsets with different items are the same. At the other extreme, if the items are totally indepen-dent, then the support thresholds for an item to appear in large itemsets with different items should be set at different values. In this case, the support threshold for an item to appear in a large itemset with r-items can be easily derived below.

Since all the r-items in a large r-itemset must be 1-large itemsets, all the supports of the r-items must be larger than or equal to the pre-defined minsup a. Since the items are assumed totally independent, the support of the r-itemset is s1· s2·    · sr, where siis the actual support of the ith

item in the itemset. If this r-itemset is large, its support must be larger than or equal to a. Thus

s1 s2     srP a:

If the predictive support threshold for an item to appear in a large r-itemset is ar, then

s1 s2     srP ar ar     arP a:

Thus ar

rP a and arP a1=r:

Therefore, if the items are totally independent, the support threshold of an item should be expected to be a1/rfor being included in a large r-itemset. Since the transactions are sel-dom totally dependent or totally independent, a data dependency parameter w, ranging between 0 and 1, is then used to calculate the predictive support threshold of an item for appearing in a large r-itemset as wa + (1 w)a1/r. A larger w value represents a stronger item relationship existing in transactions. w = 1 means total dependency for transaction items and w = 0 means total independency. The proposed approach thus uses different predictive sup-port thresholds for each item to be included in promising itemsets with different numbers of items.

4. Our method 4.1. Our algorithm

The proposed mining algorithm aims at efficiently find-ing any p-levels of large itemsets by scannfind-ing a dataset twice except for the first level. The support of each item from the first dataset scan is directly judged to predict whether an item will appear in an itemset. The proposed method uses a higher predicting minsup for each item to be included in a promising itemset of more items. Itemsets with different numbers of items then have different predict-ing minsups for an item. A predictpredict-ing minsups is calculated as a weighted average of the possible minsups for totally dependent data and for totally independent data. A data dependency parameter, ranging between 0 and 1, is used as the weight.

A mining process similar to that proposed in Denwat-tana and Getta (2001) can then be adopted to find the p-levels of large itemsets. A dataset is first scanned to get the support of each item. If the support of an item is larger than or equal to the pre-defined minsup value, it is put in large 1-itemsets L1. After large 1-itemsets have been found,

any more p-levels of large itemsets can be obtained by scan-ning a dataset twice. Candidate 2-itemsets (C2) are formed

by combination of the items in large 1-itemsets. In the meantime, the predicting minsup for an item to be included in large 2-itemsets is estimated according to the given data dependency parameter. If the support of an item in L1is

smaller than the predicting minsup, any candidate 2-item-set including this item will not be large with a high possibil-ity. On the contrary, any candidate 2-itemset with supports of all items larger than or equal to the predicting minsup will have a high possibility of being large. The candidate 2-itemset can then be partitioned into two parts: Cþ2 and C2, according to whether the supports of all the items in an itemset are larger than or equal to the predicting minsup.

After promising candidate 2-itemsets (Cþ2Þ are gener-ated, candidate 3-itemsets (C3) are formed by combining

them. Similar to the process for finding candidate 2-item-sets, a new predicting minsup for each item to be included

(5)

in promising 3-itemsets is calculated. Candidate 3-itemsets can then be partitioned into two parts: Cþ3 and C3 by

com-paring the supports of the items included in C3 with the

predicting minsup. The same procedure is repeated until the p-levels of itemsets are processed. Therefore, no dataset scan except for the first level has been done until now. A dataset is then scanned to get the actually large itemsets in the promising candidate itemsets and in the non-promis-ing candidate itemsets. The itemsets incorrectly predicted in the p-levels are further processed by scanning the dataset once more. Therefore, a total of two dataset scans are needed in getting the p-level large itemsets. After that, another processing phase of p-levels is done. This process-ing is repeated until no large itemsets are found in a phase. Let cibe the number of occurrence of each item ai

appear-ing in the input dataset of transactions and the support (ai)

of each item aibe ci/n. The details of our algorithm are

pre-sented inFig. 1.

4.2. An example

In this section, the dataset inTable 1is used to describe our method. Assume that minsup a = 30% and w = 0.5 in this example. The following processes are performed:

STEP 1: The support of each item is compared with the minsup a. Since the supports of {a}, {b}, {c}, {d}, and {e} are larger than or equal to 0.30, they are put in L1.

STEP 2: r = 1. STEP 3: r0= 1.

STEP 4: P1is the same as L1, which is {a, b, c, d, e}.

STEP 5: The candidate set C2 is formed from L1 as

C2= {ae, be, ce, de, ad, bd, cd, ac, bc, ab}.

STEP 6: r = r + 1 = 2.

STEP 7: The predicting minsup value for each item a0= 0.5· 0.30 + (1  0.5) · 0.301/2= 0.424.

INPUT: A set of n transactions with m items, the minsup α, a dependency parameter w, and a level

number p.

OUTPUT: Large itmesets.

STEP 1: Check whether the support αi of each item ai is larger than or equal to α. If αi≥α, put ai in

the set of large 1-itemsets L1.

STEP 2: Set r=1, where r is the number of items in itemsets currently being processed.

STEP 3: Set r’=1, where r’ is the number of items at an end of the last iteration.

STEP 4: Set Pr = Lr, where Pr is the items predicted to be included in r-itemsets.

STEP 5: Generate the candidate set Cr+1from Lr in a way similar to that in the apriori.

STEP 6: Set r=r+1.

STEP 7: Check whether the support αi of each item ai in Pr-1is larger than or equal to the predicting

minsupα’=wα+(1-w)α1/r to be included in Pr. If αi≥α’, put ai in Pr.

STEP 8: Form the promising candidate itemsets Cr+ by choosing from Crthe itemsets with each

item existing in Pr.

STEP 9: Set the non-promising candidate itemsets Cr−=CrCr+.

STEP 10: Set r=r+1.

STEP 11: Generate the candidate set Crfrom Cr+−1 in the way as the apriori does.

STEP 12: Check whether the support αi of each item ai in Pr-1is larger than or equal to the predicting

minsupα’=wα+(1-w)α1/r. If αi≥α’, put ai in Pr.

STEP 13: Set the non-promising candidate itemsets Cr−=CrCr+.

STEP 14: Remove the itemsets in Cr+−1, which are subsets of any itemset in C .r+

STEP 15: Repeat STEP 10 to STEP 14 until r = r’ + p.

STEP 16: Scan the dataset to check whether the promising candidate itemsets, Cr+'+1 to Cr+ are

actually large and whether the non-promising candidate itemsets, Cr'+1 to Cr− are actually small. Put the actually large itemsets in the corresponding sets Lr’+1 to Lr.

STEP 17: Find all the proper subsets with r’+1 to i items for each itemset which is not large in C , r’i+

+1≤i ≤r; keep the proper subsets which are not among existing large itemsets; donate them

as NC+.

STEP 18: Find all the proper supersets with i to r items for each itemset which is large in C , r’ +1i− ≤

i≤ r; the supersets must also have all their sub-itemsets of r’ items existing in Lr’ and cannot

include any sub-itemset in the non-large itemsets in C and i+ C checked in STEP 16;i

donate them as NC-.

STEP 19: Scan the dataset to check whether the itemsets in NC+ and NC-are large; add the large

itemsets to the corresponding sets Lr’+1 to Lr.

STEP 20: If Lr is not null, r’= r’ + p and go to STEP 4 for another iteration; otherwise do STEP 21.

STEP 21: Add the non-redundant subsets of large itemsets to the corresponding sets L2 to Lr.

(6)

The support of each item in P1 is then

compared with 0.424. Since the supports of {a}, {b}, {c}, {d}, and {e} are larger than 0.424, P2is {a, b, c, d, e}.

STEP 8: The itemsets in C2with each item existing in P2

are chosen to form promising candidate itemsets Cþ2. Thus Cþ2¼ fae;be;ce;de;ad;bd;cd;ac;bc;abg. STEP 9: The non-promising candidate itemsets C2 is

found as C2 ¼ C2 Cþ2 ¼ /.

STEP 10: r = r + 1 = 3.

STEP 11: The candidate set C3 is formed from Cþ2 as

C3= {abe, ace, ade, bce, bde, cde, abd, acd,

bcd, abc}.

STEP 12: The predicting minsup a0= 0.5· 0.30 + (1 

0.5)· 0.301/3

= 0.485. The support of each item in P2 is then compared with 0.485. Since the

supports of {a}, {b}, {c}, {d} and {e} are larger than 0.485, P3is {a, b, c, d, e}. Thus, we have

3 ¼ C3.

STEP 13: The non-promising candidate itemsets C3 is found as C3 ¼ C3 Cþ3 ¼ /.

STEP 14: Since all itemsets in Cþ2 are subsets of itemsets in Cþ3, they are removed from Cþ2=/.

STEP 15: Since r (=3) < r0+ p (=4), STEP 10–STEP 14

are repeated. r = 3 + 1 = 4. The candidate set C4is then formed from Cþ3 as C4= {abce, abde,

acde, bcde, abcd}. The predicting minsup value a0= 0.5· 0.30 + (1  0.5) · 0.301/4

= 0.520. The support of each item in P3is then compared

with 0.520. Since the supports of {a}, {b}, {c}, and {e} are larger than 0.520, P4is {a, b, c, e}.

4 is thus formed as Cþ4 ¼ fabceg. The

non-promising candidate itemsets C4 is found as C4 ¼ C4 Cþ4 ¼ fabde; acde; bcde; abcdg. The

itemsets in Cþ3 which are subsets of itemsets in Cþ4 are removed from Cþ3. Thus, we obtain Cþ3 ¼ fade; bde; cde; abd; acd; bcdg.

STEP 16: The dataset is scanned to check whether the promising candidate itemsets of Cþ2 to Cþ4 are actually large and whether the non-promising candidate itemsets of C2 to C4 are actually small. The itemsets ad in Cþ2, ade, abd, acd in Cþ3, and bcde in C4 are incorrectly predicted. By deleting ad, ade, abd, and acd from Cþ2 and Cþ3, the rest of elements in Cþ2, Cþ3, and Cþ4 are put into L2, L3, and L4, respectively. Also,

bcde is put into L4.

STEP 17: The proper subsets of the itemsets incorrectly predicted in Cþi are generated. Since {ad, a-de, abd, acd} is incorrectly predicted in this example, its proper subsets not in existing large itemsets is {ad}. Thus NC+= {ad}.

STEP 18: The proper supersets of the itemsets incorrectly predicted in Ci are generated. Since only {bcde} is incorrectly predicted in this example, its proper supersets with 3-items and not in existing large itemsets is /. Thus NC= /.

STEP 19: The dataset is scanned to find the large itemsets from NC+and NC. Since {ad} is not large by verifying the dataset, the large itemsets L2to L4

are then found as L2= {ab, bc, ac, cd, bd, de, ce,

be, ae}, L3= {abc, bcd, cde, bde, bce, ace, abe},

and L4= {bcde, abce}.

STEP 20: Since L4 is not null, the next iteration is

exe-cuted. STEP 4–STEP 19 are then repeated for L5 to L7. Then, we have C5= {abcde} and

L5= L6= L7= /.

STEP 21: The non-redundant subsets of the found large itemsets are added to the corresponding sets L2to L4. The final large itemsets L2to L4are

then found as follows: L2¼ fab; bc; ac; cd; bd; de; ce; be; aeg;

L3¼ fabc; bcd; cde; bde; bce; ace; abeg;

L4¼ fbcde; abceg:

5. Experiments and results

In order to demonstrate the performance of our method, several experiments are performed. All the three methods, the apriori, the (n, p) algorithm, and our method, are imple-mented in Java using the same data structure and represen-tation. The experiments are performed on a personal computer with a Pentium-IV 2.0 GHz CPU and Win-dowsTM. We collect different types of datasets using the data

generator provided by IBM. The following parameters indicate the contents of the dataset:

• nt: the number of transactions in dataset

• lt: average transaction length in the dataset

• ni: the number of items in the dataset

• np: the number of patterns found in the dataset

• lp: average length of pattern

• cp: correlation between consecutive patterns

• f: average confidence in a rule • vc: variation in the confidence

For convenience, we use a 9-tuple (D, nt, lt, ni, np,

lp, cp, f, vc) to describe the parameters used in generating a

dataset D. 5.1. Experiment I

First of all, we test if our algorithm performs better than the apriori does. The dataset D1 is generated with the

parameters (D1, nt, 10, 25, 1000, 4, 0.15, 0.25, 0.1). We

per-form the two programs with different sizes, nt= 10 K,

20 K, 50 K, 100 K, of datasets and compare the perfor-mance by speedup which is computed as (tp to/to), where

tp is the running time of the apriori and to is the running

time of ours. In this experiment, the minsup is set at 15% and different combinations of the data depen-dency parameter w and transaction threshold t are tested.

(7)

The experimental result is shown in Fig. 2. It seems that our method is scalable and provides a better computational performance than the apriori with a lower w.

5.2. Experiment II

Next, the dataset D1is used to test the performance of

our method with different values of minsup. We set nt= 10,000. The speedup is illustrated in Fig. 3. Like the

(n, p) algorithm, our algorithm has a better performance when minsup is low. This is because that the apriori exactly filters out itemsets with supports lower than minsup in each stage and reduces the number of elements to be considered in generating candidate itemsets. In the (n, p) algorithm and ours, we have to guess what itemsets should be considered in the prediction phase and then to verify the predicted ones. The generation of proper subsets and supersets after the verification phase takes a considerable time.

5.3. Experiment III

According to experiments I and II, we find that the per-formance of our method is sensitive with respect to the con-tents of the datasets, i.e, the relationships among data in

the datasets. We generated five different datasets, D2, D3,

D4, D5, D6by the changing cpand f and remaining the rest

parameters the same as that in D1. The parameters (cp, f) in

D2, D3, D4, D5, and D6are given as (0.05, 0.75), (0.05, 0.25),

(0.15, 0.75), (0.25, 0.75), and (0.25, 0.25), respectively.Fig. 4 presents the results with 10,000 data in each dataset.

The data dependency parameter w seems to have an effect on the performance of the proposed algorithm, but not on the final large itemsets. A larger w value represents a stronger item relationship existing in transaction data-sets. If the relationships of data items in transactions have been known to be very strong, w may be set at a value close to 1. If the relationships of data items in transactions have been known to be independent, w may be set at a value close to 0. If the relationships of data items in transactions are unknown, w may be set at 0.5.

Consider the dataset of Table 1 again. Here are two extreme cases with w = 0 and w = 1, respectively. Assume w = 0 and minsup = 30%, we obtain L1= {a, b, c, d, e} and

2 ¼ fae; be; ce; ac; bc; abg; Cþ3 ¼ Cþ

4 ¼ /;

C2 ¼ fcd; bd; ad; deg; C3 ¼ fbce; ace; abe; abcg; C4 ¼ /:

The itemsets that are incorrectly predicted are cd, bd, de, bce, ace, abe, and abc. L3 and L4 are generated from the

supersets of NC= {cd, bd, de} and NC= {bce, ace, abe, abc}, respectively. If w = 1, we obtain L1= {a, b, c, d,

e} and Cþ2 ¼ Cþ

3 ¼ /;

4 ¼ fabce; abde; acde; bcde; abcdg; C2 ¼ C

3 ¼ C4 ¼ /:

The itemsets that are incorrectly predicted are abde, acde, and abcd. L2and L3are generated from the subsets

of NC+= {abde, acde, abcd}. In both cases, many compu-tations are spent in combining the items from NC+(NC).

(w,t)=(0.5,3) (w,t)=(0.25,3) (w,t)=(0.25,5) (w,t)=(0.5,5) 10K 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 20K 50K 100K Speedup DataSize

Fig. 2. Speedup in experimental result I with minsup = 15%.

-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 15% 20% 40% 60% 80% Speedup (w,t)=(0.5,3) (w,t)=(0.25,3) minsup

Fig. 3. Speedup in experimental result II with different minsup.

-2.0 0.0 2.0 4.0 6.0 8.0 0.1 0.25 0.5 0.75 speedup D1 D2 D3 D4 D5 D6 w

(8)

Generally, w = 0.5 has a better performance than w = 0 or w = 1.

5.4. Experiment IV

Finally, we test if the level of transactions to be consid-ered in an iteration effects the performance. The dataset D2

is used to test. It seems that the best performance appears when three levels are considered in an iteration (seeFig. 5). Also, it is found that the data in D2are of low dependency

(cp= 0.05). This is because that the generation of Cþiþ1from

i is of O(m2) computational complexity, where m is the number of elements in Cþi . The more levels of transactions are considered in an iteration the larger n may be obtained. 6. Discussions and conclusions

From the experimental results, different values of data dependency will cause the same large itemsets, but different predictive effects. When w = 1, the non-promising date sets are predicted very well, but the promising candi-date sets are predicted badly; and vice versa for w = 0. By default, we set w = 0.5. If the data dependency relation-ships in transactions can be well utilized, our method can improve the overall performance of finding large itemsets. In our experiments, both the (n, p) algorithm and ours suffer from the inefficiency of generating Cþiþ1 from Cþi. When there are many items in the dataset, e.g., 25 items in D1 D6, and more levels of transactions to be

consid-ered, more computation is needed in both algorithms. However, our method provides a more accurate approach for predicting itemsets and obtains a better performance than the (n, p) algorithm, especially when p > 2.

In this paper, we have presented a mining algorithm that combines the advantages of the apriori and the (n, p) algo-rithm in finding large itemsets. As the (n, p) algoalgo-rithm does, our algorithm reduces the number of scanning datasets for finding p levels of large itemsets. A new parameter that considers data dependency is included in our method for

early filtering out the itemsets that are possibly of lower supports and thus improves the computational efficiency.

We also conclude that the three algorithms can compete with each other and gain the best performance on different types of datasets. There need more studies on how to tune the parameters, such as n, p, and transaction threshold in the (n, p) algorithm and w, t in ours, before the mining task is performed.

Acknowledgements

The authors would also like to thank Mr. Tsung-Te in the Department of Information Management, Shu-Te Uni-versity, Taiwan, for his help in conducting the experiments. References

Agrawal, R., Imielinksi, T., & Swami, A. (1993a). Dataset mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925.

Agrawal, R., Imielinksi, T., & Swami, A. (1993b). Mining association rules between sets of items in large dataset. In The ACM SIGMOD conference (pp. 207–216). Washington DC, USA.

Agrawal, R., & Srikant, R. (1994). Fast algorithm for mining association rules. In The international conference on very large data bases (pp. 487– 499).

Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In The 11th IEEE international conference on data engineering (pp. 3–14). Agrawal, R., Srikant, R., & Vu, Q. (1997). Mining association rules with

item constraints. In The third international conference on knowledge discovery in datasets and data mining (pp. 67–73). Newport Beach, California.

Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. ACM SIGKDD Explorations, 2(2), 66–75.

Bykowski, A., & Rigotti, C. (2001). A condensed representation to find frequent patterns. In The 12th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems. Santa Barbara, Califor-nia, USA.

Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: An overview from a dataset perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866–883.

Denwattana, N., & Getta, J. R. (2001). A parameterised algorithm for mining association rules. In The 12th Australasian Dataset conference (pp. 45–51).

Ezeife, C. I. (2002). Mining Incremental association rules with generalized FP-tree. In The 15th conference of the Canadian society for computa-tional studies of intelligence on advances in artificial intelligence (pp. 147–160).

Han, J., & Fu, Y. (1995). Discovery of multiple-level association rules from large dataset. In The 21st international conference on very large data bases (pp. 420–431). Zurich, Switzerland.

Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In The 2000 ACM SIGMOD international conference on management of data (pp. 1–12).

IBM, The Intelligent Information Systems Research (Quest) Group,

http://www.almaden.ibm.com/software/quest/Resources/datasets/ syndata.html.

Koh, J. L., & Shieh, S. F. (2004). An efficient approach for maintaining association rules based on adjusting FP-tree structures. In The 9th international conference on database systems for advanced applications (pp. 417–424).

Mannila, H., Toivonen, H., & Verkamo, A. I. (1994). Efficient algorithm for discovering association rules. The AAAI Workshop on Knowledge Discovery in Datasets (pp. 181–192). -2.0 0.0 2.0 4.0 6.0 8.0 10.0 2 3 4 5 6 7 speedup w=0.25 w=0.5 w=0.75 t

(9)

Park, J. S., Chen, M. S., & Yu, P. S. (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Trans-actions on Knowledge and Data Engineering, 9(5), 812–825.

Pei, J., Han, J., & Mao, R. (2000). CLOSET: an efficient algorithm for mining frequent closed itemsets. The 2000 ACM SIGMOD DMKD‘00. Dallas, TX, USA.

Qiu, Y., Lan, Y. J., & Xie, Q. S. (2004). An improved algorithm of mining from FP-tree. In The third international conference on machine learning and cybernetics (pp. 26–29).

Srikant, R., & Agrawal, R. (1995). Mining generalized association rules. In The 21st international conference on very large data bases (pp. 407–419), Zurich, Switzerland.

Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. In The 1996 ACM SIGMOD international conference on management of data (pp. 1–12). Montreal, Canada. Wojciechowski, M., & Zakrzewicz, M. (2002). Dataset filtering techniques

in constraint-based frequent pattern mining. Pattern detection and discovery. London, UK.

Zaiane, O. R., & Mohammed, E. H. (2003). COFI-tree mining: A new approach to pattern growth with reduced candidacy generation. In The IEEE international conference on data mining.

Zaki, M. J., Hsiao, & C. J. (2002). CHARM: an efficient algorithm for closed itemset mining. In The second SIAM international conference on data mining. Arlington.

數據

Fig. 3. Speedup in experimental result II with different minsup.
Fig. 5. Experimental result on different levels of transactions.

參考文獻

相關文件

Using a one-factor higher-order item response theory (HO-IRT) model formulation, it is pos- ited that an examinee’s performance in each domain is accounted for by a

● In computer science, a data structure is a data organization, management, and storage format that enables efficient access and

„ A socket is a file descriptor that lets an application read/write data from/to the network. „ Once configured the

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

Dynamic programming is a method that in general solves optimization prob- lems that involve making a sequence of decisions by determining, for each decision, subproblems that can

* All rights reserved, Tei-Wei Kuo, National Taiwan University, 2005..

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.

2 machine learning, data mining and statistics all need data. 3 data mining is just another name for