• 沒有找到結果。

An Improved Data Mining Approach Using Predictive Itemsets

N/A
N/A
Protected

Academic year: 2021

Share "An Improved Data Mining Approach Using Predictive Itemsets"

Copied!
29
0
0

加載中.... (立即查看全文)

全文

(1)

Submitted manuscript

An Improved Data Mining Approach Using Predictive Itemsets

Tzung-Pei Hong1*, Chyan-Yuan Horng2, Chih-Hung Wu3, Shyue-Liang Wang4

1Department of Electrical Engineering, National University of Kaohsiung

Kaohsiung, 811, Taiwan, R.O.C. tphong@nuk.edu.tw

2Institute of Information Engineering, I-Shou University

Kaohsiung, 840, Taiwan, R.O.C. hcy@ms4.url.com.tw

3

Department of Information Management, Shu-Te University Kaohsiung, 824, Taiwan, R.O.C.

johnw@mail.stu.edu.tw

4

Department of Computer Science, New York Institute of Technology New York, USA

slwang@isu.edu.tw

ABSTRACT

In this paper, we present a mining algorithm to improve the efficiency of finding large itemsets. Based on the concept of prediction proposed in the (n, p) algorithm, our method considers the data dependency in the given transactions to predict promising and non-promising candidate itemsets. Our method estimates for each level a different support threshold that is derived from a data dependency parameter and determines whether an item should be included in a promising candidate itemset directly. In this way, we maintain the efficiency of finding large itemsets by reducing the number of scanning the input dataset and the number candidate items. Experimental results show our method has a better efficiency than the apriori and the (n, p) algorithms when the minimum support value is small.

Keywords: data mining, association rule, predictive itemset, data dependency, predicting minimum support.

--- * Corresponding author

(2)

1. Introduction

Years of effort in data mining have produced a variety of efficient techniques. Depending

on the types of datasets processed, mining approaches may be classified as working on

transaction datasets, temporal datasets, relational datasets, or multimedia datasets, among

others. On the other hand, depending on the classes of knowledge derived, mining approaches

may be classified as finding association rules, classification rules, clustering rules, or

sequential patterns [13], etc. Among these techniques, finding association rules from

transaction datasets is usually an essential task [3][5][6][12][14][16][17][18][19][20].

Many algorithms for mining association rules from transactions are proposed, most of

which are executed in level-wise processes. That is, itemsets containing single items are

processed first, then itemsets with two items are processed. The process was repeated,

continuously adding one more item each time, until some criteria are met. The famous apriori

mining algorithm was proposed by Agrawal et al. [15][16]. The apriori iterates two phases,

the phase of candidate generation and the phase of verification. Possible large itemsets are

produced in the first phase and verified in the second phase by scanning the input dataset.

Since itemsets are processed level by level and datasets had to be scanned in each level, the

verification phase thus dominates the performance. In [10], Denwattana and Getta proposed

(3)

datasets for finding large itemsets. The (n, p) algorithm also iterates two phases, the phase of

prediction and the phase of verification. Unlike the apriori, the (n, p) algorithm predicts large

itemsets for p levels in the first phase and verifies all these p-level itemsets in the second

phase. A heuristic estimation method is presented to predict the possibly large itemsets. If the

prediction was valid, then the approach is efficient in finding the actually large itemsets.

In this paper, we propose a mining algorithm to improve the efficiency of finding large

itemsets. Our approach is based on the concept of prediction presented in the (n, p) algorithm

and considers the data dependency among transactions. As the (n, p) algorithm does, our

method iterates the same two phases but uses a new estimation method to predict promising

and non-promising candidate itemsets flexibly. The estimation mechanism computes for each

level a different support threshold derived from a data dependency parameter and determines

whether an item should be included in a promising candidate itemset directly by the support

values of items. Since we reduce the number of candidate itemsets to be verified by the new

estimation mechanism and the number of scanning of the input dataset by the concept of

prediction of the (n, p) algorithm, the performance of finding large itemsets can be improved.

The rest of this paper is organized as follows. Section 2 presents the related works of

finding large itemsets. The apriori algorithm and the (n, p) algorithm are reviewed. In Section

3, we describe our motivation and the theoretical foundation of our method. Detailed

(4)

Experimental results and the comparison on the performance of the apriori, the (n, p)

algorithm, and our method are shown in Section 5. Conclusions are given in Section 6.

2. Related Works

One application of data mining is to induce association rules from transaction data, such

that the presence of certain items in a transaction will imply the presence of certain other

items. Below we briefly reviewed the apriori and the (n, p) algorithm.

2.1. The Apriori Algorithm

Agrawal et al [12][14][16] proposes a famous mining algorithm, the apriori, based on

the concept of large itemsets to find association rules in transaction data. The apriori iterates

two phases, the phase of candidate generation and the phase of verification, at each level. At

the i-th level, i>1, itemsets consisting of i-items are processed. In candidate generation, all

possible large i-itemsets are produced by combining the unrepeated elements of (i-1)-itemsets.

In the verification phase, the input dataset is scanned and if the number of an i-itemset

appearing in the transactions is larger than a pre-defined threshold (called the minimum

support, or minsup), the itemset is considered as large. After that, these two phases iterate for

(5)

has two features, transaction identification (ID) and transaction description (Items). There are

eight items in the dataset. Assume that minsup=30%. The transaction dataset is first scanned

to count the candidate 1-itemsets. Since the counts of the items a(4), b(5), c(6), d(3), and e(4)

are larger than 6*30%=1.8, they are thus put into the set of large 1-itemsets. Candidate

2-itemsets are then formed from these large 1-itemsets by taking any two items in the large

1-itemsets and counting if their occurrences are large than or equal to 1.8. Therefore, ab, ac,

ae, bc, bd, be, cd, ce, and de then form the set of large 2-itemsets. In a similar way, abc, abe,

ace, bce, bcd, bde, and cde form the set of large 3-itemsets, abce and bcde form the set of

large 4-itemsets.

Table 1. A sample dataset of transactions ID Items 1 a b c 2 b c d e f 3 a b c e g 4 a c d 5 b c d e 6 a b c e

2.2. Denwattana and Getta’s Approach

In [10], the (n, p) algorithm tries to reduce the number of scanning datasets and

improves the efficiency of finding large itemsets by “guessing” what items should be

considered. The approach partitions candidate itemsets into two parts: positive candidate

(6)

large and the later contains itemsets guessed to be small. Initially, the (n, p) algorithm scans

the dataset to find large itemsets of 1-items. Two parameters, called n-element transaction

threshold tt and frequency threshold tf, are used to judge whether an item could compose a

positive candidate itemset. According to the n-element transaction threshold tt, only the

transactions with item numbers (lengths) less than or equal to tt are considered. The

frequency of each item appearing in transactions with j items, j  tt,, is computed. If the

appearing frequency of an item is larger than or equal to tf, the item could be used to compose

a positive candidate itemset. Then, two phases, the predication process and the verification

process, are iterated. In the prediction process, positive candidate 2-itemsets C , each of 2

which has its two items satisfying the above criteria, is formed. The remaining candidate

2-itemsets not in C form 2  2

C . C3 and  3

C are then formed from only C2 in a similar

way. The positive candidate 2-itemsets which are subsets of the itemsets in C3 are then

removed from C . The same process was repeated until p levels of positive and negative 2

candidate itemsets are formed. After that, the verification process check whether the itemsets

in the positive candidate itemsets are actually large and the itemsets in the negative candidate

itemsets are actually small by scanning the dataset once. The itemsets incorrectly guessed are

then expanded and processed by scanning the dataset again.

Consider the dataset shown in Table 1 again. Suppose that minsup=30% and the

(7)

item in transactions with different lengths are shown in Table 2.

Table 2. The number of occurrences of each item in transactions of different lengths

Assume the frequency threshold tf =80%. The occurrence number of each item in

transactions with 3 items must be larger than or equal to 1.6 (2*80%) for this item to be

considered as positive. Thus, items a and c are positive. Similarly, b, c, and e are positive for

transactions with 4 items, and b, c, and e are positive for transactions with 5 items. The items

which can be used to compose positive candidate itemsets are thus the union {a, b, c, e} of

the above three sets . We obtain C ={bc, be, ab, ce, ac, ae}, and 2  2

C ={bd ,bf, bg, cd, cf, cg,

de, ef, eg, ad, af, ag, df, dg, fg}. Then, C and 3C are formed from 3C as2C ={bce, abc, 3

abe, ace}, and C3=.  4

C and C are formed from 4  3 C as C4={abce}, and  4 C =. The elements in C which are subsets of those in 2

 3

C are removed. Thus, we obtain C2=.

Item Occurrences of an item in transactions

Length = 3 Length = 4 Length = 5

a 2 1 1 b 1 2 2 c 2 2 2 d 1 1 1 e 0 2 2 f 0 0 1 g 0 0 1 Number of transactions 2 2 2

(8)

Similarly, the itemsets in C3 which are subsets of those in  4

C are removed. Therefore,

 3

C =.

In the verification process, the dataset is scanned to check all the itemsets in Cj and

j

C , j=2,3,4. After the scan, the set of large itemsets in positive candidate sets is {abce}. The

set of small itemsets in negative candidate sets is {bf, bg, cf, cg, ef, eg, ad, af, ag, df, dg, fg}.

The itemsets de, cd, and bd in C2 are incorrectly predicted. The incorrectly predicted

itemsets are further processed. The subsets of incorrectly predicted itemsets in positive

candidate sets are first generated. These subsets are then pruned using the large itemsets

already known. Since C4 is correctly predicted, the result is . The supersets of incorrectly

predicted itemsets in negative candidate sets are also generated as {bde, cd, cde, bcde}. The

dataset is then scanned again to check these itemsets. Therefore, the itemset {bcd, bde, cde,

bcde} is large in the second scan. All the large itemsets are then collected as:

L2={bd, cd, de, ae, ac, ce, ab, be, bc}

L3={cde, bde, bcd, ace, abe, abc, bce}

L4={bcde, abce}

The same process goes on for finding L5 to L7. C5 is first generated from L4 as {abcde}

and checked to be large in the same manner. After all the large itemsets are found, the

(9)

3. Motivation

3.1. Observation

The apriori algorithm is straightforward and simple but needs to repeatedly scan the

dataset while finding the large itemsets. There have been many approaches proposed to

reduce the number of scanning datasets. The (n, p) algorithm is one of such approaches. It

scans the input dataset once in the beginning and twice in each verification process. Since p

levels of transactions are processed in an iteration, the number of scanning datasets is less

than that in the apriori. In the above example, the apriori scans the dataset 10 times; while the

(n, p) algorithm scans 5 times. In [10], it is claimed that the number of scanning the dataset

can be reduced with proper parameters n and p.

Unfortunately, in some cases, the overall performance of the (n, p) algorithm is defected

because of the verification process. In Ci and

i

C , i 2, all itemsets with lower supports are

rejected by Ci and are collected into

i

C . When generating Ci1 and  1 i C , all elements in  i

C and Ci are taken into account, respectively. If there are a large number of itemsets with

supports lower than the pre-defined minsup and are included in Ci, the (n, p) algorithm also

has to compute all combinations of such itemsets and deletes them all in the verification

process. This affects the overall performance of the (n, p) algorithm. On the other hand, in

using the apriori, only the itemsets with supports higher than minsup are considered. For

(10)

in C2 and 800 items are in  2

C ; 70% elements in C2 and 99% elements in  2

C are not

large enough. It takes about (80099%)2(k-1)

computations in generating Ck, k>2,3,…,2+p,

and removes most elements in Ck in the verification phase. In the apriori, if 200120%

2-itemsets are actually large (the (n, p) algorithm may make a wrong guess), only

(200120%)2

computations are needed in generating C3 and much less than (200120%)

2

are left after verifying their supports by scanning the dataset. Though the number of scanning

the dataset in the (n, p) algorithm is less than that in the apriori, the overall performance of

the (n, p) algorithm may be not good enough. Experimental results are available in Section 5.

Therefore, the performance of the (n, p) algorithm can be further improved. One direction of

improvement is to employ better data structures, so that the generation of Ci1 and 

1

i

C can

be done in a more efficient manner, such as [1][2][7][11][21]. The other direction is filtering

out the itemsets with lower supports and reduce the number of elements to be included in Ci

and Ci. This paper adopts the later approach from a point of view of probability.

3.2. Theoretical Foundation

Usually, items must have greater support values for being covered in large itemsets with

more items from a probabilistic viewpoint. For example, if minsup is 30%, then an item with

its support just greater than 30% is a large 1-itemset. Both items in a 2-itemset must, however,

(11)

At one extreme, if total dependency relations exist in the transactions, an appearance of one

item will certainly imply the appearance of another. In this case, the support thresholds for an

item to appear in large itemsets with different items are the same. At the other extreme, if the

items are totally independent, then the support thresholds for an item to appear in large

itemsets with different items should be set at different values. In this case, the support

threshold for an item to appear in a large itemset with r items can be easily derived below.

Since all the r items in a large r-itemset must be 1-large itemsets, all the supports of the

r items must be larger than or equal to the predefined minsup . Since the items are assumed

totally independent, the support of the r-itemset is s1 s2 sr, where si is the actual

support of the i-th item in the itemset. If this r-itemset is large, its support must be larger than

or equal to . Thus:

s1 s2 sr .

If the predictive support threshold for an item to appear in a large r-itemset is r, then: s1 s2 sr rrr .

Thus:

r

r and r1/r.

Therefore, if the items are totally independent, the support threshold of an item should be

expected to be 1/r for being included in a large r-itemset. Since the transactions are seldom

(12)

and 1, is then used to calculate the predictive support threshold of an item for appearing in a

large r-itemset as w+(1-w)1/r. A larger w value represents a stronger item relationship

existing in transactions. w=1 means total dependency for transaction items and w=0 means

total independency. The proposed approach thus uses different predictive support thresholds

for each item to be included in promising itemsets with different numbers of items.

4. Our Method

4.1. Our Algorithm

The proposed mining algorithm aims at efficiently finding any p levels of large itemsets

by scanning a dataset twice except for the first level. The support of each item from the first

dataset scan is directly judged to predict whether an item will appear in an itemset. The

proposed method uses a higher predicting minsup for each item to be included in a promising

itemset of more items. Itemsets with different numbers of items then have different predicting

minsups for an item. A predicting minsupt is calculated as a weighted average of the possible

minsups for totally dependent data and for totally independent data. A data dependency

parameter, ranging between 0 and 1, is used as the weight.

A mining process similar to that proposed in [10] can then be adopted to find the p

levels of large itemsets. A dataset is first scanned to get the support of each item. If the

(13)

1-itemsets L1. After large 1-itemsets have been found, any more p levels of large itemsets can

be obtained by scanning a dataset twice. Candidate 2-itemsets (C2) are formed by

combination of the items in large 1-itemsets. In the meantime, the predicting minsup for an

item to be included in large 2-itemsets is estimated according to the given data dependency

parameter. If the support of an item in L1 is smaller than the predicting minsup, any candidate

2-itemset including this item will not be large with a high possibility. On the contrary, any

candidate 2-itemset with supports of all items larger than or equal to the predicting minsup

will have a high possibility of being large. The candidate 2-itemset can then be partitioned

into two parts: C2 and  2

C , according to whether the supports of all the items in an itemset

are larger than or equal to the predicting minsup.

After promising candidate 2-itemsets (C2) are generated, candidate 3-itemsets (C3) are

formed by the combining them. Similar to the process for finding candidate 2-itemsets, a new

predicting minsup for each item to be included in promising 3-itemsets is calculated.

Candidate 3-itemsets can then be partitioned into two parts: C3 and  3

C by comparing the

supports of the items included in C3 with the predicting minsup. The same procedure is

repeated until the p levels of itemsets are processed. Therefore, no dataset scan except for the

first level has been done until now. A dataset is then scanned to get the actually large itemsets

in the promising candidate itemsets and in the non-promising candidate itemsets. The

(14)

once more. Therefore, a total of two dataset scans are needed in getting the p level large

itemsets. After that, another processing phase of p levels is done. This processing is repeated

until no large itemsets are found in a phase. Let ci be the number of occurrence of each item

ai appearing in the input dataset of transactions and the support (i) of each item ai be ci /n.

(15)

INPUT: A set of n transactions with m items, the minsup , a dependency parameter w, and a level

number p.

OUTPUT: Large itmesets.

STEP 1: Check whether the support i of each item ai is larger than or equal to . If i, put ai in

the set of large 1-itemsets L1.

STEP 2: Set r=1, where r is the number of items in itemsets currently being processed. STEP 3: Set r’=1, where r’ is the number of items at an end of the last iteration. STEP 4: Set Pr = Lr, where Pr is the items predicted to be included in r-itemsets. STEP 5: Generate the candidate set Cr+1 from Lr in a way similar to that in the apriori. STEP 6: Set r=r+1.

STEP 7: Check whether the support i of each item ai in Pr-1 is larger than or equal to the predicting

minsup ’=w+(1-w)1/r

to be included in Pr. If i’, put ai in Pr.

STEP 8: Form the promising candidate itemsets Cr by choosing from Cr the itemsets with each

item existing in Pr.

STEP 9: Set the non-promising candidate itemsets Cr CrCr.

STEP 10: Set r=r+1.

STEP 11: Generate the candidate set Cr from

 1

r

C in the way as the apriori does.

STEP 12: Check whether the support i of each item ai in Pr-1 is larger than or equal to the predicting

minsup ’=w+(1-w)1/r. If 

i’, put ai in Pr.

STEP 13: Set the non-promising candidate itemsets CrCrCr.

STEP 14: Remove the itemsets in Cr1, which are subsets of any itemset in Cr. STEP 15: Repeat STEP 10 to STEP 14 until r = r’ + p.

STEP 16: Scan the dataset to check whether the promising candidate itemsets, Cr'1 to Cr are

actually large and whether the non-promising candidate itemsets, Cr'1 to Cr are

actually small. Put the actually large itemsets in the corresponding sets Lr’+1 to Lr.

STEP 17: Find all the proper subsets with r’+1 to i items for each itemset which is not large in Ci, r’

+1 i r; keep the proper subsets which are not among existing large itemsets; donate them

as NC+.

STEP 18: Find all the proper supersets with i to r items for each itemset which is large in Ci, r’ +1 

i  r; the supersets must also have all their sub-itemsets of r’ items existing in Lr’ and cannot

include any sub-itemset in the non-large itemsets in Ci and Ci checked in STEP 16;

donate them as NC-.

STEP 19: Scan the dataset to check whether the itemsets in NC+ and NC- are large; add the large itemsets to the corresponding sets Lr’+1 to Lr.

STEP 20: If Lr is not null, r’= r’ + p and go to STEP 4 for another iteration; otherwise do STEP 21. STEP 21: Add the non-redundant subsets of large itemsets to the corresponding sets L2 to Lr.

(16)

4.2. An Example

In this section, the dataset in Table 1 is used to describe our method. Assume that minsup =

30% and w=0.5 in this example. The following processes are performed.

STEP 1: The support of each item is compared with the minsup . Since the supports of

{a}, {b}, {c}, {d}, and {e} are larger than or equal to 0.30, they are put in L1.

STEP 2: r=1.

STEP 3: r’= 1.

STEP 4: P1 is the same as L1, which is {a, b, c, d, e}.

STEP 5: The candidate set C2 is formed from L1 as C2 = {ae, be, ce, de, ad, bd, cd, ac, bc,

ab}.

STEP 6: r = r + 1 = 2.

STEP 7: The predicting minsup value for each item ’=0.50.30+(1-0.5)0.301/2=0.424.

The support of each item in P1 is then compared with 0.424. Since the supports

of {a}, {b}, {c}, {d}, and {e} are larger than 0.424, P2 is {a, b, c, d, e}.

STEP 8: The itemsets in C2 with each item existing in P2 are chosen to form promising

candidate itemsets C . Thus2  2

(17)

STEP 9: The non-promising candidate itemsets C is found as 2  2 C = C -2  2 C = . STEP 10: r = r + 1 = 3.

STEP 11: The candidate set C3 is formed from C as C2 3 = {abe, ace, ade, bce, bde, cde,

abd, acd, bcd, abc}.

STEP 12: The predicting minsup ’=0.50.30+(1-0.5)0.301/3=0.485. The support of each

item in P2 is then compared with 0.485. Since the supports of {a}, {b}, {c}, {d}

and {e} are larger than 0.485, P3 is {a, b, c, d, e }. Thus, we have C3= C3.

STEP 13: The non-promising candidate itemsets C is found as 3C = 3C -3 C = 3 . STEP 14: Since all itemsets in C are subsets of itemsets in 2C , they are removed from 3

 2

C =.

STEP 15: Since r (=3) < r’ + p (=4), STEP 10 to STEP 14 are repeated. r = 3+1 = 4. The

candidate set C4 is then formed from C as C3 4 = {abce, abde, acde, bcde,

abcd}. The predicting minsup value ’=0.50.30+(1-0.5)0.301/4=0.520. The

support of each item in P3 is then compared with 0.520. Since the supports of

{a}, {b}, {c}, and {e} are larger than 0.520, P4 is {a, b, c, e}. C is thus 4

formed as C = {abce}. The non-promising candidate itemsets 4  4

C is found as

 4

C = C -4 C = {abde, acde, bcde, abcd}. The itemsets in 4C which are 3

subsets of itemsets in C are removed from 4C . Thus, we obtain 3C ={ade, 3

(18)

STEP 16: The dataset is scanned to check whether the promising candidate itemsets of C 2

to C are actually large and whether the non-promising candidate itemsets of 4

 2

C to C are actually small. The itemsets ad in 4C , ade, abd, acd in 2C , 3

and bcde in C are incorrectly predicted. By deleting ad, ade, abd, and acd from 4

 2

C and C , the rest of elements in 3C , 2C , and 3C are put into L42, L3, and

L4, respectively. Also, bcde is put into L4.

STEP 17: The proper subsets of the itemsets incorrectly predicted in C are generated. i

Since {ad, ade, abd, acd} is incorrectly predicted in this example, its proper

subsets not in existing large itemsets is {ad}. Thus NC+ = {ad}.

STEP 18: The proper supersets of the itemsets incorrectly predicted in C are generated. i

Since only {bcde} is incorrectly predicted in this example, its proper supersets

with 3 items and not in existing large itemsets is . Thus NC-=.

STEP 19: The dataset is scanned to find the large itemsets from NC+ and NC-. Since {ad}

is not large by verifying the dataset, the large itemsets L2 to L4 are then found as

L2={ab, bc, ac, cd, bd, de, ce, be, ae}, L3={abc, bcd, cde, bde, bce, ace, abe},

and L4={bcde, abce}.

STEP 20: Since L4 is not null, the next iteration is executed. STEP 4 to STEP 19 are then

repeated for L5 to L7. Then, we have C5 = {abcde} and L5 = L6=L7 = .

(19)

corresponding sets L2 to L4. The final large itemsets L2 to L4 are then found as

follows:

L2={ab, bc, ac, cd, bd, de, ce, be, ae},

L3={abc, bcd, cde, bde, bce, ace, abe}

L4={bcde, abce}

5. Experiments and Results

In order to demonstrate the performance of our method, several experiments are

performed. All the three methods, the apriori, the (n, p) algorithm, and our method, are

implemented in Java using the same data structure and representation. The experiments are

performed on a personal computer with a Pentium-IV 2.0 GHz CPU and Windows™. We

collect different types of datasets using the data generator provided by [4]. The following

parameters indicate the contents of the dataset.  nt: the number of transactions in dataset

 lt: average transaction length in the dataset

 ni: the number of items in the dataset

 np: the number of patterns found in the dataset

 lp: average length of pattern

 cp: correlation between consecutive patterns

 f: average confidence in a rule  vc: variation in the confidence

(20)

used in generating a dataset D.

5.1. Experiment I

First of all, we test if our algorithm performs better than the apriori does. The dataset D1

is generated with the parameters (D1, nt, 10, 25, 1000, 4, 0.15, 0.25, 0.1). We perform the two

programs with different sizes, nt=10K, 20K, 50K, 100K, of datasets and compare the

performance by speedup which is computed as (tp-to/to), where tp is the running time of the

apriori and to is the running time of ours. In this experiment, the minsup is set at 15% and

different combinations of the data dependency parameter w and transaction threshold t are

tested. The experimental result is shown in Figure 2. It seems that our method is scalable and

provides a better computational performance than the apriori with a lower w.

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10K 20K 50K 100K DataSize Speedup (w,t)=(0.5,3) (w,t)=(0.25,3) (w,t)=(0.25,5) (w,t)=(0.5,5)

(21)

5.2. Experiment II

Next, the dataset D1 is used to test the performance of our method with different values

of minsup. We set nt=10000. The speedup is illustrated in Figure 3. Like the (n, p) algorithm,

our algorithm has a better performance when minsup is low. This is because that the apriori

exactly filters out itemsets with supports lower than minsup in each stage and reduces the

number of elements to be considered in generating candidate itemsets. In the (n, p) algorithm

and ours, we have to guess what itemsets should be considered in the prediction phase and

then to verify the predicted ones. The generation of proper subsets and supersets after the

verification phase takes a considerable time.

-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 15% 20% 40% 60% 80% minsup Speedup (w,t)=(0.5,3) (w,t)=(0.25,3)

Figure 3. Speedup in experimental result II with different minsup

5.3. Experiment III

According to experiments I and II, we find that the performance of our method is

(22)

datasets. We generated 5 different datasets, D2, D3, D4, D5, D6 by the changing cp and f and

remaining the rest parameters the same as that in D1. The parameters (cp, f) in D2, D3, D4, D5,

and D6 are given as (0.05, 0.75), (0.05, 0.25), (0.15, 0.75), (0.25, 0.75), and (0.25, 0.25),

respectively. Figure 4 presents the results with 10000 data in each dataset.

-2.0 0.0 2.0 4.0 6.0 8.0 0.1 0.25 0.5 0.75 w speedup D1 D2 D3 D4 D5 D6

Figure 4. Experimental result on different datasets

The data dependency parameter w seems to have an effect on the performance of the

proposed algorithm, but not on the final large itemsets. A larger w value represents a stronger

item relationship existing in transaction datasets. If the relationships of data items in

transactions have been known to be very strong, w may be set at a value close to 1. If the

relationships of data items in transactions have been known to be independent, w may be set

at a value close to 0. If the relationships of data items in transactions are unknown, w may be

(23)

respectively. Assume w=0 and minsup=30%, we obtain L1= {a, b, c, d, e} and

 2

C ={ae, be, ce, ac, bc, ab},

 3 C =C = 4 ,  2 C = {cd, bd, ad, de},  3

C = {bce, ace, abe, abc},

 4

C =.

The itemsets that are incorrectly predicted are cd, bd, de, bce, ace, abe, and abc. L3 and

L4 are generated from the supersets of NC-={cd, bd, de} and NC-={bce, ace, abe, abc},

respectively. If w = 1, we obtain L1= {a, b, c, d, e} and

 2

C =C =3 ,

 4

C = {abce, abde, acde, bcde, abcd},

 2

C =C =3C =4 ,

The itemsets that are incorrectly predicted are abde, acde, and abcd. L2 and L3 are

generated from the subsets of NC+={abde, acde, abcd}. In both cases, many computations

are spent in combining the items from NC+(NC-). Generally, w = 0.5 has a better performance

than w = 0 or w = 1.

5.4. Experiment IV

(24)

performance. The dataset D2 is used to test. It seems that the best performance appear when 3

levels are considered in an iteration. Also, it is found that the data in D2 are of low

dependency (cp=0.05). This is because that the generation of Ci1 from 

i

C is of O(m2)

computational complexity, where m is the number of elements in C . The more levels of i

transactions are considered in an iteration the larger n may be obtained.

-2.0 0.0 2.0 4.0 6.0 8.0 10.0 2 3 4 5 6 7 t speedup w=0.25 w=0.5 w=0.75

Figure 5. Experimental result on different levels of transactions

6. Discussions and Conclusions

From the experimental results, different values of data dependency will cause the same

large itemsets, but different predictive effects. When w=1, the non-promising candidate sets

are predicted very well, but the promising candidate sets are predicted badly; and vice versa

for w=0. By default, we set w=0.5. If the data dependency relationships in transactions can be

(25)

generating Ci1 from C . When there are many items in the dataset, e.g., 25 items in i

D1~D6, and more levels of transactions to be considered, more computation is needed in both

algorithms. However, our method provides a more accurate approach for predicting itemsets

and obtains a better performance than the (n, p) algorithm, especially when p>2.

In this paper, we have presented a mining algorithm that combines the advantages of the

apriori and the (n, p) algorithm in finding large itemsets. As the (n, p) algorithm does, our

algorithm reduces the number of scanning datasets for finding p levels of large itemsets. A

new parameter that considers data dependency is included in our method for early filtering

out the itemsets that are possibly of lower supports and thus improves the computational

efficiency.

We also conclude that the three algorithms can compete with each other and gain the best

performance on different types of datasets. There need more studies on how to tune the

parameters, such as n, p, and transaction threshold in the (n, p) algorithm and w, t in ours,

before the mining task is performed.

Acknowledgement

The authors would like to thank anonymous reviewers for their comments. The authors

would also like to thank Mr. Tsung-Te Wu who is with the department of information

(26)

conducting experiments.

References

[1] Y. Bastide, R. Taouil, N. Pasquier, G. Stumme and L. Lakhal,. “Mining frequent

patterns with counting inference,” ACM SIGKDD Explorations, Vol. 2, No. 2, pp. 66 -75,

2000.

[2] A. Bykowski and C. Rigotti, “A condensed representation to find frequent patterns,” The

12th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,

Santa Barbara, California, USA, 2001.

[3] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient algorithm for discovering

association rules,” The AAAI Workshop on Knowledge Discovery in Datasets, pp.

181-192, 1994.

[4] IBM, The Intelligent Information Systems Research (Quest) Group,

http://www.almaden.ibm.com/software/quest/Resources/datasets/syndata.html

[5] J. Han and Y. Fu, “Discovery of multiple-level association rules from large dataset,” The

Twenty-first International Conference on Very Large Data Bases, pp. 420-431, Zurich,

Switzerland, 1995.

(27)

Vol. 9, No. 5, pp. 812-825, 1997.

[7] M. Kryszkiewicz and M. Gajek, “Why to apply generalized disjunction-free generators

representation of frequent patterns?” The 13th International Symposium International

Symposium on Methodologies for Intelligent Systems, Lyon, France, pp. 382-392, 2002.

[8] L. Shen, H. Shen and L. Cheng, “New algorithms for efficient mining of association

rules,” The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp.

234-241, 1999.

[9] M.S. Chen, J. Han and P.S. Yu, “Data mining: an overview from a dataset perspective,”

IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 866-883,

1996.

[10] N. Denwattana and J. R. Getta, “A parameterised algorithm for mining association

rules,” The Twelfth Australasian Dataset Conference, pp. 45-51, 2001.

[11] J. Pei, J. Han and R. Mao, “CLOSET: an efficient algorithm for mining frequent closed

itemsets,” The 2000 ACM SIGMOD DMKD‘00, Dallas, TX, USA, 2000.

[12] R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The

International Conference on Very Large Data Bases, pp. 487-499, 1994.

[13] R. Agrawal and R. Srikant, “Mining sequential patterns,” The Eleventh IEEE

International Conference on Data Engineering, pp. 3-14, 1995.

(28)

Third International Conference on Knowledge Discovery in Datasets and Data Mining,

pp. 67-73, Newport Beach, California, 1997.

[15] R. Agrawal, T. Imielinksi and A. Swami, “Dataset mining: a performance perspective,”

IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, pp. 914-925,

1993.

[16] R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets of

items in large dataset,“ The ACM SIGMOD Conference, pp. 207-216, Washington DC,

USA, 1993.

[17] R. Srikant and R. Agrawal, “Mining generalized association rules,” The Twenty-first

International Conference on Very Large Data Bases, pp. 407-419, Zurich, Switzerland,

1995.

[18] R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational

tables,” The 1996 ACM SIGMOD International Conference on Management of Data, pp.

1-12, Montreal, Canada, 1996.

[19] T. Fukuda, Y. Morimoto, S. Morishita and T. Tokuyama, “Mining optimized association

rules for numeric attributes,” The ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Dataset Systems, pp. 182-191, 1996.

[20] M.Wojciechowski and M. Zakrzewicz, “Dataset filtering techniques in constraint-based

(29)

[21] M. J. Zaki and C. J. Hsiao, “CHARM: an efficient algorithm for closed itemset mining,”

數據

Table 2. The number of occurrences of each item in transactions of different lengths
Figure 2. Speedup in experimental result I with minsup=15%
Figure 3. Speedup in experimental result II with different minsup
Figure 4. Experimental result on different datasets
+2

參考文獻

相關文件

A factorization method for reconstructing an impenetrable obstacle in a homogeneous medium (Helmholtz equation) using the spectral data of the far-eld operator was developed

Using a one-factor higher-order item response theory (HO-IRT) model formulation, it is pos- ited that an examinee’s performance in each domain is accounted for by a

● In computer science, a data structure is a data organization, management, and storage format that enables efficient access and

„ A socket is a file descriptor that lets an application read/write data from/to the network. „ Once configured the

Data larger than memory but smaller than disk Design algorithms so that disk access is less frequent An example (Yu et al., 2010): a decomposition method to load a block at a time

With each teaching strategies, students should be involved in reading different text types for a variety of purposes. Teacher should

Dynamic programming is a method that in general solves optimization prob- lems that involve making a sequence of decisions by determining, for each decision, subproblems that can

The remaining positions contain //the rest of the original array elements //the rest of the original array elements.