• 沒有找到結果。

SETM*-Lmax: An EÆcient Set-Based Approach to Find Maximal Large Itemsets

N/A
N/A
Protected

Academic year: 2021

Share "SETM*-Lmax: An EÆcient Set-Based Approach to Find Maximal Large Itemsets"

Copied!
21
0
0

加載中.... (立即查看全文)

全文

(1)SETM*-Lmax: An EÆcient Set-Based Approach to Find Maximal Large Itemsets Ye-In Chang, and Yu-Ming Hsieh Dept. of Computer Science and Engineering National Sun Yat-Sen University Kaohsiung, Taiwan Republic of China fE-mail: changyi@cse.nsysu.edu.twg fTel: 886-7-5252000 (ext. 4334)g fFax: 886-7-5254301g. Abstract Discovery of association rules is an important problem in the area of data mining. An association rule means that the presence of some items in a transaction will imply the presence of other items in the same transaction. For this problem, how to eÆciently count large itemsets is the major work, where a large itemset is a set of items appearing in a suÆcient number of transactions. In this paper, we propose an eÆcient SETM*-Lmax algorithm to

(2) nd maximal large itemsets, based on a high-level set-based approach. The advantage of the set-based approach, like the SETM algorithm, is simple and stable over the range of parameter values. In the SETM*-Lmax algorithm, we use a forward approach to

(3) nd all maximal large itemsets from Lk , and the w-itemset is not included in the wsubsets of the j -itemset, where 1  k  MaxK , 1  w < j  MaxK , LMaxK 6= ; and LMaxK +1 = ;. We conduct several experiments using di erent synthetic relational databases. The simulation results show that the proposed forward approach (SETM*Lmax) to

(4) nd all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal. (Key Words: association rules, data mining, knowledge discovery, relational databases, transactions).

(5) 1 Introduction One of the important data mining tasks, mining association rules in transactional or relational databases, has recently attracted a lot of attention in database communities [1, 2, 5, 6, 8, 11, 12, 13, 16, 17, 19, 21]. The task is to discover the important associations among items such that the presence of some items in a transaction will imply the presence of other items in the same transaction [7]. For example, one may

(6) nd, from a large set of transaction data, such an association rules as if a customer buys (one brand of) milk, he/she usually buys (another brand of) bread in the same transaction. Since mining association rules may require to repeatedly scan through a large transaction database to

(7) nd di erent association patterns, the amount of processing could be huge, and performance improvement is an essential concern [7]. Previous approaches to mining association rules can be classi

(8) ed into two approaches: low-level and high-level approaches, where a low-level approach means to retrieve one tuple from the relational database at a time, and a high-level approach means a set-based approach. For example, Apriori/AprioriTID [2], DHP [17] and Boolean algorithms [23] are based on the low-level approach, while the SETM algorithm [14] is based on the high-level approach. A set-based approach (i.e., a high-level approach) allows a clear expression of what needs to be done as opposed to specifying exactly how the operations are carried out in a low-level approach. The declarative nature of this approach allows consideration of a variety of ways to optimize the required operations. This means that the ample experience that has been gained in optimizing relational queries can directly be applied here. Eventually, it should be possible to integrate rule discovery completely with the database system. This would facilitate the use of the large amounts of data that are currently stored on relational databases. The relational query optimizer can then determine the most eÆcient way to obtain the desired results. Finally, the set-based approach has a small number of well-de

(9) ned, simple concepts and operations. This allows easy extensibility to handling additional kinds of mining, e.g. relating association rules to customer classes. In [14], based on a high-level approach, Houtsma and Swami proposed the SETM algorithm that uses SQL for generating the frequent itemsets. Algorithm SETM is simple and stable over the range of parameter values. Moreover, it is easily parallelized. But the 1.

(10) disadvantage of the SETM algorithm is similar to that of the AIS algorithm [1]. That is, it generates too many invalid candidate itemsets. In this paper, we design an eÆcient algorithm for mining association rules based on a high-level set-oriented approach. We propose the SETM*-Lmax algorithm to

(11) nd maximal large itemsets. We conduct several experiments using di erent synthetic relational databases. The simulation results show that the proposed forward approach (SETM*-Lmax) to

(12) nd all maximal large itemsets requires shorter time than the backward approach proposed by Agrawal. The rest of the paper is organized as follows. In Section 2, we describe the background. In Section 3, we give a brief survey. In Section 4, we present the proposed SETM*-Lmax algorithm. In Section 5, we study the performance of proposed algorithm. Finally, Section 6 gives the conclusion.. 2 Background Let I = fi1 ; i2 ; :::; im g be a set of m distinct items [15]. A transaction T is de

(13) ned as any subset of items in I . A database D is a set of transactions. A set of items is called an itemset. The number of items in an itemset is called the length of an itemset. Itemsets of. some length k are referred to as k-itemsets.. A transaction T is said to support an itemset X  I if it contains all items of X , i.e., X  T . The fraction of the transactions in D that support X is called the support of. X , denoted as support(X ). An itemset is large if its support is above some user-de

(14) ned minimum support threshold [2]. An association rule has the from R : X ) Y , where X and Y are two non-empty and non-intersecting itemsets [15]. The support for rule R is de

(15) ned as support(X [ Y ). The rule X ) Y has support s in the transaction set D if s% of transactions in D contain X . A con

(16) dence factor de

(17) ned as support(X [ Y )/support(X ), is used to evaluate the strength of such association rules. The rule X ) Y holds in the transaction set D with con

(18) dence c if c% of transactions in D that contain X also contain Y . Consequently, the semantics of the con

(19) dence of a rule indicates how often it can be expected to apply, while its support indicates how trustworthy this rule is. The problem of the association rule mining is to discover all rules that have support and con

(20) dence greater than some user-de

(21) ned minimum 2.

(22) support and minimum con

(23) dence thresholds, respectively. In [1, 2, 17, 18, 20], the problem of mining association rules is decomposed into the following two steps: 1. Discover the large itemsets, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s. 2. Use the large itemsets to generate the association rules for the database. The general idea is that if, say, ABCD and AB are large itemsets, then we can determine if the. ) CD holds by computing the ration  = support(ABCD)/support(AB ). The rule holds only if   minimum con

(24) dence. Note that the rule will have minimum. rule AB. support because ABCD is a large itemset. It is noted that the overall performance of mining association rules is determined by the

(25) rst step. After the large itemsets are identi

(26) ed, the corresponding association rules can be derived in a straightforward manner. EÆcient counting of large itemsets is thus the focus of most priorwork.. 3 The Apriori Algorithm Agrawal and Srikant proposed an algorithm, called Apriori [2], for generating large itemsets. Apriori constructs a candidate set of large itemsets, counts the number of occurrence of each candidate itemset, and then determines large itemsets based on a pre-determined minimum support. In the Apriori algorithm, the candidate k-itemsets is generated by a cross product of the large (k. 1)-itemsets with itself. Then, the database is scanned for. computing the count of the candidate k-itemsets. The large k-itemsets consist of only the candidate k-itemsets with suÆcient support. This process is repeated until no new candidate itemsets is generated. It is noted that in the Apriori algorithm, each iteration requires a pass of scanning the database, which incurs a severe performance penalty [23]. Figure 1 shows the Apriori algorithm, and Table 1 summarizes the variables used in the algorithm [2]. Consider an example transaction database given in Figure 2. Figure 3 shows the process. In the

(27) rst iteration, Apriori simply scans all the transactions to count the number of occurrences for each item, and generates candidate 1-itemsets, C1 . Assuming 3.

(28) k-itemset An itemset has k items. Set of large k-itemsets (those with minimum support). Each member of this set has two

(29) elds: (i) itemset and (ii) support Lk count. Set of candidate k-itemsets (potentially large itemsets). Each member of this set has two

(30) elds: (i) itemset and (ii) support Ck count. Table 1: Variables used in the Apriori algorithm Procedure. Apriori;. begin. L1 := large 1-itemsets; k := 1; repeat. k := k + 1; Ck := apriori-gen(Lk 1); (* New candidates *) forall transactions t 2 D do begin. end;. Ct :=subset(Ck ; t); (* Candidates contained in t *) forall candidates c 2 Ct do c:count := c:count + 1; end; Lk := fc 2 Ck jc:count  minimum supportg; until Lk = ;; S Answer := k Lk ; Figure 1: The Apriori algorithm. that the minimum transaction support required is two (i.e., s = 50%), the set of large 1-itemsets, L1 , composed of candidate 1-itemsets with the minimum support required, can then be determined. To discover the set of large 2-itemsets, in view of the fact that any. subset of a large itemset must also have minimum support, Apriori uses L1  L1 to generate a candidate set of itemsets C2 where * is an operation for concatenation in this case. Next, the four transactions in D are scanned and the support of each candidate itemset in C2 is counted. The set of large 2-itemsets, L2 , is therefore determined based on the support of each candidate 2-itemset in C2 . The set of candidate itemsets, C3 , is generated from L2 as follows [7]. From L2 , two large 4.

(31) Database D TID Items 100 A C D 200 B C E 300 A B C E 400 B E. Figure 2: A transaction database (Example 1) 2-itemsets with the same

(32) rst item, such as fBC g and fBE g, are identi

(33) ed

(34) rst. Then,. Apriori tests whether the 2-itemset fCE g, which consists of their second items, constitutes. a large 2-itemset or not. Since fCE g is a large itemset by itself, we know that all the. subsets of fBCE g are large and then fBCE g becomes a candidate 3-itemset. There is no other candidate 3-itemset from L2 . Apriori then scans all the transactions and discovers the large 3-itemsets L3 . Since there is no candidate 4-itemset to be constituted from L3 , Apriori ends the process of discovering large itemsets. In the Apriori algorithm as shown in Figure 1, the apriori-gen function takes an argument. Lk 1 , and returns a superset of the set of all large k itemsets. Before exiting the apriorigen function, a prune step is executed, which deletes all itemsets c 2 Ck such that some (k 1)-subset of c is not in Lk 1 .. 4 The SETM*-Lmax Algorithm In this Section, we present the SETM*-Lmax algorithm to

(35) nd all maximal large itemsets (denoted as Lmax) from Lk , and the w-itemset is not included in the w-subsets of the. j -itemset, where 1  k  MaxK , 1  w < j  MaxK , LMaxK = 6 ; and LMaxK +1 = ;. 4.1. An Example. For the sample input as shown in Figures 2 and 4, Figures 5 and 6 show the results (Lk ), respectively. For the resulting large itemsets shown in Figures 5 and 6, Figures 7 and 8 show the corresponding all maximal large itemsets (Lmax), respectively. For example, in Figure 5, since BC 2 fBCEg (=L3 ), where BC 2 L2 , BC is removed in the result.. 5.

(36) C1 Itemset {A} {B} {C} {D} {E}. Scan D. C2 Itemset {A B} {A C} {A E} {B C} {B E} {C E}. Scan D. C3 Itemset {B C E}. Scan D. Sup. 2 3 3 1 3. L1 Itemset {A} {B} {C} {E}. C2 Itemset Sup. {A B} 1 {A C} 2 {A E} 1 {B C} 2 {B E} 3 {C E} 2. L2 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2. C3 Itemset Sup. {B C E} 2. L3 Itemset Sup. {B C E} 2. Figure 3: Generation of candidate itemsets and large itemsets. TID Items 1 2 3 4 5 6 7 8 9 10 11 12. Sup. 2 3 3 3. ABCDEI ABCEF ACDF ABCDE BCDH DEF ACDG ABCDEH BCEGI EFGH AGH BFH. Figure 4: A transaction database (Example 2). 6.

(37) L1 Itemset {A} {B} {C} {E}. L2 Itemset Sup. {A C} 2 {B C} 2 {B E} 3 {C E} 2. Sup. 2 3 3 3. L3 Itemset Sup. {B C E} 2. Figure 5: Example 1: Large Itemsets (s = 50%) L1. L2. L3. Itemset. Sup.. Itemset. Sup.. A B C D E F G H. 7 7 8 7 7 5 4 5. AB AC AD AE BC BD BE BH CD CE DE EF. 4 6 5 4 6 4 5 3 6 5 4 3. Itemset ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE. L4 Sup.. Itemset. 4 3 4 5 4 3 4 5 3 3. ABCD ABCE ABDE ACDE BCDE. L5 Sup.. Itemset. Sup.. 3 4 3 3 3. ABCDE. 3. Figure 6: Example 2: Large Itemsets (s = 25%) L2 Itemset Sup. 2 {A C}. L3 Itemset Sup. {B C E} 2. Figure 7: Maximal Large Itemsets for Example 1 L2. L5. Itemset. Sup.. Itemset. Sup.. BH EF. 3 3. ABCDE. 3. Figure 8: Maximal Large Itemsets for Example 2 7.

(38) Rk Lk Rk MaxK i j 0. A database of candidate k-itemsets (i.e., a candidate DB) Large k-itemsets A database of large k-itemsets (i.e., a

(39) ltered DB) The maximal length of the itemset A loop index (1  i  MaxK 1) A loop index (1  j  i + 1). Table 2: Variables used in the SETM*-Lmax algorithm 4.2. The Algorithm. Table 2 shows the variables used in the SETM*-Lmax algorithm. The complete algorithms are shown in Figures 9, 10, 11, 12, and 13. In procedure SETM*-Lmax as shown in Figure 9, the

(40) rst step is to generate all large itemsets Lk based on the SETM* algorithm, 1  k  MaxK , and the second step is to delete some element w (2 Lk ) from Lk if. w 2 k-subset of Lk+1 except k = MaxK . Note that the SETM algorithm constructs Rk based on Rk 1 and the original database SALES . Due to this reason, the SETM algorithm generates and counts too many candidates itemsets. To reduce the size of the candidate database Rk , we have a new strategy to construct Rk in procedure gen-CDB (as shown in Figure 12). In Agrawal's algorithm [3] for

(41) nding sequential patterns, they have proposed a backward approach to process step 2 as shown in Figures 14 and 15, where procedure comb(j, i) (shown in Figure 15) is a function to compute Cij and stores all the possible combinations in a two dimensional array COMBD. (Note that in Figure 15, COMBD[k][1] : : : COMBD[k][i] denote the k'th combination pattern with a size = i.) For the example of comb(6,4), i.e., C46 , the contents of COMBD computed from function comb(6,4) is shown in Figure 16, in additional to return C46 = 15. Take Figure 5 as an example, the backward approach will delete BC, BE and CE from L2 in the

(42) rst iteration by checking L3 , and delete A, B, C and E from L1 at the second iteration by checking L2 and L3 . While our step 2 is a forward approach. Take Figure 5 as example, our forward approach will delete A, B, C and E from L1 in the

(43) rst iteration by checking L2 , and delete BC, BE and CE from L2 in the second iteration by checking L3 . Figures 17 and 18 show a simpli

(44) ed interpretation of the backward and the forward approaches, respectively. 0. 0. 0. 8.

(45) procedure SETM*-Lmax; begin (* Step 1: Finding all large itemsets *) (* the sort operation is optional *) k := 1; Lk := gen-Litemset(Sales; minsup); Rk :=

(46) lter-DB(Sales; Lk ); repeat k := k + 1; R := gen-CDB(Rk 1 ; Rk 1 ); k Lk := gen-Litemset(R ; minsup); k Rk :=

(47) lter-DB(R ; Lk ); k until Rk = ;; M axK := k 1; 0. 0. 0. (* Step 2: Deleting items forwards *) for i := 1 to M axK 1 do for j :=1 to (i + 1) do del-Litemset(Li ; Li+1 ; i; j ); S Answer := Lk ; end;. Figure 9: The SETM*-Lmax procedure In our forward approach, for the example shown in Figure 5 and 6, Tables 3 and 4 show the changes of the values of the variables for MaxK = 3 and MaxK = 5, respectively. For example, take Figure 5 as an example, when i = 2, we will

(48) rst remove BC from L2 by following the else part in procedure del-Litemset, since B (= L2 :item1 = L3 :item1 ) and C (= L2 :item2 = L3 :item2 ) appear at the

(49) rst and second positions in L3 , where j = 1 and (i + j 1) = 2. Next, we will remove CE from L2 by following the else part in procedure del-Litemset, since C (= L2 :item1 = L3 :item2 ) and E (= L2 :item2 = L3 :item3 ) appear at the second and third positions in L3 , where j = 2 and (i + j 1) = 3. Finally, we will remove BE from L2 by following the then part in procedure del-Litemset, since B (= L2 :item1 = L3 :item1 ) and E (= L2 :item2 = L3 :item3 ) appear at the

(50) rst and third positions in L3 , where (j 2 = 1) and j = (i + 1) = 3. Therefore, only the itemset AC remains in L2 . The change of L2 is shown in Figure 19.. 9.

(51) procedure gen-Litemset(Rk ; minsup); begin insert into Lk select p.item1; : : : ; p.itemk , COUNT(*) from Rk p group by p.item1; : : : ; p.itemk having COUNT(*)  :minsup; end; 0. 0. Figure 10: The gen-Litemset procedure procedure

(52) lter-DB(Rk ; Lk ); begin insert into Rk select p.tid, p.item1, . . . , p.itemk from Rk p; Lk q where p.item1 = q .item1 AND : : : AND p.itemk = q .itemk ; end; 0. 0. Figure 11: The

(53) lter-DB procedure procedure gen-CDB(Rk 1; Rk 1 ); begin insert into Rk select p.tid, p.item1; : : : ; p.itemk 1; q .itemk 1 from Rk 1 p; Rk 1 q where p.tid=q .tid AND p.item1 = q .item1 AND . . . AND p.itemk 2 = q .itemk 2 AND p.itemk 1 < q .itemk 1; end; 0. Figure 12: The gen-CDB procedure. loop 1 2 3 4 5. i 1 1 2 2 2. j i+j 1 j 2 j 1 i+1 1 1 2 2 1 2 2 3 3 1 2 3. Table 3: Changes of the values of the variables for MaxK = 3 10.

(54) procedure del-Litemset(Li ; Li+1 ; i; j ); begin if (j > 2) then begin delete from Li p where exists (select * from Li+1 q where p.item1 = q .item1 AND . . . AND p.itemj 2 = q .itemj 2 AND p.itemj 1 = q .itemj AND . . . AND p.itemi = q .itemi+1) else begin delete from Li p where exists (select * from Li+1 q where p.item1 = q .itemj AND . . . AND p.itemi = q .itemi+j 1) end; end; Figure 13: The del-Litemset procedure (* Step 2: Deleting items backwards *) for i := (MaxK 1) downto 2 do for j := (i + 1) to MaxK do begin (* comb(j; i) is a function to compute Cij and generate COMBD *) loop times := comb(j; i); for k := 1 to loop times do begin delete from Li p where exists (select * from Lj q where p.item1 = q .itemCOMBD[k][1] AND . . . AND p.itemi = q .itemCOMBD[k][i]; end; end; Figure 14: Step 2 in Agrawal's Algorithm (denoted as BLmax) 11.

(55) function comb(j; i):integer; begin (* compute. j Ci. *). x:=1; y :=1;. for. k x. for. k y. :=. :=. j. x. downto (i + 1) do. . k;. := (j. :=. . y. i). downto 1 do. k;. total times. :=. x. div. y;. (* generate COMBD *) for. k. := 1 to. i. do. C OM BD [1][k ]. for. k. := 2 to. :=. k;. do. total times. begin for. x. := 1 to. i. do. C OM BD [k ][x] p. :=. C OM BD [k. 1][x];. := i;. C OM BD [k ][p]. :=. C OM BD [k ][p]. while (C OM BD [k ][i]. > j). + 1;. do. begin p. :=. p. 1;. C OM BD [k ][p]. for. x:=. :=. (p + 1) to. C OM BD [k ][x]. C OM BD [k ][p] i. + 1;. do. :=. C OM BD [k ][x. 1] + 1;. end; end; (* return value *) comb:=total. times;. end;. Figure 15: The comb function. 12.

(56) k COMBD[k][i] 1 1234 2 1235 3 1236 4 1245 5 1246 6 1256 7 1345 8 1346 9 1356 10 1456 11 2345 12 2346 13 2356 14 2456 15 3456 Figure 16: The contents of COMBD computed from function comb(6,4) (1  i  4) (* Deleting items backwards*) for i := (k 1) downto 1 do Delete all itemsets in Li contained in some subsets of Lj , j > i; Figure 17: A backward approach (* Deleting items forwards *) for i := 1 to (k 1) do Delete all itemsets in Li contained in some subsets of Li+1 ; Figure 18: A forward approach. i 2 2 2. j Deleted Item 1 BC 2 CE 3 BE. Resulting Li AC BE CE AC BE AC. Figure 19: Change of L2 (i = 2). 13.

(57) loop 1 2 3 4 5 6 7 8 9 10 11 12 13 14. i 1 1 2 2 2 3 3 3 3 4 4 4 4 4. j i+j 1 j 2 j 1 i+1 1 1 2 2 1 2 2 3 3 1 2 3 1 3 2 4 3 1 2 4 4 2 3 4 1 4 2 5 3 1 2 5 4 2 3 5 5 3 4 5. Table 4: Changes of the values of the variables for MaxK = 5. 5 Performance In this Section, we study the performance of the proposed SETM*-Lmax algorithm by simulation. Our experiments were performed on a PentiumIII Server with one CPU clock rate of 450 MHz, 128 MB of main memory, running Windows-NT 2000, and coded in Delphi. The data resided in the Delphi relational database and was stored on a local 8G IDE 3.5" drive. 5.1. Generation of Synthetic Data. We generated synthetic transactions to evaluate the performance of the algorithms over a large range of data characteristics. The synthetic data is said to simulate a customer buying pattern in a retail environment. The parameters used in the generation of the synthetic data are shown in Table 5. The length of a transaction is determined by a Poisson distribution with mean  equal to jT j. The size of a transaction is between 1 and jMT j. The transaction is repeatedly assigned items from a set of potentially maximal large itemsets F , until the length of the transaction does not exceed the generated length [2, 17, 19, 24]. The length of an itemset in. F. is determined according to a Poisson distribution with 14.

(58) jDj jT j. jMT j jI j jMI j jLj N. Number of transactions Average size of transactions Maximum size of the transactions Average size of maximal potentially large itemsets Maximum size of the potentially large itemsets Number of maximal potentially large itemsets Number of items Table 5: Parameters. mean  equal to jI j. The size of each potentially large itemset is between 1 and jMI j. Items in the

(59) rst itemset are chosen randomly from the set of items. To model the phenomenon that large itemsets often have common items, some fraction of items in subsequent itemsets are chosen from the previous itemset generated. We use an exponentially distributed random variable with mean equal to the correlation level to decide this fraction for each itemset. The remaining items are picked at random. In the datasets used in the experiments, the correlation level was set to 0.5. Each itemset in. F. has an associated weight. that determines the probability that this itemset will be picked. The weight is picked from an exponential distribution with mean equal to 1. The weights are normalized such that the sum of all weights equals 1. For example, suppose the number of large itemsets is 5. According to the exponential distribution with mean equal to 1, the probabilities for those 5 itemsets with ID equal to 1, 2, 3, 4 and 5 are 0.43, 0.26, 0.16, 0.1 and 0.05, respectively, after the normalization process. These probabilities are then accumulated such that each size falls in a range, which is shown in Table 6. For each transaction, we generate a random real number which is between 0 and 1 to determine the ID of the potentially large itemset. To model the phenomenon that all the items in a large itemset are not always bought. together, we assign each itemset in F a corruption level c. When adding an itemset to a transaction, we keep dropping an item from the itemset as long as a uniformly distributed random number (between 0 and 1) is less than c. The corruption level for an itemset is

(60) xed and is obtained from a normal distribution with mean = 0.5 and variance = 0.1. Each transaction is stored in a

(61) le system with the form of <transaction identi

(62) er, item>. Some di erent data sets were used for performance comparison. Table 7 shows the 15.

(63) Itemset ID 1 2 3 4 5. Range 0 0:43 0:44 0:69 0:70 0:85 0:86 0:95 0:96 1. Table 6: The probabilities of itemsets after normalization Case Name 1 T5.MT10.I2.MI4.D20K 2 T10.MT15.I6.MI10.D5K. jT j jMT j jI j jMI j jDj 5 10. 10 15. 2 6. Size 4 20K 1.5MB 10 5K 0.8MB. Table 7: Parameter values for synthetic datasets names and parameter settings for each data set. For all data sets, N was set to 1,000 and jLj was set to 2,000. 5.2. Experiments. In this Section, we compare the performance of our SETM*-Lmax algorithm based on a forward approach (denoted as FLmax) with the backward approach described in Agrawal's Algorithm [3] (denoted as BLmax). When we choose the synthetic dataset as T5.MT10.I2.MI4.D20K (Case 1), Figure 20 shows a comparison of execution time between the forward and the backward approaches. The detailed information is shown in Table 8. For this result, we show that our forward approach requires shorter time than the backward approach. Obviously, as the value of the minimum support is decreased, the execution time in both approaches is decreased. Figure 21 shows another simulation result where we choose the synthetic dataset as T10.MT15.I6.MI10.D5K (Case 2), which also shows that forward approach requires shorter time than the backward approach.. 16.

(64) BLmax FLmax. Time (seconds). 15. 10. 5. 0. 0.50. 0.75. 1.00. 1.25. 1.50. 1.75. 2.00. Minimum supoort (%). Figure 20: A comparison of execution time between BLmax (the backward approach) and FLmax (the forward approach) (T5.MT10.I2.MI4.D20K: Case 1). Time Minimum Support (seconds) 0.5 0.75 1 1.5 2 BLmax 18.3 7.98 4.65 2.3 1.7 FLmax 9.24 3.45 2.11 1.1 0.75 Table 8: A comparison of execution time based on di erent values of the minimum support (T5.MT10.I2.MI4.D20K: Case 1). 17.

(65) 600 BLmax FLmax. Time (seconds). 500. 400. 300. 200. 100. 0. 0.50. 0.75. 1.00. 1.25. 1.50. 1.75. 2.00. Minimum support (%). Figure 21: A comparison of execution time between BLmax (the backward approach) and FLmax (the forward approach) (T10.MT15.I6.MI10.D5K: Case 2). 6 Conclusion Discovery of association rules is an important problem in the area of data mining. Since the amount of the processed data in mining association rules tends to be huge, it is important to devise eÆcient algorithms to conduct mining on such data [7]. In order to bene

(66) t from the experience with relational databases, a set-oriented approach to mining data is needed [14]. In such an approach, the data mining operations are expressed in terms of relational or set-oriented operations. In this paper, to

(67) nd a large itemset of a speci

(68) c size in relational database, we have proposed the SETM*-Lmax algorithm to

(69) nd all maximal large itemsets from Lk . We have studied the performance of the proposed SETM*-Lmax algorithm. The simulation results have shown that the proposed forward approach (SETM*-Lmax) to

(70) nd all maximal large itemsets requires shorter time than the backward approach. In the future, we plan to extend this work to the related problems of mining multiple-level association rules, mining sequential patterns, and mining path traversal patterns directions.. 18.

(71) References [1] R. Agrawal, T. Imielinski, and A. Swami, \Mining Association Rules Between Sets of Items in Large Databases," Proc. 1993 ACM SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993. [2] R. Agrawal and R. Srikant, \Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases, pp. 490-501, Sept. 1994. [3] R. Agrawal and R. Srikant, \Mining Sequential Patterns," Proc. 11th IEEE Int'l Conf. Data Engineering, pp. 3-14, March 1995. [4] R. Agrawal and K. Shim, \Developing Tightly-Coupled Applications on IBM DB2/CS Relational Database System: Methodology and Experience," IBM Research Report, 1995. [5] R. Agrawal, C. C. Aggarwal and V. V. V. Prasad, \A Tree Projection Algorithm for Generation of Frequent Item Sets," Journal of Parallel and Distributed Computing, Vol. 61, No. 3, pp. 350-371, March 2001. [6] F. Berzal, J. Cubero, N. Marin, and J Serrano, \TBAR: An EÆcient Method for Association Rule Mining in Relationall Databases," Data and Knowledge Engineering, Vol. 37, No. 1, pp. 47-64, April 2001. [7] M.-S. Chen, J. Han, and P.S. Yu, \Data Mining: An Overview from a Database Perspective," IEEE Trans. on Knowledge and Data Engineering, Vol. 8, No. 5, pp. 866-882, Dec. 1996. [8] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P.Indyk, R. Motwani, J. Ullman, and C. Yang, \Finding Interesting Associations without Support Pruning," IEEE Tansactions on Knowledge and Data Engineering, Vol. 13, No. 1, pp. 64-78, Jan. 2001. [9] Y. Fu, \Data Mining," IEEE Potentials, pp. 18-20, 1997. [10] V. Ganti, J. Gehrke, and R. Ramakrishnan, \Mining Very Large Databases," IEEE Computer, Vol. 32, No. 8, pp. 38-45, 1999. [11] J. Han and Y. Fu, \Mining of Multiple-level Association Rules from Large Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 11, No. 5, pp. 798-805, September/October 1999. [12] J. Han, J. Pei and Y. Yin, \Mining Frequent Patterns without Candidate Generation," Proc. 2000 ACM SIGMOD Conf. on Management of Data, pp. 1-11, May 2000. [13] J. Han, and J. Pei , \Mining Frequent Patterns by Pattern-Growth: Methodology and Implications," ACM SIGKDD Explorations (Special Issue on Scaleble Data Mining Algorithms), Vol. 2, No. 4, pp. 14-20, December 2000. [14] M. Houtsma and A. Swami, \Set-oriented Mining for Association Rules in Relational Databases," Proc. 11th IEEE Int'l Conf. Data Engineering, pp. 25-33, 1995. [15] Dao-I Lin and Zvi M. Kedem, \Pincer Search: An EÆcient Algorithm for Discovering the Maximum Frequent Set," IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 3, pp. 553-565, May/June 2002, pp. 105-119, 1998.. 19.

(72) [16] H. Mannila, H. Toivonen, and A. Inkeri Verkamo, \EÆcient Algorithms for Discovering Association Rules," Proc. AAAI Workshop Knowledge Discovering in Databases, pp. 181192, July 1994. [17] J.-S. Park, M.-S. Chen, and P.S. Yu, \An E ective Hash Based Algorithm for Mining Association Rules," Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data, pp. 175-186, May 1995. [18] G. Piatetsky-Shapiro, \Discovery, Analysis, and Presentation of Strong Rules," G. PiatetskyShapiro and W.J. Frawley, eds., Knowledge Discovery in Databases, AAAI/MIT Press. pp. 229-238. 1991. [19] A. Savasere, E. Omiecinski, and S. Navathe, \An EÆcient Algorithm for Mining Association Rules in Large Databases," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 432-444, Sept. 1995. [20] S. Sarawagi, S. Thomas, and R. Agrawal, \Integrating Association Rule Mining with Relational Database Systems: Alternatives and Implications," Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 343-354, 1998. [21] R. Srikant and R. Agrawal, \Mining Generalized Association Rules," Proc. 21th Int'l Conf. Very Large Data Bases, pp. 407-419, Sept. 1995. [22] D. Tsur, J. D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov, and A. Rosenthal, \Query Flocks: A Generalization of Association-Rule Mining," Proc. ACM SIGMOD Int'l Conf. on Management of Data, pp. 1-12, June 2-4, 1998. [23] S.-Y. Wur and Y. Leu, \An E ective Boolean Algorithm for Mining Association Rules in Large Databases," Proc. 6th Int'l Conf. Database Systems for Advanced Applications, pp. 179186, April 1999. [24] S.-J. Yen and A. Chen, \An EÆcient Approach to Discovering Knowledge from Large Databases," Proc. 4th Int'l Conf. Parallel and Distributed Information Systems, pp. 8-18, 1996.. 20.

(73)

數據

Figure 3: Generation of candidate itemsets and large itemsets
Figure 5: Example 1: Large Itemsets (s = 50%)
Table 2: Variables used in the SETM*-Lmax algorithm
Figure 10: The gen-Litemset procedure
+5

參考文獻

相關文件

In this paper we prove a Carleman estimate for second order elliptic equa- tions with a general anisotropic Lipschitz coefficients having a jump at an interface.. Our approach does

&#34;Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,&#34; Data Mining and Knowledge Discovery, Vol. “Density-Based Clustering in

In this paper, we have studied a neural network approach for solving general nonlinear convex programs with second-order cone constraints.. The proposed neural network is based on

In summary, the main contribution of this paper is to propose a new family of smoothing functions and correct a flaw in an algorithm studied in [13], which is used to guarantee

We propose a primal-dual continuation approach for the capacitated multi- facility Weber problem (CMFWP) based on its nonlinear second-order cone program (SOCP) reformulation.. The

We try to explore category and association rules of customer questions by applying customer analysis and the combination of data mining and rough set theory.. We use customer

In this thesis, we have proposed a new and simple feedforward sampling time offset (STO) estimation scheme for an OFDM-based IEEE 802.11a WLAN that uses an interpolator to recover

To tackle these problems, this study develops a three-stage approach (i.e., firstly create a correct CAD-oriented explosion graph and then find a graph-based assembly sequence