Mining top-k frequent patterns in the presence of the memory constraint

(1)

DOI 10.1007/s00778-007-0078-6 R E G U L A R PA P E R

Mining top-k frequent patterns in the presence of the memory constraint

Kun-Ta Chuang · Jiun-Long Huang ·

Ming-Syan Chen

Received: 16 January 2006 / Revised: 11 March 2007 / Accepted: 8 August 2007 / Published online: 7 November 2007 © Springer-Verlag 2007

Abstract We explore in this paper a practicably interest-ing mininterest-ing task to retrieve top-k (closed) itemsets in the presence of the memory constraint. Specifically, as opposed to most previous works that concentrate on improving the mining efficiency or on reducing the memory size by best effort, we first attempt to specify the available upper mem-ory size that can be utilized by mining frequent itemsets. To comply with the upper bound of the memory consump-tion, two efficient algorithms, called MTK and MTK_Close, are devised for mining frequent itemsets and closed item-sets, respectively, without specifying the subtle minimum support. Instead, users only need to give a more human-understandable parameter, namely the desired number of frequent (closed) itemsets k. In practice, it is quite chal-lenging to constrain the memory consumption while also efficiently retrieving top-k itemsets. To effectively achieve this, MTK and MTK_Close are devised as level-wise search algorithms, where the number of candidates being generated-and-tested in each database scan will be limited. A novel search approach, calledδ-stair search, is utilized in MTK and MTK_Close to effectively assign the available memory for testing candidate itemsets with various itemset-lengths, which leads to a small number of required database scans. As demonstrated in the empirical study on real data and

K.-T. Chuang (

B

)· M.-S. Chen Department of Electrical Engineering,

National Taiwan University, Taipei, Taiwan, ROC e-mail: doug@arbor.ee.ntu.edu.tw

M.-S. Chen

e-mail: mschen@cc.ee.ntu.edu.tw J.-L. Huang

Department of Computer Science,

National Chiao Tung University, Hsinchu, Taiwan, ROC e-mail: jlhuang@cs.nctu.edu.tw

synthetic data, instead of only providing the flexibility of striking a compromise between the execution efficiency and the memory consumption, MTK and MTK_Close can both achieve high efficiency and have a constrained memory bound, showing the prominent advantage to be practical algo-rithms of mining frequent patterns.

1 Introduction

The discovery of frequent relationship among a huge data-base has been known to be useful in selective marketing, decision analysis, and business management [14]. A popular area of its applications is the market basket analysis, which studies the buying behaviors of customers by searching for sets of items that are frequently purchased together. Specifi-cally, letI = {x1, x2, . . . , xm} be a set of items. A set X ⊆ I with m = |X| is called a m-itemset or simply an itemset. Formally, an itemset X refers to a frequent itemset or a large itemset if the support of X , i.e., the fraction of transactions in the database that contain X, is larger than the minimum

support threshold, indicating that the presence of itemset X

is significant in the database.

However, it is reported that discovering frequent item-sets suffers from two inherent obstacles, namely, (1) the subtle determination of the minimum support [22]; (2) the unbounded memory consumption [11]. Specifically, without specific knowledge, a critical problem “What is the

appro-priate minimum support?” is usually left unsolved to users

in previous works. Note that setting the minimum support is quite subtle since a small minimum support may result in an extremely large size of frequent itemsets at the cost of execution efficiency. Oppositely, setting a large minimum support may only generate a few itemsets, which cannot provide enough information for marketing decisions. In

(2)

order to obtain a desired result, users in general need to tune the minimum support over a wide range. This is very time-consuming and indeed is a serious problem for the appli-cability of mining frequent itemsets. Furthermore, another issue which will be faced in practice is the large memory consumption. A large memory, which may not be affordable in most personal computers nowadays, is in general required during the mining process, especially when the minimum support is small or the database size is large. It will result in the serious “out of memory” system crash, making users shy away from executing the frequent itemset mining. Note that users may tolerate to mine frequent itemsets off-line. For example, frequent itemsets can be discovered in every night as long as users are able to make their marketing decisions in the morning. In contrast, the system crash due to the “out of memory” error is repulsive in a commercial mining system. To remedy the first problem, recent research advances in data mining call for the need to discover top-k frequent pat-terns without the minimum support specification [6,22]. The

top-k frequent patterns refer to the k most frequent itemsets

in the database. As opposed to specify the subtle minimum support, users will only need to give the desired count of fre-quent itemsets, which is indeed a more human-understand-able parameter. For example, to make marketing decisions, users may be interested in less than 10,000 frequent itemsets. Hence they can easily give the number of frequent itemsets k equal to 10,000 and further mine top-k frequent itemsets. More specifically, instead of mining top-k frequent item-sets, the work in [22] aimed to discover top-k closed item-sets whose lengths exceed a specified threshold. Under such specific constraint, the FP-tree can be constructed with sev-eral pruning strategies such as omitting transactions whose lengths are less than the specified itemset-length. Moreover, the work in [6] initially constructs a complete FP-tree in memory, and then retrieves k most frequent l-itemsets, where

l lies in a range specified by users. In addition, a recent work

in [1] studied a post-processing approach to determine the

k patterns that best approximate all frequent itemsets

dis-covered. However, its concept inherently deviates far from discovering top-k frequent patterns, since its objective is to approximately describe the set of frequent itemsets and the minimum support still needs to be specified in advance.

The second problem of mining frequent patterns, i.e., the unbounded memory consumption, has been discussed in the direction of reducing the required memory by means of com-pressed structures or skillful search approaches [4,10,17,19]. Recently, the issue has also received a great deal of attention in mining data streams [7,15,23,25]. Since a large mem-ory consumption is prohibitive in streaming environments, we have to discover frequent patterns within an estimated memory upper bound at the cost of the resulting precision [3]. For example, the solution in [25] empirically derived its memory upper bound of O(_s13), where s is the specified

minimum support. Formally, same as traditional algorithms such as Apriori [2] and FPGrowth [12], the applicability of these approaches is valid based on the premise that the required memory can be unconditionally provided by the system. However, it is improbable and the “out of memory” system crash is still likely to happen while the minimum sup-port is small or the data distribution is quite dense. Recent studies in frequent-pattern mining have pointed out that most previous works were optimized for efficiency at the cost of the memory space, and thus their scalability will need further justification [11]. In practice, a desirable research direction is to allow that the available memory upper bound, say 100 or 200 MB, can be specified by system designers. Mining frequent patterns under the specified bound of the mem-ory consumption is referred to as the “memmem-ory-constraint

frequent-pattern mining” in this paper. Despite of its great

applicability, how to realize the memory-constraint

frequent-pattern mining is however not fully explored thus far.

To enable the better feasibility of mining frequent patterns, we examine in this paper the problem of discovering top-k frequent patterns, coupled with the need of the

memory-constraint mining. The goal is desirable but is quite

chal-lenging. Note that previous works of mining top-k frequent patterns [6,22] need to be executed by initially building a complete FP-tree in memory. It is clear that the memory prob-lem will be worse than the traditional frequent itemset mining since the size of the in-memory FP-tree is solely proportional to the entire database size.1Although we can implement the disk-based FPGrowth algorithm [13] to ensure the complete FP-tree can be constructed, it has been reported that the disk-based implementation is much inefficient as compared to the memory-based implementation since the I/O swap will dras-tically degrade the mining performance [10]. Furthermore, previous works of mining top-k frequent itemsets only con-centrate on mining special itemsets such as closed itemsets [22] or itemsets with the specified long itemset-length [6], because some heuristic strategies to reduce the search space can be applied. For example, as mentioned above, the FP-tree can be constructed by omitting transactions whose lengths are less than the specified itemset-length, which helps to reduce the size of the FP-tree and makes the search more efficient [22]. However, those pruning techniques will be no longer valid in the general model of mining pure top-k frequent item-sets. Mining top-k frequent itemsets without any constraint of item types or itemset-lengths are referred to as mining pure

top-k frequent patterns in this paper. The naive extensions of

previous solutions to discover pure top-k frequent itemsets 1 _{One may suggest to apply sampling prior to mine top-k frequent}

item-sets at the cost of resulting precision. However, for obtaining a con-sistent mining result, the space to store the complete FP-tree is still unbounded since all itemset combinations remain in the tree (note that, what changed after the unbiased sampling is the frequency of itemsets rather than the tree structure).

(3)

will not only lead to inefficiency but also face more serious problem of the memory bottleneck. Note that determining the minimum itemset-length incurs another inconvenience to users, which conflicts the purpose to release users from the determination of subtle parameters. In addition, in many real applications such as retail applications, mining pure

top-k frequent itemsets is equally or more important than

mining top-k itemsets with long itemset-lengths (users may not be interested in long itemsets since they usually attempt to cross-sell two or three products as opposed to one hundred products [14]).

Actually, as we can imagine, even though the memory space is not affordable in a PC nowadays, the memory issue will become insignificant in a server-level machine in the near future. In addition, the minimum support requirement may be determined by a domain expert without much effort. However, in our consideration, the data mining functionality is not a patent owned by few people. It is worth providing an easily deployed solution to mine association rules every-where and every time in such a way that the visibility and usability of the mining capability can be broadened to more users with a PC in hand.

As a consequence, we propose in this paper efficient solu-tions, called MTK (standing for the Memory-constraint top-k frequent-pattern mining) and MTK_Close, to discover pure

top-k frequent patterns and top-k closed patterns,

respec-tively, in the presence of the memory constraint. Since our goal is to release users from the burden of setting subtle parameters and to provide better flexibility in various appli-cations, the itemset-length constraint is not imposed on our model. Note that FP-tree based solutions intrinsically cannot be memory-constraint frequent-pattern mining approaches since the size of an in-memory FP-tree is proportional to the database size. As such, we devise MTK and MTK_Close as level-wise based algorithms as analogous to Apriori [2], DHP [19], and DIC [4]. In practice, level-wise based algorithms generate a potentially huge set of candidate itemsets which may not fit in memory. It also leads to a large memory require-ment. To remedy this, we devise an efficient search approach in MTK and MTK_Close, called theδ-stair search, to limit the number of candidates which are generated-and-tested in each database scan. Specifically, theδ-stair search assigns the available memory to concurrently generate candidates with consecutive itemset-lengths. Using the δ-stair search will lead to a small number of database scans which are required to retrieve the set of top-k frequent itemsets, to ensure the memory usage can be constrained without comprising the execution efficiency. More importantly, the MTK algorithm even requires a smaller number of database scans than tra-ditional approaches with an unbounded memory usage. This is attributed to that theδ-stair search can effectively utilize the memory to test candidates which are highly potential to be included in top-k frequent itemsets. In addition, the high

efficiency also comes from that the MTK and MTK_Close algorithms are sophisticatedly designed to fully integrate with many skillful techniques proposed in the literature, such as the scan-reduction technique [2,20] and the hash-index-ing technique [19] (readers can refer [9] for the detailed sur-vey and comparison of these optimizations). As such, the

MTK and MTK_Close algorithms cannot only comply with

the memory constraint but also retrieve top-k frequent/closed itemsets with high efficiency.

The contribution of this paper can be summarized as fol-lows: (1) while previous works on mining frequent patterns mostly concentrate on improving the mining efficiency or on reducing the memory size by best effort, we further inves-tigate in this paper the important issue of mining frequent/ closed itemsets in the presence of the explicit memory con-straint. (2) While previous works on mining top-k frequent patterns aimed to discover special top-k patterns, we propose the MTK algorithm to mine pure top-k frequent itemsets, and devise its extension to mine top-k closed itemsets, to provide better flexibility of mining frequent patterns for various appli-cations. (3) We complement our analytical and algorithmic results by a thorough empirical study on real data and syn-thetic data, and show that MTK and MTK_Close can retrieve

top-k itemsets and top-k closed itemsets with high efficiency

even though the memory usage is constrained. The result demonstrates that, instead of only providing the flexibility of striking a compromise between the execution efficiency and the memory consumption, MTK and MTK_Close can both achieve high efficiency and have a constrained memory bound, showing their prominent advantages to be practical algorithms of mining frequent patterns.

This paper is organized as follows. Section2introduces the problem description and gives a baseline approach to discover frequent itemsets with the memory constraint. In Sect.3, we give the design of theδ -stair search to retrieve

top-k frequent itemsets. The implementation of the MTK and MTK_Close algorithms are presented in Sect. 4. Section5 shows the experimental results. Finally, this paper concludes with Sect.6.

2 Memory-constraint frequent-pattern mining

In Sect.2.1, we formally specify the problem we study in this paper. In Sect.2.2, we introduce a baseline approach, referred to as the Naive algorithm, to discover frequent patterns in the presence of the memory constraint.

2.1 Problem description

We first introduce the notations used hereafter. For ease of exposition, in the sequel, pure top-k frequent itemsets will be simply denoted by top-k frequent itemsets, to distinguish

(4)

from top-k closed itemsets without ambiguity. Suppose that sup(X) denotes the support2of itemset X in the database D. We give several necessary definitions as follows:

Definition 1 (top-k frequent itemsets) Given the desired

number of frequent itemsets k, an itemset X is a top-k fre-quent itemset in D if there are less than k itemsets3whose supports are larger than sup(X). Let Tk denote the set of all

top-k frequent itemsets. The minimum support to retrieve Tk will be

supmin(Tk) = min {sup(X) |X ∈ Tk} .

Definition 2 (closed itemsets) An itemset X is referred to as

a closed itemset if there exists no itemset Xthat (1) sup(X) = sup(X); and (2) X ⊂ X[26].

Definition 3 (top-k closed itemsets) Given the desired

number of closed itemsets k, an itemset X is a top-k closed itemset in D if there are less than k closed itemsets whose supports are larger than sup(X). Let T Ckdenote the set of all

top-k closed itemsets. The minimum support to retrieve T Ck will be

supmin(T Ck) = min {sup(X) |X ∈ T Ck} .

Furthermore, let an itemset containing j items be referred to as a j -itemset. We then have Definition4below:

Definition 4 An itemset, denoted by Xj_,m, is the mth most frequent j -itemset if and only if there are(m − 1) j-itemsets whose supports exceed sup(Xj,m). In addition, an itemset, denoted by Xc_j_,m, is the mth most closed frequent j -item-set if and only if there are(m − 1) closed j-itemsets whose supports exceed sup(Xc_j_,m).

Example 2.1 As the example shown in Table 1, we illus-trate top ten frequent itemsets and top ten closed itemsets in Table2to best understand the notation used. As can be seen, the minimum support to retrieve the top ten frequent itemsets is equal to five, i.e., supmin(T10) = 5, because there are ten itemsets with support larger than or equal to 5. Moreover, the minimum support to retrieve the top ten closed itemsets is equal to four [supmin(T C10) = 4], where we will retrieve 11 closed itemsets to be independent of the order of items. In addition, itemset{A} is not a closed itemset since one of its superset,{AF}, has the same support. In this example,

{D} is the fourth most frequent 1-item which is denoted as

2 _{Without loss of generality, the support is considered as the absolute}

occurrence frequency in this paper.

3 _{Note that there may be larger than k itemsets satisfying this definition}

since itemsets may have the same support. Definition1will avoid the situation that the mining result depends on the order of items.

Table 1 An example transaction

database TID Items

100 A B D F 200 A B F 300 A D F 400 B C E D 500 B C D E F 600 A B F 700 A B F 800 A B D F 900 A B C D F 1,000 A B C E F

Table 2 The illustrative example of top-k frequent/closed itemsets

Itemset Sup. Itemset Sup.

Top-ten frequent itemsets supmin(T10) = 5

A 8 A F 8

B 9 B D 5

D 6 B F 8

F 9 D F 5

A B 7 A B F 7

Top-ten closed itemsets supmin(T C10) = 4

B 9 B F 8 D 6 D F 5 F 9 A B F 7 A F 8 A D F 4 B C 4 B D F 4 B D 5

Examples of the mth most frequent/closed itemsets X1,4= {D} sup(X1,4) = 6 X2,1= {AF}, {B F} sup(X2,1) = 8 Xc₃_,1= {AB F} sup(X₃c_,1) = 7

X1_,4, because three 1-itemsets{A}, {B} and {F} have sup-ports larger than sup(X1,4). In other words, sup(X1,4) = 6 will be the minimum support to retrieve top four 1-items. In addition, X2_,1will correspond to either itemsets{AF} or

{B F} because they have the same support and there is no

2-itemset whose support exceeds theirs. Furthermore, the first most closed frequent 3-itemsets, denoted by Xc₃_,1, is {AB F},

and sup(X₃c_,1) = 7.

Note that closed itemsets have been deemed as the con-densed representation of frequent itemsets, because a closed itemset is the itemset that covers all of its sub-itemsets with the same support [22]. In some applications, mining top-k

(5)

support 0

supmin( Tk)

Distributed range of supports of all n - itemsets Itemset length ₁ 2 3 n Sup( X1,1) Sup( X1,2) Sup( X3,1) Sup( X1,4) Sup( X3,2)

Fig. 1 The illustrative support distribution plot

mining top-k itemsets. We consider in this paper the approach which is equally applicable to mine top-k frequent item-sets and top-k closed itemitem-sets, depending on the application need. In the following, we first describe the support distribu-tion plot, which will be frequently exploited to illustrate our model of retrieving top-k frequent itemsets hereafter.

The support distribution plot: The support distribution plot

consists of various parallel lines, where the i th line presents the range of supports of all i -itemsets, and each i -itemset can be plotted in the i th line with respect to its support. An illustration of the support distribution plot is shown in Fig.1, where we can identify the position of itemset Xi,m,∀m in the i th line according to its support sup(Xi,m). As can be seen, itemsets whose supports lie in the shadow region will comprise top-k frequent itemsets. Furthermore, according to the downward closure property [2], the line with respect to

i -itemsets will be shorter than the line with respect to j

-item-sets, where i> j.

We then describe the concept to retrieve frequent patterns in the presence of the memory constraint. For ease of pre-sentation, we discuss the issue of the memory constraint on the case of frequent itemsets. The discussion of closed item-sets is similar and thus we defer the details to Sect.4. Note that the upper memory consumption of depth-first algorithms such as the FPGrowth algorithm [12] is proportional to the database size, which inherently cannot be limited below a user-specified memory size. We resort to level-wise search algorithms to realize the memory-constraint frequent-pattern

mining. Specifically, it is clear that the memory consumption

of level-wise search algorithms is solely proportional to the number of itemsets residing in the memory [24] , including the candidate itemsets and the stored itemsets.4Moreover, 4 _{We assume that discovered frequent itemsets will be stored in the}

menory for further use.

following Definition1, mining top-k frequent itemsets can be viewed as mining frequent itemsets with the minimum support equal to supmin(Tk) if we assume that supmin(Tk) can be known in advance. Although it is infeasible to make

such an assumption, it can help to clarify important concepts of the considered model. As such, Remark1below tells that we can limit the memory consumption in level-wise search algorithms by constraining the size of candidates tested in each database scan.

Remark 1 Suppose that the available memory size is

speci-fied as M. M can be equivalently transformed to the upper number of itemsets concurrently residing in the memory. Let the corresponding upper number of itemsets in memory be denoted by Mc. As such, the memory consumption will be limited below M if at most Mccandidates will be concur-rently generated-and-tested in each database scan.5

In essence, the memory size occupied by each i -item-set is proportional to the corresponding item-item-set-length i . For simplicity, here we assume that all candidate itemsets occupy the same memory without considering its itemset-length. The discussion of this implementation issue will be deferred to Sect.4. Clearly, Remark1states that a level-wise search algorithm is able to limit its memory consumption if we can guarantee the upper number of candidate item-sets being tested in one database scan. For example, sup-pose that Mc = 300, 000, meaning that at most 300,000 candidate itemsets can be generated-and-tested in one data-base scan. Assuming we have 1,000 frequent 1-items, we can only select 775 1-items to generate candidate 2-itemsets since775₂= 299, 925, which is bounded below Mc. These 299, 925 candidate 2-itemsets will be tested in one database scan, and the remaining1000₂ −775₂ = 199, 575 candidate 2-itemsets will be generated-and-tested in the next database scan.

The concept of this approach deviates far from that of previous level-wise search algorithms, where all candidate

(i + 1)-itemsets are generated-and-tested in one database

scan after all frequent i -itemsets have been found. Actually, readers may easily point out a straightforward solution to constrain the number of generated candidates: (1) arbitrarily combining frequent i -itemsets to generate their correspond-ing candidate (i+ 1)-itemsets until the number of candidates reaches the upper number of candidates Mc; (2) test these Mc candidates in one database scan and identify the contained frequent itemsets; (3) return to Step 1 unless no candidate

(i + 1)-itemset can be generated; (4) increase i by 1 and

return to Step 1.

5 _{It is reasonable to assume that M}

cis much larger than the desired

number of frequent itemsets k. Therefore, without loss of generality, we simply assume Mconly indicates the upper number of candidate

item-sets which will be concurrently generated-and-tested in each database scan.

(6)

However, it is clear to see that the procedure of the candi-date generation will become difficult since those early gen-erated candidates must be systematically recorded to avoid the duplicate generation. Moreover, this approach may incur extra database scans since the available memory may not be fully utilized in some database scans (for example, the latest scan to test candidate i -itemsets may only occupy a small memory). It conflicts the spirit of previous works to reduce the number of database scans. To realize the

memory-con-straint frequent-pattern mining without comprising the

exe-cution efficiency, the number of database scans is required to be as small as possible. We therefore devise a baseline algo-rithm, called the Naive algoalgo-rithm, to be an efficient

mem-ory-constraint frequent-pattern mining approach.

2.2 Algorithm Naive: the baseline method to discover frequent patterns in the presence of the memory constraint

In order to efficiently generate candidates in the presence of the memory constraint, we resort to the recent advanced technique presented in [8]. Specifically, the technique in [8] can estimate a tight upper bound of candidate itemsets. In our model, this technique can be further utilized to select the appropriate set of frequent i -itemsets in such a way that we can guarantee that their candidates (i+ 1)-itemsets can be fully generated in the available memory. Formally, given a set of j -itemsets Fj, the upper bound of candidate( j + i)-itemsets, generated from Fj can be estimated according to Theorem1below:

Theorem 1 Given N and j, there exists a unique representa-tion, called the j-canonical representarepresenta-tion, as the form N= mj j + mj−1 j− 1 + · · · + mr r , where r ≥ 1, mj ≥ mj−1 ≥ · · · ≥ mr, and mv ≥ v,

forv = r, r + 1, . . . , j. Therefore, assuming we have N j -itemsets, the tight upper bound of candidate ( j + i)-itemsets generated from these N j -i)-itemsets will be equal to

Cj,i(N) = mj j+ i + mj−1 j− 1 + i + · · · + ms+1 s+ i + 1 , where i ≥ 1 and s is the smallest integer such that ms < s+i.

If no such an integer exists, s will be equal to r− 1 [8].

Theorem1gives the tight upper bound of candidate( j+i)-itemsets which will be generated from a set of N j -j+i)-itemsets. An illustrative example, which is quoted from [8], is shown below to clarify the concept of Theorem1.

Example 2.2 Suppose that there are 13 3-itemsets in L3, which are

{{3,2,1}, {4,2,1}, {4,3,1}, {4,3,2}, {5,2,1}, {5,3,2}, {5,4,1}, {5,4,2}, {5,4,2}, {5,4,3}, {5,3,1}, {6,2,1}, {6,3,2}}.

The 3-canonical representation of 13 is5₃+3₂= 13, and hence the upper bound of candidate 4-itemsets is C3_,1(13) =

5 4 +3 3

= 6. The upper bound of candidate 5-itemsets

is C3_,2(13) =

5 5

= 1. This is tight indeed since candidates

C4generated from L3will be C4= {{4, 3, 2, 1}, {5, 3, 2, 1},

{5, 4, 2, 1}, {5, 4, 3, 1}, {5, 4, 3, 2}, {6, 3, 2, 1}}, and C5

is {5,4,3,2,1}.

In light of Theorem 1, we devise a naive extension of level-wise search algorithms, called the Naive algorithm, to discover top-k frequent itemsets with the memory constraint:

Naive algorithm: We illustrate the idea of the Naive

algo-rithm in Fig. 2, where Fig. 2a shows the process of data-base scans in the perspective of the support distribution plot and Fig.2b shows the perspective of candidates generated in the available memory. Assuming sup_min(Tk) can be known in advance, we can initially obtain the set of 1-items whose supports exceed supmin(Tk) after the first database scan. Sup-pose that Li denotes the set of i -itemsets whose supports exceed sup_min(Tk), and |Li| denotes the number of itemsets in Li. We then select the most n1frequent items of L1, i.e., {X1,1, X1,2, . . . , X1,n1} to generate their candidate 2-item-sets in the second database scan, where

n1= max

nC1_,1(n) ≤ Mc

.

For example, n1 = 775 if Mc is specified as 300,000 [∵

C1,1(775) = 299, 925]. Therefore only candidate 2-item-sets from the most frequent 775 1-items will be generated in memory and tested in the second database scan. Formally, Lemma 1 tells that all 2-itemsets whose supports exceed

X1,775will be retrieved:

Lemma 1 Given the most n frequent i -itemsets{Xi,1, Xi,2,

. . . , Xi,n}, the set of (i+j)-itemsets whose supports exceed

sup(Xi,n) will be included in the candidates generated from {Xi,1, Xi,2, . . . , Xi,n}, for j ≥ 1.

Note that Lemma1is a direct result from the downward closure property. In this case, all 2-itemsets whose supports exceed sup(X1,775) will be retrieved after the second data-base scan. Afterward, if C1,1(|L1|) − C1,1(n1) < Mc, the remaining candidate 2-itemsets will be generated-and-tested in the third database scan. To better utilize the available mem-ory, partial candidate 3-itemsets from the most frequent n2 2-itemsets will be also generated and tested in the third data-base scan, where

n2= max

nC2,1(n) ≤ 2Mc− C1,1(|L1|)

(7)

support 0 _sup_min_{( T}_k₎ 1st_scan 2nd scan 3rd_scan 4th_scan

(a) The process of database scans in the perspective of the support distribution plot 1 Itemset length 2 3 4

First database scan

(b) The process of database scans in the perspective of the candidate generation

Candidate 2- itemsets

Second database scan

Third database scan

Candidate 3- itemsets ...

Candidate 2-itemsets Candidate 3- itemsets

Fig. 2 The illustration of mining frequent itemsets under the memory constraint

For example, suppose|L1| = 1, 000. We will generate and test C1,1(1000)−C1,1(775) = 199, 575 candidate 2-itemsets, and at most 2× 300, 000 − 499, 500 = 100, 500 candidate 3-itemsets in the third database scan, where candidate 3-item-sets are generated from most frequent 3,629 2-item3-item-sets since

n2 = 3, 629 [∵ C2,1(3629) = 100, 481 and C2,1(3630) =

100, 540]. Accordingly, we retrieve all 3-itemsets whose sup-port exceed sup(X2,3629) after the third scan.

Explicitly, at most Mc candidate itemsets, possibly including candidate itemsets with various lengths, will be generated-and-tested in one database scan until no further candidates will be generated. Following the procedure of traditional level-wise search algorithms except the strategy of the candidate generation, we will retrieve top-k frequent itemsets finally. In case Mcis large enough, we may directly generate candidate i -itemsets from Li−2 or Li−3 as long as the candidate count is below Mc [Theorem1 is able to determine the tight upper bound of candidate(i + j)-item-set generated from frequent i -itemj)-item-sets, where j> 1]. It can be achieved by the technique similar to the scan-reduction

technique discussed in [5].

Note that the feasibility of the naive algorithm relies on the process to avoid the generation of duplicate candidates. To realize this, recall a standard candidate generation procedure addressed in [16] at first: C_i=X∪ XX, X∈ Li−1, X∩ X =i− 2 . Ci = X∈ Ci|X contains i members of Li−1 .

In light of the property that the Naive algorithm always tests the set of candidates by joining higher-frequency itemsets, we can effectively avoid the candidate regeneration by rewrit-ing the two-step candidate generation procedure, as shown in Lemma2below:

Lemma 2 Suppose that following the procedure of the Naive algorithm, we have identified all frequent i -itemsets with supports exceeding sup(Xi−1,n1) after the wth database

scan. Let Fi−1,n1 = {Xi−1,1, Xi−1,2, . . . , Xi−1,n1}, and

Fi−1,n2= {Fi−1,n1, F}, where

F= {Xi−1,n1+1, Xi−1,n1+2, . . . , Xi−1,n2}

and n2≥ n1. While we expect to identify all i-itemsets with

support exceeding sup(Xi−1,n2) in the next scan, the set of

candidate i -itemsets Ci that will be generated by the Naive

algorithm in thew + 1th scan is:

C_i =X∪ XX∈ Fi−1,n2, X∈ F, X∩ X =i− 2 . Ci = X ∈ C_iX contains i members of Fi−1,n2 . As such, the Naive algorithm can effectively test necessary candidates without regenerating candidates which have been generated before.

Rationale: Note that the Naive algorithm has a property

that it always generates-and-tests the set of candidates from higher-frequency itemsets. According to Lemma1, the set of candidate i -itemsets generated from Fi−1,n1 will not be necessary to be examined again because they have been generated in previous scans. Following the first step of the candidate generation in [16], the superset of candidate i -item-sets generated from Fi−1,n2, is

X ∪ XX, X ∈ Fi−1,n2, _X_{∩ X} =_i_{− 2}_{, which is equivalent to} X∪ XX ∈ Fi−1,n2, X∈ F,X∩ X =i− 2 ∩ X∪ XX, X∈ Fi_−1,n1,X∩ X =i− 2 . Since all validated candidates in

X∪ XX, X∈ Fi−1,n1, X∩ X =i− 2

have been generated before, only the set of itemsets in

X∪ XX ∈ Fi−1,n2, X∈ F, X∩ X =i− 2

, i.e., C_i, is necessary to be examined to generated validated can-didates in the w + 1th scan. Based on the foregoing, the candidate generation procedure in Lemma2can effectively

(8)

avoid regenerating candidates which have been tested

before.

Lemma2can be clearly illustrated by the following exam-ple. Suppose that we have tested all candidate 3-itemsets from

Fi−1,n1= {{1, 2},{1, 3}, {1, 4},{2, 3},{2, 4},{3, 5}} in pre-vious scans, i.e., candidates 3-itemsets {1, 2, 3} and {1, 2, 4} have been tested before. While we expect to test all can-didates from Fi−1,n2= {Fi−1,n1,{3, 4},{1, 5}}, we have

C3= X∪ XX ∈ Fi−1,n2 , X∈ {{3,4}, {1,5}}, X∩ X =i− 2 = {{1, 3, 4}, {2, 3, 4}, {3, 4, 5}},

and C3 = {{1, 3, 4},{2, 3, 4}}. Finally, we generate the necessary candidates without regenerating these candidate 3-itemsets which have been tested before, i.e., {1, 2, 3} and {1, 2, 4}. Since Fi−1,n2 and F can be identified without extra overhead, we can effectively avoid the duplicate candi-date generation in the Naive algorithm.

For interest of space, we omit the formal presentation of this naive approach because this approach is merely devised for the comparison purpose. It is worth mentioning that, the

Naive algorithm conveys the important concept that the

prob-lem of mining top-k frequent itemsets can be equivalently viewed as a jigsaw puzzle-like problem as follows:

Remark 2 Consider the view of the support distribution plot

such as Fig.2a. Imagine that following the procedure of the Naive algorithm, we can fill up a right-most region of the sup-port distribution plot, which consists of identified frequent itemsets, after each database scan.6 For example, the four database scans in Fig.2a will correspond to four right-most regions in the shadow region. From this point, the problem of efficiently mining top-k frequent itemsets can be translated to the problem that “how can the area with the support exceed-ing supmin(Tk) in the support distribution plot be separated into disjoint regions while the number of regions is as small as possible?”

Clearly, Remark2gives an important perspective to ana-lyze the problem of mining top-k frequent itemsets. This will be used for further devising the efficient solution to mine

top-k frequent itemsets. Note that the Naive approach must

be achieved under the assumption that supmin(Tk) is known prior to the mining process. However, it is difficult, even impossible to know sup_min(Tk) in advance. We still need to devise an efficient mining solution to achieve the same goal without such an assumption. Before presenting the details 6 _{In practice, some itemsets are generated-and-tested in previous}

data-base scans since itemsets with support larger than supmin(Tk) will be

maintained after each database scan. For simplicity, we convey the con-cept without considering such a slight difference. It will be a matter of the implementation.

of the feasible approach, we give Remark3below to dem-onstrate that the Naive approach is one of the most efficient level-wise search algorithms to retrieve top-k frequent item-sets under the memory constraint, if supmin(Tk) can be known in advance.

Remark 3 Let Cm denote the set of candidate m-itemsets generated from the set of (m-1)-itemsets whose supports exceed supmin(Tk). Without loss of generality, Cm will be the set of candidate m-itemsets which should be generated-and-tested to obtain the set of m-itemsets belonging to top-k frequent itemsets by applying level-wise search approaches. Therefore, while also utilizing several advanced pruning techniques of level-wise algorithms such as the hash-pruning technique [5], the Naive algorithm will retrieve the top-k fre-quent itemsets with the minimum number of database scans in the presence of the memory constraint.

In Sect.5, the Naive algorithm will be used to evaluate the efficiency of solutions to mine top-k frequent/closed itemsets for comparison purposes.

3 Memory-constraint top-k frequent-pattern mining

3.1 Principles to search top-k frequent patterns

We present the idea to efficiently retrieve top-k frequent item-sets under the memory constraint without the assumption that supmin(Tk) is known in advance. At first, necessary properties of top-k frequent/closed itemsets are presented.

Lemma 3 The mth most frequent j -itemset, Xj,m, is

included in candidates generated from the set of i -itemsets whose supports are larger than or equal to sup(Xj,m), where

j> i.

In essence, Lemma 3 is a direct result from the down-ward closure property. According to Lemma3, we also have Lemmas4,5and6.

Lemma 4 The set of( j + i)-itemsets belonging to top-k fre-quent itemsets will be a subset of candidates generated from the set of j -itemsets belonging to top-k frequent itemsets, where i ≥ 1.

Rationale: Suppose that Cj_+i denotes the set of candidate ( j+i)-itemsets generated from the set of frequent ( j + i − 1)-itemsets, denoted by Lj+i−1, whose supports all exceed supmin(Tk) in this case. That is, Cj_+i = Lj_+i−1∗ Lj_+i−1. Let C1_j_+i be the set of candidate ( j + i)-itemsets directly generated from Cj+i−1, i.e., C1j_+i = Cj+i−1 ∗ Cj+i−1. Since Cj_+i−1 ⊇ Lj_+i−1, we have Cj_+i−1 ∗ Cj_+i−1 ⊇

Lj+i−1∗Lj+i−1, showing that C1_j_+i ⊇ Cj+i ⊇ Lj+i. Recur-sively we have Ci_j_+i ⊇ · · · ⊇ C2_j_+i ⊇ C1_j_+i ⊇ Cj+i, where

(9)

Ci_j_+iis the candidate ( j+i)-itemsets directly generated from frequent j -itemsets, thus indicating Ci_j_+i ⊇ Lj+i and lead-ing to Lemma4. Note that the power of this property has been fully utilized in the scan-reduction technique to reduce

the number of database scans [2,20].

Lemma 5 Suppose that we make sure that all i -itemsets whose supports exceed sup(Xi,vi) have been discovered. It implies that we have already tested all candidate i -itemsets generated from (i − m)-itemsets whose supports exceed sup(Xi,vi), where m ≥ 1.

Rationale: According to Lemma1, all candidate i -itemsets generated from (i − 1)-itemsets with support exceeding sup(Xi,vi) must be tested (or be pruned) if we want to obtain

the set of frequent i -itemsets whose supports exceed sup(Xi,vi). In addition, the set of candidate i-itemsets

gener-ated from the set of (i− 1)-itemsets whose supports exceed sup(Xi,vi) is a subset of candidate i-itemsets directly

gener-ated from frequent (i− m)-itemsets whose supports exceed sup(Xi,vi) for m > 1. Hence, it is clear that if we have obtained

all i -itemsets whose supports exceed sup(Xi,vi), all

candi-date i -itemsets generated from (i − 1)-itemsets whose supports exceed sup(Xi,vi) have also been tested, even though

in practice candidate i -itemsets are directly generated from frequent (i− m)-itemsets for m > 1. Recursively, we can derive that all candidate i -itemsets generated from (i−

m)-itemsets whose supports exceed sup(Xi_,vi).

Lemma 6 Suppose that after thewth database scan, we have retrieved a set of itemsetsR_w = {X1,1, X1,2, . . . , X1,m1,

X2,1, . . . , X2,m2, . . . , Xt,mt, . . .}, where t ≥ 1 and Xi,1, . . . , Xi,mi denotes the set of all i -itemsets whose supports exceed sup(Xi,mi). Let supk(w) be equal to the support of the kth most frequent itemset inR_w (if |R_w| < k, supk(w) = 0).

Accordingly, we have supk(w) ≤ supk(z), for z > w, and

supk(w) ≤ supmin(Tk).

Rationale: Lemma6can be proved by contradiction. Sup-pose that supk(w) > supmin(Tk). Note that there are k item-sets whose supports exceed supmin(Tk). Thus we will have less than k itemsets whose supports exceed supk(w), which conflicts the definition of supk(w). As such, supk(w) will be smaller than or equal to supmin(Tk). In addition, since Rw ⊆ Rzfor z> w, it is clear that supk(w) ≤ supk(z) . In light of Lemma6, one may obtain top-k frequent item-sets by initially setting the minimum support equal to zero, and then raising the minimum support equal to supk(w) after thewth database scan. Therefore, without the assumption that sup_min(Tk) is known in advance, the problem to efficiently mine top-k frequent itemsets can be viewed as the problem to find supk(w) = supmin(Tk), where the number of database scans,w, is as small as possible.

support 0 1st_scan 2nd_scan 3rd_scan 4th scan supmin( Tk) 1 It e m s e t l e n g th 2 3 4 supk( 1) supk( 2) supk( 3) n

Fig. 3 The illustration of the horizontal first search approach

As a consequence, analogous to the process in the Naive algorithm, we can initially discover the set of high-support itemsets in each level, and then progressively discover item-sets with the relatively small support by generating and test-ing candidates except those candidates tested in previous scans. In other words, we can prioritize to fill up the region of the right part of the support distribution plot (recall that Remark2states that mining top-k frequent itemsets can be viewed as a jigsaw puzzle-like problem). As such, two alter-native approaches below is devised to fill up the right part of the support distribution plot, i.e., to retrieve top-k fre-quent itemsets, without assuming that sup_min(Tk) is known in advance.

Horizontal first search approach: The first approach is

called the horizontal first search approach, whose perspec-tive of the support distribution plot is shown in Fig.3. The basic concept is that, in the wth database scan, we prior-itize to discover itemsets in the sibling level whose sup-ports exceed supk(w −1). Specifically, suppose that we have tested candidate (i + 1)-itemsets generated from i-itemsets whose supports exceed s1 after the wth database scan. In the (w + 1)th database scan, we will generate all candi-date (i+ 1)-itemsets from i-itemsets whose supports exceed

s2, excluding those candidates generated in previous scans, where supk(w − 1) ≤ s2 < s1. Surely, the memory space must be guaranteed to ensure the generated candidates can be maintained in the memory. While we still have the remain-ing memory, we can concurrently generate partial candidate

(i+ 2)-itemsets from those identified most frequent (i +

1)-itemsets.

Note that the horizontal first search approach can fully utilize the merit of level-wise algorithms to effectively prune unnecessary candidates. For example, a candidate 3-itemset

(10)

support 0 1st scan 2nd_scan 3rd scan 4th scan supmin( Tk) 1 Itemset length 2 3 4 supk( 1) supk( 2) supk( 3) n

Fig. 4 The illustration of the vertical first search approach

scan if the support of one of{A, B}, {A, C}, {B, C} is smaller than supk(w−1). However, one drawback of this approach is that we may identify many itemsets which are not included in top-k frequent itemsets. The reason results from that a lot of top-k itemsets are with long itemset-lengths, and thus supk(w) cannot effectively and quickly approach supmin(Tk) whenw is small. As shown in Fig.3, it is clear to see that a lot of itemsets with support smaller than supmin(Tk) will be also discovered in thewth database scan when supk(w −1) is not close to supmin(Tk). Generating those unnecessary itemsets will lead to extra database scans, thus degrading the

execu-tion efficiency.

Vertical first search approach: The second approach is

called the vertical first search approach, whose perspective of the support distribution plot is shown in Fig.4. The basic con-cept is that, in thewth database scan, we prioritize to discover longer itemsets whose supports exceed supk(w − 1). This can be achieved by directly generate candidate i -itemsets from frequent j -itemsets, where j may be smaller than i−1. For example, let n1= max

n 2_i₌₁C1,i(n) ≤ Mc . We may select n1 most frequent 1-items, i.e., X1,1, X1,2, . . . ,

X1,n1, and then test whether there are i -itemsets, 2≤ i ≤ 3, whose supports exceed sup(X1,n1). In practice, the vertical

first search approach is beneficial to efficiently obtain top-k

frequent itemsets when k is small [or, supmin(Tk) is high]. Moreover, another merit of the vertical first search approach is that the identified itemsets will mostly belong to top-k fre-quent itemsets as compared to the case in the horizontal first

search approach. However, the drawback of this approach is

that a lot of candidates, which indeed can be pruned by the level-wise search, will also be generated-and-tested. Thus when k is large, this approach will suffer from an extremely

large number of database scans.

3.2 Theδ-stair search

Apparently, it is still required to devise a solution to more effectively make supk(w) approach supmin(Tk) while also fully utilizing the merit of the level-wise candidate pruning. In other words, we try to integrate the merit of the horizontal

first search approach and the vertical first search approach

while diminishing the side-effect of those two approaches. To achieve this, we propose a novel search approach, called the δ-stair search, in this paper. The basic concept behind the δ-stair search is to equally share Mc candidates to δ different itemset lengths and then to gradually upward or downward search frequent itemsets. Specifically, theδ-stair

search consists of two distinct steps, namely the upward δ-stair search step and the downwardδ-stair search step. We

formally describe the principles of upwardδ-stair search step and the downwardδ-stair search step, respectively.

Upwardδ-stair search step: Suppose that in the wth

data-base scan, we have concurrently generated-and-tested candi-date itemsets whose lengths are between u and u+δ−1. In the (w + 1)th database scan, the upward δ-stair search will con-currently generate-and-test candidate itemsets whose lengths are between u+1 and u +δ. Furthermore, assume before the (w + 1)th database scan, we have discovered the set of item-sets R_w = {X1,1, X1,2, . . . , X1,v1, X2,1, . . . , X2,v2, . . . ,

Xt_,vt, . . .} in which each itemset has the support exceeding

supk(w), where {Xi,1, . . . , Xi,vi} denotes the set of i-itemsets

whose supports exceed sup(Xi,vi). In the (w + 1)th database

scan, we examine candidate ( j+ 1)-itemsets generated from

{Xj,1, Xj,2, . . . , Xj,nj}, excluding candidate ( j+1)-itemsets

from{Xj,1, Xj,2, . . . , Xj,mj}, where u ≤ j ≤ u + δ − 1, and nj = max nCj,1(n) − γj+1≤ Mc δ , sup(Xj,n) ≥ supk(w) , mj = min nsup(Xj_,n) ≥ sup(Xj_+1,vj+1) .

Hereγi denotes the number of candidate i -itemsets which have been tested before the (w + 1)th database scan.

As such, we will retrieve all (j+1)-itemsets whose supports exceed sup(Xj_,nj) after the (w + 1)th database scan, where

u ≤ j ≤ u + δ − 1.

Example 2.3 Consider the illustration shown in Fig.5a. In this case,δ is set to 2, meaning that candidate itemsets of two levels will be concurrently generated-and-tested in each data-base scan. Specifically, we will obtain the set of 1-items after the first database scan (if the count of distinct 1-items exceeds

k, we only select most k frequent 1-items). Afterward, we

generate Mc

2 candidate 2-itemsets from the set of most h1 frequent 1-items and generate Mc

2 candidate 3-itemsets

directly from the set of most frequent h₁ 1-items, where

C1_,1(h1) ≤ M₂c and C1_,2(h₁) ≤ M₂c. Therefore we will obtain 2-itemsets whose supports exceed sup(X1,h1) and

(11)

Fig. 5 The illustration of the δ-stair search with δ = 2

0 support 1 1st_{scan ( Initial )} 2nd_{scan ( Initial )} 3rd_{scan ( upward )} 4th_{scan ( upward )}

Itemset length _{Itemset length}

2 3 4 5

(a) upward search step

supk( 4)

(b) downward search step

0 support 1 2 3 4 5 supk( 4) 5th_{scan ( downward )} 6th_{scan ( upward )} supk( 5) 2 nd sc a n 3 rd sc a n 4 th sc a n 5 th sc a n 6 th sc a n

3-itemsets whose supports exceed sup(X1,h₁) after the second database scan. For example, suppose that Mc= 300,000. We select the most frequent 548 1-items to generate candidate 2-itemsets [h1 = 548, because C1,1(548) = 149,878] and select the most frequent 97 1-items to directly generate can-didate 3-itemsets [h₁ = 97, because C1_,2(97) = 147,440]. The first and the second database scans are referred to as the

initial step in this paper.

Note that after the second database scan, we will retrieve all 2-itemsets, denoted by {X2,1, . . . , X2,v2}, whose sup-ports exceed sup(X1,h1), and retrieve the set of 3-itemsets, denoted by {X3,1, . . . , X3,v3}, whose support exceed sup (X1_,h₁). In addition, let X2,m2 be the 2-itemset that all 3-itemsets with support exceeding sup(X2,m2) have been discovered in the second database scan, i.e., m2 = maxnsup(X2,n) ≤ sup(X3,v3)

. In the third database scan, we will upwardly search top-k frequent itemsets. Note that according to Lemma4, we have already tested all candidate 3-itemsets generated from {X2,1, …, X2,m2} in the second database scan. As such, in the third database scan, Mc

2 candidate 3-itemsets are generated from the set {X2,1, X2,2,

. . . , X2,n2}, excluding candidate 3-itemsets from {X2,1, X2,2, . . . , X2,m2}, where n2 = max

nC2,1(n) ≤ M₂c

. More-over, since we did not test any 4-itemset in previous database scans, in the third database scan, we will also gen-erate and test Mc

2 candidate 4-itemsets, which are concur-rently generated from the set {X3,1, X3,2, . . . , X3,n3}, where

n3 = max

nC3,1(n3) ≤ M₂c

. Finally, the third database scan is executed to test these generated candidates.

The execution of the third database scan is called an

upward δ-stair search step. Same as the execution of the

third database scan, the execution of the fourth database scan is also an upwardδ-stair search step, as illustrated in

Fig.5a.

Downward δ-stair search step: Suppose that in the wth

database scan, we have concurrently generated-and-tested candidate itemsets whose lengths are between u and u+δ−1.

The downward search will generate candidates correspond-ing to two different cases below:

(a) If thewth database scan is an upward δ-stair search, in the(w + 1)th database scan, the downward δ-stair search will concurrently generate-and-test candidate itemsets whose lengths are between u and u+ δ − 1. (b) If thewth database scan is a downward δ-stair search,

in the(w + 1)th database scan, the downward δ-stair search will concurrently generate-and-test candidate itemsets whose lengths are between u−1 and u +δ−2. Furthermore, assume before the (w + 1)th database scan, we have discovered the set of itemsetsR_w= {X1,1, X1,2, . . . ,

X1_,v1, X2,1, . . . , X2,v2, . . . , Xt,vt, . . .} in which each itemset

has the support exceeding supk(w), where {Xi,1, . . . , Xi,vi

denotes the set of i -itemsets whose supports exceed sup(Xi_,vi). In the (w + 1)th database scan, the downward

δ-stair search step will generate candidate ( j + 1)-itemsets from{Xj,1, Xj,2, . . . , Xj,nj}, excluding candidate ( j +

1)-itemsets from {Xj,1, Xj,2, . . . , Xj,mj}, where j lies in the

range corresponding to case (a) or case (b), and

nj = max nCj,1(n) − γj+1≤ Mc δ , sup(Xj,n) ≥ supk(w) , mj = min nsup(Xj,n) ≥ sup(Xj+1,vj+1) .

Here γj+1 denotes candidate ( j + 1)-itemsets which have been tested before (w + 1)th database scan.

As such, after the(w + 1)th database scan, we will retrieve all ( j+ 1)-itemsets whose supports exceed sup(Xj_,nj) with

the itemset-lengths corresponding to case (a) or case (b) above.

Example 2.4 Consider the illustration shown in Fig.5b. In this example, candidate 5-itemsets generated from {X4,1,

X4,2, . . . , X4,m4} has been tested after the fourth database scan and no 5-itemsets whose supports exceed sup(X4,m4) were found. As such, the upward search will fail to find any 6-itemsets belonging to top-k frequent itemsets, thus the fifth

(12)

database scan will be turned to the downwardδ-stair search

step.

In the fifth database scan,Mc

2 candidate 5-itemsets, exclud-ing candidates generated from {X4_,1, X4_,2, . . . , X4_,m4}, are generated from the set {X4,1, X4,2, . . . , X4,n4}, where

n4= max nC4,1(n) − γ5≤ Mc 2

andγ5 denotes the number of candidate 5-itemsets which have been tested in previous database scans. Moreover, assume that candidate 4-itemsets generated from {X3,1, X3,2,

. . . , X3,m3} has been tested in previous database scans. We also generateMc

2 candidate 4-itemsets from {X3,1, X3,2, . . . ,

X3,n3}, excluding candidate 4-itemsets generated from {X3,1, X3,2, . . . , X3,m3}, where n4 = max

nC3,1(n) −

γ4≤ M₂c

, and γ4 denotes the number of candidate 4-itemsets which have been tested in previous scans.

Suppose that after the fifth database scan, all 1-items, 2-itemsets, 3-itemsets and 4-itemsets whose supports exceed supk(5) are found. In other words, we did not need to “down-ward” search itemsets which will not belong to top-k frequent itemsets. As such, the sixth database scan will return to be an upwardδ-stair search step. Finally, the process of mining

top-k frequent itemsets will end after the sixth database scan

since no 5-itemsets and 6-itemsets whose supports exceed supk(6) were found. Accordingly, top-k frequent itemsets, i.e., those itemsets whose supports exceed supk(6), are dis-covered. In this case, sup_min(Tk) is equal to supk(6). We then formally describe when to shift from the upward δ-stair search step to the downward δ-stair search step, and vice versa.

The upward search step→ the downward search step:

Suppose that in thewth database scan, which is an upward δ-stair search step, we have generated-and-tested candidate itemsets whose lengths are between i−δ+1 and i. Moreover, assume that after thewth database scan, candidate i-itemsets generated from {Xi−1,1, Xi−1,2, . . . , Xi−1,vi−1} have been generated and tested, and no i -itemsets with support larger than sup(Xi−1,vi−1) were found. In such cases, the (w + 1)th

database scan will turn to be a downwardδ -stair search step since there will be no (i+1)-itemsets whose supports exceed sup(Xi−1,vi−1) (see the example of the fourth scan to the fifth

scan in Fig.5).

The downward search step→ the upward search step:

Suppose that in thewth database scan, which is a downward δ-stair search step, we have generated-and-tested itemsets whose lengths are between i and i+δ −1. Moreover, assume that after the wth database scan, all candidate i-itemsets generated from {Xi_−1,1, Xi_−1,2, . . . , Xi_−1,vi−1} have been tested, wherevi−1 = max

nsup(Xi−1,n) ≤ supk(w)

. In

such cases, the (w + 1)th database scan will turn to be an

upwardδ-stair search step, because no (i − 1)-itemsets with

support below supk(w) will belong to top-k frequent itemsets (see the example of the fifth scan to the sixth scan in Fig.5b).

Accordingly, the upwardδ-stair search step and the

down-wardδ-stair search step will be adaptively switched until all top-k frequent itemsets are found. It is worth mentioning that,

theδ-stair search approach will efficiently retrieve top-k fre-quent itemsets with the number of database scans close to the optimal one, which is required by the Naive algorithm described in Sect. 2 [under the premise that supmin(Tk) is known in advance]. This is attributed to that theδ-stair search has both advantages of the horizontal first search approach and the vertical first search approach while diminishing their side-effects.

4 Algorithm M T K

To efficiently retrieve top-k frequent patterns, we in this section introduce the algorithm, called MTK (standing for

M emory-constraint top-k mining), to realize the concept of

theδ-stair search. In Sect.4.1, we give the implementation details of MTK. The extension of MTK to mine top-k closed itemsets will be discussed in Sect.4.2. Section4.3gives illus-trative examples.

4.1 Implementation of MTK

Before presenting the details of the implementation, we give the overview of MTK in Fig.6. In essence, MTK applies the hash pruning technique from algorithm DHP [19], which can effectively reduce unnecessary candidates by utilizing a hash

table structure. Specifically, when scanning the database to

obtain all 1-items (or most k frequent 1-items), we also exam-ine all 2-itemsets of each transaction, and hash them into the different buckets of the hash table H2, i.e., increasing the corresponding bucket count. The hash table H2can be fur-ther used to reduce the amount of candidate 2-itemsets which will be examined in the following database scans if neces-sary. It is worth mentioning that, as demonstrated in [19], the hash pruning technique will be powerful in early stages, particularly when pruning candidate 2-itemsets. Thus both considering the pruning effect and the other incurred over-head, the hash pruning will only be utilized in generating candidate 2-itemsets.

In the second database scan, Mc

δ candidate i -itemsets, 2≤

i ≤ 2+δ−1, will be directly generated from 1-items. For the

case of candidate 2-itemsets, we need to generate candidate 2-itemsets, which belong to the bucket whose bucket count exceeds supk(1) in H2(interested readers can refer to [19] for the details). After the second database scan, we filter out

(13)

Fig. 6 The flowchart of MTK 1-item sets . . . . . . . . S1 S2 Sj 2-item sets i-item sets 1-item sets

Sorted i-itemsets whose

support equals to Sj

Key 1: support

Key 2: itemset length

Fig. 7 The two-level sorted array to maintain top-k frequent itemsets

itemsets which were discovered in the first and the second database scans and their supports are smaller than supk(2). Afterward, the third database scan will be executed as either an upwardδ-stair search step or a downward δ-stair search

step. Then, the upwardδ-stair search step and the downward δ-stair search step will be adaptively switched according to the criterion described in Sect.3.2, until all top-k frequent itemsets are found. In case we need to examine candidate 2-itemsets, H2 will be utilized again to prune the size of candidates.

The detailed pseudo-codes are outlined in the below. Explicitly, in addition to the hash table H2to pruning can-didate 2-itemsets, several important global variables will be also maintained, including (1) Tk: the repository to maintain

top-k frequent itemsets; (2) si zei: the pre-estimated memory overhead to store a candidate i -itemsets; (3) TB[x][y][z]: a pre-computed sorted array, in which each array unit con-sists of two variables, namely TB[x][y][z].bound_cand and

TB[x][y][z].itemset_num, to indicate the upper number of

candidate (x+ y)-itemsets generated from x-itemsets, where

T B[x][y][z].bound_cand

= Cx,y(T B[x][y][z].itemset_num).

Specifically, as shown in Fig.7, Tk is a dynamic array structure to maintain top-k frequent itemsets, sorted by two keys: (1) the support of the itemset; (2) the itemset length. Once we identify a new itemsets whose support exceeds the up-to-date minimum support threshold s, it will be inserted into Tk and itemsets with the smallest support in Tk will be removed if|Tk| > k. In addition, the memory to store a candidate is proportional to the itemset length. As such,

si zeican be approximated as the memory of necessary units multiplied by a factorΦ, where Φ, depending on the imple-mentation, represents the other overhead such as memory to maintain necessary pointers in a hash-tree [2]. Moreover, to search nopt= max

nCx,y(n) ≤ m

, for a given upper number of candidates m, the function to obtain Cx,y(n) orig-inally needs to be iteratively executed until nopt is found.