容許間距之近似重覆樣式探勘

(1)

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

容許間距之近似重覆樣式探勘

Mining Repeating Pattern with Gap Constraint

研究生：邱欣怡

指導教授：黃俊龍教授

(2)

容許間距之近似重覆樣式探勘

Mining Repeating Pattern with Gap Constraint

研究生：邱欣怡 Student：Shin-Yi Chiu

指導教授：黃俊龍 Advisor：Jiun-Long Huang

國立交通大學

資訊科學與工程研究所

碩士論文

A Thesis

Submitted to Institute of Computer Science and Engineering College of Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Computer Science

Jan. 2009

Hsinchu, Taiwan, Republic of China

(3)

i

摘要

以往對於 repeating pattern mining 的研究主要著重於從一個由音樂轉成較長的字串中找出經常重覆出現的子字串。舉例來說，A 公司和 B 公司股價上漲，則 C 公司股價則會在 4 天之後上漲。然而，鄧教授所提出的問題給予太多的限制在從一長串 set 中找出 repeating pattern，這使得許多潛在的 frequent patterns 會因為這個限制導致他們的 support 分散進而無法被找出。因此，在我們的論文中定義了一個新的 pattern，它允許二個相鄰 set 之間有 gap 的存在，此外我們也提出了一個演算法，G-Apriori，找出允許 gap 的 pattern。G-Apriori 演算法產生 candidates 且透過掃描 database 來計算 candidates 的 support。然而為了要避免掃描 database 太多次，GwI-Apriori 被提出來解決這個問題。在 GwI-Apriori 中，我們設計了一個 index list，它包含一個開始位置跟一串的結尾位且利用它來紀錄 frequent pattern 的所在位置。透過這些 index lists，GwI-Apriori 只需要掃瞄 database 一次且利用它們來進行較長 pattern 的 support 的計算。此外，在

GwI-Apriori 中我們也設計了 pruning 策略來加速 support 的計算。實驗的資料是以實際的資料評估，且實驗的結果顯示 GwI-Apriori 優於 G-Apriori。

(4)

Abstract

Previous studies on mining repeating patterns focus on discovering sub-strings which appear frequently in a long string, converted from the music. An example of such repeating pattern is ”if the stock price of companies A and B both goes up on day one, the stock price of company C will go up on

exactly day fifth.” But the problem proposed by Tung gives too much limitation for mining repeating

patterns from set sequence, the potential frequent patterns can not be found due to the frequencies distrusted. Hence, in our paper we define a new pattern, which allows the gap between two adjacent sets, and propose an algorithm, G-Apriori, to discover the repeating patterns with gap constraint from a set sequence. G-Apriori algorithm generates candidates and counts the frequency of these candidates by scanning the database. In order to avoid scanning the database so many times, the algorithm, GwI-Apriori is proposed to solve the problem. In GwI-Apriori method, it designs an index list, which contains the start position (SP) and end position (EP) list, for recording the positions of the frequent patterns. Besides, the GwI-Apriori also takes the additional strategy for pruning the searching space among the index lists. By using the index lists, the GwI-Apriori only scans the database once and computes the frequency of frequent patterns through the index lists. The experimental results show that the GwI-Apriori performs much better than G-Apriori.

(5)

iii

致謝

首先我最感謝我的指導教授黃俊龍老師，感謝他在碩士班時間內給予我研究上的指導以及論文上寫作的技巧。我也要感謝指導我的博班班邱釷銓學長，每星期特地抽空與我討論，解答我心中的疑惑。此外，也感謝口試委員：台科大資工系戴碧如教授及在美商新思科技公司工作的莊坤達博士在口試過程中提供了許多的寶貴意見及看法，讓我的論文能更加完善。此外我也要感謝我實驗室的學長，同學及學弟們。很感謝你們在我的碩士班的生活裡給了我幫忙及鼓勵，好讓我在失去鬥志時能夠很快的恢復動力繼續努力做研究。然而，我也很感謝我的大學好友們在這段時間內很有耐心的傾聽我訴苦，真的很謝謝你們。最後，誠摯感謝父母親及哥哥們的支持及鼓勵，使我能夠在無後顧之憂下專心做研究。在此希望與你們共同分享完成這篇論文的喜悅。欣怡 2009.01

(6)

List of Figures

1.1 A phrase excerpted from Brahms Waltz in A flat . . . 1

1.2 Five extracts from Mozarts Piano Sonata K. 311 and a prototypical melody (ex-cerpted from [Self98]). . . 2

2.1 Correlative Matrix . . . 5

2.2 Bit Sequence Representation . . . 7

2.3 Episodes Class . . . 9

2.4 Data Convert Process . . . 10

2.5 Comparison Table . . . 11

2.6 Illustrative Example I . . . 12

2.7 Illustrative Example II . . . 14

3.1 Apriori Based Example . . . 17

3.2 Set Sequence Data . . . 17

3.3 The example for GwI-Apriori Algorithm . . . 20

3.4 Flow Chart for Pruning Strategies . . . 21

3.5 The example for Pruning Technique Part I . . . 24

3.6 The example for Pruning Technique Part II . . . 25

5.1 Gap constraint versus Execution time . . . 29

5.2 Minimum support versus Execution time . . . 29

(8)

List of Tables

(9)

Chapter 1 Introduction

In our diary life, we can find that patterns appear repeatedly, such as in DNA sequence, music and video, the behavior of a person, ups and downs relation of the stock price between each company,. . ., so on. We show a example in Fig 1.1. However, if we can find the patterns from these data, we can use these patterns to describe and forecast the future trend or behavior of these data. For example, in music, we can extract music segment, appearing frequently, from the music data and make use of these segment to represent the music. Therefore we can achieve both efficiency and semantic-richness requirements for content-based music data retrieval. Besides, the investors may also interested in the relation of stock prices among the companies, such as ”When stock price of A company rise 10 percent and stock price of B company rises 5 percent, the stock price of C company will fall 4 percent within the following three days.” The investors can make a profitable investment while obtaining these information.

Segment A Segment A

Figure 1.1: A phrase excerpted from Brahms Waltz in A flat

The first application for discovering the pattern appearing repeatedly is in biological field [2]. In this field, we convert the DNA sequence into a string, we want to find the sub-string which is tandem repeat in the converted string. In multimedia area, the indexing and searching techniques for multimedia data are main topic, therefore Chen et al. propose the new problem, repeating pattern mining, to discover the repeating music segment. The repeating pattern mining problem firstly focused on finding exact frequent patterns in music database. [9] was the first work to solve the problem. In this work, the music is converted into a sequence of notes, and a data structure called correlative matrix which integrates associated algorithm is proposed to discovering the repeating

(10)

patterns efficiently. Fig 1.1 shows the example, where the sub-string ”C6-Ab5-Ab5-C6” appears two times in the string ”C6-Ab5-Ab5-C6-C6-Ab5-Ab5-C6-D6-C6-B5-C6-A5-A5-E6.” However, music segments with minor difference may be regard as the same segments and also could be important patterns for indexing, shown in Fig 1.2, so Hsu et al. proposed the approximate repeating patterns mining problem in [10].

Figure 1.2: Five extracts from Mozarts Piano Sonata K. 311 and a prototypical melody (excerpted from [Self98]).

In [10], it defines the match operator which is used to determine whether the pattern match this music segment, then divides the music into non-overlap music segments and sums up the number of segments satisfying the match operator. Besides, Liu, et al.[14] proposed new definition for approximate repeating patterns in 2005. The method in [14] applies edit distance to find out the approximate repeating patterns in music data. However, in the following year, 2006, Koh, et al. proposed the new problem for mining top-k fault-tolerant repeating patterns in [13]. The work used bit strings to express the string data, and applied this bit strings to perform the ”AND” and ”Shift” operators to obtain the fault-tolerant repeating patterns. Nevertheless, the events may happen on the same time and we may interested in finding the connection of these events which happen on the different time. For example, investors may want to know whether the pattern, the stock price of A company and B company both go up 5%, the following data the stock price of C company goes up 5%, appears frequently. Hence Tung proposed the related work to solve this problem in [20]. In this paper, it views these events appearing on the same time as a set, then connects these sets of different appearing time by order, and finds the relation among all sets.

However, in real world the frequent patterns may be concealed by the noise or delay, therefore the work which proposed by Tung in [20] will give too much limitation for discovering the frequent patterns, the potential frequent patterns may not be found. Therefore, in our study, we loosed the restrictions and viewed these patterns the same by allowing the gaps (delay or noise) between two

(11)

adjacent sets. For example, consider the following patterns < _{{A},{B} >, < {A},∗,{B} > and}

<{A},∗,∗,{B} >, we said that frequency of pattern < {A},{B} > is 3 from above patterns by giving

the gap_{= 2, but the frequency of pattern < {A},{B} > is 2 by giving the gap = 1. The algorithm,} G-Apriori, is proposed to solve the problem. In this algorithm the candidates are generated based on apriori property, and the frequent patterns, which frequency are larger than the user-defined threshold, are obtained from these candidates by scanning the database for counting the frequency. Nevertheless, the G-Apriori algorithm takes to much time to scan database for frequency counting. Hence, we refined the G-Apriori algorithm and proposed GwI-Apriori algorithm to avoid inherently scanning database many times. The first stage of GwI-Apriori is to scan the database and record the index positions for each 1-length frequent pattern. For the patterns which length is larger than 1, we only need to use the index positions, recorded in the first stage, for counting the frequency. Besides, in order to speed up the frequency counting, we also design the pruning technique to reduce the redundant comparison among these index positions.

The remainder of this paper is organized as follows: In Section 2, we present preliminaries, in-cluding related work and definitions of our new pattern, the repeating pattern with gap constraint. In Section 3, we propose two methods, G-Apriori method and GwI-Apriori method, to mine repeating patterns with gap constraint efficiently. In Section 4, we prove correctness. In Section 5 shows experimental results. Finally, we will conclude in Section 6.

(12)

Chapter 2 Preliminaries

2.1 Related Works

In this section, we review the related work on discovering repeating patterns. Besides, we also discuss inter-transaction association. Section 2.1.1 provides a brief discussion about the existing work on finding repeating patterns. The following section, the algorithm for finding inter-transaction association rule will be offered.

2.1.1 Repeating Pattern

In this subsection, we will talk about three types of repeating pattern, exact, approximate and fault-tolerant, sequentially.

Exact Repeating Pattern

[9] was the first work that proposed the repeating pattern mining problem in music field. In this paper, the music is converted into a sequence of notes, and a data structure called correlative matrix which integrates associated algorithms is proposed to discovering the repeating pattern efficiently, which is a shorter sequence of notes appearing more than once in a music object. Consider the phrase with 12 notes from Brahms Waltz in A flat. Its corresponding sequence of nodes is ”C6-Ab5-Ab5-C6-C6-Ab5-Ab5-C6-Db5-C6-Bb5-C6”. The correlative matrix of this sequence is shown in Fig 2.1. The function of the correlative matrix is to preserve the intermediate results of substring matching.

The first step for finding the repeating pattern is to initialize the matrix, and it use Ti, jto indicate the element of the i-th row and j-th column in the matrix T. The upper-triangle slots in the matrix will be filled up based on the following principles: For any two notes Siand Sj(i6= j and i, j > 1) in

(13)

Figure 2.1: Correlative Matrix

the music string S, if Si= Sj, we set Ti, j = Ti−1, j−1+ 1. If the value stored in Ti, j is n, it indicates a repeating pattern of length n appearing in the positions (j V n + 1) to j in S. Fig 2.1 shows the result after all notes are processed. After filling up the correlative matrix, the following step is to find all repeating patterns and their repeating frequencies. In this step, it uses a set called the candidate set (denoted CS) to record the repeating patterns and their repeating frequencies. However, there are only four cases which can be put into the CS.The cases are (1). Ti, j = 1 and Ti+1, j+1= 0, (2).

Ti, j = 1 and Ti+1, j+16= 0, (3). Ti, j >1 and Ti+1, j+16= 0 and (4). Ti, j >1 and Ti+1, j+1= 0. After

finding all candidate sets, there are extra two steps to implement. The first is the pruning step. If a repeating pattern satisfies the pruning principle, it will be removed from the candidate set. The principle is that for any repeating patterns in CS, if it is a substring of another repeating pattern and they have the same frequencies, it will be pruned from the CS. The second step is that compute the real repeating frequency for each repeating pattern based on the formulas: f = 1+√1+8×rep count₂ .

Approximate Repeating Pattern

However, there exists another repeating pattern which loosens the condition for finding the repeating pattern. [10] was the first proposed by Jia-Lien Hsu, et al. to solve this problem. In [10], it defines the match operator which is used to determine whether the pattern match this music segment. The match operator is stated as follow: Given P= (p1,p2, . . . ,pm) and LL = (s1,s2, . . . ,sn), where n >

(14)

m. long leng match(P, LL) = 1, if pi= sbi,f ori= 1, 2, . . ., m, where1 = b1<b2< . . . < bm= n,

otherwise, long leng match(P, LL) = 0. Based on the match operator, Jia-Lien Hsu also proposed

the method for compute the repeating frequency of a pattern P. The main idea of frequency counting for a pattern P is dividing the music into non-overlap music segments and sum up the number of segments which satisfies the match operator. In order to find the longer length pattern, Jia-Lien Hsu uses the pattern join operator in level-wise approach.

Besides, Jia-Ling Koh, et al. [13] also proposed another new definition for approximate repeating patterns, which allows insertiong/deletion errors occurring.

In [13], Jia-Ling Koh proposed two definitions of counting frequency for a pattern, IFT-contain and DFT-contain. Given a data sequence DSeq= D1D2. . .Dn and a pattern P= P1P2P3. . .Pm, we said that DSeq is FT-contain pattern P on position i withε insertion errors iff there exist an integer 1_{≤ i ≤ n, such that D}i= P1, D_(i+m−1)+ε = Pm, and P is a sub-sequence of DiDi+1. . .D_(i+m−1)+_ε

and DSeq is said to IFT-contain pattern P under fault tolerance εI, iff DSeq FT-contain P with ε insertion errors andε _≤εI. The DSeq is FT-contain pattern P on position i withε deletion errors iff there exists an integer 1_{≤ i ≤ n, such that D}iDi+1. . .D_(i+m−1)−ε is a sub-sequence of P, and we call

DSeq is DFT-contain pattern P by giving a fault tolerance εD, iff DSeq FT-contains P on position with ε deletion errors, where Di= P1. andε ≤εD. Consequently, the fault tolerant frequency for a pattern P in DSeq is the number of different positions in DSeq where DSeq IFT/DFT-contains P. The example is provided as follows.

Example Consider DSeq=ABCDCABA,εI= 2 andεD= 3. Given patterns P1=ABCC, P2=BCDC,

P3=ACAB,P4=AEF, P5=BCFC. DSeq FT-contains P1on position 1 with 1 insertion error and DSeq

also FT-contains P2on position 2 with 0 insertion error. Hence, DSeq IFT-contains P1and P2.

How-ever, DSeq doesn’t IFT-contains P3 for the insertion error of it is larger than εI. With regard to P4

and P5, DSeq DFT-contains both as they all satisfied the DFT-contain definition.

In order to speed up the time of frequency counting, the bit sequence representation of data item and shift and and operation on bit sequence [5] are incorporated into two algorithms, named TFTRP-Mine and RE-TFTRP-Mine which were proposed in [13]. Fig 2.2 shows the bit sequence of each data item in data sequence, ”ABCDABCACDEEABCCDEACD” .

In fig 2.2, the bit sequence for each data item N is denoted as AppearN and the length of bit sequence for each data item is equal to the length of the data sequence. The numbers, 1 and 0, appearing in bit sequence represent whether some data item appears on the ith position of the data sequence respectively. Therefore, we can obtain the frequency of data item by accumulating the all none zero number in the bit sequence.

(15)

Sequence Data ABCDABCACDEEABCCDEACD

Data Item Bit Sequence

A 100010010000100000100 B 010001000000010000000 C 001000101000001100010 C 001000101000001100010 D 000100000100000010001 E 000000000011000001000

Figure 2.2: Bit Sequence Representation

pattern are proposed. The functions are shows as follows,

(1) Recursive functions of getting Appear+_P(E):Suppose a pattern P = P1P2. . .Pmis given. Let

P′ = P1P2. . .Pm₋₁ and X denote Pm. Appear+_P(E) is obtained from the following function for 0≤

E _≤εI.

IF_{|P| = 1, then Appear}+_P(E)=Appearp;∀1 ≤ E ≤εI,Appear+_P(E) = 0;

Else If E=0, then temp1(E) = Appear+_p′(0); temp2(E) = L shi f t(Appearx,|P| − 1);

Else temp1(E) = temp1(E − 1) ∨ Appear+_P′(E);

temp2(E) = L shi f t(temp2(E − 1),1); Appear+_P(E) = temp1(E) ∧ temp2(E).

(2)Recursive functions of getting Appear−_P(E): Suppose a pattern P = P1P2. . .Pm is given, where Pi(i= 1, . . ., m) is a data item. Let Y denote P1, P′′ denote P2P3. . .Pm, Q denote P2P3. . .Pm−1,

and X denote Pm. When deletion fault tolerance E is given, Appear_P−(E) is obtained from the following recursive function.

IF_{|P| ≤ E + 1, then Appear}−_P(E) = AppearY;

Else temp′_p_{(E − 1)=Appear}Q∨ (Appear−_Q(E − 1) ∧ L shi f t(AppearX,|P

′′

| − E,0));

temp′_p= tempQ(E − 1) ∨ (Appear−Q(E) ∧ L shi f t(AppearX,|P

′′

− E − 1,0));

Appear−_P(E) = AppearY∧ L shi f t(tempp”′(E), 1, 0).

In TFTRP-Mine algorithm, all fault tolerant patterns, denoted as FT-RPs, are obtain by using the recursive functions defined in the previous paragraph. But in TFTRP-Mine algorithm, the top-k non-trivial FT-RPs was extracted from all results which are found first, RE-TFTRP-Mine algorithm was designed to improve TFTR-Mine algorithm. In RE-TFTRP-Mine method, the FT-RPs which are not possible the top-k non-trivial FT-RPs are removed in advance by increasing the min f req during the mining process, hence the FT-RPs with the fault-tolerant frequencies less than the min f req will not be employed in the following mining process. Besides, it also gives the priorities for the found FT-RPs, the higher fault-tolerant frequency the patterns have, the higher the priorities the patterns have. Then, the FT-RPs with higher frequencies are selected to generate the new candidates.

(16)

However, Ning-Han Liu, et al. [14] also proposed another new method to find the approximate repeating pattern. The first step of this method is converting the pitch string of music into the interval string, and then divide the interval string into the interval segments according to max len and min len constraints, which used to filter out unimportant music patterns. Then we regard these segments as candidates ARP. Then, for each candidate, the edit distance was adopted to measure the similarity degree between two music segments. Finally, according to the number of similar music segments and how they overlap each other, we decide whether the candidate ARP has qualification for being an ARP. In order to speed up the execution time, it also modifies the R*-tree to remove impossible candidates before computing the edit distances.

2.1.2 Inter-transaction Association

There are several kinds of inter-transaction association mining problem, such as sequential pattern mining [4], frequent episodes mining [17] [16], periodic patterns mining [7] [22] [24] [6] and fre-quent continuities mining [21] [20] [11] [12] [15]. We will give a shorter introduction for sefre-quential pattern, frequent episodes, and periodic pattern mining, but give a detailed explanation for frequent continuities mining which is most resemble to our work.

Sequential Pattern Mining

The sequential pattern mining problem was first introduced in [4] by Agrawal and Srikant. In order to improve the speed for algorithm proposed in [4], there were a lot of methods are designed, such as PrefixSpan [19], SPADE [25], SPAM [5], FreeSpan [8] . . . and so on. Since the sequential pat-tern mining may generate many redundant patpat-terns, it will decrease not only effectiveness but also efficiency of mining. Therefore, closed pattern mining problem was gradually noticed by our. The famous algorithms for it are Clospan [23] and BIDE [1].

Frequent Episodes

Different from the sequential pattern, the data for frequent episodes is a sequence of event sets where the events are sampled regularly. An episodes is defined as a collection of events in a user-defined windows interval that appear relatively close to each other in a given partial order [17]. In [17], Mannila et at. defined three classes of episodes: serial, parallel and combination of serial and parallel. Serial episodes consider order for patterns in the sequence, while parallel episodes do not have constraints on the relative order of event sets. Fig 2.3 shows the three kinds of episodes.

Moreover, Mannila, et al. also proposed a new approach, WINEPI, for discovering the all fre-quent serial/parallel episodes in [17]. For finding the exact relation among episodes, Mannila et al.

(17)

A B

A A

C

B B

serial _parallel_parallel combination

Figure 2.3: Episodes Class

also specify another classes of generalized epiosdes in [16] and designed an algorithm, MINEPI, for discovering the frequent episodes based on minimal occurrences of episodes.

Periodic Patterns

Periodic pattern is defined as the pattern appears in the same time periodically. In last decades, there exist many studies for finding periodic patterns. However, in these studies many definitions of periodic pattern are proposed to apply in different situation which more conforms to real life. For example, in early days, cyclic association rules mining was first proposed by Banu ¨Ozden, et al. in [18], and the following is partial periodic patterns defined by Jiawei Han, et al. in [7] [6] to loose the constraints on whether every point in time contributes to the periodicity. For example, Bob eats breakfast from 8:00 to 9:00 every day, but do other things which is not regular at other times. Moreover in order to solve the problem that the periodic pattern may occurs asynchronous, the asynchronous periodic pattern is designed by Jiong Yang, et al. in [24] [22]. Take the previous example, Bob may eats breakfast from 9:00 to 10:00, which also contribute to the periodic pattern.

Frequent Continuities

The name continuity pattern was coined by Huang in [11] which used to substitute the name inter-transaction association rule defined by Anthony K.H. Tung in [20]. The continuity pattern, also called inter-transaction association rule, is defined as the pattern that considers the occurring order of each itemset in the pattern. Hence, we can also refer this patter as a looser constraint of peri-odic pattern which has limitation on contiguous and disjoint match. An algorithm, FITI [21], was proposed to solve this problem efficiently. FITI [21] have three stages:

(1) Mining and Storing Frequent Intra-transaction Itemsets, (2) Database Transformation, and (3) Mining Frequent Inter-transaction Itemsets. Nevertheless, it also takes to much time to find the results, hence Huang in [11] designed PROWL algorithm to mine results efficiently. The central

(18)

! " " " " " # ! $ % & ' " " " " " " ()( " " " " " " " " " " " % & ' " " " " " " ()( % & ' " " " " " ()( & ' " " " " " " " " " " " % & " ' " " " " " ()( % & " ' " " " " " ()( # * ) + # # , * )( - . % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % # # # / 0

Figure 2.4: Data Convert Process

thought of PROWL is to use the memory for both the event sequence and the indices in the mining process.

Huang also integrates prune hash table into PROWL algorithm to design the algorithm, Closed-PROWL [12]. In ClosedClosed-PROWL [12], there are three phrases for discovering the results. The first phase is to find all 1-size closed frequent itemsets, called C.F.E.. Then in the second phase, encode these C.F.E and construct them into a encoded horizontal database. Fig 2.4 shows the each step from converting the temporal database into encoded horizontal database.

In the third phase, refined PROWL [11] algorithm was utilized to find all closed frequent conti-nuities. The mining process of refined PROWL is described as follows:

(1) First we find the 1-offset projected window list, denoted as PWL, of each encoded eventset P, also called closed frequent continuity.

(2) Find all eventsets P which support surpass the minimum support, and record these eventsets and their PWL into Prune Hash Table, denoted as PruneHT, through the hash function. Moreover, using the pruning strategy to prune the redundant eventsets.

(3) For each eventsets X which are not removed after step 2, we connect it with P to generate the new continuity,and then perform step (1), (2) and (3) recursively to produce the larger closed

(19)

frequent continuities until the length of the patter is larger the maxwin or its support is smaller than minimum support.

(4) Find all possible closed frequent continuities, then use the Closed Continuity Checking Table, denoted as CCCT, to filter the duplicated closed frequent continuities.

Finally we give the comparison among these patterns. Fig 2.5 shows the table.

! ! " ! " ! " # ! # # ! $ " ! $ " ! # %%# $ % $ %# & $ % $ %#

(20)

2.2 Definition

In this section, we present essential preliminaries.

Definition 1 (Set Sequence Database) Let I _{= {i}1,i2, . . . ,im} be a set of elements. Let Si be a

subset of I, where Si= (s1,s2, . . . ,sn) is a set of elements such that sk ∈ I for 1 ≤ k ≤ n and each

element in Siis distinct. The set sequence database SD is defined as an order sets of Si, i.e SD=<

S1,S2, . . . ,Sn>.

Definition 2 (GCk-contain instance) Given two set sequence SD=< S1,S2, . . . ,Sn >, and P=<

p1,p2, . . . ,pm >, where n≫ m, we say that SD GCk-contains P at position k iff there exists an

integer 1_{≤ k ≤ n, such that p}1⊆ Si0,where i0= k, p2⊆ Si1, . . . ,pm⊆ Sim₋₁ and ij− ij−1≤ GC + 1

for 1_{≤ j ≤ m. The GC (abbreviated from gap constraint and denoted as}γ) is a user-defined upper bound number of gaps between two adjacent set of P in SD.

Example Consider SD =<_{{A,B,C,D},{A,C},{A,B,C},{A},{A,C,D,E},{A},{B,C,E,F},{B,D},} {A,C},{E} > and GC=1, we say that SD has two GC1-contain instances for pattern P1=<

{D},{C} > i.e. m1 and m4 in Fig 2.6.

A B A B C A C A B C A C D E A B C E F B D A C E A D C C E A F D C E m1 _m3 A m2 m1 _m3 m4 m2 m5 m6 m7 m8 m9 m10 m 0

Figure 2.6: Illustrative Example I

Definition 3 (Length and Size of a pattern) Given a pattern P=< p1,p2, . . . ,pm>, then size of P is

defined as the number of sets in P, denoted as size(P). The length l of P is defined by the l=∑m_i₌₁_|Pi|. Example Given P_{=< {A,B},{C},{A},{C} >, size of P is 4, length of P is 5.}

Definition 4 (kth Position GC_k-Contain Set, abbreviated as KPCS) Given a SD=< S1,S2, . . . ,Sn>,

a pattern P =< p1,p2, . . . ,pm >, where size(SD) ≥ size(P) and the GC, the kth Position GCk

-Contain Set consists of a starting position, k, and different ending positions which are the ending

(21)

to record all positions, where nimeans the ending position for a pattern P for 1≤ i ≤ j. We also use

<k, ni>to mean one instance of kth Position GCk-Contain Set, where k is a starting position and

niis an ending position. i.e. Given a KPCS{1,(3,4,5,7)}, the instances of this KPCS are < 1,3 >,

<1, 4 >, < 1, 5 > and < 1, 7 >.

Example Given SD =<_{{A,B,C,D},{A,C},{A,B,C},{A},{A,C,D,E},{A},{B,C,E,F},{B,D},} {A,C},{E} > and GC=1, we say that P1 =< {D},{C} > has two GC1-contain instances in SD, i.e.

m1 and m4 in Fig 2.6, where KPCS is_{{1,(2,3)}. Besides the P1 also has a GC}5-contain instance,

i.e m2 and a GC8-contain instance, i.e m3.

Definition 5 (Repeating Pattern with Gap Constraint, abbreviated as RPGC) Given a set sequence

SD=< S1,S2, . . . ,Sn>and all distinct kth Position GCk-Contain Sets of a pattern P, the frequency

of a pattern P, denoted as freq(P,SD), is the maximum number of non-overlap instances of all distinct

kth Position GCk-Contain Sets of a pattern P. If P is called a RPGC, then f req(P, SD) ≥δ , where

δ is a minimum support defined by user.

Example Consider SD =<_{{A,B,C,D},{A,C},{A,B,C},{A},{A,C,D,E},{A},{B,C,E,F},{B,D},} {A,C},{E} >, GC=1 and a pattern P1 =< {A,C},{B},{A} > and P2 =< {A,C},{B},{C} >. The

freq(P1,SD)=2 and freq(P2,SD)=1 ,shown in Fig 2.6. Hence, P1 is a RPGC, but P2 is not a RPGC.

Finally, we define the repeating pattern with gap constraint problem as follows,

Definition 6 (RPGC discovery problem) Given a set sequence SD and GC, find all RPGC P in SD,

where f req_{(P, SD) ≥}δ .

In order to give a clear explanation for the correctness of our algorithms, we define the following definitions.

Definition 7 (Counting Basis Set, abbreviated as CBS) Given all distinct kth Position GC_k-Contain Sets of pattern P in SD under GC = m, the counting basis set is defined as a set of instances,

i.e. _{{< k}0,n0 >, <k1,n2>, . . . , <ke,ne >} obtained from KPCSs of different k and satisfied the

following condition :

1) For any pair of < ki,ni>and < kj,nj>, < ki,ni>can not overlaps < kj,nj>for i6= j.

2) we select the instances, < kj,nj>, from all distinct kth Position GCk-Contain Sets, where j

starts from 1 to e and nj− kjis the minimum value.1.

Example Given SD =<_{{A,B,C},{A,C},{A,B},{A,C},{A,C,D,E},{A,C},{B,C,E,F},{B,D},} {A,C},{E} >, GC=1 and a pattern P1 =< {A,C},{A},{C} >, We say that CBS for pattern P1 =< {A,C},{A},{C} > are < 1,4 > and < 5,7 >, which means m1 and m8 in Fig 2.7.

(22)

A B A A A C D A B C E B A A B C A C B D E A C E F B D A C E m1 C m1 m3 m4 m2 m5 m5 m6 m7 m8

Figure 2.7: Illustrative Example II

Definition 8 (Unit Counting Set, abbreviated as UCS) Given all distinct k KPCS of a pattern P in

SD under GC = m and a starting position S and ending position E, denoted as UCSP(S, E). The

Unit Counting Set is defined as a set of instances obtained from the KPCS, where ending position of each instance in UCS is equal to E, and starting position of each instance in UCS is large or equal

to S. We regard all sets in Unit Counting Set being 1 of frequency count for we only need to compute the non overlap instance.

Example In Fig 2.7, Given SD, S=1 and E=4 and GC=1, then the UCSP(1, 4) for pattern P =<

{A,C},{A},{C} > is < 1,4 > and < 2,4 >, which mean m1 and m3.

Definition 9 (Frequency Counting Set Group, abbreviated as FCSG) Given the CBS for a pattern

P under GC = m, we classified distinct kth Position GC-Contain Sets of P into a group of UCS

according to the CBS of a pattern P. FCSG contains these UCS, where each UCS_{6= /0 and the} freq(P,SD) is equal to the number of UCS in FCSG.

Example Consider SD_{=< {A,B,C},{A,C},{A,B}, {A,C}, {A,C,D,E},{A,C}, {B,C,E,F},{B,D},} {A,C}, {E} >, GC=1 and a pattern P1 =< {A,C},{A},{C} >. The CBS for P is (1,4) and (5,7),

which mean m1 and m8. The UCS according to the CBS of P is UCSP(1,4){(< 1,4 >,< 2,4 >} and

UCSP_(5,7){< 5,7 >}. Because non of UCS is /0, the FCS contain these UCS and the freq(P,SD)=2.

(23)

Chapter 3 The Proposed Algorithms

In this section, we propose a algorithm, G-Apriori, to find the RPGC. Besides, we also design an index list which is incorporated into G-Apriori algorithm to generate a refined method, GwI-Apriori, for efficiency.

3.1 G-Apriori Algorithm

According to anti-monotonic property, any length-(n-1) pattern of length-n RPGC must be RPGC. Hence, G-Apriori algorithm employs the property to generate the candidates Ck from Lk−1 then

finds Lk by scanning the set sequence and counting the support of Ck,where Ck is a set of length k candidate for RPGC and Lk is a set of length-k RPGC. Here, we also name a RPGC as a frequent pattern. The process is as follows: 1) Scan the set sequence to find all length 1 frequent patterns L1. 2) Generate all length k candidates Ck from length k-1 frequent patterns Lk−1 by pattern-grow

method. 3) Scan the set sequence to count support of Ck and find Lkfrom Ckwhich support is larger than the minimum support. 4) Go to step 2) until Lk is empty. However, the method of generating candidates for a RPGC is different from the method of generating candidates for association rule [3] because each set in a RPGC has its order. Hence, we propose the pattern-grow method to generate all possible candidates Ck from Lk−1. Based on anti-monotonic principle, we can know that for

every length l frequent pattern, all its l-1 length patterns are frequent. The pattern-grow method is designed as follows. Given two patterns p1and p2 in Lk−1, then we delete the first element from p1

to obtain p′₁and delete the last element from p2to obtain p′₂. First, we need to check p′₁ and p′₂. If

p′₁and p′₂are both /0, two possible length-2 patterns will be generated by two methods, appending set of p2 to set of p1 and adding element in set of p2 into set of p1. Otherwise, both p′₁ and p′₂are

not /0, length k pattern will be generated by combining p1and p2, which means if the length of the

(24)

of p2into the last set of p1. The detailed steps of G-Apriori is described in Algorithm 1.

Algorithm 1 G-Apriori Algorithm

Input:A set sequence SD, thresholdδ, Gap Constraintγ

Output:All large RPGC

1: L1= {i|i ∈ I, f req(SD,i) ≥δ}

2: for k= 2;Lk−16= /0;k++ do

3: Ck=Candidate-Generate(Lk−1); 4: for all patten c in Ckdo

5: count = freq(c,SD,γ)

6: Lk ={c ∈ Ck| f req(c,SD,γ) ≥δ}

7: RSPSet=S_kLk

Algorithm 2 Candidate-Generate Algorithm Input:Lk−1

Output:Ck

1: for each pair(csi,csj) where csiand csj∈ Lk−1 do 2: cs′_i= delete first element of csi

3: cs′_j = delete last element of csj

4: if equal(cs′_i, /0) and equal(cs′_j, /0) then

5: cs_k₁ = append the last set of csjto csi

6: csk2 = add the last element of csj to csilast set

7: if length(csk1)=i+1 then

8: add cs_k₁ to C_k

10: add csk2 to Ck

11: else

12: if equal(cs′_i,cs′_j) then

13: if last element of cs′_iis the set of length 1 then then

14: csk= append the last set of csjto csi

15: else

16: csk= add the last element of csjto csilast set

17: if length(csk)=i+1 then

18: add ck to Ck

19: return Ck

We give an example to show the merging process for Candidate Generate step. Let L3 be{<

{A},{B,C} >,< {B,C},{D} >,< {D},{A},{B} >,< {B,C,F} >}. For patterns < {A},{B,C} >

and <_{{B,C},{D} >, after the Candidate Generate step, C}4will be{< {A},{B,C},{D} >,< {A},

{B,C,F} >} and {< {D},{A},{B,C} >}. Example

Given a set sequence sd_{=< {B,C},{D},{A},{B,C},{E,G},{A,B,C},{C},{A,F},{A,C},{H} >,} shown in fig 3.2, where min-support=2 and GC=1. L1is obtained by scanning the set sequence and

checking frequency for each item, and then we use L1 to generate C2, line 3 of Algorithm 1. After

(25)

C1 <{A}> 4 L1 <{A}> 4 C2 <{A} {A}> 1 L2 <{A} {C}> 3 { } <{B}> 3 <{C}> 5 <{D}> 1 { } <{B}> 3 <{C}> 5 <{A},{A}> 1 <{A},{B}> 1 <{A},{C}> 3 <{A},{C}> 3 <{A,C}> 2 <{B},{A}> 2 <{D}> 1 <{E}> 1 <{F}> 1 {G} 1 <{A,B}> 1 <{A,C}> 2 <{B},{A}> 2 <{B,C}> 3 <{C},{A}> 3 <{C},{C}> 2 <{G}> 1 <{H}> 1 <{B},{B}> 1 <{B},{C}> 1 <{B,C}> 3 C3 L3 { , } <{C},{A}> 3 <{C},{B}> 1 <{C} {C}> 2 <{A},{C},{A}> 1 <{A} {C} {C}> 1 <{B},{A},{C}> 2 <{B C} {A}> 2 <{C},{C}> 2 <{A},{C},{C}> 1 <{A,C},{A}> 1 <{A,C},{C}> 1 {B} {A} {C} 2 <{B,C},{A}> 2 <{C},{A},{C}> 2 <{C},{A,C}> 2 <{B},{A},{C}> 2 <{B},{A,C}> 1 <{B,C},{A}> 2 <{B,C},{A},{C}> 2 C₄ <{B,C},{A},{C}> 2 L4 <{B,C},{C}> 1 <{C},{A},{C}> 2 <{C},{A,C}> 2 { , },{ },{ } <{B,C},{A,C}> 1 { , },{ },{ } <{C},{C},{A}> 1 <{C},{C},{C}> 1

Figure 3.1: Apriori Based Example

Index Position 1 2 3 4 5 6 7 8 9 10 Sets B C D A B C E G A B C C A F A C H

Figure 3.2: Set Sequence Data

those patterns in C2 which support is under the threshold. Whole process will terminate when no

large pattern is derived. In this example, since L6is empty set, the process will stop. The frequent

patterns are in Li, where I= {1,2,3,4}. Fig 3.1 shows the process for discovering all Lk.

3.2 GwI-Apriori Algorithm

Since G-Apriori algorithm takes too much time to scan database for counting support of the patterns. Hence, we propose GwI-Apriori (abbreviated from Gap with Index Apriori) algorithm to solve this problem. The GwI-Apriori algorithm also bases on G-Apriori algorithm to generate the candidates, but only scans database once and records the positions information of L1, then uses the positions

(26)

information for further counting the support of the patterns. Moreover, we devise an index list, which consists a S P and E P list, to record a start and end positions where the pattern may appear in the set sequence, i.e KPCS of the pattern in kth position. Here, we need to notice that for each pattern, it has a set of index lists where every index list stands for KPCS of the pattern at distinct S P position. Besides, we also design a pruning strategy to speed up the execution time.

As the G-Apriori algorithm, we scan the set sequence and construct the index lists to record the positions where L1 locate. However, We modify Candidate Generated in G-Apriori algorithm to

derive the Merge Check shown in Algorithm 3 for generating the candidates. In Merge Check, the candidates and their corresponding extended type and extended pattern are return. After finding the L1, C2are generated from L1by calling the Merge Check. For C2, distinct strategies to construct

in-dex lists are adopted according to different extended types, sequence-extended (S-Step) and itemset-extended (I-Step) [5]. For example, given a set sequence s=<_{{A},{B} >, then the S-Step for the s} is <_{{A},{B},{C} > and the I-Step for the s is < {A},{B,C} >. When a pattern c}iin Lkcan merge

cj in Lk underγ = n. According to the returned extend type and extend pattern, two different range check are applied to construct the index lists for merged pattern. The steps for constructing the index lists for merged pattern is stated as follows, we takes each value EV in E P list from each index list of pattern cito check which values in S P from all index lists of last element of pattern cjare contained in corresponding range, where range (EV,EV+ n + 1] is for S-Step but range [S P,S P] is for I-Step,

then we construct the index list which S P equals to S P in current checked index list of pattern ci and record the values which stratified the corresponding range into the E P list. This process will continue until all index lists of pattern Ciare checked. However, the patterns which supports are less than theδare removed from the Ck. To be mentioned that we count the frequency while constructing the result index lists. For instance, consider the the following L2, pattern <{A},{C} >, which index

lists are_{{1,(2,3,4)}, {3,(4,5)}, {4,(5)}, pattern < {C},{B} > which index lists of pattern < {B} >} are_{{2,(2)}, {3,(3)}, {5,(5)}, {6,(6)}, {10,(10)} and pattern < {C,D} > which index lists of} pat-tern <_{{D} > are {4,(4)}, {5,(5)}, {9,(9)}. C}3are <{A},{C},{B} > and < {A},{C,D} > where

the index list,_{{1,(2,3,5,6)}, for the pattern < {A},{C},{B} > under}γ = 2 is generated by taking

each value in E P list of _{{1,(2,3,4)} to do S-Step range check in index lists of pattern < {B} >,} hence the times of comparison are 3*5. Besides_{{3,(5,6)} and {4,(6)} are also the index lists for the} pattern <_{{A},{C},{B} >. And the index lists for pattern < {A},{C,D} > are {1,(4)}, {3,(4,5)}} and_{{4,(5)} by applying the I-Step range check. The pattern mining process will terminate until L}k is /0.

An example for RwI-Apriori Algorithm

(27)

Algorithm 3 Merger Check Input: Lk−1

Output:All merged pattern Ck and corresponding extended type Ck.T and extended pattern Ck.L E

1: for each pair(csi,csj) where csiand csj∈ Lk−1 do 2: cs′_i= delete first element of csi

3: cs′_j = delete last element of csj

4: if equal(cs′_i, /0) and equal(cs′_j, /0) then

5: cs_k₁ = append the last set of csjto csi

6: csk2 = add the last element of csj to csilast set

8: add 1 to C_k.T

11: add 0 to C_k.T

13: add last element of cj to Ck.L E

14: else

15: if equal(cs′_i,cs′_j) then

16: if last element of cs′_iis the set of length 1 then then

17: add 1 to Ck.T

18: csk= append the last set of csjto csi

19: else

20: add 0 to Ck.T

21: csk= add the last element of csjto csilast set

22: if length(csk)=i+1 then

23: add ck to Ck

24: add last element of cjto Ck.L E

(28)

data and construct the index lists for L1, shown in Fig 3.3 (a), where the pattern{A} occurs at

posi-tions of 3, 6, 8, 9 in set sequence sd. C2are generated by L1, where one of pattern <{B,C},{A} >

is generated by combining pattern pattern <_{{B,C} > and < {C},{A} >. The index lists for} pat-tern <_{{B,C},{A} > are generated by taking each index lists of pattern < {B,C} >, which are}

{1,(1)} {4,(4)} and {6,(6)}, to do range check in index lists of pattern < {A} >, which are {3,(3)}, {6,(6)}, {8,(8)}, {9,(9)}. Fig 3.3 shows the index lists for L1, L2, L3 and L4, respectively, where

the S P underlined means that it is contributed to the frequency counting.

Pattern S_P {A} 3 6 E_P 8 { } 1 9 {B} 1 4 6 {C} 1 4 6 6 7 7 9 9 (a) Pattern S_P {A},{C} 4 7 E_P 9 {A,C} 6 9 9 {B},{A} 3 6 6 8 {B,C} 1 4 6 {C},{A} 3 6 8 7 8,9 {C} {C} 6 {C},{C} 6 6 7 9 (b) Pattern {B},{A},{C} 4 4 7 S_P E_P {B,C},{A} 3 6 9 6 6 8 {C},{A},{C} 4 4 7 9 7 9 {C},{A,C} 6 9 (c) Pattern {B,C},{A},{C} 4 4 7 S_P E_P 4 7 9 (d)

Figure 3.3: The example for GwI-Apriori Algorithm

3.2.1 Pruning Strategies

Because times of comparisons between index lists take much time, in order to resolve this problem, we design the pruning strategies, based on order of index lists for a pattern, to speed up the mining process. The pruning strategies is stated as follows.

(1)Range pruning: The general concept is that for a pattern P1, which S P=ps1 and E P=pe1, if we want to extend a pattern P2 of size 1 underγ=n, then based on distinct extended type, the the position of pattern P2 must locate at the range (pe1,pe+n+ 1] or [pe1,pe1]. For detailed description,

(29)

IL.E P[i + 2], if FALSE then scan the PL from position j. The reason is that True means we have

scanned the previous E P and put result positions into the result index list.

(2)Last value pruning: The general concept is that for a pattern P1, which S P=ps1 and E P=pe1, if we want to extended a pattern P2 of size 1 under γ=n, then the position of pattern P2 must larger than pe1. For detailed description, if the PL[i].E P[0]_{≥ the last value in PL then stop}

comparison. The reason is that PL_{[i].E P[0] ≥ last value in PL means the S P of following index} lists must_{≥ last value in PL, so we do not need to scan the following index lists.}

Algorithm 4 shows the pruning strategies which is integrated into frequency counting process, where next range is used to range contained pruning and cur L element is used to last value pruning. Besides Fig 3.4 shows the flow chart for the pruning strategies.

START End of Ind_list Y Pair Cur_L_element ! E_P[0] of current Ind_list N Y N

set starting position of Pos_list = ptr_L_E Extended Type Check range " next range S#Step I#Step next_range S#Frequency Counting I#Frequency Counting N Y

End of E_P End of E_P

N

Chapter 4 Correctness

We also prove the correctness of the G-Apriori algorithm in the following. Lemma 1: For each pattern Piin L2, the each 1 size pattern of Piis in L1.

Theorem 1: For each pattern Pi in Lk under GC=m, the k-1 size pattern of Piis also in Lk−1 under

GC=m.

Proof: Given Pi=< p1,p2, . . . ,pn>, where pj⊆ I f or1 ≤ j ≤ n and frequency counting sets under GC=m for Pi. While we delete the first element for Pi to obtain P D F =< p d f1,p2, . . . ,pn >, we can know FCSG(P) _{⊆ FCSG(P D F). We delete the last element for P}i to obtain P D L=<

p1,p2, . . . ,p d l1>, FCS(Pi) ⊆ FCSG(P D L). We can clear know that for each pattern Piin Lk, the k-1 size pattern of Piis in Lk−1.

Lemma 2: Based on the conditions for finding the CBS among the KPCS for pattern P in SD under GC= m, the number of instances in CBS is the maximum non overlap instances.

Proof: In here, we need to prove two ideas: 1) Greedy choice property and 2) Optimal substructure property.

(1) Let S _{= {< k}0,n0>, <k1,n1>, . . . , <ki,ni>} be a set of instances obtained from the KPCS of P. The instances in S are first sorted by ending positions, after the first stage we have made, if the ending positions of the instances are the same then we need to sort these instances by starting positions progressively. It implies that instances < k0,n0> has the earliest starting position and

ending position. Suppose, AS is a subset of S and is an optimal solution then let instances in AS are first ordered by ending positions then ordered by starting position. Suppose, the first instance in AS is < kj,nj >. If < kj,nj >=< k0,n0 >, then AS begins with greedy choice and we are

done. If < kj,nj >6=< k0,n0>, we want to show that there is another solution BS that begins

with greedy choice, instance < k0,n0 >. Let BS= AS − {< kj,nj >}U{< k0,n0 >}. Because

<k0,n0>≤< kj,nj>, the instances in BS are disjoint and since BS has same number of instances as AS, i.e.,_{|AS| = |BS|, BS is also optimal.}

(35)

(2) Now we prove optimal substructure. If AS is an optimal solution to S, the AS′_{= S −{< k}0,n0>}

is an optimal solution for S′_{= {< k}p,np>∈ S|kp≥ n0}. Therefore, after each greedy choice we are

left with an optimization problem of the same from as the original. Induction on the number of choices, the greedy strategy produces an optimal solution.

Theorem 2: The GwI-Apriori algorithm can find maximum frequency for a pattern.

Proof: The frequency counting strategy of GwI-Apriori algorithm for a pattern P based on greedy choice. According the Theorem 2, we can assure that the GwI-Apriori algorithm can find the max-imum number of non overlap instances for a pattern, which means the maxmax-imum frequency for a pattern.

(36)

Chapter 5 Experiment

In this section, we present the experiment results of both G-Apriori and GwI-Apriori algorithms. All programs were implemented in Microsoft Visual C++ 6.0. All experiments are performed on Intel Pentium4 CPU 3.20GHz with 1 Gigabytes main memory, running on Linux. For our experimental evaluation we used real data.

We perform our algorithms on real world data, stock data, to get the useful pattern in a set se-quence. Stock data are collected form eight companies from Tawian Stock Exchange Daily Official list from January 1, 1995 to December 31, 2007 using Perl and the number of trading days are 3388. We discretize the stock price go-up/go-down into five categories: (1) Up-High(UH):_{≤ 3.5%,} Up-Low(UL): < 3.5% and > 0%, Unbiased(UN): 0%, Low(DL): > -3.5% and < 0%, Down-High(DH):_{≤ -3.5%. Hence, we have 40 different elements. The average size of the transactions are} 8. Table 5.1 shows the companies.

Stock Number Company Name

2330 TSMC1 2308 AELTA2 2317 Foxconn3 2324 Compal4 2311 ASE5 2321 TECOM6 2312 Kinpo7 2313 Compeq8

Table 5.1: Stock number and name for companies

1_{http://www.tsmc.com/chinese/default.htm} 2_{http://www.delta.com.tw/ch/index.asp} 3_{http://www.foxconn.com.tw/} 4_{http://www.compal.com/index En.htm} 5_{http://www.asetwn.com.tw/} 6_{http://www1.tecom.com.tw/} 7_{http://www.kinpo.com.tw/ChineseT/index.htm} 8_{http://www.compeq.com.tw/home.htm}

(37)

0.2 0.25 0.3 0.35 T im e ( s e c )

G Apriori GwI Apriori

0 0.05 0.1 0.15 15 16 17 18 19 20 E x e c u t io n Minimum Support (%) (a) GC=0 0.8 1 1.2 1.4 n T im e ( s e c )

GApriori GwIApriori

0 0.2 0.4 0.6 15 16 17 18 19 20 E x e c u t io n Minimum Support (%) (b) GC=1 2 2.5 3 3.5 n T im e ( s e c ) G Apriori GwI Apriori 0 0.5 1 1.5 15 16 17 18 19 20 E x e c u t io n Minimum Support(%) (c) GC=2 3 4 5 6 n T im e ( s e c )

GApriori GwIApriori

0 1 2 3 15 16 17 18 19 20 E x e c u t io n Minimum Support (%) (d) GC=3 4 5 6 7 T im e (s e c )

GApriori GwIApriori

0 1 2 3 15 16 17 18 19 20 E x e c u t io n Minimum Support (%) (e) GC=4 4 5 6 7 8 T im e ( s e c ) G Apriori GwI Apriori 0 1 2 3 4 15 16 17 18 19 20 E x e c u t io n Minimum Support (%) (f) GC=5

Figure 5.1: Gap constraint versus Execution time

4 5 6 7 8 n T im e (s e c ) G Apriori GwI Apriori 0 1 2 3 0 1 2 3 4 5 E x e c u t io n GC (a) GC=0 1.5 2 2.5 T im e (s e c )

GApriori GwIApriori

0 0.5 1 0 1 2 3 4 5 E x e c t io n T GC (b) GC=1 0 8 1 1.2 1.4 1.6 Ti me (s e c ) GApriori 0 0.2 0.4 0.6 0.8 0 1 2 3 4 5 E x e c u t io n GC (c) GC=2 0.6 0.8 1 1.2 n Ti me (s e c )

G Apriori GwI Apriori

0 0.2 0.4 0 1 2 3 4 5 E x e c u t io n GC (d) GC=3 0.4 0.5 0.6 0.7 0.8 n T im e (s e c )

GApriori GwIApriori

0 0.1 0.2 0.3 0.4 0 1 2 3 4 5 E x e c u t io n GC (e) GC=4 0 25 0.3 0.35 0.4 0.45 0.5 T ime ( s e c)

GApriori GwIApriori

0 0.05 0.1 0.15 0.2 0.25 0 1 2 3 4 5 E x e c u t io n GC (f) GC=5

(38)

The execution time of both algorithms with varying GC are shown in Fig 5.1. From these figures, we can clearly know that when we increase the value of GC, although GwI-Apriori algorithm run faster than the G-Apriori algorithm, the time difference between execution time of them is getting much nearly. The reason is that we may record more index lists needed to compare.

Fig 5.2 shows the execution time of both algorithms with varying minimum support. When the minimum support increase from 15% to 20%, the execution time of both algorithms decrease for the average pattern length being shorter. However, the GwI-Apriori algorithm still performs much better than the G-Apriori as the minimum support increasing.

400 500 600 700 o f P a tt e rn s minsup=15% minsup=16% 0 100 200 300 0 1 2 3 4 5 N u m b e r o GC minsup=17% minsup=18% minsup=19% minsup=20%

(a) Number of patterns for different GC versus mini-mum support 65% 75% 85% minsup=15% minsup=16% 35% 45% 55% 0 1 2 3 4 5 GC minsup=17% minsup=18% minsup=19% minsup=20%

(b) Time rate for different GC and minimum support

400 500 600 700 o f P at te rn s GC=0 GC=1 0 100 200 300 15 16 17 18 19 20 N u m b e r o Minimum Support (%) GC=2 GC=3 GC=4 GC=5

(c) Number of patterns for different minimum support versus GC 60% 70% 80% 90% GC=0 GC=1 30% 40% 50% 15 16 17 18 19 20 Mininum Support (%) GC=2 GC=3 GC=4 GC=5

(d) Time rate for different GC and minimum support

Figure 5.3: Summary Illustration

Fig 5.3 (a) and (c) show the number of frequent patterns for different GC versus minimum support. Besides, in order to clearly show how GwI-Apriori algorithm run faster than G-Apriori algorithm, we define a formula ( (Execution time of G-Apriori)−(Execution time of GwI-Apriori)

(Execution time of G-Apriori) ) and apply this formula to compute the rate of difference of execution time between G-Apriori and GwI-Apriori. Fig 5.3 (b) and (d) show the rate.

(39)

Chapter 6 Conclusion

In this paper, we propose a new problem, mining repeating patterns with gap constraint from the set sequence. Besides, we also propose a algorithms, G-Apriori, to mine the repeating pattern with gap constraint. The refined algorithm, GwI-Apriori, is proposed to prevent set sequence from scanning many times to obtain frequent patterns. In GwI-Apriori method, a new data structure is designed to record the appearing start and end positions of a pattern, hence we only need to scan the set sequence once that can save a lot of time to scan database while finding the longer patterns. Besides, the pruning strategies also designed to reduce the comparing times among the index lists. The experimental results show that GwI-Apriori outperforms G-Apriori algorithm. In addition, we can obtain potential repeating patterns while adopting gap constraint.

容許間距之近似重覆樣式探勘

國

立

交

通

大

學

資訊科學與工程研究所

碩

士

論

文

容 許 間 距 之 近 似 重 覆 樣 式 探 勘

Mining Repeating Pattern with Gap Constraint

研 究 生：邱欣怡

指導教授：黃俊龍 教授

容許間距之近似重覆樣式探勘

Mining Repeating Pattern with Gap Constraint

研 究 生：邱欣怡 Student：Shin-Yi Chiu

指導教授：黃俊龍 Advisor：Jiun-Long Huang

國 立 交 通 大 學

資 訊 科 學 與 工 程 研 究 所

碩 士 論 文

摘要

致謝

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Preliminaries

2.1

Related Works

2.1.1

Repeating Pattern

2.1.2

Inter-transaction Association

2.2

Definition

Chapter 3

The Proposed Algorithms

3.1

G-Apriori Algorithm

3.2

GwI-Apriori Algorithm

3.2.1

Pruning Strategies

Chapter 4

Correctness

Chapter 5

Experiment

Chapter 6

Conclusion

容許間距之近似重覆樣式探勘

研究生：邱欣怡

指導教授：黃俊龍教授

研究生：邱欣怡 Student：Shin-Yi Chiu

國立交通大學

資訊科學與工程研究所

碩士論文