With the advance of information technology, more and more corporations and

(1)

Chapter 1 Introduction

With the advance of information technology, more and more corporations and

organizations store data in databases by means of automated data collection tools in

recent years. Since the amount of data grows tremendously, how to extract

interesting and useful knowledge from large databases is a very important issue.

“Data mining” is the study used to solve these problems.

1.1 Motivation

Repeating patterns represent the important sub-patterns in a data sequence

because they appear in the sequence repeatedly. In music analysis [1][2][6][7] and

biology study, repeating patterns are usually considered representative features in a

sequence data. Therefore, repeating patterns discovery is of great interest for the

applications with sequential data representation.

There have been many approaches proposed for mining repeating patterns in

recent years. However, when matching patterns during mining process, only exact

matching was considered. It may cause that some implicit repeating patterns could

not be found because insertion or deletion errors occur on the patterns. For example,

suppose two data sequences are given as follows:

(2)

Sequence 1 = ACDE……ACEDE……,

Sequence 2 = ACDE……ADE……

In Sequence 1, the pattern “ACEDE” approximately matches “ACDE” with one

insertion error. And in Sequence 2, the pattern “ADE” approximately matches

“ACDE” with one deletion error. Therefore, using exact matching approach to

mine repeating patterns will lost the implicit repeating pattern “ACDE”.

To solve this problem, this thesis focuses on the strategy for mining

“fault-tolerant” repeating patterns, FT-RPs in short. In other words, the

insertion/deletion errors are allowed when counting the appearing frequency of a

pattern. Besides, to avoid duplicated information and many short patterns being

found, only “non-trivial” FT-RPs, i.e., those containing no super-pattern FT-RPs with

the same fault-tolerant frequency, and their lengths no less than a given min_len are

mined out. Moreover, by giving the desired number of non-trivial FT-RPs to be

mined, we propose an approach of mining “top-K non-trivial fault-tolerant repeating

patterns with length no less than min_len” to avoid finding a huge amount of

non-representative patterns in the result.

(3)

1.2 Literature Review

The related works of our task mainly include four parts: the strategies for

repeating patterns mining, periodic patterns mining, fault- tolerant mining, and top-K

closed patterns mining.

1.2.1 Repeating Patterns Mining

For providing content-based retrieval in music data, the thematic feature strings,

such as melody, rhythm, and chord are usually extracted from the original music

objects and treated as the meta data of music data [2]. By giving a segment of a

music object, the querying system must find the corresponding music object. In order

to prevent matching the whole feature sequence when processing such queries, the

concept of repeating patterns was used in [6] to represent the significant content of a

music object. This paper proposed a data structure called correlative matrix to aid the

process of mining for extracting repeating patterns. However, the main disadvantage

of this approach is that a correlative matrix needed much storage space because the

size of the matrix is proportion to the square of the length of the music object.

Therefore, the larger the size of a correlative matrix is, the more cost it is required to

do the mining process.

To solve the problem described above, the same authors developed the String-Join

(4)

algorithm [1]. Instead of finding all repeating patterns in a music object, only the

non-trivial repeating patterns were extracted. In other words, those repeating patterns

which have the same frequencies with their super-patterns were removed from the

result. In String-Join algorithm, all repeating patterns with length one were found

first. For each of the found repeating patterns, its frequency and appearing locations

were recorded. Then, two repeating patterns of length one were joined to generate

patterns of length two. Repeatedly, all length 1, 2

¹

,…, 2

^k

repeating patterns can be

obtained. Then, trivial patterns among them were removed. By combing those

non-trivial repeating patterns with length 2

^k

, patterns with lengths not power of two

were generated and all the non-trivial repeating patterns were found. The worst case

of this approach is when there are many non-trivial repeating patterns with length very

close to the length of the longest repeating patterns. In this situation, the execution

efficiency of String-Join algorithm will decelerate [1][7][8].

In [7], according to the concept of bit strings, the representation of bit index

sequence was designed to characterize the note sequence of a music object. Similar

to the bitmap representation used to express transaction database [3], the frequencies

of patterns could be computed from the bit sequences more efficiently. In the mining

process, the frequency of a candidate pattern could be obtained by performing shift

and and operations on bit sequences and counting the number of 1s in the resultant bit

(5)

sequence. This checking method could be performed quickly. In addition, it

avoided scanning the data sequences repeatedly. In this approach, to prevent

producing too much temporal results of bit sequences, the candidates were generated

by appending one note to a found repeating pattern recursively. In other words, the

candidates were generated in depth-first searching approach. At the same time, the

frequency of each found repeating pattern P was compared with its super-patterns

which have P as their prefixes to remove parts of the trivial repeating patterns.

Finally, those results were grouped to find the non-trivial patterns globally.

All of those approaches described above didn’t consider fault tolerance when

mapping patterns in the mining process. Therefore, implicit repeating patterns, which

may occur with insertion/deletion errors as discussed in chapter 1.1, could not be

found.

1.2.2 Periodic Patterns Mining

Periodic pattern discovery is an important mining problem for time series data.

Most of the past studies only considered the strategies for finding full periodic patterns.

However, finding periodic patterns which are exhibited by some of the time episodes is

more applicable since it is less restrictive than finding full periodic patterns.

Therefore, [4] proposed several algorithms for mining partial periodic patterns by

(6)

applying some interesting properties respectively; including the Apriori,

max-subpattern hit set properties, and shared mining of multiple periods. Because

max-subpattern hit set method only needed to scan the time series twice, it offered

better performance than the other two.

The above strategies are aimed to find the frequent period patterns. However, in

some situation, the infrequent patterns may interest users more than the frequent ones.

Therefore, [11] proposed a suitable measure function, called information gain, to

evaluate the degree of surprise for the occurrences of a pattern. A period pattern was

called a surprising pattern if its information gain reached the given threshold value.

Because the surprising patterns didn’t satisfy the downward closure property, the

pruning techniques in Apriori algorithm could not be used any more. Thus, the

concept of bounded information gain was used in the proposed InfoMiner algorithm to

eliminate the unnecessary checking during the mining.

1.2.3 Fault-Tolerant Mining

Fault-tolerant data mining would discover more general and useful information

for real-world dirty data. The problem of fault-tolerant frequent patterns (itemsets)

was defined and solved in [10] by proposing FT-Apriori algorithm. Two thresholds

were used to decide whether a pattern is a fault-tolerant frequent pattern or not: (1)

(7)

fault-tolerant support threshold (min_sup

^FT

), which was used to require the minimum

times that a pattern appears by considering fault tolerance, and (2) fault-tolerant item

threshold (min_sup

^item

), which was used to prune patterns consisting of non-frequent

items. FT-Apriori algorithm suffered from generating a large number of candidates

and scanning database repeatedly. This problem became worse when increasing the

fault tolerance or decreasing the support thresholds.

[13] proposed FTP-Mine Algorithm for mining fault-tolerant frequent patterns.

During the mining process of generating candidates, whenever a candidate pattern was

judged to be a fault-tolerant frequent pattern, a longer candidate was generated by

adding an item according to the support descending order of items. A data structure,

called Stable, was defined to assist the counting of fault-tolerant support and item

support of a candidate pattern. However, when generating longer patterns, the database

had to be scanned to update the necessary data stored in Stable. Although the

performance efficiency of FTP-Mine is faster than FT-Apriori, it still should scan the

database again and again.

To speed up the process of mining fault tolerant frequent patterns, [9] proposed an

algorithm named FFT-Mine (Fast Fault Tolerant frequent patterns Mining). In this

approach, the information in a transaction database is represented in the form of

appearing bit sequences, which is a bitmap representation similar to those used in [3].

(8)

By extending this idea, the representation of fault-tolerant appearing bit sequences was

designed to represent the distribution that the candidate patterns were contained in

database with fault tolerance. FFT-Mine algorithm generated candidate itemsets in

depth-first searching approach, which provided a systematically method to reduce the

number of operations performed on bit sequences to get the fault-tolerant appearing bit

sequence of a candidate itemset. Then a candidate could be judged whether it is a

fault-tolerant frequent itemset quickly according to its fault-tolerant appearing bit

sequence. The whole mining process only scanned the database once. The

experiment results showed FFT-Mine algorithm provided more efficiency

improvement than FTP-Mine algorithm.

The surveyed three papers about fault-tolerant mining all considered the problem

of frequent patterns mining in transaction data sets. However, the algorithms were

not applicable for mining patterns in an ordered data sequence.

1.2.4 Top-k Closed Patterns Mining

When mining frequent patterns, it is difficult for users to set an appropriate

minimum support without knowing the distribution of data in the database. Moreover,

if there exists long patterns in a database, the mining result may returns many short or

tedious patterns with duplicated information. To prevent the above problems

(9)

occurring, [5] proposed a TFP algorithm to discover top-K frequent closed patterns

with length no less than min_l without needing to set the minimum support. The TFP

algorithm applied on FP-tree data structure to achieve this task. The following skills

were performed to speed up the mining process. 1) Because only patterns with length

longer or equal to min_l were outputted, the transactions whose lengths shorter than

min_l were pruned. 2) Only frequent closed patterns with top-K frequencies were

returned. Therefore, the minimum support set by system was raised dynamically

according to some heuristics so as to prune FP-tree structure during the mining process.

3) Among the discovered frequent patterns, it is most possible to generate candidate

patterns with large supports from the pattern with the largest support. Therefore, the

frequent patterns were used to generate candidates in support descending order. 4) In

order to verify whether a frequent pattern is closed or not efficiently, hash-based

strategy was applied to check the containment relationship among patterns.

For solving the similar problems when mining frequent sequential patterns, TSP

algorithm was proposed in [12] to discover top-K closed sequential patterns with

length no less than min_l. It adopted the similar idea proposed in TFP algorithm to

raise the minimum support during the mining process. Once K closed sequential

patterns, whose lengths were not short than min_l, had been found, the minimum

support was reset to the least support among these patterns. Then the raised new

(10)

minimum support would prune the searching space dramatically to speed up the

mining process.

1.3 Our Approach

In summarizing the interesting strategies proposed in the related works, an

efficient way of mining top-K non-trivial fault-tolerant repeating patterns (FT-RPs in

short) with length no less than min_len for data sequences is proposed in this thesis.

By extending the idea of appearing bit sequences, fault-tolerant appearing bit

sequences are defined to represent the locations where candidate patterns appear in a

data sequence with insertion/deletion errors allowed. Then the fault-tolerant

frequency of a candidate pattern could be counted from its fault-tolerant appearing bit

sequence quickly. In this thesis, we conclude the recursive formulas to obtain the

fault-tolerant appearing bit sequence of a pattern systematically in order to eliminate

the duplicate computations. Two algorithms, named TFTRP-Mine and

RE-TFTRP-Mine, respectively, are proposed. The TFTRP-Mine algorithm

generates candidate patterns in depth-first searching approach. During the mining

process, each FT-RP is checked non-trivial or not by comparing its fault-tolerant

frequency with the ones of its generated patterns temporarily. Finally, the minimum

length and non-trivial requirements are checked globally, and the qualified patterns are

(11)

sorted to obtain the ones with top-K fault-tolerant frequencies. The RE-TFTRP-Mine

algorithm adopts two additional strategies to increase the mining efficiency. The first

one is to assign priorities for FT-RPs to generate candidates according to fault-tolerant

frequency descending order when the minimum length requirement has been achieved.

Moreover, the minimum frequency is raised dynamically when K numbers of FT-RPs

have been found. The experiment results show these two strategies will prune the

searching space dramatically when K is small proportional to the number of whole

FT-RPs.

1.4 Organization

This paper is organized as follows. Chapter 2 defines the relative terms used in

this thesis. The representation of appearing bit sequences and the way of getting

fault-tolerant appearing bit sequences are discussed in Chapter 3. Chapter 4 describes

the whole processing steps of TFTRP-Mine and RE-TFTRP-Mine algorithms. The

performance evaluation of proposed algorithms is shown in Chapter 5. Finally, in

Chapter 6, we propose the conclusion and feature works of this thesis.