Online mining maximal frequent structures in continuous landmark melody streams

(1)

Online mining maximal frequent structures in

continuous landmark melody streams

Hua-Fu Li

a,*

, Suh-Yin Lee

a

, Man-Kwan Shan

b

a

Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta Hsueh Road, Hsin-Chu 300, Taiwan

b_{Department of Computer Science, National Chengchi University, 64, Sec. 2, Zhi-nan Road, Wenshan, Taipei 116, Taiwan}

Received 10 January 2004; received in revised form 13 November 2004 Available online 14 April 2005

Communicated by E. Backer

Abstract

In this paper, we address the problem of online mining maximal frequent structures (Type I & II melody structures) in unbounded, continuous landmark melody streams. An efficient algorithm, called MMSLMS(Maximal Melody Struc-tures of Landmark Melody Streams), is developed for online incremental mining of maximal frequent melody substruc-tures in one scan of the continuous melody streams. In MMSLMS, a space-efficient scheme, called CMB (Chord-set Memory Border), is proposed to constrain the upper-bound of space requirement of maximal frequent melody struc-tures in such a streaming environment. Theoretical analysis and experimental study show that our algorithm is efficient and scalable for mining the set of all maximal melody structures in a landmark melody stream.

Keywords: Machine learning; Data mining; Landmark melody stream; Maximal melody structure; Online algorithm

1. Introduction

Recently, database and knowledge discovery communities have focused on a new data model,

where data arrives in the form of continuous, rapid, huge, unbounded streams. It is often re-ferred to as data streams or streaming data. Many applications generate large amount of data streams in real time, such as sensor data generated from sensor networks, transaction ﬂows in retail chains, Web record and click streams in Web applications, performance measurement in net-work monitoring and traﬃc management, call re-cords in telecommunications, etc. In such a data

*

Corresponding author. Tel.: +886 35731901; fax: +886 35724176.

E-mail addresses:hﬂ[email protected](H.-F. Li),sylee@ csie.nctu.edu.tw (S.-Y. Lee), [email protected] (M.-K. Shan).

(2)

stream model, knowledge discovery has two major characteristics (Babcock et al., 2002). First, the volume of a continuous stream over its lifetime could be huge and fast changing. Second, the con-tinuous queries (not just one-shot queries) require timely answers, and the response time is short. Hence, it is not possible to store all the data in main memory or even in secondary storage. This motivates the design of in-memory summary data structure with small memory footprints that can support both one-time and continuous queries. In other words, data stream mining algorithms have to sacriﬁce the exactness of its analysis result by allowing some counting error.

Although several techniques have been deve-loped recently for discovering and analyzing the content of static music data (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001), new techniques are needed to analyze and discover the content of streaming music data. Thus, this pa-per studies a new problem of how to discover the maximal melody structures in a continuous un-bounded melody stream. The problem comes from the context of online music-downloading services (such as Kuro at www.music.com.tw), where the streams in question are streams of queries, i.e., music-downloading requests, sent to the server, and we are interested in ﬁnding the maximal melody structures requested by most customers during some period of time. With the computation model of music melody streams presented inFig. 1, the melody stream processor and the summary data structure are two major components in the

music melody streaming environment. The user query processor receives user queries in the form ofhTimestamp, Customer-ID, Music-IDi, and then transforms the queries into music data (i.e., melody sequences) in the form of hTimestamp, Customer-ID, Music-Customer-ID, Melody-Sequencei by retrieving the music database. Note that a buﬀer can be optionally set for temporary storage of recent music melodies from the music melody streams.

In this paper, we present a novel algorithm MMSLMS (Maximal Melody Structures of

Land-mark Melody Streams) for mining the set of all maximal melody structures in a landmark melody stream. Moreover, the music melody data and pat-terns are represented as sets of chord-sets (Type I Melody structures) or strings of chord-sets (Type II Melody structures). While providing a general framework of music stream mining, algorithm MMSLMShas two major features, namely one scan

of music melody streams for online frequency collec-tion, and preﬁx-tree-based compact pattern repre-sentation. With these two important features, MMSLMSis provided with the capability to work

continuously in the unbounded streams for an arbitrary long time with bounded resources, and to quickly answer users queries at any time.

2. Preliminaries

2.1. Music terminologies

In this section, we describe several features of music data used in this paper. For the basic User

Query

Streams _(Sequence)Melody

Streams Music Database User Query Processor ••• Melody Stream Processor Maximal Melody Structure Streams Music ID Melody Sequence ••• Buffer Summary Data Structure in Main Memory

(3)

terminologies on music, we refer to (Jones, 1974). A chord is a sounding combination of three or more notes at the same time. A note is a single symbol on a musical score, indicating the pitch and duration of what is to be sung and played. A chord-set is a set of chords (Shan and Kuo, 2003).

Deﬁnition 1. The type I melody structure is represented as a set of chord-sets. The type II melody structure is represented as a string of chord-sets.

2.2. Problem statement

Let W = {i1,i2, . . . , in} be a set of chord-sets,

called items for simplicity. A melody sequence S with m chord-sets is denoted by S =hx1x2 xmi,

where xi2 W, "i = 1,2, . . . , m. A block is a set of

melody sequences.

Deﬁnition 2. A landmark melody stream LMS = [B1,B2, . . . , BN), is an infinite sequence of blocks,

where each block Bi is associated with a block

identifier i, and N is the identiﬁer of the ‘‘latest’’ block BN. The current length of LMS, written as

jLMSj, is N. The blocks arrive in some order (implicitly by arrival time or explicitly by time-stamp), and may be seen only once.

Deﬁnition 3. A set Y W is called an item-set, i.e., a set of chord-sets. k-item-set is represented by (y1y2 yk). The support of an item-set Y, denoted

as r(Y), is the number of melody sequences con-taining Y as a subset in the LMS seen so far. An item-set is frequent if its support is greater than or equal to minsup ÆjLMSj, where minsup is a user-speciﬁed minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream LMS.

Deﬁnition 4. A string Z is called an item-string, i.e., a string of chord-sets. A k-item-string is represented by hz1z2 zki, where zi2 W,

"i = 1,2, . . ., k. The support of an item-string Z, denoted as r(Z), is the number of melody sequences containing Z as a substring in the LMS seen so far. An item-string is frequent if its support is greater than or equal to minsup ÆjLMSj,

where minsup is a user-speciﬁed minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream seen so far.

Deﬁnition 5. A frequent item-set (or item-string) is called maximal if it is not a subset (or sub-string) of any other frequent set (or item-string).

In fact, the total number of maximal melody structures is smaller than that of frequent mel-ody structure. Hence, the type of maximal melmel-ody structures is more suitable for the performance requirements of music stream mining.

Deﬁnition 6. (Problem Deﬁnition of Online Min-ing Maximal Melody Structures in Continuous Landmark Melody Streams.) Given a landmark melody stream LMS = [B1,B2, . . . , BN) and the user

speciﬁed minimum support, minsup, in the range of [0, 1], the problem of online mining maximal melody substructures is to discover the set of all maximal melody structures, i.e., maximal item-sets or maximal item-strings, in single one scan of the landmark music stream.

2.3. Main performance requirements of music melody stream mining

The main performance challenges of mining melody streams are:

(1) Online, one-pass algorithm: each sequence in the landmark melody stream is examined once.

(2) Bounded-storage: limited memory for storing crucial, compressed information in summary data structure.

(3) Real-time: per item processing time must be low.

The proposed MMSLMSalgorithm possesses all

of these characteristics, while none of previously published methods (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001) can claim the same.

(4)

3. Online mining maximal frequent structures in landmark melody streams

3.1. Chord-set memory border

In this section, the upper bound on the number of candidate maximal melody structures is dis-cussed, and an eﬃcient algorithm for chord-set memory border construction is proposed.

Theorem 1. Given a set of k frequent chord-sets from a landmark melody stream, an upper bound of the amount of maximal frequent melody structures is Ck_dk=2e.

Proof. Assume that there are k frequent chord-sets, i.e., k frequent items, in the current landmark melody stream. The solution space of mining all frequent item-sets in the worst case is Ck₁þ Ck 2þ þ C k i þ þ C k dk=2eþ þ C k k, where Ck

1is the total number of frequent 1-item-sets, C k i is

that of frequent i-item-sets, and Ck

k is that of

fre-quent k-item-sets. We observe that the value of Ck

dk=2e is the maximum value among all the

binom-inal coefﬁcient Ck_i; 8i ¼ 1; 2; . . . ; k, in mining all frequent i-item-sets. In other words, the number of frequentdk/2e-item-sets is a maximum. We will prove the number of maximal frequent item-sets can not be greater than the value Ck_dk=2e, i.e., Ck_dk=2e is the upper bound. We prove it by contradiction.

Assume that the value of Ck

dk=2e is not the

maximum number of maximal frequent item-sets, i.e., a larger upper bound U exists, where U > Ck_dk=2e. Consider that there are one or more frequent melody structures with length L, where L >dk/2e. If F is a frequent melody structure with length dk/2e + i and it is maximal, where i = 1,2, . . ., k dk/2e, then all of the substructures of F are frequent, which is based on the anti-monotone Apriori heuristic (Agrawal and Srikant, 1994): if any i-item-set (or i-item-string) is not frequent, its (i + 1)-item-set (or (i + 1)-item-string) can never be frequent, but not maximal, which is based on the deﬁnition 5: a frequent item-set (or item-string) is called maximal if it is not a subset (or substring) of any other frequent set (or item-string). In other words, it means that when one

maximal frequent structure with length L, where L >dk/2e, is added, at most L frequent melody structures with length L 1, are decremented from the current collection of maximal frequent melody structures found so far. Hence, the maximum number of maximal melody structures is changed from U to U0, where U0= U + 1 L, which is not greater than Ck_dk=2e. This conﬂicts with the assump-tion of U > Ck_dk=2e and results in a contradiction. Thus the statement is proven to be true. Therefore, we conclude that the maximum number of max-imal melody structures is Ck

dk=2ein the problem of

online mining maximal melody structures in a landmark melody stream. h

Example 1. Assume that there are ﬁve frequent items (i.e., frequent 1-item-sets) a, b, c, d, and e in the landmark melody stream as shown in Fig. 2. Let MF denote the total number of maxi-mal frequent item-sets. At this point, a, b, c, d and e are maximal and MF ¼ C5

1. Based on the

Apri-ori heuristic, C5

2 frequent 2-item-sets are

discov-ered in the worst case. In this case, these frequent 2-item-sets are also maximal and those frequent 1-item-sets are not maximal any more. The current MF is C5₁þ C52 C 5 1¼ C 5 2. Next, C 5 3

frequent 3-item-sets are found in the worst case. These frequent 3-item-sets are maximal but the sub-sets of the maximal 3-item-sets, i.e., frequent 2-item-sets, are not maximal any more. Now, the MF becomes C5₂þ C5 3 C 5 2¼ C 5 3. At this

time, suppose the frequent 4-item-set abcd exists in this instance and it is also a maximal 4-item-set. The frequent subsets, with length three, of abcd, i.e., abc, abd, acd and bcd, are not maximal any more. Now, the MF becomes C5

3þ 1 4 ¼ 7,

i.e., abcd, ace, ade, bcd, bce, bde, cde are maximal frequent item-sets. The new MF is smaller than the upper bound C5

d5=2e. Hence, we can ﬁnd that

if one or more frequent item-sets with length L, where L >d5/2e, are added into the collection of maximal frequent item-sets found so far, the value of MF would be changed and would be less than C5_d5=2e. Consequently, the C5_d5=2e is the upper bound of the number of maximal melody structures.

(5)

The key property of algorithm MMSLMSis

de-rived from the recent work (Karp et al., 2003) for ﬁnding frequent elements in streaming data. The basic scheme of mining chord-sets from music data streams is generalized from the well-known algo-rithm (Fisher and Salzberg, 1982) for determining whether a value (majority element) occurs more than n/2 times, i.e., minsup = 0.5, in a data stream of length n.

The method can be extended to an arbitrary value of minsup. The scheme is processed as fol-lows. At any given time, a superset of k probably frequent chord-sets with at most 1/minsup times is maintained. Initially, the set is empty. As a chord-set is read from the melody sequence in the current block, two operations are performed as follows. First, if the current chord-set is not contained in the superset and some entries are free, it is inserted into the superset with a count set to one. Second, if the chord-set is already in the sup-erset, its count is incremented by one. However, if the superset is full, the count of each entry in the superset is decremented by one, and the chord-sets whose frequencies are just one are dropped. The method thus identiﬁes at most k candidates for having appeared more than n/(k + 1) times, and uses O(1/minsup) memory entries.

3.2. The proposed algorithm: MMSLMS

Algorithm MMSLMS has three modules:

MMSLMS-buﬀer, MMSLMS-summary, and

MMSLMS-mine. MMSLMS-buﬀer repeatedly reads

in a block of melody sequences into available main memory. All compressed and essential information about the maximal melody structures will be maintained in the MMSLMS-summary. Finally,

MMSLMS-mine ﬁnds the maximal melody

struc-tures by a depth-ﬁrst manner in the current MMSLMS-summary. Therefore, the challenges of

online mining landmark melody streams lie in the design of a space-eﬃcient representation of the in-memory summary data structure and a fast discovery algorithm for ﬁnding maximal melody structures in real time.

3.2.1. MMSLMS-summary

First of all, the in-memory data structure MMSLMS-summary is deﬁned and the

construct-ing process of MMSLMS-summary is discussed.

Then we use a running example to illustrate. Deﬁnition 7. A MMSLMS-summary is an extended

preﬁx-tree-based summary data structure deﬁned below.

a b c d e

ab ac ad ae bc bd be cd ce de

abc abd abe acd ace ade bcd bce bde cde

abce abde acde bcde

abcd abcde C5₁ C5₂ C5₃ C5₄ C5₅

(6)

1. MMSLMS-summary consists of a CMB

(Chord-set Memory Border), and a (Chord-set of MPI-trees (Maximal Prefix-Item trees of item-suffixes) denoted as MPI-trees(item-suffixes).

2. Each node in the MPI-tree(item-suffix) consists of four fields: item-id, support, block-id and node-link, where item-id is the item identifier of the inserting item, support registers the num-ber of melody sequences represented by a por-tion of the path reaching the node with the item-id, the value of block-id assigned to a new node is the block identifier of the current block, and node-link links up a node with the next node with the same item-id in the same MPI-tree or null if there is none.

3. Each entry in the CMB consists of four fields: item-id, support, block-id, and head of node-link (a pointer links to the root node of the MPI-tree with the same item-id), abbreviated as head-link, where item-id registers which item identifier the entry represents, support records the number of transactions containing the item carrying the item-id, the value of block-id assigned to a new entry is the block identifier of current block, and head-link points to the root node of the MPI-tree(item-suffix). Notice that each entry with item-id in the CMB is an item-suffix and it is also the root node of the MPI-tree(item-id). 4. Each MPI-tree(item-suffix) has a specific

CMB-table (Chord-set Memory Border CMB-table) with respect to the item-suffix (denoted as CMB-table(item-suffix)). The CMB-table(item-suffix) is composed of four fields, namely item-id, support, block-id, and head-link. The CMB-table(item-suffix) operates the same as the CMB except that the field head-link links to the first node carrying the item-id in the MPI-tree(item-suffix). Notice that jCMB-table(item-suf-fix)j = jCMBj in the worst case, where jCMBj denotes the total number of entries in the CMB. The construction of MMSLMS-summary is

de-scribed as follows. First of all, MMSLMSreads a

melody sequence S from the current block. Then, MMSLMSprojects the sequence S into many

sub-sequences and inserts these subsub-sequences into the CMB and MPI-trees. In details, each melody

sequence S, such as hx1,x2, . . . , xmi, in the current

block should be projected by inserting m item-suﬃx melody subsequences into the MMSLMS

-summary. In other words, the melody sequence S =hx1,x2, . . . , xmi is converted into m melody

sub-sequences; that is, hx1,x2, . . . , xmi, hx2,x3, . . . , xmi,

. . .,hxm1,xmi, and hxmi. The m melody

subse-quences are called item-suffix sesubse-quences, since the first item of each melody subsequence is an item-suffix of the original melody sequence S. This step is called sequence projection, and is denoted as Sequence-Projection (S) = {x1jS,x2jS, . . . , xijS, . . . ,

xmjS}, where xijS = hxi,xi+1, . . . , xmi, "i = 1,2, . . . ,

m. Furthermore, the cost of sequence projection of a melody sequence with length m is (m2+ m)/ 2, i.e., m + (m 1) + + 2 + 1.

After Sequence-Projection (S), MMSLMS

algo-rithm removes the original melody sequence S from the MMSLMS-buﬀer. Next, the set of items

in these item-suffix sequences are inserted into the CMB and the MPI-trees(item-suffixes) as a branch, and the CMB-table(item-suffixes) are up-dated according to the suffixes. If an item-set (or item-string) share a prefix with an item-item-set (or string) already in the tree, the new item-set (or item-string) will share a prefix of the branch representing that item-set (or item-string). In addi-tion, a support counter is associated with each node in the tree. The counter is updated when an item-suffix sequence causes the insertion of a new branch.

In order to limit the memory size of the sum-mary data structure MMSLMS-summary, a space

pruning technique is performed. Let the minimum support threshold be minsup, in the range of [0, 1], and the current length of the landmark melody stream be N. The rule for space pruning is as fol-lows. A melody structure E is deleted if E.sup-port < minsup Æ N. E is called an infrequent melody structure. After pruning all infrequent mel-ody structures from the CMB, CMB-table-(item-suﬃx) and MPI-trees, the MMSLMS-summary

contains all information about frequent melody structures of the landmark melody stream gener-ated so far. Example 2 below illustrates the algo-rithm step by step. Note that the h i of sequences are omitted for clear presentation.

(7)

Example 2. Let a block Bj of the landmark

melody stream LMS be hacdefi, habei, hcefi, hacdfi, hcefi and hdfi, and the minimum support threshold be 0.5 (i.e., minsup = 0.5), where a, b, c, d, e and f are chord-sets (i.e., items) in a landmark melody stream seen so far. MMSLMS algorithm

constructs the MMSLMS-summary with respect to

the incoming block Bjand prunes all item-sets that

are infrequent from the current MMSLMS

-sum-mary in the following steps. Note that each node or entry represented as (f1:f2:f3) is composed of

three ﬁelds: item-id, support, and block-id. For example, (a:2:j) indicates that, from block Bj, item

a appeared twice.

Step 1: MMSLMSreads current block Bjinto main

memory for constructing the MMSLMS

-summary.

(a) First melody sequence acdef: First of all, MMSLMS algorithm reads the ﬁrst

melody sequence acdef and calls the Sequence-Projection (acdef). Then MMSLMS inserts the item-suﬃx

se-quences acdef, cdef, def, ef, and f into

the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMB-table(c)], [MPI-tree(d), CMB-table(d)], [MPI-tree(e), CMB-table(e)], and [MPI-tree(f), CMB-table(f)], respectively. The result is shown in Fig. 3. In the following sub-steps, as demonstrated in Fig. 4 through Fig. 9, the head-links of each CMB-table (item-suﬃx) are omitted for concise presentation.

(b) Second melody sequence abe: MMSLMS

reads the second melody sequence abe and calls the Sequence-Projection (abe). Next, MMSLMS inserts the item-suﬃx

sequences abe, be and e into the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(b), CMB-table(b)] and [MPI-tree(e), CMB-table(e)], respectively. The result is shown inFig. 4.

(c) Third melody sequence cef: MMSLMS

reads the third melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMSinserts the item-suﬃx

sequences cef, ef and f into the CMB,

Fig. 3. MMSLMS-summary construction after inserting ﬁrst melody sequence acdef in block Bj. In the following sub-steps, as

(8)

[MPI-tree(c), CMB-table(c)], [MPI-tree(e), CMB-table(e)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 5.

(d) Fourth melody sequence acdf: MMSLMS

reads the fourth melody sequence acdf and calls the Sequence-Projection (acdf). Next, MMSLMS inserts the

item-suﬃx sequences acdf, cdf, df and f into the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMB-table(c)], [MPI-tree(d), CMB-table(d)] and [MPI-tree(f), CMB-table(f)], resp-ectively. The result is shown in Fig. 6.

(e) Fifth melody sequence cef: MMSLMS

reads the ﬁfth melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMSinserts the item-suﬃx

sequences cef, ef and f into the CMB, [MPI-tree(c), CMB-table(c)], [MPI-tree(e), CMB-table(e)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 7.

(f) Sixth melody sequence df: MMSLMS

reads the sixth melody sequence df and calls the Sequence-Projection (df). Next, MMSLMS inserts the item-suﬃx

sequences df and f into the CMB, tree(d), CMB-table(d)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 8.

Step 2: After computing the current block Bj, all

infrequent melody structures are pruned by MMSLMSfrom the current MMSLMS

-summary. At this time, MMSLMS deletes

the MPI-tree(b) and its corresponding CMB-table(b), and prunes the entry b from the CMB, since item b is an infre-quent item; that is, r(b) < minsup ÆjLMSj, where r(b) = 1 and minsup ÆjLMSj = 0.5 Æ 6 = 3. Next, MMSLMS reconstructs

the MPI-tree(a) by eliminating the infor-mation about the infrequent item b. The result is shown inFig. 9.

The description stated above is the constructing process of MMSLMS-summary with respect to the

(9)

incoming block over a landmark melody stream. The MMSLMS-summary construction algorithm

is depicted inFig. 10.

3.2.2. MMSLMS-mine

In this section, the module, called MMSLMS

-mine, of mining maximal melody item-sets and

Fig. 5. MMSLMS-summary construction after inserting third melody sequence cef.

(10)

maximal melody item-strings from the current MMSLMS-summary is discussed (Fig. 11).

First of all, given an entry id (from left to right, for example) in the current CMB, MMSLMS-mine

Fig. 7. MMSLMS-summary construction after inserting ﬁfth melody sequence cef.

(11)

generates candidate maximal melody structures by a top-down approach. The top-down method uses the frequent items (i.e., chord-sets) of CMB-table-(id) and item id to generate the candidates. The generating order of these candidates is determined by the size of item-set, from item-set size 1 +jCMB-table(id)j down to size 2. Note that, the generating order ends in 2-item-sets because all frequent entries in the current CMB-table are frequent 1-item-sets. Then MMSLMS-mine checks

these candidates whether they are frequent or not by traversing the MPI-tree(id). The MPI-tree tra-versing principle is described as follows. First, MMSLMS-mine generates a candidate maximal

melody item-set, (j + 1)-item-set, containing the item id and all items of the CMB-table(id), where jCMB-table(id)j = j. Second, MMSLMS-mine

tra-verses the MPI-tree via the node-links of the fre-quent candidate. After that if the candidate is not a frequent item-set, MMSLMS-mine generates

substructure candidates with j-item-sets. Next, MMSLMS-mine executes the same MPI-tree

tra-versing scheme for item-set counting. The process stops when MMSLMS-mine ﬁnds all maximal

frequent melody structures from the current MMSLMS-summary. Moreover, MMSLMS-mine

stores these maximal melody structures into a tem-poral pattern list, called MMSLMS-list. Notice that

MMSLMS-mine can ﬁnd the set of frequent

2-item-sets by combining the item-suﬃx id with the fre-quent items of the CMB-table(id).

Example 3. This example illustrates the mining of the maximal melody item-sets from the current MMSLMS-summary in Fig. 9. Let the minimum

support threshold be 0.5, i.e., minsup = 0.5. (1) Now, we start the maximal melody item-set

mining scheme from the frequent item a. At this moment, the frequent item-set is the only 1-item-set (a), since the support of items c, d, e and f in the CMB-table (a) are less than minsup ÆjLMSj, where jLMSj = jBjj = 6.

(2) Next, MMSLMS-mine starts on the second

entry c for maximal melody item-set mining. MMSLMS-mine generates a candidate

maxi-mal 3-item-set (cef), and traverses the MPI-tree(c) for counting its support. As a result,

(12)

the candidate (cef) is a maximal frequent item-set, since its support is 3, and it is not a sub-structure of any other maximal melody structures within the MMSLMS-list. Now,

MMSLMS-mine stores the maximal item-set

(cef) into the MMSLMS-list.

(3) MMSLMS-mine starts on the third entry d

and generates a maximal frequent 2-item-set (df). We store this item-2-item-set (df) into the MMSLMS-list because it is not a

sub-struc-ture of any other maximal melody strucsub-struc-tures within the current MMSLMS-list.

Algorithm 1 (MMSLMS-summary Construction)

Input: A landmark melody stream, LMS = [B1, B2, …, BN), and a user-specified minimum support threshold, minsup.

Output: A current MMSLMS-summary.

1: CMB = ∅ /*initialize the CMB to empty.*/ 2: foreach block Bj do /* j = 1, 2, …, N */

3: foreach melody sequence S = <x1, x2, …, xm> ∈ Bj (j = 1, 2, …, N) do 4: foreach item xi S do /*the CMB maintenance*/

5: if xi∉ CMB then

6: create a new entry (xi, 1, j, head-link) into the CMB; /* the entry form is (item-id, support, block-id, head-link)*/ 7: else /* the entry already in the CMB*/

8: xi.support = xi.support + 1; /* increment the support of item-id xi by one*/ 9: end if

10: end for

11: call Sequence-Projection(S);

/* project the sequence with every prefix-item xi for the construction of MPI-tree(xi)*/ 12: end for

13: call MMSLMS-summary-pruning(MMSLMS-summary, minsup, |LMS|); 14: end for

Subroutine Sequence-Projection

Input: A melody sequence S = <x1, x2, …, xm> and the current block-id j;

Output: MPI-trees(xi), ∀i=1, 2, …, m; 1: foreach item xi (i =1, 2, …, m) do

2: MPI-tree-maintenance([xi|X], MPI-tree(xi), j);

/* X = x1, x2, …, xm is the original melody sequence */

/* [xi|X] is an item-suffix melody sequence with item-suffix xi*/

3: end for

Subroutine MPI-tree-maintenance

Input: An item-suffix melody sequence <xi, xi+1, …, xm> (i=1, 2, …, m), MPI-tree(xi) and the current block-id j;

Output: The modified MPI-tree(xi), where i=1, 2, ..., m; 1: foreach item xi do /* i=1, 2, …, m */

2: if MPI-tree has a child Y such that Y.item-id = xi.item-id then 3: Y.support = Y.support +1; /*increment Y’s support by one*/ 5: else

6: create a new node Y = (item-id, 1, j, node-link);

/* initialize the Y’s support to 1, and link its parent link to MPI-tree, and its node-link linked to the nodes with same item-id via the node-link structure. */

7: end if 8: end for

(13)

(4) On the fourth entry e, since its maximal melody item-set (ef) is a sub-structure of previous maximal melody item-set (cef), MMSLMS-mine does not store it into the

MMSLMS-list.

(5) Finally, MMSLMS-mine computes the entry

f, and generates a maximal frequent 1-item-set (f) directly, since the CMB-table(f) is

empty. MMSLMS-mine does not store it

into the MMSLMS-list, because it is a

sub-structure of a generated maximal item-set (cef).

In conclusion, the Maximal Type I Melody Structures determined by algorithm MMSLMSare

(a), (cef) and (df). Now, we describe the mining Algorithm 2 (MMSLMS-mine)

Input: A current MMSLMS-summary, the current length of landmark melody stream |LMS|,

and a minimum support threshold minsup.

Output: A temporal-pattern-list, MMSLMS-list, of maximal melody structures. 1: MMSLMS-list = ∅;

2: foreach entry e in the current CMB do

3: do generate a candidate maximal melody structure E with size |E| /* |E| = 1+|CMB-table(e) */

4: counting E.support by traversing the MPI-tree(e); 5: if E.support minsup |LMS| then

6: if E∉ MMSLMS-list and E is not a substructure of any other maximal frequent

structures contained into the MMSLMS-list then

7: add E into the MMSLMS-list;

8: remove E’s substructures from the MMSLMS-list;

9: end if

10: else /* if E is not a frequent melody structure*/

11: enumerate E into melody substructures with size |E|—1; 12: end if

13: until MMSLMS-mine find the set of all maximal frequent structures with respect to the

item e; 14: end for

Fig. 11. Algorithm of MMSLMS-mine. Subroutine MMSLMS-summary-pruning

Input: An MMSLMS-summary, a user-specified minimum support threshold, minsup, and the current length of LMS, |LMS|;

Output: An MMSLMS-summary which contains the set of all frequent melody structures.

1: foreach entry xi (i=1, 2, …, d) ∈ CMB, where d =|CMB| do

2: if xi .support < minsup |LMS| then /* xi is not a frequent item */

3: delete those nodes (item-id = xi) via node-link structure;

4: merge the fragmented sub-trees;

/* a simple way is to reinsert or to join the remainder sub-trees into the MPI-tree*/; 5: delete MPI-tree(xi);

6: delete the entry xi from the CMB;

7: end if 8: end for

(14)

principle of maximal melody item-strings, i.e., Maximal Type II Melody Structures, as below. MMSLMS-mine generates maximal melody

item-strings from the current MMSLMS-summary as

shown inFig. 9 by a depth-ﬁrst-search (DFS) ap-proach. Hence, the Maximal Type II Melody Structures determined by algorithm MMSLMSare

hai, hci, hdi and hefi. Note that hfi is not maximal melody item-string since it is a sub-string of the existing maximal melody 2-item-stringhefi.

Based on the algorithm MMSLMS-mine inFig.

10, we have the following lemma.

Lemma 2. A melody structure is a maximal melody structure if and only if it is generated by algorithm MMSLMS-mine.

Proof. Algorithm MMSLMS-mine is composed of

two major steps: frequent melody structure selection (step 1) and maximal melody structure verification (step 2). These steps are performed in sequence. First of all, in the step of frequent melody struc-ture selection, MMSLMS-mine ﬁnds frequent

mel-ody structure based on the Apriori property if any length i-item-set (or i-item-string) is not fre-quent, its length (i + set (or (i + 1)-item-string) can never be frequent. That means MMSLMS-mine does not miss any frequent melody

structures. Next, in step 2, MMSLMS-mine checks

the frequent melody structures generated from step 1 against the maximal melody structures of the MMSLMS-list, a temporal pattern list of

maxi-mal melody structures. If this frequent melody structure is a structure (i.e., set or sub-string) of any other structures within the MMSLMS-list, then it is not a maximal melody

structure according to the Deﬁnition 5; otherwise it is a candidate maximal melody structure before the next execution of step 2. Repeating step 1 and step 2, MMSLMS-mine can generate all the

maximal melody structures contained in the MMSLMS-list. Hence, we have the lemma: a

mel-ody structure is a maximal melmel-ody structure if and only if it is generated by algorithm MMSLMS-mine.

Space complexity analysis: The space require-ment of MMSLMS-summary consists of two parts:

the working space needed to create a CMB and the CMB-tables, and the storage space needed to

maintain the set of MPI-trees. Assume that CMB contains k frequent chord-sets such as e1,e2, . . . ,

ei, . . . , ekat any time. Based on the Theorem 1, we

know that there are at most Ck_dk=2e maximal frequent chord-sets in the landmark melody stream seen so far. If we construct the MMSLMS

-summary for all these maximal frequent melody structures, the maximum height of all the MPI-trees is dk/2e. There are 1 þ Ck1₁ þ Ck1₂ þ þ Ck1_dk=2e1nodes in the MPI-tree(e1), where the value

1 indicates the root node e1 of the MPI-tree(e1),

and Ck1₁ þ Ck1₂ þ þ Ck1dk=2e1 are internal and

leaf nodes of the MPI-tree(e1). Moreover, there are

1þ Ck2₁ þ Ck2₂ þ þ Ck2dk=2e1 nodes in the

MPI-tree(e2), . . ., 1þ Cki1 þ C ki

2 þ þ C ki dk=2e1

nodes in the MPI-tree(ei), 1þ C

kðk1Þ

1 nodes in the

MPI-tree(ek1), and 1 (root) node in the

MPI-tree(ek). Thus, the total number of nodes of

MPI-trees in the MMSLMS-summary is

ð1 þ Ck1 1 þ C k1 2 þ þ C k1 dk=2e1Þ þ ð1 þ Ck2 1 þ Ck22 þ þ Ck2dk=2e1Þ þ þ ð1 þ Cki1 þ C ki 2 þ þ C ki dk=2e1Þ þ þ ð1 þ Ckðk1Þ₁ Þ þ 1 ¼ ðCk10 þ C k1 1 þ C k1 2 þ þ C k1 dk=2e1Þ þ ðCk20 þ C k2 1 þ C k2 2 þ þ C k2 dk=2e1Þ þ þ ðCki 0 þ C ki 1 þ C ki 2 þ þ C ki dk=2e1Þ þ þ ðCkðk1Þ0 þ C kðk1Þ 1 Þ þ Ckk0 :

This number equals Ck₁þ Ck

2þ þ C k

dk=2e based

on Pascals Identity: let x and y be positive integers with x P y. Then Cxþ1_y ¼ Cx

y1þ C x y.

Moreover, the worst case working space requires at most (k2+ k)/2 entries, which is based on the process of Sequence-Projection. Thus, the space requirement of MMSLMS-summary is

ðk2þ kÞ=2 þ Ck 1þ C

k

2þ þ C k

dk=2e. Finally, the

upper bound of space requirement is O(2k). h The worst case space complexity of algorithm MMSLMScan be analyzed in terms of melody

se-quence size as described below. Assume that the average melody sequence size is m, the current

(15)

length of the landmark melody stream is N, and the minimum support threshold is minsup. The space requirement of algorithm MMSLMS is composed

of two parts, working space and storage space. The working space is used to store the CMB and CMB-tables and the storage space is used to store the MPI-trees. The working space requirement is m + (m 1) + (m 2) + + 1 and the storage space requirement is also about m + (m 1) + (m 2) + + 1. Hence, the space requirement of MMSLMS for inserting a melody sequence with

average size m into MMSLMS-summary is 2[m +

(m 1) + (m 2) + + 1] = m2+ m. Hence, the space requirement of the stream generated so far in the worst case is N Æ (m2+ m). Note that in the analysis, we assume that the sminsup Æ N is just one and therefore every item of the incoming melody se-quence is a frequent item, which is the worst case. However, we know that the value of N increases as time progresses. Hence, the pruning mechanism of MMSLMS-summary is deployed to limit

the memory requirement not to exceed an upper bound.

From the space complexity analysis, it is not surprising to ﬁnd that the space complexity grows exponentially into the number of frequent items in the CMB, as all frequent item-sets are represented in the data structure. It is also the solution space of the problem.

Time complexity analysis: From the construc-tion process of MMSLMS-summary, we can see

that exactly one scan of a landmark melody stream is required. The cost (denoted by Time-cost(S)) of inserting a melody sequence S into the MMSLMS

-summary by sequence projection is jSj + (jSj 1) + + 1 = (jSj2

+jSj)/2; that is O(jfreq(S)j2

), where freq(S) is the set of frequent items in the melody sequence S. Note that jfreq(S)j 6 jSj, wherejSj denotes the size of the melody sequence S.

Because the items within the CMB are frequent items, therefore, the cost of inserting a melody se-quence S can be stated in terms of the size of CMB. Time-cost(S) = O(jS0_j2

), where jS0_{j is the}

number of chord-sets of melody sequence S within the CMB. In the worst case, if the melody se-quence S contains all the frequent items within the CMB, Time-cost(S) = O(jCMBj2

).

4. Experimental results

In this section, we ﬁrst describe the data and experiment set-up used to evaluate the perfor-mance of the proposed algorithm, and then report our experimental results.

4.1. Synthetic data and experiment set-up

To evaluate the performance of MMSLMS

algo-rithm, two experiments are performed. The exper-iments were carried out on the IBM synthetic market-basket test data generator proposed by Agrawal and Srikant (1994). Two data streams, denoted by S10.I5.D1000K and S30.I15.D1000K, of size 1 million melody sequences each are stud-ied. The ﬁrst one, S10.I5.D1000K with 1 K unique items, has an average melody sequence size of 10 with average maximal potentially frequent struc-ture size of 5. The second one, S30.I15.D1000K with 10 K unique items, has an average melody se-quence size of 30 with average maximal potentially frequent structure size of 15. In all experiments, the melody sequences of each datasets are looked up in sequence to simulate the environment of a landmark melody stream. All the experiments are performed on a 1066-MHz Pentium III processor with 128 megabytes main memory, running on Microsoft Windows XP. In addition, all the pro-grams are written in Microsoft/Visual C++ 6.0. 4.2. Experimental results

In the ﬁrst experiment, two primary factors, memory and execution time, are examined in the online mining of a landmark melody stream, since both should be bounded online as time advances. In Fig. 12(a), the execution time grows smoothly as the dataset size increases. This is because the average execution time of dataset S10.I5 and S30.I15 are about 12 and 25 s per block respec-tively, where a block is composed of 50,000 mel-ody sequences. In other words, the computation time of dataset S10.I5 by algorithm MMSLMS is

12 s every 50,000 melody sequences, and for data-set S30.I15 is 25 s every 50,000 melody sequences. Hence, it grows smoothly as the dataset size in-creases. The memory usage inFig. 12(b) for both

(16)

synthetic datasets is stable as time progresses, indi-cating the feasibility of algorithm MMSLMS. Note

that the synthetic landmark melody stream is par-titioned into blocks with size 50 K.

In the second experiment, we investigate the scalability and relative error of algorithm MMSLMS with respect to varying minimum

sup-ports. The relative error is deﬁned as the diﬀer-ence between the measured support and the actual support estimation divided by the actual support. In Fig. 13(a), the execution time grows smoothly as the dataset increases (assume min-sup = 0.01%) indicating linear scalability. Fig. 13(b) shows that the relative error decreases as minisup decreases, i.e., as the size of CMB de-creases. Generally, the more frequent items are

maintained in the CMB, the more accurate the mining result is.

5. Conclusions

In this paper, we proposed a single-pass algo-rithm, MMSLMS, to discover and maintain all

max-imal melody structures in a landmark model that contains all the melody sequences in a data stream. In the MMSLMSalgorithm, an eﬃcient in-memory

summary data structure, MMSLMS-summary, is

developed to record all maximal frequent struc-tures in the current landmark model. In addition, MMSLMS uses a space-eﬃcient scheme, the

Chord-set Memory Border (CMB), to guarantee

Fig. 12. Required resources for synthetic datasets: (a) execution time and (b) memory.

(17)

the upper-bound of space requirements of mining maximal melody sequences in a streaming environ-ment. Theoretical analysis and experimental results with synthetic data show that MMSLMSalgorithm

can meet the performance requirements of data stream mining: one-scan, bounded-space and real time. Further work includes online mining maxi-mal melody structures in count-based and time-based sliding window that contains the most recent melody sequences in a data stream.

Acknowledgements

The authors thank the reviewers precious com-ments for improving the quality of the paper. The research is supported by National Science Council of R.O.C. under grant no. NSC93-2213-E-009-043.

References

Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499.

Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In: Proceedings of 21th ACM Symposium on Principles of Database Systems, pp. 1–16.

Bakhmutora, V., Gusev, V.U., Titkova, T.N., 1997. The search for adaptations in song melodies. Computer Music Journal 21 (1), 58–67.

Fisher, M.J., Salzberg, S.L., 1982. Finding a majority among n votes: solution to problem 81-5. Journal of Algorithms 3 (4), 362–380.

Hsu, J.L., Liu, C.C., Chen, A.L.P., 2001. Discovering non-trivial repeating patterns in music data. IEEE Transactions on Multimedia 3 (3), 311–325.

Jones, G.T., 1974. Music Theory. Harper & Row, Publishers, New York.

Karp, R.M., Papadimitrious, C.H., Shanker, S., 2003. A simple algorithm for ﬁnding frequent elements in streams and bags. ACM Transactions on Database Systems 28 (1), 51– 55.

Shan, M.-K., Kuo, F.-F., 2003. Music style mining and classiﬁcation by melody. IEICE Transactions on Informa-tion and Systems E86-D (4), 655–659.

Yoshitaka, A., Ichikawa, T., 1999. A survey on content-based retrieval for multimedia databases. IEEE Transactions on Knowledge and Data Engineering 11 (1), 81–93.

Zhu, Y., Kankanhalli, M.S., Xu, C., 2001. Pitch tracking and melody slope matching for song retrieval. In: Proceedings of the Second IEEE Paciﬁc Rim Conference on Multimedia: Advances in Multimedia Information, pp. 530–537.