Online mining maximal frequent structures in
continuous landmark melody streams
Hua-Fu Li
a,*, Suh-Yin Lee
a, Man-Kwan Shan
ba
Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta Hsueh Road, Hsin-Chu 300, Taiwan
bDepartment of Computer Science, National Chengchi University, 64, Sec. 2, Zhi-nan Road, Wenshan, Taipei 116, Taiwan
Received 10 January 2004; received in revised form 13 November 2004 Available online 14 April 2005
Communicated by E. Backer
Abstract
In this paper, we address the problem of online mining maximal frequent structures (Type I & II melody structures) in unbounded, continuous landmark melody streams. An efficient algorithm, called MMSLMS(Maximal Melody Struc-tures of Landmark Melody Streams), is developed for online incremental mining of maximal frequent melody substruc-tures in one scan of the continuous melody streams. In MMSLMS, a space-efficient scheme, called CMB (Chord-set Memory Border), is proposed to constrain the upper-bound of space requirement of maximal frequent melody struc-tures in such a streaming environment. Theoretical analysis and experimental study show that our algorithm is efficient and scalable for mining the set of all maximal melody structures in a landmark melody stream.
2005 Elsevier B.V. All rights reserved.
Keywords: Machine learning; Data mining; Landmark melody stream; Maximal melody structure; Online algorithm
1. Introduction
Recently, database and knowledge discovery communities have focused on a new data model,
where data arrives in the form of continuous, rapid, huge, unbounded streams. It is often re-ferred to as data streams or streaming data. Many applications generate large amount of data streams in real time, such as sensor data generated from sensor networks, transaction flows in retail chains, Web record and click streams in Web applications, performance measurement in net-work monitoring and traffic management, call re-cords in telecommunications, etc. In such a data
0167-8655/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.01.016
*
Corresponding author. Tel.: +886 35731901; fax: +886 35724176.
E-mail addresses:hfl[email protected](H.-F. Li),sylee@ csie.nctu.edu.tw (S.-Y. Lee), [email protected] (M.-K. Shan).
stream model, knowledge discovery has two major characteristics (Babcock et al., 2002). First, the volume of a continuous stream over its lifetime could be huge and fast changing. Second, the con-tinuous queries (not just one-shot queries) require timely answers, and the response time is short. Hence, it is not possible to store all the data in main memory or even in secondary storage. This motivates the design of in-memory summary data structure with small memory footprints that can support both one-time and continuous queries. In other words, data stream mining algorithms have to sacrifice the exactness of its analysis result by allowing some counting error.
Although several techniques have been deve-loped recently for discovering and analyzing the content of static music data (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001), new techniques are needed to analyze and discover the content of streaming music data. Thus, this pa-per studies a new problem of how to discover the maximal melody structures in a continuous un-bounded melody stream. The problem comes from the context of online music-downloading services (such as Kuro at www.music.com.tw), where the streams in question are streams of queries, i.e., music-downloading requests, sent to the server, and we are interested in finding the maximal melody structures requested by most customers during some period of time. With the computation model of music melody streams presented inFig. 1, the melody stream processor and the summary data structure are two major components in the
music melody streaming environment. The user query processor receives user queries in the form ofhTimestamp, Customer-ID, Music-IDi, and then transforms the queries into music data (i.e., melody sequences) in the form of hTimestamp, Customer-ID, Music-Customer-ID, Melody-Sequencei by retrieving the music database. Note that a buffer can be optionally set for temporary storage of recent music melodies from the music melody streams.
In this paper, we present a novel algorithm MMSLMS (Maximal Melody Structures of
Land-mark Melody Streams) for mining the set of all maximal melody structures in a landmark melody stream. Moreover, the music melody data and pat-terns are represented as sets of chord-sets (Type I Melody structures) or strings of chord-sets (Type II Melody structures). While providing a general framework of music stream mining, algorithm MMSLMShas two major features, namely one scan
of music melody streams for online frequency collec-tion, and prefix-tree-based compact pattern repre-sentation. With these two important features, MMSLMSis provided with the capability to work
continuously in the unbounded streams for an arbitrary long time with bounded resources, and to quickly answer users queries at any time.
2. Preliminaries
2.1. Music terminologies
In this section, we describe several features of music data used in this paper. For the basic User
Query
Streams (Sequence)Melody
Streams Music Database User Query Processor ••• Melody Stream Processor Maximal Melody Structure Streams Music ID Melody Sequence ••• Buffer Summary Data Structure in Main Memory
terminologies on music, we refer to (Jones, 1974). A chord is a sounding combination of three or more notes at the same time. A note is a single symbol on a musical score, indicating the pitch and duration of what is to be sung and played. A chord-set is a set of chords (Shan and Kuo, 2003).
Definition 1. The type I melody structure is represented as a set of chord-sets. The type II melody structure is represented as a string of chord-sets.
2.2. Problem statement
Let W = {i1,i2, . . . , in} be a set of chord-sets,
called items for simplicity. A melody sequence S with m chord-sets is denoted by S =hx1x2 xmi,
where xi2 W, "i = 1,2, . . . , m. A block is a set of
melody sequences.
Definition 2. A landmark melody stream LMS = [B1,B2, . . . , BN), is an infinite sequence of blocks,
where each block Bi is associated with a block
identifier i, and N is the identifier of the ‘‘latest’’ block BN. The current length of LMS, written as
jLMSj, is N. The blocks arrive in some order (implicitly by arrival time or explicitly by time-stamp), and may be seen only once.
Definition 3. A set Y W is called an item-set, i.e., a set of chord-sets. k-item-set is represented by (y1y2 yk). The support of an item-set Y, denoted
as r(Y), is the number of melody sequences con-taining Y as a subset in the LMS seen so far. An item-set is frequent if its support is greater than or equal to minsup ÆjLMSj, where minsup is a user-specified minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream LMS.
Definition 4. A string Z is called an item-string, i.e., a string of chord-sets. A k-item-string is represented by hz1z2 zki, where zi2 W,
"i = 1,2, . . ., k. The support of an item-string Z, denoted as r(Z), is the number of melody sequences containing Z as a substring in the LMS seen so far. An item-string is frequent if its support is greater than or equal to minsup ÆjLMSj,
where minsup is a user-specified minimum support threshold in the range of [0, 1], and jLMSj is the current length of the landmark melody stream seen so far.
Definition 5. A frequent item-set (or item-string) is called maximal if it is not a subset (or sub-string) of any other frequent set (or item-string).
In fact, the total number of maximal melody structures is smaller than that of frequent mel-ody structure. Hence, the type of maximal melmel-ody structures is more suitable for the performance requirements of music stream mining.
Definition 6. (Problem Definition of Online Min-ing Maximal Melody Structures in Continuous Landmark Melody Streams.) Given a landmark melody stream LMS = [B1,B2, . . . , BN) and the user
specified minimum support, minsup, in the range of [0, 1], the problem of online mining maximal melody substructures is to discover the set of all maximal melody structures, i.e., maximal item-sets or maximal item-strings, in single one scan of the landmark music stream.
2.3. Main performance requirements of music melody stream mining
The main performance challenges of mining melody streams are:
(1) Online, one-pass algorithm: each sequence in the landmark melody stream is examined once.
(2) Bounded-storage: limited memory for storing crucial, compressed information in summary data structure.
(3) Real-time: per item processing time must be low.
The proposed MMSLMSalgorithm possesses all
of these characteristics, while none of previously published methods (Bakhmutora et al., 1997; Hsu et al., 2001; Shan and Kuo, 2003; Yoshitaka and Ichikawa, 1999; Zhu et al., 2001) can claim the same.
3. Online mining maximal frequent structures in landmark melody streams
3.1. Chord-set memory border
In this section, the upper bound on the number of candidate maximal melody structures is dis-cussed, and an efficient algorithm for chord-set memory border construction is proposed.
Theorem 1. Given a set of k frequent chord-sets from a landmark melody stream, an upper bound of the amount of maximal frequent melody structures is Ckdk=2e.
Proof. Assume that there are k frequent chord-sets, i.e., k frequent items, in the current landmark melody stream. The solution space of mining all frequent item-sets in the worst case is Ck1þ Ck 2þ þ C k i þ þ C k dk=2eþ þ C k k, where Ck
1is the total number of frequent 1-item-sets, C k i is
that of frequent i-item-sets, and Ck
k is that of
fre-quent k-item-sets. We observe that the value of Ck
dk=2e is the maximum value among all the
binom-inal coefficient Cki; 8i ¼ 1; 2; . . . ; k, in mining all frequent i-item-sets. In other words, the number of frequentdk/2e-item-sets is a maximum. We will prove the number of maximal frequent item-sets can not be greater than the value Ckdk=2e, i.e., Ckdk=2e is the upper bound. We prove it by contradiction.
Assume that the value of Ck
dk=2e is not the
maximum number of maximal frequent item-sets, i.e., a larger upper bound U exists, where U > Ckdk=2e. Consider that there are one or more frequent melody structures with length L, where L >dk/2e. If F is a frequent melody structure with length dk/2e + i and it is maximal, where i = 1,2, . . ., k dk/2e, then all of the substructures of F are frequent, which is based on the anti-monotone Apriori heuristic (Agrawal and Srikant, 1994): if any i-item-set (or i-item-string) is not frequent, its (i + 1)-item-set (or (i + 1)-item-string) can never be frequent, but not maximal, which is based on the definition 5: a frequent item-set (or item-string) is called maximal if it is not a subset (or substring) of any other frequent set (or item-string). In other words, it means that when one
maximal frequent structure with length L, where L >dk/2e, is added, at most L frequent melody structures with length L 1, are decremented from the current collection of maximal frequent melody structures found so far. Hence, the maximum number of maximal melody structures is changed from U to U0, where U0= U + 1 L, which is not greater than Ckdk=2e. This conflicts with the assump-tion of U > Ckdk=2e and results in a contradiction. Thus the statement is proven to be true. Therefore, we conclude that the maximum number of max-imal melody structures is Ck
dk=2ein the problem of
online mining maximal melody structures in a landmark melody stream. h
Example 1. Assume that there are five frequent items (i.e., frequent 1-item-sets) a, b, c, d, and e in the landmark melody stream as shown in Fig. 2. Let MF denote the total number of maxi-mal frequent item-sets. At this point, a, b, c, d and e are maximal and MF ¼ C5
1. Based on the
Apri-ori heuristic, C5
2 frequent 2-item-sets are
discov-ered in the worst case. In this case, these frequent 2-item-sets are also maximal and those frequent 1-item-sets are not maximal any more. The current MF is C51þ C52 C 5 1¼ C 5 2. Next, C 5 3
frequent 3-item-sets are found in the worst case. These frequent 3-item-sets are maximal but the sub-sets of the maximal 3-item-sets, i.e., frequent 2-item-sets, are not maximal any more. Now, the MF becomes C52þ C5 3 C 5 2¼ C 5 3. At this
time, suppose the frequent 4-item-set abcd exists in this instance and it is also a maximal 4-item-set. The frequent subsets, with length three, of abcd, i.e., abc, abd, acd and bcd, are not maximal any more. Now, the MF becomes C5
3þ 1 4 ¼ 7,
i.e., abcd, ace, ade, bcd, bce, bde, cde are maximal frequent item-sets. The new MF is smaller than the upper bound C5
d5=2e. Hence, we can find that
if one or more frequent item-sets with length L, where L >d5/2e, are added into the collection of maximal frequent item-sets found so far, the value of MF would be changed and would be less than C5d5=2e. Consequently, the C5d5=2e is the upper bound of the number of maximal melody structures.
The key property of algorithm MMSLMSis
de-rived from the recent work (Karp et al., 2003) for finding frequent elements in streaming data. The basic scheme of mining chord-sets from music data streams is generalized from the well-known algo-rithm (Fisher and Salzberg, 1982) for determining whether a value (majority element) occurs more than n/2 times, i.e., minsup = 0.5, in a data stream of length n.
The method can be extended to an arbitrary value of minsup. The scheme is processed as fol-lows. At any given time, a superset of k probably frequent chord-sets with at most 1/minsup times is maintained. Initially, the set is empty. As a chord-set is read from the melody sequence in the current block, two operations are performed as follows. First, if the current chord-set is not contained in the superset and some entries are free, it is inserted into the superset with a count set to one. Second, if the chord-set is already in the sup-erset, its count is incremented by one. However, if the superset is full, the count of each entry in the superset is decremented by one, and the chord-sets whose frequencies are just one are dropped. The method thus identifies at most k candidates for having appeared more than n/(k + 1) times, and uses O(1/minsup) memory entries.
3.2. The proposed algorithm: MMSLMS
Algorithm MMSLMS has three modules:
MMSLMS-buffer, MMSLMS-summary, and
MMSLMS-mine. MMSLMS-buffer repeatedly reads
in a block of melody sequences into available main memory. All compressed and essential information about the maximal melody structures will be maintained in the MMSLMS-summary. Finally,
MMSLMS-mine finds the maximal melody
struc-tures by a depth-first manner in the current MMSLMS-summary. Therefore, the challenges of
online mining landmark melody streams lie in the design of a space-efficient representation of the in-memory summary data structure and a fast discovery algorithm for finding maximal melody structures in real time.
3.2.1. MMSLMS-summary
First of all, the in-memory data structure MMSLMS-summary is defined and the
construct-ing process of MMSLMS-summary is discussed.
Then we use a running example to illustrate. Definition 7. A MMSLMS-summary is an extended
prefix-tree-based summary data structure defined below.
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abce abde acde bcde
abcd abcde C51 C52 C53 C54 C55
1. MMSLMS-summary consists of a CMB
(Chord-set Memory Border), and a (Chord-set of MPI-trees (Maximal Prefix-Item trees of item-suffixes) denoted as MPI-trees(item-suffixes).
2. Each node in the MPI-tree(item-suffix) consists of four fields: item-id, support, block-id and node-link, where item-id is the item identifier of the inserting item, support registers the num-ber of melody sequences represented by a por-tion of the path reaching the node with the item-id, the value of block-id assigned to a new node is the block identifier of the current block, and node-link links up a node with the next node with the same item-id in the same MPI-tree or null if there is none.
3. Each entry in the CMB consists of four fields: item-id, support, block-id, and head of node-link (a pointer links to the root node of the MPI-tree with the same item-id), abbreviated as head-link, where item-id registers which item identifier the entry represents, support records the number of transactions containing the item carrying the item-id, the value of block-id assigned to a new entry is the block identifier of current block, and head-link points to the root node of the MPI-tree(item-suffix). Notice that each entry with item-id in the CMB is an item-suffix and it is also the root node of the MPI-tree(item-id). 4. Each MPI-tree(item-suffix) has a specific
CMB-table (Chord-set Memory Border CMB-table) with respect to the item-suffix (denoted as CMB-table(item-suffix)). The CMB-table(item-suffix) is composed of four fields, namely item-id, support, block-id, and head-link. The CMB-table(item-suffix) operates the same as the CMB except that the field head-link links to the first node carrying the item-id in the MPI-tree(item-suffix). Notice that jCMB-table(item-suf-fix)j = jCMBj in the worst case, where jCMBj denotes the total number of entries in the CMB. The construction of MMSLMS-summary is
de-scribed as follows. First of all, MMSLMSreads a
melody sequence S from the current block. Then, MMSLMSprojects the sequence S into many
sub-sequences and inserts these subsub-sequences into the CMB and MPI-trees. In details, each melody
sequence S, such as hx1,x2, . . . , xmi, in the current
block should be projected by inserting m item-suffix melody subsequences into the MMSLMS
-summary. In other words, the melody sequence S =hx1,x2, . . . , xmi is converted into m melody
sub-sequences; that is, hx1,x2, . . . , xmi, hx2,x3, . . . , xmi,
. . .,hxm1,xmi, and hxmi. The m melody
subse-quences are called item-suffix sesubse-quences, since the first item of each melody subsequence is an item-suffix of the original melody sequence S. This step is called sequence projection, and is denoted as Sequence-Projection (S) = {x1jS,x2jS, . . . , xijS, . . . ,
xmjS}, where xijS = hxi,xi+1, . . . , xmi, "i = 1,2, . . . ,
m. Furthermore, the cost of sequence projection of a melody sequence with length m is (m2+ m)/ 2, i.e., m + (m 1) + + 2 + 1.
After Sequence-Projection (S), MMSLMS
algo-rithm removes the original melody sequence S from the MMSLMS-buffer. Next, the set of items
in these item-suffix sequences are inserted into the CMB and the MPI-trees(item-suffixes) as a branch, and the CMB-table(item-suffixes) are up-dated according to the suffixes. If an item-set (or item-string) share a prefix with an item-item-set (or string) already in the tree, the new item-set (or item-string) will share a prefix of the branch representing that item-set (or item-string). In addi-tion, a support counter is associated with each node in the tree. The counter is updated when an item-suffix sequence causes the insertion of a new branch.
In order to limit the memory size of the sum-mary data structure MMSLMS-summary, a space
pruning technique is performed. Let the minimum support threshold be minsup, in the range of [0, 1], and the current length of the landmark melody stream be N. The rule for space pruning is as fol-lows. A melody structure E is deleted if E.sup-port < minsup Æ N. E is called an infrequent melody structure. After pruning all infrequent mel-ody structures from the CMB, CMB-table-(item-suffix) and MPI-trees, the MMSLMS-summary
contains all information about frequent melody structures of the landmark melody stream gener-ated so far. Example 2 below illustrates the algo-rithm step by step. Note that the h i of sequences are omitted for clear presentation.
Example 2. Let a block Bj of the landmark
melody stream LMS be hacdefi, habei, hcefi, hacdfi, hcefi and hdfi, and the minimum support threshold be 0.5 (i.e., minsup = 0.5), where a, b, c, d, e and f are chord-sets (i.e., items) in a landmark melody stream seen so far. MMSLMS algorithm
constructs the MMSLMS-summary with respect to
the incoming block Bjand prunes all item-sets that
are infrequent from the current MMSLMS
-sum-mary in the following steps. Note that each node or entry represented as (f1:f2:f3) is composed of
three fields: item-id, support, and block-id. For example, (a:2:j) indicates that, from block Bj, item
a appeared twice.
Step 1: MMSLMSreads current block Bjinto main
memory for constructing the MMSLMS
-summary.
(a) First melody sequence acdef: First of all, MMSLMS algorithm reads the first
melody sequence acdef and calls the Sequence-Projection (acdef). Then MMSLMS inserts the item-suffix
se-quences acdef, cdef, def, ef, and f into
the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMB-table(c)], [MPI-tree(d), CMB-table(d)], [MPI-tree(e), CMB-table(e)], and [MPI-tree(f), CMB-table(f)], respectively. The result is shown in Fig. 3. In the following sub-steps, as demonstrated in Fig. 4 through Fig. 9, the head-links of each CMB-table (item-suffix) are omitted for concise presentation.
(b) Second melody sequence abe: MMSLMS
reads the second melody sequence abe and calls the Sequence-Projection (abe). Next, MMSLMS inserts the item-suffix
sequences abe, be and e into the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(b), CMB-table(b)] and [MPI-tree(e), CMB-table(e)], respectively. The result is shown inFig. 4.
(c) Third melody sequence cef: MMSLMS
reads the third melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMSinserts the item-suffix
sequences cef, ef and f into the CMB,
Fig. 3. MMSLMS-summary construction after inserting first melody sequence acdef in block Bj. In the following sub-steps, as
[MPI-tree(c), CMB-table(c)], [MPI-tree(e), CMB-table(e)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 5.
(d) Fourth melody sequence acdf: MMSLMS
reads the fourth melody sequence acdf and calls the Sequence-Projection (acdf). Next, MMSLMS inserts the
item-suffix sequences acdf, cdf, df and f into the CMB, [MPI-tree(a), CMB-table(a)], [MPI-tree(c), CMB-table(c)], [MPI-tree(d), CMB-table(d)] and [MPI-tree(f), CMB-table(f)], resp-ectively. The result is shown in Fig. 6.
(e) Fifth melody sequence cef: MMSLMS
reads the fifth melody sequence cef and calls the Sequence-Projection (cef). Then, MMSLMSinserts the item-suffix
sequences cef, ef and f into the CMB, [MPI-tree(c), CMB-table(c)], [MPI-tree(e), CMB-table(e)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 7.
(f) Sixth melody sequence df: MMSLMS
reads the sixth melody sequence df and calls the Sequence-Projection (df). Next, MMSLMS inserts the item-suffix
sequences df and f into the CMB, tree(d), CMB-table(d)] and [MPI-tree(f), CMB-table(f)], respectively. The result is shown inFig. 8.
Step 2: After computing the current block Bj, all
infrequent melody structures are pruned by MMSLMSfrom the current MMSLMS
-summary. At this time, MMSLMS deletes
the MPI-tree(b) and its corresponding CMB-table(b), and prunes the entry b from the CMB, since item b is an infre-quent item; that is, r(b) < minsup ÆjLMSj, where r(b) = 1 and minsup ÆjLMSj = 0.5 Æ 6 = 3. Next, MMSLMS reconstructs
the MPI-tree(a) by eliminating the infor-mation about the infrequent item b. The result is shown inFig. 9.
The description stated above is the constructing process of MMSLMS-summary with respect to the
incoming block over a landmark melody stream. The MMSLMS-summary construction algorithm
is depicted inFig. 10.
3.2.2. MMSLMS-mine
In this section, the module, called MMSLMS
-mine, of mining maximal melody item-sets and
Fig. 5. MMSLMS-summary construction after inserting third melody sequence cef.
maximal melody item-strings from the current MMSLMS-summary is discussed (Fig. 11).
First of all, given an entry id (from left to right, for example) in the current CMB, MMSLMS-mine
Fig. 7. MMSLMS-summary construction after inserting fifth melody sequence cef.
generates candidate maximal melody structures by a top-down approach. The top-down method uses the frequent items (i.e., chord-sets) of CMB-table-(id) and item id to generate the candidates. The generating order of these candidates is determined by the size of item-set, from item-set size 1 +jCMB-table(id)j down to size 2. Note that, the generating order ends in 2-item-sets because all frequent entries in the current CMB-table are frequent 1-item-sets. Then MMSLMS-mine checks
these candidates whether they are frequent or not by traversing the MPI-tree(id). The MPI-tree tra-versing principle is described as follows. First, MMSLMS-mine generates a candidate maximal
melody item-set, (j + 1)-item-set, containing the item id and all items of the CMB-table(id), where jCMB-table(id)j = j. Second, MMSLMS-mine
tra-verses the MPI-tree via the node-links of the fre-quent candidate. After that if the candidate is not a frequent item-set, MMSLMS-mine generates
substructure candidates with j-item-sets. Next, MMSLMS-mine executes the same MPI-tree
tra-versing scheme for item-set counting. The process stops when MMSLMS-mine finds all maximal
frequent melody structures from the current MMSLMS-summary. Moreover, MMSLMS-mine
stores these maximal melody structures into a tem-poral pattern list, called MMSLMS-list. Notice that
MMSLMS-mine can find the set of frequent
2-item-sets by combining the item-suffix id with the fre-quent items of the CMB-table(id).
Example 3. This example illustrates the mining of the maximal melody item-sets from the current MMSLMS-summary in Fig. 9. Let the minimum
support threshold be 0.5, i.e., minsup = 0.5. (1) Now, we start the maximal melody item-set
mining scheme from the frequent item a. At this moment, the frequent item-set is the only 1-item-set (a), since the support of items c, d, e and f in the CMB-table (a) are less than minsup ÆjLMSj, where jLMSj = jBjj = 6.
(2) Next, MMSLMS-mine starts on the second
entry c for maximal melody item-set mining. MMSLMS-mine generates a candidate
maxi-mal 3-item-set (cef), and traverses the MPI-tree(c) for counting its support. As a result,
the candidate (cef) is a maximal frequent item-set, since its support is 3, and it is not a sub-structure of any other maximal melody structures within the MMSLMS-list. Now,
MMSLMS-mine stores the maximal item-set
(cef) into the MMSLMS-list.
(3) MMSLMS-mine starts on the third entry d
and generates a maximal frequent 2-item-set (df). We store this item-2-item-set (df) into the MMSLMS-list because it is not a
sub-struc-ture of any other maximal melody strucsub-struc-tures within the current MMSLMS-list.
Algorithm 1 (MMSLMS-summary Construction)
Input: A landmark melody stream, LMS = [B1, B2, …, BN), and a user-specified minimum support threshold, minsup.
Output: A current MMSLMS-summary.
1: CMB = ∅ /*initialize the CMB to empty.*/ 2: foreach block Bj do /* j = 1, 2, …, N */
3: foreach melody sequence S = <x1, x2, …, xm> ∈ Bj (j = 1, 2, …, N) do 4: foreach item xi S do /*the CMB maintenance*/
5: if xi∉ CMB then
6: create a new entry (xi, 1, j, head-link) into the CMB; /* the entry form is (item-id, support, block-id, head-link)*/ 7: else /* the entry already in the CMB*/
8: xi.support = xi.support + 1; /* increment the support of item-id xi by one*/ 9: end if
10: end for
11: call Sequence-Projection(S);
/* project the sequence with every prefix-item xi for the construction of MPI-tree(xi)*/ 12: end for
13: call MMSLMS-summary-pruning(MMSLMS-summary, minsup, |LMS|); 14: end for
Subroutine Sequence-Projection
Input: A melody sequence S = <x1, x2, …, xm> and the current block-id j;
Output: MPI-trees(xi), ∀i=1, 2, …, m; 1: foreach item xi (i =1, 2, …, m) do
2: MPI-tree-maintenance([xi|X], MPI-tree(xi), j);
/* X = x1, x2, …, xm is the original melody sequence */
/* [xi|X] is an item-suffix melody sequence with item-suffix xi*/
3: end for
Subroutine MPI-tree-maintenance
Input: An item-suffix melody sequence <xi, xi+1, …, xm> (i=1, 2, …, m), MPI-tree(xi) and the current block-id j;
Output: The modified MPI-tree(xi), where i=1, 2, ..., m; 1: foreach item xi do /* i=1, 2, …, m */
2: if MPI-tree has a child Y such that Y.item-id = xi.item-id then 3: Y.support = Y.support +1; /*increment Y’s support by one*/ 5: else
6: create a new node Y = (item-id, 1, j, node-link);
/* initialize the Y’s support to 1, and link its parent link to MPI-tree, and its node-link linked to the nodes with same item-id via the node-link structure. */
7: end if 8: end for
(4) On the fourth entry e, since its maximal melody item-set (ef) is a sub-structure of previous maximal melody item-set (cef), MMSLMS-mine does not store it into the
MMSLMS-list.
(5) Finally, MMSLMS-mine computes the entry
f, and generates a maximal frequent 1-item-set (f) directly, since the CMB-table(f) is
empty. MMSLMS-mine does not store it
into the MMSLMS-list, because it is a
sub-structure of a generated maximal item-set (cef).
In conclusion, the Maximal Type I Melody Structures determined by algorithm MMSLMSare
(a), (cef) and (df). Now, we describe the mining Algorithm 2 (MMSLMS-mine)
Input: A current MMSLMS-summary, the current length of landmark melody stream |LMS|,
and a minimum support threshold minsup.
Output: A temporal-pattern-list, MMSLMS-list, of maximal melody structures. 1: MMSLMS-list = ∅;
2: foreach entry e in the current CMB do
3: do generate a candidate maximal melody structure E with size |E| /* |E| = 1+|CMB-table(e) */
4: counting E.support by traversing the MPI-tree(e); 5: if E.support minsup |LMS| then
6: if E∉ MMSLMS-list and E is not a substructure of any other maximal frequent
structures contained into the MMSLMS-list then
7: add E into the MMSLMS-list;
8: remove E’s substructures from the MMSLMS-list;
9: end if
10: else /* if E is not a frequent melody structure*/
11: enumerate E into melody substructures with size |E|—1; 12: end if
13: until MMSLMS-mine find the set of all maximal frequent structures with respect to the
item e; 14: end for
Fig. 11. Algorithm of MMSLMS-mine. Subroutine MMSLMS-summary-pruning
Input: An MMSLMS-summary, a user-specified minimum support threshold, minsup, and the current length of LMS, |LMS|;
Output: An MMSLMS-summary which contains the set of all frequent melody structures.
1: foreach entry xi (i=1, 2, …, d) ∈ CMB, where d =|CMB| do
2: if xi .support < minsup |LMS| then /* xi is not a frequent item */
3: delete those nodes (item-id = xi) via node-link structure;
4: merge the fragmented sub-trees;
/* a simple way is to reinsert or to join the remainder sub-trees into the MPI-tree*/; 5: delete MPI-tree(xi);
6: delete the entry xi from the CMB;
7: end if 8: end for
principle of maximal melody item-strings, i.e., Maximal Type II Melody Structures, as below. MMSLMS-mine generates maximal melody
item-strings from the current MMSLMS-summary as
shown inFig. 9 by a depth-first-search (DFS) ap-proach. Hence, the Maximal Type II Melody Structures determined by algorithm MMSLMSare
hai, hci, hdi and hefi. Note that hfi is not maximal melody item-string since it is a sub-string of the existing maximal melody 2-item-stringhefi.
Based on the algorithm MMSLMS-mine inFig.
10, we have the following lemma.
Lemma 2. A melody structure is a maximal melody structure if and only if it is generated by algorithm MMSLMS-mine.
Proof. Algorithm MMSLMS-mine is composed of
two major steps: frequent melody structure selection (step 1) and maximal melody structure verification (step 2). These steps are performed in sequence. First of all, in the step of frequent melody struc-ture selection, MMSLMS-mine finds frequent
mel-ody structure based on the Apriori property if any length i-item-set (or i-item-string) is not fre-quent, its length (i + set (or (i + 1)-item-string) can never be frequent. That means MMSLMS-mine does not miss any frequent melody
structures. Next, in step 2, MMSLMS-mine checks
the frequent melody structures generated from step 1 against the maximal melody structures of the MMSLMS-list, a temporal pattern list of
maxi-mal melody structures. If this frequent melody structure is a structure (i.e., set or sub-string) of any other structures within the MMSLMS-list, then it is not a maximal melody
structure according to the Definition 5; otherwise it is a candidate maximal melody structure before the next execution of step 2. Repeating step 1 and step 2, MMSLMS-mine can generate all the
maximal melody structures contained in the MMSLMS-list. Hence, we have the lemma: a
mel-ody structure is a maximal melmel-ody structure if and only if it is generated by algorithm MMSLMS-mine.
Space complexity analysis: The space require-ment of MMSLMS-summary consists of two parts:
the working space needed to create a CMB and the CMB-tables, and the storage space needed to
maintain the set of MPI-trees. Assume that CMB contains k frequent chord-sets such as e1,e2, . . . ,
ei, . . . , ekat any time. Based on the Theorem 1, we
know that there are at most Ckdk=2e maximal frequent chord-sets in the landmark melody stream seen so far. If we construct the MMSLMS
-summary for all these maximal frequent melody structures, the maximum height of all the MPI-trees is dk/2e. There are 1 þ Ck11 þ Ck12 þ þ Ck1dk=2e1nodes in the MPI-tree(e1), where the value
1 indicates the root node e1 of the MPI-tree(e1),
and Ck11 þ Ck12 þ þ Ck1dk=2e1 are internal and
leaf nodes of the MPI-tree(e1). Moreover, there are
1þ Ck21 þ Ck22 þ þ Ck2dk=2e1 nodes in the
MPI-tree(e2), . . ., 1þ Cki1 þ C ki
2 þ þ C ki dk=2e1
nodes in the MPI-tree(ei), 1þ C
kðk1Þ
1 nodes in the
MPI-tree(ek1), and 1 (root) node in the
MPI-tree(ek). Thus, the total number of nodes of
MPI-trees in the MMSLMS-summary is
ð1 þ Ck1 1 þ C k1 2 þ þ C k1 dk=2e1Þ þ ð1 þ Ck2 1 þ Ck22 þ þ Ck2dk=2e1Þ þ þ ð1 þ Cki1 þ C ki 2 þ þ C ki dk=2e1Þ þ þ ð1 þ Ckðk1Þ1 Þ þ 1 ¼ ðCk10 þ C k1 1 þ C k1 2 þ þ C k1 dk=2e1Þ þ ðCk20 þ C k2 1 þ C k2 2 þ þ C k2 dk=2e1Þ þ þ ðCki 0 þ C ki 1 þ C ki 2 þ þ C ki dk=2e1Þ þ þ ðCkðk1Þ0 þ C kðk1Þ 1 Þ þ Ckk0 :
This number equals Ck1þ Ck
2þ þ C k
dk=2e based
on Pascals Identity: let x and y be positive integers with x P y. Then Cxþ1y ¼ Cx
y1þ C x y.
Moreover, the worst case working space requires at most (k2+ k)/2 entries, which is based on the process of Sequence-Projection. Thus, the space requirement of MMSLMS-summary is
ðk2þ kÞ=2 þ Ck 1þ C
k
2þ þ C k
dk=2e. Finally, the
upper bound of space requirement is O(2k). h The worst case space complexity of algorithm MMSLMScan be analyzed in terms of melody
se-quence size as described below. Assume that the average melody sequence size is m, the current
length of the landmark melody stream is N, and the minimum support threshold is minsup. The space requirement of algorithm MMSLMS is composed
of two parts, working space and storage space. The working space is used to store the CMB and CMB-tables and the storage space is used to store the MPI-trees. The working space requirement is m + (m 1) + (m 2) + + 1 and the storage space requirement is also about m + (m 1) + (m 2) + + 1. Hence, the space requirement of MMSLMS for inserting a melody sequence with
average size m into MMSLMS-summary is 2[m +
(m 1) + (m 2) + + 1] = m2+ m. Hence, the space requirement of the stream generated so far in the worst case is N Æ (m2+ m). Note that in the analysis, we assume that the sminsup Æ N is just one and therefore every item of the incoming melody se-quence is a frequent item, which is the worst case. However, we know that the value of N increases as time progresses. Hence, the pruning mechanism of MMSLMS-summary is deployed to limit
the memory requirement not to exceed an upper bound.
From the space complexity analysis, it is not surprising to find that the space complexity grows exponentially into the number of frequent items in the CMB, as all frequent item-sets are represented in the data structure. It is also the solution space of the problem.
Time complexity analysis: From the construc-tion process of MMSLMS-summary, we can see
that exactly one scan of a landmark melody stream is required. The cost (denoted by Time-cost(S)) of inserting a melody sequence S into the MMSLMS
-summary by sequence projection is jSj + (jSj 1) + + 1 = (jSj2
+jSj)/2; that is O(jfreq(S)j2
), where freq(S) is the set of frequent items in the melody sequence S. Note that jfreq(S)j 6 jSj, wherejSj denotes the size of the melody sequence S.
Because the items within the CMB are frequent items, therefore, the cost of inserting a melody se-quence S can be stated in terms of the size of CMB. Time-cost(S) = O(jS0j2
), where jS0j is the
number of chord-sets of melody sequence S within the CMB. In the worst case, if the melody se-quence S contains all the frequent items within the CMB, Time-cost(S) = O(jCMBj2
).
4. Experimental results
In this section, we first describe the data and experiment set-up used to evaluate the perfor-mance of the proposed algorithm, and then report our experimental results.
4.1. Synthetic data and experiment set-up
To evaluate the performance of MMSLMS
algo-rithm, two experiments are performed. The exper-iments were carried out on the IBM synthetic market-basket test data generator proposed by Agrawal and Srikant (1994). Two data streams, denoted by S10.I5.D1000K and S30.I15.D1000K, of size 1 million melody sequences each are stud-ied. The first one, S10.I5.D1000K with 1 K unique items, has an average melody sequence size of 10 with average maximal potentially frequent struc-ture size of 5. The second one, S30.I15.D1000K with 10 K unique items, has an average melody se-quence size of 30 with average maximal potentially frequent structure size of 15. In all experiments, the melody sequences of each datasets are looked up in sequence to simulate the environment of a landmark melody stream. All the experiments are performed on a 1066-MHz Pentium III processor with 128 megabytes main memory, running on Microsoft Windows XP. In addition, all the pro-grams are written in Microsoft/Visual C++ 6.0. 4.2. Experimental results
In the first experiment, two primary factors, memory and execution time, are examined in the online mining of a landmark melody stream, since both should be bounded online as time advances. In Fig. 12(a), the execution time grows smoothly as the dataset size increases. This is because the average execution time of dataset S10.I5 and S30.I15 are about 12 and 25 s per block respec-tively, where a block is composed of 50,000 mel-ody sequences. In other words, the computation time of dataset S10.I5 by algorithm MMSLMS is
12 s every 50,000 melody sequences, and for data-set S30.I15 is 25 s every 50,000 melody sequences. Hence, it grows smoothly as the dataset size in-creases. The memory usage inFig. 12(b) for both
synthetic datasets is stable as time progresses, indi-cating the feasibility of algorithm MMSLMS. Note
that the synthetic landmark melody stream is par-titioned into blocks with size 50 K.
In the second experiment, we investigate the scalability and relative error of algorithm MMSLMS with respect to varying minimum
sup-ports. The relative error is defined as the differ-ence between the measured support and the actual support estimation divided by the actual support. In Fig. 13(a), the execution time grows smoothly as the dataset increases (assume min-sup = 0.01%) indicating linear scalability. Fig. 13(b) shows that the relative error decreases as minisup decreases, i.e., as the size of CMB de-creases. Generally, the more frequent items are
maintained in the CMB, the more accurate the mining result is.
5. Conclusions
In this paper, we proposed a single-pass algo-rithm, MMSLMS, to discover and maintain all
max-imal melody structures in a landmark model that contains all the melody sequences in a data stream. In the MMSLMSalgorithm, an efficient in-memory
summary data structure, MMSLMS-summary, is
developed to record all maximal frequent struc-tures in the current landmark model. In addition, MMSLMS uses a space-efficient scheme, the
Chord-set Memory Border (CMB), to guarantee
Fig. 12. Required resources for synthetic datasets: (a) execution time and (b) memory.
the upper-bound of space requirements of mining maximal melody sequences in a streaming environ-ment. Theoretical analysis and experimental results with synthetic data show that MMSLMSalgorithm
can meet the performance requirements of data stream mining: one-scan, bounded-space and real time. Further work includes online mining maxi-mal melody structures in count-based and time-based sliding window that contains the most recent melody sequences in a data stream.
Acknowledgements
The authors thank the reviewers precious com-ments for improving the quality of the paper. The research is supported by National Science Council of R.O.C. under grant no. NSC93-2213-E-009-043.
References
Agrawal, R., Srikant, R., 1994. Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499.
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In: Proceedings of 21th ACM Symposium on Principles of Database Systems, pp. 1–16.
Bakhmutora, V., Gusev, V.U., Titkova, T.N., 1997. The search for adaptations in song melodies. Computer Music Journal 21 (1), 58–67.
Fisher, M.J., Salzberg, S.L., 1982. Finding a majority among n votes: solution to problem 81-5. Journal of Algorithms 3 (4), 362–380.
Hsu, J.L., Liu, C.C., Chen, A.L.P., 2001. Discovering non-trivial repeating patterns in music data. IEEE Transactions on Multimedia 3 (3), 311–325.
Jones, G.T., 1974. Music Theory. Harper & Row, Publishers, New York.
Karp, R.M., Papadimitrious, C.H., Shanker, S., 2003. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems 28 (1), 51– 55.
Shan, M.-K., Kuo, F.-F., 2003. Music style mining and classification by melody. IEICE Transactions on Informa-tion and Systems E86-D (4), 655–659.
Yoshitaka, A., Ichikawa, T., 1999. A survey on content-based retrieval for multimedia databases. IEEE Transactions on Knowledge and Data Engineering 11 (1), 81–93.
Zhu, Y., Kankanhalli, M.S., Xu, C., 2001. Pitch tracking and melody slope matching for song retrieval. In: Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information, pp. 530–537.