Chapter 4 Polyphonic Repeating Pattern Mining
4.4 Bit-String Approach
Frequency counting is the main performance issue of our proposed algorithms. To improve the performance, a bit-string approach is developed for counting frequency efficiently.
sd=<{A,E}, {B,D}, {A,B,C}, {B}, {A}, {B,C}, {A,C}>
<{A}> 1 0 1 0 1 0 1
<{B}> 0 1 1 1 0 1 0
<{C}> 0 0 1 0 0 1 1
<{A,C}> 0 0 1 0 0 0 1
<{A},{B},{A}> 1 0 1 0 1 0 0
Figure 4-11. An example of bit-string index
4.4.1 Bit-String Index
A set-sequence pattern sp in a set-sequence data sd can be represented by a bit-string. The length of the bit-string bs(sp) is equal to the size of sd. The k-th value of bs(sp) is 1 when sp has a k-position instance in sd ; otherwise, k-th value is 0. Furthermore, the total number of bit
“1” in the bit string is equal to the frequency of this pattern. We give the formal definition of the bit-string index as follows.
Definition 4 (bit-string index) Given a set-sequence pattern sp=< p1, p2, …, pi > and a set-sequence data sd=< q1, q2, …, qj >, where size(sp)≦size(sd) and size(sd) is j. We say the bit-string of sp in sd is bs(sp)= b1 b2 … bj, where bk=1, if b has a k-position instance in sd;
otherwise, bk=0.
64
An example of bit-string index is given in Figure 4-11. For a sequence pattern sp=< {A}, {B}, {A} >, it has k-position instances in sd at 1, 3 and 5; thus, bs(< {A}, {B}, {A}>)=1010100.
4.4.2 Frequency Counting with Bit-String Operation
While the bit strings are maintained, the frequency of the extended pattern can be counted efficiently by applying bit-string operation. We denote the bit string of the length-l pattern as bs(spl).
Bit-String Approach used in A-PRPD
In the proposed A-PRPD algorithm, pattern-extend function is used for finding longer polyphonic repeating patterns. That is, a length-l pattern c_sp is generated by two length-(l-1) pattern sp1l-1 and sp2l-1. As previous mentioned, we have already maintained their bit strings, bs(sp1l-1) and bs(sp2l-1). The pattern-extend function is performed when sp1‟l-1 (sp1l-1 deleting the first element) and sp2‟l-1 (sp2l-1 deleting the last element) are equal. Depended on the cardinality of first set of sp1l-1 (larger than 1 or equal to 1), the size of sp1‟l-1 and sp2‟l-1 are equal to sp1l-1.size or sp1l-1.size-1. When the cardinality of the first set of sp1l-1 is larger than 1, then the first set of sp2l-1 is included by the first set of sp1l-1; otherwise, the first set of sp2l-1
is included by the second set of sp1l-1. According to pattern-extend operation, c_spl is equal to the pattern which the last element of sp2l-1 is added to the last set of sp1l-1 or is appended to sp1l-1. We discuss these two cases for designing bit-string operation as follows.
Case 1: Since the second set of sp2l-1 is comprised by the first set of sp1l-1, a property is in the extended pattern spl: the element i used to extend will occur at the size‟-th position. We know bs(sp2l-1) records all k-positions instances in sd. It also denotes that, for each k-position instances, there is an element i appearing after size‟ or size‟-1 position. Thus, bs(spl) can be derived by checking all k-position instances of sp1l-1 whether there is an element appearing
65
after size‟ or size‟-1 position. The examination can be accomplished by using two hardware operations, bitwise-and-operation (denoted by „&‟) and bitwise-shift-operation (denote by LEFT_SHIFT), as the formula: bs(sp1l-1) & LEFT_SHIFT1(bs(sp2l-1)).
Case 2: Since the first set of sp2l-1 is included by the first set of sp1l-1, it not necessary to shift bs(sp2l-1) to align bs(sp1l-1). Thus, bs(spl) can be derived by bs(sp1l-1) & bs(sp2l-1).
To summarize, the bit-string of length-l pattern, bs(spl) can be derived by the following formula.
bs(spl) = bs(sp1l-1) & LEFT_SHIFTi(bs(sp2l-1)) (8) where i=0, if the cardinality of the first set of sp1l-1 is equal to 1; otherwise, i=0.
For example, given sp13 = <{A}, {B}, {A}>, sp23 = <{B}, {A,C}>, and their bit-string representations are bs(sp13) = 1010100, bs(sp23) = 0100010. After checking, these two length-3 patterns can be used to generate a candidate sp4 = <{A}, {B}, {A,C}>. We need to check the frequency of sp4. The process of calculating the frequency of this sp4 by bit-string operation is described as follows. First, we obtain LEFT_SHIFT1(bs(sp23)) = 1000100. After that, we perform AND operation with bs(sp13) = 1010100 and LEFT_SHIFT1(bs(sp23)) = 1000100. Hence, we obtain bs(<{A}, {B}, {A,C}>)=1000100. Consequently, the frequency of sp4 is 2 by counting „1‟ of the bit-string.
Bit-String Approach in D-PRPD
In D-PRPD, a candidate pattern is generated from the pattern in parent node and length-1 pattern by set-extend or sequence-extend operation. For descriptions, we discuss sequence-extend operation first, then, set-extend operation.
(a) sequence-extend operation
The sequence-extend operation extends a length-(l-1) pattern by appending a size-1 pattern.
66
. The sequence-extend is to attached the size-1 set to the pattern; that is, spl has one more set {i} than spl-1, which appears at m-th position. If we know which k-position instances of spl-1 in sd appearing {i} after m position(s), then, k-position instances of spl is derived. As we known, bs(spl-1) and bs(sp1) record all position instances of spl-1 and bs(<{B}>)=0111010, and we apply set-extend(<{A}, {B}, {A}>, <{B}>) to derive bs(<{A}, {B}, {A}, {B}>). In this case, the position of the last set of sp is 3, so bs(<{B}>) has to left shift 3 positions, and we obtain LEFT_SHIFT2(bs(sp1))=1010000. Then, performing AND operation between bs(spl-1)=1010100 and LEFT_SHIFT2(bs(sp1))=1010000 will derive bs(<{A}, {B}, {A}, {B}>)=1010000. After aggregating „1‟ in the bit string, frequency of sp’l is 2.
(b) set-extend operation
Since spl-1=< s1,…, sm-1> and sp1=< {i} >, by set-extend(spl-1, sp1) , we have spl=< s1,…, sm-1
∪ {i}>. In spl, the position of the last set which {i} is included is m-1 position after the first set. The set {i} is less one position than {i} in sequence-extend. Thus, with the same reason, we check every k-position instances of bs(spl-1) if sp1 (i.e. {i}) appears m position(s) later. In set-extend operation, compare to sequence-extend operation, sp1 is added to the last set of spl-1, consequently, sp1 needs to be shifted left less 1 position than the position of sp1 in set-sequence operation be shifted.
67
In set-extend operation, the formula is designed as Equation (10).
bs(sp’l) = bs(spl-1) & LEFT_SHIFTi-1(bs(sp1)) (10) where i is the position of the last set of spl-1.
For example, given sp=<{A}, {B}, {A}>, sp1=<{C}>, and their bit string are bs(spl-1)=1000100, bs(<{C}>)=0010011. We apply set-extend(<{A}, {B}, {A}>, <{C}>) to derive bs(<{A}, {B}, {A,C}>). In this case, the position of the last set of sp is 3, so bs(<{C}>) has to left shift 3−1 positions, and we obtain LEFT_SHIFT2(bs(sp1))=1001100. Then, performing AND operation between bs(spl-1)=1000100 and LEFT_SHIFT2(bs(sp1))=1001100 will derive bs(<{A}, {B}, {A,C}>)=1000100. After aggregating „1‟ in the bit string, the frequency of sp’ is 2.