Chapter 3 Bit Sequence Representation
In our approach, the information in the data sequence is stored as the form of bit
sequences. According to this representation, the frequencies of candidate patterns
could be checked more efficiently during the mining process.
In this chapter, chapter 3.1 will introduce the appearing bit sequence and the bit
index table. How to apply the appearing bit sequences of patterns to compute the
“frequency” of candidate patterns with fault tolerance quickly is introduced in chapter
3.2 and 3.3.
3.1 Appearing Bit Sequences
For each kind of data item N in the data sequence, N has a corresponding
appearing bit sequence (denoted asAppear ). The length of each appearing bit N
sequence equals the length of the data sequence. The leftmost bit is numbered as bit
1 and the numbering increases to the rightmost bit. In other words, if some data item
appears on the ith position of the data sequence, bit i in the appearance bit sequence of
this data item is set to be 1; otherwise, it is set to be 0. A bit index table is used to
store the appearing bit sequences for all the data items in the data sequence. Take
DSeq = ABCDABCACDEEABCCDEAC as an example. Its corresponding bit index
table is shown as Table 3.1.
Table 3.1: The bit index table of “ABCDABCACDEEABCCDEAC”
Data Item Appearing Bit Sequence (Appear ) N
A 10001001000010000010
B 01000100000001000000
C 00100010100000110001
D 00010000010000001000
E 00000000001100000100
From the bit index table, the number of bits with value 1 inAppear for a data N
item N equals to freq(N) in DSeq. Therefore, the frequency of a data item can be
obtained without needing to scan the data sequence repeatedly. The idea is also
applicable for a longer pattern of a data sequence. We use the following example to
show this idea.
Example 3.1 The bit index table of “ABCDABCACDEEABCCDEAC” is given as
shown in Table 3.1.
(1) Suppose we would like to getAppearAB. If bit i inAppearAB contains value 1,
it denotes a position where “AB” appears. Therefore, “A” must appear on
position i and the next position i+1 must contain “B”. The relative
information about the positions where “A” and “B” appears is stored in
AppearA and AppearB , respectively. So can use the following steps to getAppearAB.
Step1. Get the appearing bit sequence of the data item “B” from Table 1, 0001000000
0100010000
B =
Appear .
Step2. Perform the left shift operation one (= AB −1) bit on AppearB(shift
bit (i+1) to bit i, where1≤ i<19 , and set bit 20 to be 0),
0010000000 1000100000
) 1 , (
_shift AppearB =
l .
Step3. Perform an and operation on the result of Step2 and AppearA to
getAppearAB. Thus the resultant bit sequence:
0010000000 1000100000
) 1 , (
_ =
∧ B
A l shift Appear
Appear .
(2) After gettingAppearAB, suppose we would like to getAppearABC. When
“AB” appears on position i and “C” appears on position i+2, it implies pattern
“ABC” appears on position i. Therefore, we can get AppearABC by
performing the following steps.
Step1. Obtain the appearing bit sequence of the last note “C” from Table 1, 0000110001
0010001010
C =
Appear .
Step2. Perform the left shift operation two (= ABC −1) bits onAppear , C
0011000100 1000101000
) 2 , (
_shift AppearC = l
Step3. Perform an and operation on the result of Step2 and AppearAB to
getAppearABC. Thus the resultant bit sequence:
The bits with value 1 in AppearAB implies those positions where “AB” begins to
appear. Therefore, the number of these bits equals the frequency of “AB” in DSeq,
that is 3. Similarly, fromAppearABC, the frequency of “ABC” in DSeq, 3, could be
obtained efficiently.
Observing Example 3.1 carefully, we can get the following rules to getAppearP
for a pattern P. Suppose a pattern P=P1P2KPm (m≥2)is given, where P i
) , , 1
(i= K m is a data item. Let P′=P1P2KPm−1 and P be denoted by X. Then m
AppearP can be deducted fromAppearP'andAppearX according to the following recursive formula, where l_shift(b,n,c)means to perform left shift n bits on b, and
the rightmost bit on b is filled with constant c(c=0 or 1). If the parameter c is omitted
from the function, the default value of c is set to be 0.
−
∧
=
= =
. ),
1 , (
_
, 1 ,
'
' Appear l shift Appear P otherwise Appear
P if Appear
Appear
X P
X P
P P
(3.1)
3.2 Appearing Bit Sequences of Insertion Fault Tolerance
By extending the representations of appearing bit sequence, the fault-tolerant
appearing bit sequences are designed to represent the appearing positions of a pattern
with fault tolerance. Given a fault-tolerance δ (δIorδD), the appearing bit sequence
of a pattern P in a data sequence, denoted asFT-AppearP+(δ)/FT-AppearP−(δ),
represents the positions where the data sequence IFT/DFT-contains P under fault
tolerance δ . The methods for getting FT-AppearP+(δI) and FT-AppearP−(δD)
systematically are described in this chapter and the next chapter separately.
Considering the insertion fault tolerance, we define the appearing bit sequence of
a pattern P with E numbers of insertion errors, denoted as AppearP+(E). The bits
with value 1 in AppearP+(E)represent those positions where the data sequence
FT-contains P with E insertion errors. According to [Def. 2.4], there are (δI +1)
situations that a pattern P IFT-appears in DSeq under fault toleranceδI. That is,
DSeq FT-contains P with 0, 1, 2,…,δI insertion errors. In other words, performing
δI or operations on (δI +1) appearing bit sequences: AppearP+(0),AppearP+(1), )
( and
, ), 2
( P I
P Appear
Appear+ K + δ , )FT-AppearP+(δI could be obtained. Figure 3.1
illustrates the relationship between FT-AppearP+(δI)andAppearP+(E)when δI= 2
for example patterns “A”, “AB”, “ABC”, “ABCD”, and “ABCDE” .
Figure 3.1 Example of the relationship between FT-AppearP+(δI)andAppearP+(E)
In Figure 3.1, note that FT-AppearA+(2)equals toAppearA+(0)directly because it
is not possible to insert 1 or 2 data items in the “middle” of pattern “A” to get a
sub-pattern of the data sequence. Therefore, we can conclude the following rule.
[Rule 3.1] Whenever a given pattern P such that P =1 and insertion fault-tolerance
isδI, then P = A
FT-AppearA+(2)= AppearA+(0)∨AppearA+(1)∨AppearA+(2)
P = AB
FT-AppearAB+ (2)= AppearAB+ (0)∨ AppearAB+ (1)∨ AppearAB+ (2)
P = ABC
FT-AppearABC+ (2)= AppearABC+ (0)∨ AppearABC+ (1)∨AppearABC+ (2)
P = ABCD
) 2 ( )
1 ( )
0 ( )
2 (
-AppearABCD+ = AppearABCD+ ∨AppearABCD+ ∨ AppearABCD+ FT
P = ABCDE
) 2 ( )
1 ( )
0 ( )
2 (
-AppearABCDE+ = AppearABCDE+ ∨AppearABCDE+ ∨ AppearABCDE+ FT
E≤δI
≤
∀ 1 , 0AppearP+(E)= (represented byDSeq numbers of bits with 0s)
(3.2)
The remaining problem is how to get eachAppearP+(E) , where P >1and
E≤δI
≤
0 . SinceAppearP+(0) represents the locations where DSeq FT-contains P with zero insertion error, it implies the same information represented inAppearP.
Therefore, the way of getting AppearP+(0) is the same as getting AppearP(described
in the previous chapter).
When1≤E≤δI, )AppearP+(E also can be obtained by performing the left shift
and and operations on appearing bit sequences of sub-pattern of P according to the
following lemma.
[Lemma 3.1]
Given a patternP=P1P2KPm, where P i (i=1,K,m)is a data item. Let P′
denote the sub-pattern P1P2KPm−1 of P and X denote theP . DSeq FT-contains m
pattern P with E insertion errors on position i, iff DSeq FT-contains pattern P′ with k
insertion errors on position i (0≤k ≤E) and X appears on position i+( P +E)− . 1
Proof. P′ appears in DSeq from position i to (i+ P′ −1)+k (with k insertion errors).
So there exists E−k insertion errors between P′ and X. It implies X must appear on
position i+( P +E) 1− , because (i+ P′ −1+k) + (E−k) + 1 = i+(P′+1)+E−1 +
= i ( P +E) 1− (QP′ 1+ = P ). #
In other words, X must appear on the ( P + E−1)th position on the right hand
side of position i. Therefore, the way of gettingAppearP+(E)can be expressed as the
following recursive formula for0< E≤δI.
− +
∧
=
= +
=
+ (∨ ( )) _ ( , 1), .
, 1 0s)
with bits of numbers by
ed (represent 0
) (
' 0
otherwise E
P Appear shift
l k Appear
P if DSeq
E Appear
X P
E
k
P (3.3)
To combine Formulas (3.1) and (3.3), a recursive function of gettingAppearP+(E),
where0≤E≤δIis defined as follows.
[Def. 3.1] (Recursive function of getting AppearP+(E) ) Suppose a pattern
Pm
P P
P= 1 2K is given. Let P′ denote the sub-pattern P1P2KPm−1 of P and X
denote the last data item Pm . When insertion fault tolerance δI is
given, AppearP+(E), where0≤E≤δI , is obtained from the following recursive
function.
If P =1, then
P
P Appear
Appear+(0)= ;
∀ 1≤ E≤δI, 0AppearP+(E)= (represented byDSeq numbers of bits with 0s);
Else
( ) ( '( )) _ ( , 1)
0
− +
∧
= +
=
+ E ∨ Appear k l shift Appear P E
Appear p X
E k P
Example 3.2 Given the bit index table of “ABCDABCACDEEABCCDEAC” as
shown in Table 1. Assume δI =1 , the process of getting AppearAB+ (1)
andAppearABC+ (1) is shown as follows.
(1) )AppearAB+ (1
Step1. GetAppearB =01000100000001000000from the bit index table.
Step2. Perform an or operation onAppearA+(0)andAppearA+(1). According
to formula (3.2), AppearA+(1)=0, andAppearA+(0)= AppearA. The
result of the or operation is assigned to temporal variable s.
0010000010 1000100100
) 1 ( )
0
( ∨ =
= AppearA+ AppearA+ s
Step3. Perform the left shift operation two (= AB +1−1) bits onAppearB,
0100000000 0001000000
) 2 , (
_ =
=l shift AppearB t
Step4. Perform an and operation on s and t to getAppearAB+ (1). Thus the
resultant bit sequence: s∧ t =00000000000000000000.
(2) AppearABC+ (1)
Step1. GetAppearC =00100010100000110001 from the bit index table.
Step2. Perform an or operation onAppearAB+ (0)andAppearAB+ (1), then assign
it to s. SinceAppearAB+ (0) is gotten based on formula (3.1) and
) 1
+ (
AppearAB is known from the previous result of this example, the resultant appearing bit sequence s= AppearAB+ (0)∨AppearAB+ (1)
0010000000 1000100000
= .
Step3. Perform the left shift operation three (= ABC +1−1) bits
onAppear , 0110001000C t =l_shift(AppearC,3)=0001010000 .
Step4. Perform an and operation on s and t to getAppearABC+ (1). Thus the
resultant bit sequence: s∧ t =00000000000010000000.
Figure 3.2 illustrates the process to get AppearP+(E)recursively for a sample
pattern “ABCDE”, whereδI =2.
Figure 3.2 Example of the way of gettingAppearABCDE+ (E)recursively P = A
A
A Appear
Appear+(0)=
000 000 ) 1
( = K
+
AppearA
000 000 ) 2
( = K
+
AppearA
P = AB
) 1 , (
_ ) 0 ( )
0
( A B
AB Appear l shift Appear
Appear+ = + ∧
) 2 , (
_ )) 1 ( )
0 ( (
) 1
( A A B
AB Appear Appear l shift Appear
Appear+ = + ∨ + ∧
) 3 , (
_ )) 2 ( )
1 ( )
0 ( (
) 2
( A A A B
AB Appear Appear Appear l shift Appear
Appear+ = + ∨ + ∨ + ∧
P = ABC
) 2 , (
_ ) 0 ( )
0
( AB C
ABC Appear l shift Appear
Appear+ = + ∧
) 3 , (
_ )) 1 ( )
0 ( (
) 1
( AB AB C
ABC Appear Appear l shift Appear
Appear+ = + ∨ + ∧
) 4 , (
_ )) 2 ( )
1 ( )
0 ( (
) 2
( AB AB AB C
ABC Appear Appear Appear l shift Appear
Appear+ = + ∨ + ∨ + ∧
P = ABCD
) 3 , (
_ ) 0 ( )
0
( ABC D
ABCD Appear l shift Appear
Appear+ = + ∧
) 4 , (
_ )) 1 ( )
0 ( (
) 1
( ABC ABC D
ABCD Appear Appear l shift Appear
Appear+ = + ∨ + ∧
) 5 , (
_
)) 2 ( )
1 ( )
0 ( (
) 2 (
D
ABC ABC
ABC ABCD
Appear shift
l
Appear Appear
Appear Appear
∧
∨
∨
= + + +
+
P = ABCDE
) 4 , (
_ ) 0 ( )
0
( ABCD E
ABCDE Appear l shift Appear
Appear+ = + ∧
) 5 , (
_ )) 1 ( )
0 ( (
) 1
( ABCD ABCD E
ABCDE Appear Appear l shift Appear
Appear+ = + ∨ + ∧
) 6 , (
_
)) 2 ( )
1 ( )
0 ( (
) 2 (
E
ABCD ABCD
ABCD ABCDE
Appear shift
l
Appear Appear
Appear Appear
∧
∨
∨
= + + +
+
Finally, FT-AppearP+(δI) can be obtained by performing ()
0
i AppearP
i
I +
∨δ= .
) ( -freq P
FT DSeq equals to the number of bits with value 1 inFT-AppearP+(δI).
Therefore the insertion fault-tolerant frequency of a pattern P could be counted
efficiently to evaluate whether P is a FT-RP or not.
Example 3.3 Follows the result shown in Example 3.1 and Example 3.2,
0010000000 1000100000
) 1 ( )
0 ( )
1 (
-AppearABC+ = AppearABC+ ∨AppearABC+ =
FT , thus
under insertion fault tolerance 1, FT-freqDSeq("ABC")=3.
To avoid the duplicated computations of performing or and left shift operations
to get AppearP+(E) for various E, we modify the recursive function of
getting AppearP+(E) to show the recurrent relations between temp variables for
computing )AppearP+(E andAppearP+(E−1).
[Def. 3.2] (Modified recursive function of gettingAppearP+(E)) Given a pattern
Pm
P P
P= 1 2K . Let P′ denote the sub-pattern P1P2KPm−1 of P and X denote the
last data item P . When insertion fault tolerance m δI is given, AppearP+(E)
where0≤E≤δI, is obtained from the following recursive function.
If P =1, then
P
P Appear
Appear+(0)= ;
E≤δI
≤
∀ 1 , 0AppearP+(E)= (represented byDSeq numbers of bits with 0s);
Else
IfE=0, then
) 0 ( )
( '
1
= AppearP+
E
temp ;
) 1 , (
_ )
2(E =l shift Appear P −
temp X
Else
) ( )
1 ( )
( 1 '
1 E temp E Appear E
temp = − ∨ P+ ;
) 1 ), 1 ( ( _ )
( 2
2 E =l shift temp E−
temp ;
) ( )
( )
(E temp1 E temp2 E
AppearP+ = ∧
Figure 3.3 shows the results after modifying Figure 3.2 based on [Def. 3.2].
Figure 3.3 Modification of Figure 3.2 P = A
A
A Appear
Appear+(0)=
000 000 ) 1
( = K
+
AppearA
000 000 ) 2
( = K
+
AppearA
P = AB
) 0 ( )
0
1(
= AppearA+
temp
) 1 , (
_ ) 0
2( l shift AppearB
temp =
) 0 ( )
0 ( )
0
( temp1 temp2 AppearAB+ = ∧
) 1 ( )
0 ( )
1
( 1
1
∨ +
=temp AppearA temp
) 1 ), 0 ( ( _ ) 1
( 2
2 l shift temp temp =
) 1 ( )
1 ( )
1
( temp1 temp2 AppearAB+ = ∧
) 2 ( )
1 ( )
2
( 1
1
∨ +
=temp AppearA temp
) 1 ), 1 ( (
_ ) 2
( 2
2 l shift temp temp =
) 2 ( )
2 ( )
2
( temp1 temp2 AppearAB+ = ∧ P = ABC
) 0 ( )
0
1(
= AppearAB+
temp
) 2 , (
_ ) 0
2( l shift AppearC
temp =
) 0 ( )
0 ( )
0
( temp1 temp2 AppearABC+ = ∧
) 1 ( )
0 ( )
1
( 1
1
∨ +
=temp AppearAB temp
) 1 ), 0 ( ( _ ) 1
( 2
2 l shift temp temp =
) 1 ( )
1 ( )
1
( temp1 temp2 AppearABC+ = ∧
) 2 ( )
1 ( )
2
( 1
1
∨ +
=temp AppearAB temp
) 1 ), 1 ( (
_ ) 2
( 2
2 l shift temp temp =
) 2 ( )
2 ( )
2
( temp1 temp2 AppearABC+ = ∧
Figure 3.3 Modification of Figure 3.2 (Continue)
3.3 Appearing Bit Sequences of Deletion Fault Tolerance
Similar to the insertion fault tolerance, we define the appearing bit sequence of a P = ABCD
) 0 ( )
0
1(
= AppearABC+
temp
) 3 , (
_ ) 0
2( l shift AppearD
temp =
) 0 ( )
0 ( )
0
( temp1 temp2 AppearABCD+ = ∧
) 1 ( )
0 ( )
1
( 1
1
∨ +
=temp AppearABC temp
) 1 ), 0 ( ( _ ) 1
( 2
2 l shift temp temp =
) 1 ( )
1 ( )
1
( temp1 temp2 AppearABCD+ = ∧
) 2 ( )
1 ( )
2
( 1
1
∨ +
=temp AppearABC temp
) 1 ), 1 ( ( _ ) 2
( 2
2 l shift temp temp =
) 2 ( )
2 ( )
2
( temp1 temp2 AppearABCD+ = ∧ P = ABCDE
) 0 ( )
0
1(
= AppearABCD+
temp
) 4 , (
_ ) 0
2( l shift AppearE
temp =
) 0 ( )
0 ( )
0
( temp1 temp2 AppearABCDE+ = ∧
) 1 ( )
0 ( )
1
( 1
1
∨ +
=temp AppearABCD temp
) 1 ), 0 ( ( _ ) 1
( 2
2 l shift temp temp =
) 1 ( )
1 ( )
1
( temp1 temp2 AppearABCDE+ = ∧
) 2 ( )
1 ( )
2
( 1
1
∨ +
=temp AppearABCD temp
) 1 ), 1 ( ( _ ) 2
( 2
2 l shift temp temp =
) 2 ( )
2 ( )
2
( temp1 temp2 AppearABCDE+ = ∧
value 1 in AppearP−(E)represent those positions where the data sequence FT-contains
P with E deletion errors.
Suppose a pattern P=P1P2KPm is given. Let Y denote the first data item
and P ′′ denote the sub-pattern P2P3…Pm of P. Different fromFT-AppearP+(δI) ,
) ( -AppearP D
FT − δ represents the positions where Y appears and DSeq FT-contains P ′′
on the next positions with at most δD deletion errors. Therefore, when finding a
position j where DSeq FT-contains P ′′ with 0, 1, 2,…, or δD deletion errors, we can
find the position ( j−1) where DSeq DFT-contains P under fault tolerance δD if
position ( j−1) contains Y. In other words, after performing δD or operations on
(δD +1) appearing bit sequences:AppearP−′′(0),AppearP−′′(1), …,AppearP−′′(δD −1), and
) ( D
AppearP−′′ δ , then performing a left shift operation on the previous result, and finally performing an and operation with AppearY, )FT-AppearP−(δD could be
obtained. Figure 3.4 illustrates the relationship between FT-AppearP−(δD)
andAppearP−′′(E)when δD= 2 for example patterns “A”, “AB”, “ABC”, “ABCD”,
and “ABCDE”. Note that if |P| ≤δD+1, the rightmost bit is filled with 1 when
performing the left shift operation because the bit is considered as “don’t care” bit on
the next performed and operation. Otherwise, 0 is filled to the rightmost bit.