As mentioned before, the pattern occurrence checking in the original ABF algorithm is only performed cross two consecutive data blocks. Moreover, only the
Chapter 3 An Efficient Pattern Matching Scheme In LZW Compressed Sequences
first occurrence is reported. To generalize the ABF algorithm to find all pattern occurrences, we need to consider all pattern occurrences cross two consecutive data blocks and those inside a data block as well. Our implementation presented in Chapter 3.1 allows detection of all pattern occurrences cross two consecutive data blocks. Therefore, the remaining work is to detect all pattern occurrences inside a data block. The generalization is designed to also report the absolute positions of pattern occurrences. Reporting the absolute positions of all occurrences may be desirable to some applications.
To detect all pattern occurrences inside a data block, we add two fields, called pattern inside flag (PIF) and pattern inside pointer (PIP), to every node of the LZW trie . The PIF flag is an indication of existence of patterns inside the chunk and the PIP pointer is used for backtracking to find the positions of all pattern occurrences inside the chunk. For the root node, its PIF is 0 and its PIP pointer points to the node itself, which is also 0. Assume that a new node M is to be added as a child node of node N. The PIP pointer of node M inherits the PIP value of node N if N is not a final node, i.e., a node whose chuck ends with the complete pattern P. To identify final nodes, we let the prefix number of a final node equal m. In case node N is a final node, the PIP pointer of node M points to node N. Similarly, the PIF of node M inherits the PIF value of node N unless the PIF of node N is 0 and node M is a final node. In this case, we set the PIF of node M to 1. With these additional fields, one can trace back the LZW trie to find all pattern occurrences inside a chuck. The trace-back ends once a node with PIP pointer points to the root node, i.e., PIP = 0, is reached. Note that although PIF can be replaced by the PIP pointer and the prefix number (PIF = 1 is equivalent to PIP
TS
≠ 0 or prefix number = m), we suggest to use PIF to simplify the checking of pattern existence inside a chunk.
17
Note that, since we allow the prefix number of a node to be equal to m, we need to add to the set of prefix bitmaps an additional prefix bitmap corresponding to prefix number = m. The contents of the bitmap are assigned with the same algorithm described in Chapter 3.1. It is clear that the value of the variable Prefix may equal m too. However, it does not cause any error because the bitmap corresponding to prefix number = m is the same as the bitmap corresponding to prefix number = k, where pm k− +1...p is the longest suffix of P which is also a proper prefix of P, i.e., a m prefix which is not P itself.
For convenience, we also allow the suffix number of a node to be equal to m.
As a consequence, another bitmap corresponding to suffix number = m is added to the set of suffix bitmaps. Again, the contents of the added suffix bitmap are assigned according to the algorithm described in Chapter 3.1 and the additional suffix bitmap does not cause any error because a1i = 0 for all i, 1≤ i ≤ m+1.
To report absolute positions of pattern occurrences, we can rely on the depth fields of nodes on the LZW trie and a global variable COUNT which stores the number of bytes in text that have been scanned. Computation of the depth field is simple. The depth of the root node is 0. When node M is added as a child node of node N, the depth of node M equals that of node N plus one. Clearly, with the depth fields, one can compute the position of a node inside a chuck, which, together with the global variable COUNT, can be used to determine the absolute position of any pattern occurrence. The overall generalized algorithm is described below.
TS
S
Chapter 3 An Efficient Pattern Matching Scheme In LZW Compressed Sequences
A. Pattern Preprocessing
The prefix bitmaps and the suffix bitmaps are computed. Also, the compacted suffix trie ST of pattern P with the associated bitmaps are determined. P
B. Compressed Text Scanning
When constructing the LZW trie , each node’s node number, label, prefix number, suffix number, internal range, the first symbol, depth, PIF, and PIP are computed and stored. The compressed text scanning procedure is described below.
TS l+q then the label is S.Z[l]'s first symbol.)
1.4 If S.Z[l] is an internal node with corresponding string S1 Set l+q's internal range [i, j] as Q S3( , )1 α .
1.7 If F = 0 and l+q's prefix number = m, then l+q's PIF Å 1.
Else, l+q's PIF Å F.
1.8 Set the depth of node l+q as D+1.
1.9 If = m, then l+q's PIP Å S.Z[l]. Px Else, l+q's PIP Å S.Z[l]’s PIP.
2 Pattern Search If Sx ≠ 0
Check cross-boundary occurrences with the bitwise AND operation for query Q2(Prefix,Sx). Let R = r r r1 2...m be the result of the bitwise AND operation.
for k = 1 to m do If rk = 1
Report the position: COUNT – k + 2 If F = 1 // Pattern is inside S.Z[l]
If Px = m
Report the position: COUNT + D – m + 1 N Å S.Z[l]’s PIP
While N ≠ 0
Report the position: COUNT + Dep(N) – m + 1 N Å N’s PIP
Prefix Å Q1(Prefix,I) // Note that the answer of Q1(Prefix,I) is , if the result of bitwise AND operation for Q
Px 1(Prefix,I) is all-zero COUNT Å COUNT + D
The following example illustrates the process to detect all pattern occurrences and report their absolute positions.
Chapter 3 An Efficient Pattern Matching Scheme In LZW Compressed Sequences
Example 3. As in Example 2, let P = ababc. The prefix bitmaps and the suffix bitmaps are shown in Tables 3-5 and 3-6, respectively. Since the suffix trie of pattern P and the bitmaps associated with the explicit nodes are not changed, they are not reproduced here. Assume that some of the compressed text had been processed and the current value of COUNT = 100. Assume further that the last three chunks that had been processed are xxx, xxxx, and xaba, and the current chunk to be processed is Np= bcababcxababcxx.
Table 3-7 shows the contents of the nodes along the path from the root node to node on the LZW tire. Note that there are two pattern occurrences inside the current chunk which can be determined by tracing back the PIP pointers. Table 3-8 shows a brief summary of the results when the last three chunks are processed. The procedure of pattern detection with report of absolute occurrence positions in processing is sketched below.
Np
Np
• Reporting absolute positions of cross-boundary pattern occurrences:
Since Np’s suffix number = 2 ≠ 0
// Check cross-boundary occurrences with bitmaps:
Prefix = 3 with corresponding bitmap 01010.
Np’s suffix number = 2 with corresponding bitmap 00010.
The result of bitwise AND operation R = 01010⊗ 00010 = 00010.
Î The absolute occurrence position COUNT – 4 + 2 = 98 is reported.
• Reporting absolute positions of inside-chunk pattern occurrences:
Since Np’s PIF = 1
Since Np’s PIP = N2 ≠ 0
The absolute occurrence position COUNT + Dep(N2) – m + 1 = 109 is
21
reported.
number Internal range PIF PIP Depth
0 (root) NULL - - - - 0 0 0
Chapter 3 An Efficient Pattern Matching Scheme In LZW Compressed Sequences
Table 3-8. Brief summary of the results when the last three chunks are processed
The last three
chunks that had been processed
Np
S =… xxx xxxx xaba bcababcxababcxx
Depth = 3 4 4
COUNT = 92 96 100
Prefix = 0 0 3
23