Data Structures
Consider a particular goto graph. In our proposed scheme, we classify states according to the number of child states. State P is said to be a branch state, a single-child state, or a leaf state, if it has at least two child states, exactly one child state, or no child state, respectively. Moreover, state P is said to be a final state if
( )
output P . Note that a leaf state is either a final state or a fork state or both.
As shown in Fig. 5, the data structures for branch, single-child, and leaf states are different. The meanings of the first four bits of the first byte, denoted by b b b b , 0 1 2 3 are the same for all data structures. Bit b =1 iff the state is a final state and bit 0
b =1 iff the state is a fork state. Bits 1 b b indicate the type of the state and are 2 3 equal to 00, 01, or 10 if the state is a leaf state, a single-child state, or a branch state, respectively. The rest four bits of the first byte are unused. The data structure consists of four bytes if b =1 regardless of the type of the state. In this case, bytes 2, 0 3, and 4 store the index of matched signatures. In the following, we only describe data structures for non-final states.
Final state
Final Fork Type
Index of matched signatures: 3 bytes Leaf state
band values: 3(start index – end index +1) bytes
Figure 5. Data structures for leaf, single-child, and branch states.
The data structure for non-final leaf state P consists of eleven bytes. Since state P is not a final state, it must be a fork state. Bytes 2, 3, and 4 store the start state of
the goto graph to be traversed by a forked process. Let {min max be the mark of , } the state. Bytes 5 and 6 store the min value and bytes 7 and 8 store the max
value. The content of bytes 9, 10, and 11 represents the failure state ( )f P . Note that ( ) 0f P means the END state is entered when the failure function is consulted in state P.
Assume that state P is a single-child state and g(P, ) = R. We allocate eight or fifteen bytes for state P. The second byte stores the symbol . Bytes 3, 4, and 5 store the failure state ( )f P and bytes 6, 7, and 8 store state R. The data structure is completed if state P is not a fork state. Otherwise, seven more bytes are needed.
Bytes 9, 10, and 11 store the start state of the goto graph to be traversed by a forked process. Bytes 12 and 13 store the min value and bytes 14 and 15 store the max value of the mark.
Finally, assume that state P is a branch state. The data structure adopted is the banded-row format [11]. As an example, consider the sparse vector (0 0 0 5 4 0 0 0 9 0 7 0 0 0 0 0 0 0 0 0). The non-zero values occur in between the third (numbered from 0) and the tenth elements. Consequently, it can be represented as (3 10 5 4 0 0 0 9 0 7), where the first number indicates the start index and the second number denotes the end index, followed by eight band values. In our application, a non-zero band value represents the next state number and value zero means the failure function is to be consulted. To summarize, the data structure for non-final branch state P includes four or eleven bytes followed by the banded-row format. Bytes 2, 3, and 4 store the failure state ( )f P . If state P is a fork state, then seven more bytes are needed. Bytes 5, 6, and 7 store the start state of the goto graph to be traversed by a forked process. Bytes 8 and 9 store the min value and bytes 10 and 11 store the max value of the mark. As for the banded-row format, there is one byte for the start index and another byte for the end index. Each band value takes three bytes.
For an input symbol which falls in the band with a non-zero band value k, it means that ( , )g P . In case the input symbol k falls outside the band or it falls in the band with a band value zero, it means ( , )g P fail.
Since the goto graph G is likely to have a large number of states for a large 0 signature set. As a result, to make the proposed signature matching system useful, it is necessary to reduce the memory space required by goto graphs. We modified the goto graph G such that the state number of 0 G can be largely reduced. 0
There are many redundancies in the failure function, since many states may fail to the same state (say, the start state of a goto graph). But in the data structure we mention before, we store the failure function for each state. State R is said to be a first single-child state if it is a single-child state and its parent state is a branch state.
Moreover, state S is said to be an explicit state if it is the start state, a branch state, a first single-child state, a switching state, a fork state, or a final state. We modified the goto graphG into a different way which is represented by explicit state only. 0
Assume that state P is a single-child state and is represented by stringS . State R 1 is said to be a descendent state of state P if it is represented byS S , where 1 2 S2 is a non-empty string. Furthermore, state R is said to be a descendent explicit state of state P if R is an explicit state and a descendent state of state P. State R is said to be the
nearest descendent explicit state (NDES) of state P if state R is a descendent explicit state of state P and there is no other descendent explicit state of state P which is represented by string S W where string1 1 W1 is a proper prefix of stringS . The data 2 structure for the single-child state P includes P.pattern, P.distance, and f(P), where P.pattern and P.distance store, respectively, the identification of the patternP , l
and|S S . 1 2|
Only the goto graphG is modified, the original data structure is still needed. It 0 doesn't make any different on branch state and leaf state (or final state). So we add an additional data structure shown if Fig. 6 for the first single-child state onG . Bytes 2, 0 3, and 4 store the failure state ( )f P . Bytes 5, 6, and 7 store the next explicit state it will enter according to the goto function. If it is not a fork state, bytes 8 and 9 store the P.distance. Bytes 10 and more (if needed) store the P.pattern. If it is a fork state, then seven more bytes are needed. Bytes 8, 9, and 10 store the start state of the goto graph to be traversed by a forked process. Bytes 11 and 12 store the min value and bytes 13 and 14 store the max value of the mark. Bytes 15 and 16 store the P.distance. Bytes 17 and more (if needed) store the P.pattern.
Non-branch, non-leaf explicit state Final Fork 11
f(P): 3 bytes R: 3 bytes
fork(P): 3 bytes or empty min: 2 bytes or empty max: 2 bytes or empty
distance: 2 bytes σ : 1*(distance) bytes
Figure 6. Data structures for Non-branch, non-leaf explicit state.
Chapter 6.
Programming Schedule
In this section, we will describe the programming schedule in detail. There are six processes in this program, each of them has their own input and output. The main idea of this program is dictated by three parts: the matching machine construction, data compression, the scanning engine. Process 1 to 4 is the construction part, including the pre-filter, goto function, failure function, and output function. Process 5 handles the data compression. In this process, we combine the goto, failure and output function into a form of data structure we describe in section 5. Process 6 is the scanning part. In this process, we can really scan a file and show that if there is any pattern matched. Each process is described in the following statement in detail individually:
Process 1: Signature analysis Inputs: Signature file
Outputs: NumSignature, eacwp.pattern[NumSignature]
Description:
Since we care about the regular expression, each signature is fragmented into several segments according to their operator. And we need to know how many segment does a signature has. If it’s a plain string, it’s obvious that it doesn’t need to be fragmented, so the segment number must be one. For each segment, we not only store the actual string, but also other information, ex. Length, operator type following the segment. All this information will be stored under eacwp.pattern.