The algorithm proposed by Aho and Corasick (AC-algorithm) is an exact string-matching algorithm that can locate the occurances of multiple patterns in a text with a linear time complexity. Therefore, the AC-algorithm is used in many applications to fast filter out the specific data. This chapter first explains the AC-algorithm briefly and then presents various implementations of the AC-algorithm.
0
1 2 3 4
8 a 9
h
e h a
¬{e,h}
goto function failure function
p 5
p 6 y
10 p 11
p y g o
e 13 14
15 16
7 n
n 12
(enhappy happy)
(happy) (happygo) (happen)
Figure 3.1: AC-trie
3.1 AC-trie
A prefix-tree built according to the AC-algorithm is known as an AC-trie that con-sists of goto and failure functions. Fig. 3.1 illustrates an AC-trie built on the keyword set {enhappy, happy, happen, happygo}. In this figure, the circled numbers denote
Chapter 3. AC-Algorithm and Implementations
states, which are called nodes alternatively. In addition, the double circled numbers denote output states, or output nodes, that have non-empty matching outputs. The physical and dashed lines denote goto and failure functions respectively. State 0 is also known as the initial state or the root node. Every non-initial state has a failure function, while the failure functions linked to the initial state are not shown for clarity. In an AC-trie, the depth of a node is also called the level of it. A property of the failure function is that a state only links to another state in the smaller level through the failure function. For example, the failure function of state 7 in level 7 links to state 12 in level 5.
A matching cycle in the AC-algorithm is defined as a period that begins with inputting a character and ends with outputting a matching output. There is only one active state in an AC-trie at any time. In a matching cycle, the goto functions of the active state are checked first. If none of the goto functions is matched, then the state transits to a new state through the failure function and the goto functions of the new activated state are checked continuously. Since all non-initial states are linked to the initial state eventually through the failure functions and the initial state has the goto functions for all characters, it ensures that a matched goto function can be found in every matching cycle. In the matching process, an input string is processed character by character and we call the character under processing as an inspecting character. A matching cycle leads to a matching output, which is represented by a state number. If a keyword is matched, the matching output is a non-zero state number; otherwise the matching output is state 0.
The AC-algorithm can be implemented in various approaches, which are de-scribed later, and the matching operations of these approaches are illustrated to-gether in Fig. 3.2 for comparison. The inspecting texts of these matching examples are the same which is ‘enhappenhappygo’. The matching operation of the AC-trie is illustrated in Fig. 3.2(a). In response to the characters ‘enhapp’, the transition traverses through the states 0, 1, 2, 3, 4, and 5 sequentially. In response to the next character ‘e’, none of the goto functions of state 5 matches with this character, so that the state transits to 11 following the failure function of state 5 and the opera-tion continues to match the goto funcopera-tions of state 11. In response to the following character ‘n’, the state transits to 14 which is a output state and we have a matching output ‘happen’. State 14 is a terminal state, therefore according to its failure func-tion, the state transits to 2 when the next character is input and then the matching process goes on from the state 2. Similarly, in response the remaining characters, the transition traverses through the states 3, 4, 5, 6, 7, 12, 15, and 16 sequentially;
3.1. AC-trie
(a) Matching operation of AC-trie
3
(b) Matching operation of AC-DFA
0 8 9 10 11
(c) Matching operation of AC-NFA
0 8 9 10 11
(d) Matching operation of hybrid AC-FA Figure 3.2: Examples of matching operations
where the transition from 7 to 12 is a transition according to the failure function of state 7. In the remaining matching operation, we have matching outputs on states 7 and 16 which are ‘enhappy happy’ and ‘happygo’ respectively.
With the failure functions, all matched keywords can be found in a one-pass search by using an AC-trie. Nevertheless, when an AC-trie is implemented in hard-ware the complexity will be increased due to the property that often more than one state transition are needed to find a matched goto function through the fail-ure functions. An AC-trie can be implemented as a DFA (AC-DFA) or an NFA (AC-NFA). The DFA and NFA approaches have significant different features. This thesis proposes a hybrid finite automaton approach that combines the DFA and NFA approach to implement the AC-algorithm (hybrid AC-FA). The proposed hy-brid AC-FA approach has both the advantages of DFA and NFA approaches. The AC-DFA, AC-NFA, and hybrid AC-FA approaches are described in the following sections.
Chapter 3. AC-Algorithm and Implementations
3.2 AC-DFA
Fig. 3.3 illustrates the DFA converted from the original AC-trie. The transitions pointed to the states 1 and 8 that are derived from the failure functions are de-noted together, respectively, for clarity in this figure. Fig. 3.2(b) illustrates the matching operations of the AC-DFA which is straightforward and the explanation of the matching operations is omitted. Fig. 3.4 illustrates the block diagram for implementing the AC-DFA. The transition table is implemented as a lookup ta-ble generally. The next state NX is determined by the input character IN CHAR, current state CUR ST, and the transition table. The next state NX is saved in a register and output as as the matching result OP. The next state NX is looped back as the current state CUR ST to be used in the next matching cycle. However, the transition table of AC-DFA typically is a sparse table, and thus it is inefficient in space utilization.
0
1 2 3 4
8 a 9
h
e h a
¬{e,h}
p 5
p 6 y
10 p 11
p y g o
e 13 14
15 16
7 n
n 12
h g e e
States 1-5, 7-10,12-16
h
States 1, 3-13, 15-16
Figure 3.3: AC-DFA