PROPOSED MULTI-TCAM SOLUTIONS - 快速及可擴充之多TCAM封包分類器於大量法則表格搜尋

To cope with various QoS requirements and obtain wire-speed switching, our research proposes a classification co-processor performing constant-time classification for network processor-based platforms. Most control units break incoming packets into fixed-size segments; hence two classification scenarios exist in the network processor-based platforms. One scenario is NPU performs packet classification while receiving start-of-packet (SOP) signal, the other is NPU does not classify the packet until receive entire packet. The lookup operations of these two scenarios are significantly different. As the header information refers to several segments in the first scenario, NPU performs each lookup while NPU gets the header information. NPU can acquire the whole searching key and perform lookup procedure while receiving end-of-packet (EOP) signal. Therefore, our research proposes two architectures of classification co-processor by utilizing multiple TCAMs. In the rest of this session, our research first presents two multi-TCAM architectures for different classification scenarios. Then, our research describes the sorting scheme in the multi-TCAM systems. Finally, our research shows the ambiguous cases that may cause the trace-back and mis-match problems.

3.1. Pipeline and Parallel Multi-TCAM Architectures

Since the whole classification procedure includes input, search, and output steps, the lookup latency is the sum of the delays of all steps. In other words, searching longer policy table to classify packets takes more time due to the limitation of hardware bus. Take five-tuple IPv6 packet classification as an example; 304-bit searching key should be inputted 5 times on 72-bit data bus. To improve the throughput of packet classification, our research proposes pipeline architecture utilizing multiple TCAMs for the first scenario.

This architecture shown in Figure 6 is able to output a result every four clock-cycles with pipeline process.

For the first scenario, FPGA controller is responsible for inputting searching keys, handling lookup procedures, and finally outputting the lookup results. TCAMs are used to store the policy table. Three TCAMs store IPv6 source addresses, IPv6 destination addresses, and information of protocols and ports respectively. In our design, the TCAM-3 also stores the action table. In the first searching step, TCAM-1 outputs the associated tag, and the FPGA controller combines the tag with second searching key (IPv6 destination address) to perform second search. After that, TCAM-2 outputs the second tag and FPGA controller combines the tag with third key (ports and protocol) to perform third search.

Finally, TCAM-3 outputs the lookup result. This lookup procedure can be pipeline executed. Taking advantage of pipeline process, the proposed classification can output one result every four clock-cycles at full rate.

In the second scenario, the classifier can get the whole packet information. Thus, our research proposes the parallel architecture to employ multiple TCAMs and increase lookup speed. The architecture of proposed parallel multi-TCAM classification engine is shown in Figure 7. After NPU acquires the whole searching keys, NPU inputs the keys through FPGA controller. Then, all TCAMs execute the lookup procedure simultaneously and produce the associated indexes. Afterwards, the classifier combines all associated indexes and sends to the Binary CAM to perform hash operation. Eventually, FPGA controller replies the result to NPU. The proposed architecture can simultaneously search all fields and get a result after four clock-cycles.

Classification Engine (FPGA Controller)

TCAM-1 144-bit Addr

(SRC)

TCAM-2 144-bit Addr

(DST)

TCAM-3 144-bit Protocol and

Ports

72 72 72

Input Output

Databus

Figure 6. The pipeline architecture of proposed Classification Engine.

Since TCAM performs ternary (0, 1, and don’t care) searching, multiple matches may exist in TCAMs. Although TCAM always selects the index with longest prefix length, not all fields of highest policy rule has longest prefix length. Therefore, selecting the accurate index is critical.

3.2. Sorting in Multi-TCAM Systems

Ternary CAM is adept at performing longest prefix match (LPM) to find out the best route for Layer-3 IP lookup. In other words, the data should be sorted before stored into TCAM, and TCAM selects the best-matched rule according to the priority. For Layer-3 lookup, routes are sorted by using their prefix lengths, and rules are sorted by their priorities for Layer-4 classification. Our research employs multiple TCAMs in pipeline and parallel architectures increasing the throughput of packet classification. Since each TCAM stores just one field such as source or destination IPv6 address, sorting in each TCAM is according to the prefix length. For example, two rules- R₁and R₂, R₁ is (3ffe:3600:B::/48, 3ffe:3600::/32) and R₂ is (3ffe:3600::/32, 3ffe:3600:1::/48); R₁ has higher priority than R₂.

If R₁ is put in front of R₂ along the priority, then the packets in the range (3ffe:3600:B::/48, 3ffe:3600:1::/48) always matches R₁. Nevertheless, the packets within this range should match R₂. The key matched 3ffe:3600:B::/48 should also match 3ffe:3600::/32. If 3ffe:3600::/32 is put in front of 3ffe:3600:B::/48, then 3ffe:3600:B::/48 is never selected.

Accordingly, for multi-TCAM system, the entries in each TCAM should be sorted along the prefix length, instead of priority.

Classification Engine (FPGA Controller)

TCAM-1 144-bit Addr

(SRC)

TCAM-2 144-bit Addr

(DST)

TCAM-3 144-bit Protocol and

Ports

72 72 72

Input Output

Databus

Binary CAM

Figure 7. The parallel architecture of proposed Classification Engine.

3.3. Trace-back and Mis-match Problems

To furnish fast and scalable classification engines, the proposed scheme has to eliminate undesirable factors for smoothly executing pipeline and parallel classification.

Take the pipeline architecture as an example, if any one TCAM selects incorrect index, the classification procedure would occur error. Then the procedure must go back to previous TCAM and search again. Thus, in this case, classification procedure might spend extra time to lookup and fail to obtain the index at constant-time. The problem that causes extra actions is called the trace-back problem. On the other hand, while executing the parallel classification, all TCAMs must simultaneously select an associated index for Binary CAM (BCAM) to execute the hash operation. However, if any TCAM selects incorrect index, then BCAM could not output the accurate result. We define this situation as a mis-match problem. Consequently, the factors of trace-back problem and mis-match problem in multi-TCAM system should be eliminated to ensure fast packet classification.

In our research, we define the ambiguous cases between two rules, and these two rules would cause trace-back and mis-match problems. The ambiguous cases between two

rules have been presented in previous articles [19-20]. These articles present that two rules contains ambiguous cases if they are overlapped, but the relations of policy rules in multi-TCAM systems are not the same. For simplicity, let us use two-field (source and destination address fields) rules as an example. Consider the relation of six rules shown in Figure 8. The rules R₁ and R₂ are completely disjoined, rules R₃ and R₄ contain an overlapping area, and R₆ is a subset of R₅. For instance, consider the rules R₅ and R₆ shown in Figure 8. Since both the source and destination fields of rule R₆ have longer prefix length than those of rule R₅, we say that R₆ is a subset of R₅. For this case, it is clear that R₆ should have a higher priority than R₅. Otherwise, rule R₆ will never have the chance to be matched. In this case, we can just set R₆ higher than R₅ in all TCAMs. On the other hand, the source field of R₂ is a subset of R₁. If lookup source field first, R₂ is always selected. If the key is matched R₁, this case causes the trace-back problem. Since Layer-4 packet classification lookups multiple fields, the priority between two rules in the classification table should be arranged more carefully.

SRC

DST 2¹²⁸

2¹²⁸

Fully O verlap Partial O verlap

Disjoin

Figure 8. Relationships between six two-field rules.

To sum up, our research defines two ambiguous cases that cause two rules R_i and R_j ambiguous in multi-TCAM environments. The first one is when there exists one field of R_i is a subset of R_j and one field of R_j is a subset of R_i. The first ambiguous case is also called conflict rules in literature [19]. The second case is that there is one overlapped field and one disjoined field between R_i and R_j. In case the TCAM contains conflict rules, then it is difficult for TCAM to select the best-matched rule by only according to the prefix length.

Searching each field sequentially may suffer second ambiguous case. Since all fields should be matched at the same time, to decide the best selection by LPM in one TCAM/field is inappropriate. In other words, if the TCAM selects a wrong entry for the

first field and lookups the second field, then this will cause a trace-back [21] search.

Obviously, TCAM-classification might not work well if exists ambiguous rules. The ambiguous cases in multi-TCAM environments are defined in the following section. The detection and resolution algorithms for this problem are also introduced as well.

3.4. Comparisons of Proposed Multi-TCAM Solutions

To conclude this section, we compare the proposed pipeline and parallel multi-TCAM solutions and depict their characteristics in Table 1. In the pipeline solution, the NPU performs classification while receiving start-of-packet (SOP) signal. In parallel solution, the NPU does not classify the packet until receives end-of-packet (EOP) signal. Thus the pipeline solution is suitable for applying on the classification algorithm that lookup each field at each step. On the other hand, the parallel solution is suitable for the algorithm that takes the whole header information to lookup. Looking into the hardware specification, both of these two solutions consist of a FPGA controller and three 144-bit-width TCAMs.

Besides, parallel solution has an extra Binary CAM to perform hashing function and select the result. Since parallel solution has one extra BCAM, pipeline solution is cheaper than parallel solution. Considering the classification latency, the pipeline solution performs three-step lookup operations and each step takes 4 clock cycles to get a result. Thus, the whole lookup procedure takes total 12 clock cycles. By contrast, parallel solution takes 4 clock cycles to perform one lookup procedure. Nevertheless, the throughputs of these two solutions are the same.

Table 1. Comparison of characteristics with pipeline and parallel Multi-TCAMs.

Items Pipeline Architecture Parallel Architecture Classification When receiving SOP When receiving EOP

# of FPGA 1 1

# of TCAM 3 3

# of Binary CAM 0 1

Classification Latency 12 clock cycles 4 clock cycles Throughput 1 per 4 clock cycles 1 per 4 clock cycles

Problem Trace-back Mis-match

Both pipeline and parallel solutions can improve the throughput and the scalability for wide policy table lookup. However, the trace-back and mis-match problems may block the pipeline and parallel processes. Since the ambiguous cases in multi-TCAM systems causes these two problems, the proposed classification engine can obtain fast and scalable lookup as long as removing the ambiguous cases. Therefore, our research introduces the ambiguous cases and presents the solutions in the next session.

在文檔中快速及可擴充之多TCAM封包分類器於大量法則表格搜尋 (頁 18-23)