Systolic-array-based LZSS Encoder - Review of LZSS Encoder Design

Chapter 4 Hardware Architectures for Test Stimulus Decompressors and Test

4.2 Review of LZSS Encoder Design

4.2.2 Systolic-array-based LZSS Encoder

The systolic array performs comparison on a cycle-by-cycle basis, which means that the comparison process of one string is distributed into multiple cycles, as shown in Figure 4.7. [18] The systolic array design starts with a pre-map dependence graph (DG). Using proper scheduling vector and projection vector, we can trade-off the area overhead from the latency. Although this architecture has lower hardware cost, it suffers from lower throughput.

4.2.3 Comparison

The CAM-based design has the merit of high throughput, low latency. However, it suffers from large area overhead due to the highly parallel comparison hardware. The systolic array-based design has lower area overhead compared with the CAM-based design, but the throughput is lower and the latency is higher. Because the we do not want the test compression circuit to degrade the speed of the circuit, the compactor must not become the system bottleneck. Therefore we choose the CAM-based hardware architecture, which has higher throughput and lower latency.

4.3 Encoder Hardware Architecture

4.3.1 Proposed Three-stage CAM-based LZSS Encoder

The proposed CAM-based LZSS consists of three main parts, as shown in Figure 4.8. The first part is the sliding dictionary as well as the parallel generation of the match bits. The second part is the parallel string length counter and the comparator tree. The third part is the generation of the codeword.

Figure 4.8 Block diagram of LZSS encoder architecture.

In the first part, the sliding dictionary uses the CAM-based design as described in the previous section. It compares each set of the sliding dictionary in parallel. Namely, we compare the string _{_SW_j_,_SW_j_₁_,...,_SW_j__LA_{_}_BF_₁_}with _{_LA₀_,_LA₁_,...,_LA_LA_{_}_BF_₁_}for each j

from 0 to SW1. Next, we generate the match bits for each comparison set j, namely )

(

_bit_j_,_i SW_j _i LA_i

match  _  (4.1)

for j from 0 to SW1 and for i from 0 to LA_BF 1.

In the second part, the string length counter counts the consecutive ‘1’ generated by the previous part, and it is immediately followed by a binary maximum selection tree selecting j corresponding to the maximum match.

In the third part, the codeword is generated according to whether there is a string match or not. If there is a match, the codeword is encoded in the form of 1, position, length , and is encoded in the form of 0, symbol if there is no match.

The bit width of position p is ⌈log2(SW)⌉, and the bit width of maximum match length l is ⌈log2(LA_BF-1)⌉. The shift operation of the sliding dictionary is also done at the

clock edge of each clock cycles.

The following is a case study as shown in Figure 4.9(a)-(d). Every row here represents a cycle, and every column here represents a sliding window position. First, we do fully parallel symbol match in each row. Second, calculate the length of matched strings in each column. The accumulated match length is shown in red color. Third, select the maximal match length from each row of sliding dictionary, which means to select the best match in every cycle. Fourth, the best match strings can be encoded by the series of maximal counts, as shown in black border.

(a)

(b)

(c)

4.3.1 Unfolded Three-stage CAM-based LZSS Encoder

Figure 4.10 Unfolding as in the previous case study.

Unfolding is a transformation technique that can be applied to a DSP program to create a new program describing more than one iteration of the original program. [3]

Using unfolding means to trade more hardware area for better data processing rate.

Although the design mentioned above has the property of high speed, it still cannot reach the requirement of at-speed testing. The critical path of the above design is estimated to be about 2.5ns, and therefore the clock speed is about 1/2.5ns = 400MHz.

Because we want the encoder to have a 10Gbit/s throughput, which is the transmission speed of the USB 3.1 standard [24], we can do a J-unfold and J is estimated as

3.125

However, using unfolding also increases the critical path, so the operation frequency is less than the original. Therefore, we choose J to be 4.

Figure 4.11 Original sliding window of LZSS encoder.

Figure 4.12 Unfolded sliding window of LZSS encoder.

Figure 4.13 Parallel comparators of Figure 4.11 and Figure 4.12.

Figure 4.12 shows the architecture of the sliding dictionary and the parallel comparators after doing 4-unfolding. Due to the hardware sharing in the sliding window elements, we only need 3 extra registers for the sliding window and 3 extra registers for the input buffer. Figure 4.14 shows the parallel comparator in the first part and the string length counter in the second part. Because there is a feedback loop in the string length counter, we cannot further do pipelining to enhance the speed, and the path is a candidate for the possible critical path.

In the binary comparison tree of the second part, because the path is all feed-forward without feedback loop in this part, we can pipeline the comparator tree to enhance the critical path. However, we do not want the area overhead due to pipeline registers to become too much, and the critical path is bounded by about 4 stages of comparators plus accumulators, we choose 2 comparator stages per pipeline.

Figure 4.14 The 4-unfolded parallel comparison logic for the sliding window.

(a)

(b)

Figure 4.15 (a) The pipelined maximum selection tree with two comparator stages in one pipeline stage, and (b) the structure of each node in (a).

Figure 4.16 The codeword formation example.

In the third part, we encode the symbol into codewords according to whether there is a match or not. As shown in Figure 4.16, the third part may generate 1 to 4 codewords in one cycle. We use the processed max-length and position form the codeword. We need the information from the string length counter in the previous part to do the correct output. The whole architecture of the third part and the single processing element unit in it are shown in Figure 4.17(a)(b), respectively.

Table 4.3 Synthesis result of the 4-unfold LZSS encoder.

(a)

(b)

Figure 4.17 (a)(b) The processing element and the best-match encoder.

Figure 4.18 Proportions of the area in the 4-unfolded LZSS encoder architecture.

Figure 4.19 The encoder design is scalable as the core number grows.

在文檔中適用於相同多核心系統的測試壓縮及診斷機制之LZSS壓縮演算法及硬體架構設計 (頁 64-77)