• 沒有找到結果。

Chapter 2  Review of Test Response Compaction Techniques

3.4  LZSS-Based Test Response Compaction for Identical Multiple

3.4.2  Determination of Design Parameters

We need to determine the design parameters of the LZSS encoding engine to maximize the

compression ratio CR under reasonable hardware constraints. A large sliding window and lookahead

buffer can help match more symbols, and therefore improve CR. But the hardware cost as well as the

Table 3.5 The settings of experiment 2.

Testing configuration

Test pattern length 1024 bits

Number of test patterns 100

Multicore parameters

Number of cores (N) 16

Core error probability parameter (λ) 0.025 LZSS design parameters

Lookahead buffer size 2,3,5,9,17 symbols Sliding window size 64, 128, 256, 512, 1024 symbols

Figure 3.13 The compression ratio result as LA_BF and SW varies.

The simulation result of this experiment shows that the compression ratio saturates at the point of (LA_BF, SW) = (5, 1024) in the design space of LZSS, as shown in Figure 3.13. In addition, when we consider the hardware area overhead of the compactor, we know that larger lookahead buffer as well as larger sliding window requires more hardware overhead. Therefore, we choose the point where the CR has diminished return, which is (LA_BF, SW) = (5, 256). Also in Figure 3.13, we can see that, in this specific setting of lookahead buffer size and sliding window size, the compression ratio can be below 0.5 when the core error rate parameter λ is below 0.025.

3.5 Summary

In test stimulus decompression, we can see that the decompressor reaches a compression ratio of about 50% under 95% proportion of don’t-care bits in test patterns.

The test response compactor can achieve compression ratio of less than 0.5 when there are 16 cores with the core error rate parameter λ=0.025. As for design parameters, the lookahead buffer should be long because there might be long consecutive matches.

Increasing LA_BF can improve the compression ratio, at the small cost of codeword

Chapter 4

Hardware Architectures for Test Stimulus Decompressors and Test Response Compactors

4.1 Hardware Architecture for Test Stimulus Decompressor

4.1.1 LZSS Decoder Architecture

The LZSS decoder is a simple and direct implementation. The decoding cycle consists of three stages: codeword decomposition, symbols read-out, symbols shift-in.

At the first stage, the translation process, the read-out circuit translates the codeword into match flag, symbol, index and match length. At the remaining stages of the decoder, including the read-out and the shift-in stages, operates on a constant cycle-by-cycle

Figure 4.1 Hardware architecture for LZSS decoder.

Figure 4.2 The finite state machine (FSM) of the decoder.

Figure 4.3 The timing diagram of LZSS decoder.

Using the property of continuous output to the scan chains, we design the finite state machine (FSM) of the decoder, as shown in Figure 4.2. The finite-state-machine (FSM) of the decoder contains only three states: initialization, decode and output. The system begins at the initialization state. While there are remaining symbols to be output, the FSM stays in the output state; else, the next state of the FSM goes back to the decoding state, and accept the next codeword. That is to say, there are continuous output cycle feeding into the scan chain. This can comply well with the conventional scan chain architecture, where in each cycle the scan chain feeds in one bit.

Figure 4.4 Symbol combining for shared mux control input.

4.1.2 Speed Limitations and Optimal Design

The stack of large address decoders (large multiplexers) in the read-out circuit is the main source of the critical path delay. Due to the feedback loop in the critical path, unfolding cannot increase the throughput because the critical path delay also increases as the unfold factor J increases. In other words, the circuit timing is limited to the iteration bound. The only optimization can be done in architectural level is the sharing of selection signals for the decoder mux. As shown in Figure 4.4, for every sliding window position, we fetch the symbol in that position of the sliding window, followed by consecutive LA_BF-1 symbols. This eliminates the hardware (signal wire) for generating LA_BF signals at max.

4.1.3 Implementation Results

We use design compiler to synthesize our decoder with TSMC 0.13μm process.

The synthesis result is shown in Table 4.1. The LZSS-based decoder hardware using this architecture can reach a throughput of 0.278~1.39 GB/s throughput.

We do the back-end place-and-route of the test stimulus decompressor using SoC Encounter. The decoder hardware after APR is shown in Figure 4.5 and Table 4.2. The die size is about 0.56mm×0.55mm, and can operate at a frequency of 250MHz, and gives a symbol processing rate of 250MB/s.

Table 4.1 Synthesis result of LZSS decoder.

Figure 4.5 The chip layout of LZSS decoder design core.

Table 4.2The implementation result of LZSS decoder.

(a)

(b)

Figure 4.6 Content addressable memory (CAM): (a) The whole CAM architecture and (b) a cell of the CAM.

4.2 Review of LZSS Encoder Design

4.2.1 CAM-based LZSS Encoder

Content-addressable memory is a kind of memory with “match” operation. The circuit shown in Figure 4.6(a) is an analog implementation of CAM. The analog CAM cell shown in Figure 4.6(b) is an ordinary SRAM cell plus three transistors dedicated for match operations. The CAM-based LZSS codec in related work [15] utilizes the intrinsic comparison function in the CAM to do fast comparisons. The hardware does parallel comparisons in a single cycle, and therefore have high throughput.

(a)

(b) (c)

Figure 4.7 Systolic-array-based LZSS encoder: (a) the systolic array architecture; (b) the space-time diagram of (a); (c) the processor element of each circle of (a).

4.2.2 Systolic-array-based LZSS Encoder

The systolic array performs comparison on a cycle-by-cycle basis, which means that the comparison process of one string is distributed into multiple cycles, as shown in Figure 4.7. [18] The systolic array design starts with a pre-map dependence graph (DG). Using proper scheduling vector and projection vector, we can trade-off the area overhead from the latency. Although this architecture has lower hardware cost, it suffers from lower throughput.

4.2.3 Comparison

The CAM-based design has the merit of high throughput, low latency. However, it suffers from large area overhead due to the highly parallel comparison hardware. The systolic array-based design has lower area overhead compared with the CAM-based design, but the throughput is lower and the latency is higher. Because the we do not want the test compression circuit to degrade the speed of the circuit, the compactor must not become the system bottleneck. Therefore we choose the CAM-based hardware architecture, which has higher throughput and lower latency.

4.3 Encoder Hardware Architecture

4.3.1 Proposed Three-stage CAM-based LZSS Encoder

The proposed CAM-based LZSS consists of three main parts, as shown in Figure 4.8. The first part is the sliding dictionary as well as the parallel generation of the match bits. The second part is the parallel string length counter and the comparator tree. The third part is the generation of the codeword.

Figure 4.8 Block diagram of LZSS encoder architecture.

In the first part, the sliding dictionary uses the CAM-based design as described in the previous section. It compares each set of the sliding dictionary in parallel. Namely, we compare the string {SWj,SWj1,...,SWjLA_BF1}with {LA0,LA1,...,LALA_BF1}for each j

from 0 to SW1. Next, we generate the match bits for each comparison set j, namely )

(

_bitj,i SWj i LAi

match  (4.1)

for j from 0 to SW1 and for i from 0 to LA_BF 1.

In the second part, the string length counter counts the consecutive ‘1’ generated by the previous part, and it is immediately followed by a binary maximum selection tree selecting j corresponding to the maximum match.

In the third part, the codeword is generated according to whether there is a string match or not. If there is a match, the codeword is encoded in the form of 1, position, length , and is encoded in the form of 0, symbol if there is no match.

The bit width of position p is ⌈log2(SW)⌉, and the bit width of maximum match length l is ⌈log2(LA_BF-1)⌉. The shift operation of the sliding dictionary is also done at the

clock edge of each clock cycles.

The following is a case study as shown in Figure 4.9(a)-(d). Every row here represents a cycle, and every column here represents a sliding window position. First, we do fully parallel symbol match in each row. Second, calculate the length of matched strings in each column. The accumulated match length is shown in red color. Third, select the maximal match length from each row of sliding dictionary, which means to select the best match in every cycle. Fourth, the best match strings can be encoded by the series of maximal counts, as shown in black border.

(a)

(b)

(c)

4.3.1 Unfolded Three-stage CAM-based LZSS Encoder

Figure 4.10 Unfolding as in the previous case study.

Unfolding is a transformation technique that can be applied to a DSP program to create a new program describing more than one iteration of the original program. [3]

Using unfolding means to trade more hardware area for better data processing rate.

Although the design mentioned above has the property of high speed, it still cannot reach the requirement of at-speed testing. The critical path of the above design is estimated to be about 2.5ns, and therefore the clock speed is about 1/2.5ns = 400MHz.

Because we want the encoder to have a 10Gbit/s throughput, which is the transmission speed of the USB 3.1 standard [24], we can do a J-unfold and J is estimated as

3.125

However, using unfolding also increases the critical path, so the operation frequency is less than the original. Therefore, we choose J to be 4.

Figure 4.11 Original sliding window of LZSS encoder.

Figure 4.12 Unfolded sliding window of LZSS encoder.

Figure 4.13 Parallel comparators of Figure 4.11 and Figure 4.12.

Figure 4.12 shows the architecture of the sliding dictionary and the parallel comparators after doing 4-unfolding. Due to the hardware sharing in the sliding window elements, we only need 3 extra registers for the sliding window and 3 extra registers for the input buffer. Figure 4.14 shows the parallel comparator in the first part and the string length counter in the second part. Because there is a feedback loop in the string length counter, we cannot further do pipelining to enhance the speed, and the path is a candidate for the possible critical path.

In the binary comparison tree of the second part, because the path is all feed-forward without feedback loop in this part, we can pipeline the comparator tree to enhance the critical path. However, we do not want the area overhead due to pipeline registers to become too much, and the critical path is bounded by about 4 stages of comparators plus accumulators, we choose 2 comparator stages per pipeline.

Figure 4.14 The 4-unfolded parallel comparison logic for the sliding window.

(a)

(b)

Figure 4.15 (a) The pipelined maximum selection tree with two comparator stages in one pipeline stage, and (b) the structure of each node in (a).

Figure 4.16 The codeword formation example.

In the third part, we encode the symbol into codewords according to whether there is a match or not. As shown in Figure 4.16, the third part may generate 1 to 4 codewords in one cycle. We use the processed max-length and position form the codeword. We need the information from the string length counter in the previous part to do the correct output. The whole architecture of the third part and the single processing element unit in it are shown in Figure 4.17(a)(b), respectively.

Table 4.3 Synthesis result of the 4-unfold LZSS encoder.

(a)

(b)

Figure 4.17 (a)(b) The processing element and the best-match encoder.

Figure 4.18 Proportions of the area in the 4-unfolded LZSS encoder architecture.

Figure 4.19 The encoder design is scalable as the core number grows.

4.3.1 Scalability of LZSS Encoder

The LZSS-based test response compactor is a scalable design. The portion of the hardware architecture that grows with the core number N consists of only about 35%,

(4.3)

As can be seen in Figure 4.19, as the number of cores grows from 8 to 32 cores, the area overhead per core decreases by 63.0%.

4.3.2 Implementation Results

We synthesize the encoder with design compiler under TSMC 0.13μm process.

The synthesis result is shown in Table 4.3. The synthesis result shows that the 4-unfold LZSS encoder architecture can achieve 2 GB/s throughput at the clock rate of 500MHz.

We do the back-end place-and-route of the test response compactor using SoC Encounter. The encoder hardware after APR is shown in Figure 4.20 and Table 4.4. The die size is about 1.24mm×1.24mm, and can operate at a frequency of 100MHz, and gives a symbol processing rate of 400MB/s.

Figure 4.20 The chip layout of LZSS encoder design core.

Table 4.4 The implementation result of LZSS encoder.

4.4 Summary

The hardware architecture of LZSS decoder is simple, and can reach throughput of 0.278~1.39GB/s with low latency. The LZSS encoder is a four-unfold architecture, and can reach 1.6 GB/s throughput. The 4-unfold LZSS encoder uses three-stage design, and is a scalable as the number of identical cores under test grows. The 4-unfold LZSS encoder reaches throughput of 1.6 GB/s at the clock frequency of 400 MHz.

Chapter 5

Conclusion and Future Works

5.1 Main Contributions

In conclusion, this thesis proposes an LZSS-based test compression scheme that is both scalable and feasible for diagnosis purpose for identical multicore systems. The algorithm of the LZSS-based test response compactor utilizes the intrinsic property that identical multiple core systems have correlation between cores. As shown in the simulation, the compression ratio reaches about 50% under the core error rate parameter λ=0.025. The decoder is implemented using TSMC .13μm process and can reach the symbol processing rate of 0.278 GB/s at the clock speed of 278 MHz, and the area is 0.55mm × 0.56mm. The LZSS encoder for test response compaction is implemented using TSMC .13μm process and can reach the symbol processing rate of 0.4 GB/s at the clock speed of 100 MHz, and the area is 1.24mm × 1.24mm. The encoder hardware architecture is also a scalable one. When the number of cores grows from 8 to 32 cores, the area overhead per core drops by 63.0%.

5.2 Future Directions

In the first direction, we can use Huffman code instead of LZSS in test stimulus compression in order to lower on-chip decoding complexity. Although the encoding of Huffman code is more complex than that of LZSS, its decoding is simpler. Using Huffman code as the test stimulus decompressor hardware can reduce the hardware overhead. Also, due to the prefix-code property of the Huffman code, we can easily parse the encoded bitstream into codewords in the decoding process. Therefore, using Huffman coding in test stimulus compression may be better in test stimulus compression

In the second direction, we can do the test-compression-aware ATPG, which means to generate the test stimulus patterns with don’t-care bits that are more compressible using LZSS algorithm. The source of the don’t-care bits comes from the unspecified bits after running ATPG algorithm targeting some specific faults. We can reorder the scan chain bits to aggregate the don’t-care bits together to improve the compressibility.

References

[1] L.-T. Wang, C.-W. Wu and X. Wen, VLSI Test Principles and Architectures. Elsevier, 2006.

[2] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Memory & Mixed-Signal VLSI Circuits. Kluwer Academic, 2000.

[3] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation, John Wiley & Sons, 1999.

[4] Test and Test Equipment, in International Technology Roadmap for Semiconductors (ITRS-2011),

https://www.dropbox.com/sh/r51qrus06k6ehrc/AABnW0tCHso_jKTGwLKyXE Jxa/2011Chapters/2011Test.pdf?dl=0

[5] nVidia, http://international.download.nvidia.com/geforce-com/international/pdfs/

GeForce_GTX_980_Whitepaper_FINAL.PDF

[6] T. Han, I. Choi and S. Kang, “Majority-Based Test Access Mechanism for Parallel Testing of Multiple Identical Cores,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 23, no.8, Aug. 2015.

[7] G. Giles, J. Wang, A. Sehgal, K. J. Balakrishnan, and J. Wingfield, “Test Access Mechanism for Multiple Identical Cores,” in IEEE International Test Conference (ITC’08), 2008.

[8] M. Sharma, A. Dutta, W.-T. Cheng, B. Benware and M. Kassab, “A Novel Test Access Mechanism for Failure Diagnosis of Multiple Isolated Identical Cores,” in IEEE Internationl Test Conference (ITC 2011), 2011.

[9] S. Vangal et al., “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,”

IEEE International Solid-State Circuits Conference (ISSCC), 2007.

[10] J. Howard et al., "A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS," IEEE International Solid-State Circuits Conference (ISSCC), 2010.

[11] TILE-Gx 72 Processor, Mellanox, “TILE-Gx Processor Family,”

http://www.mellanox.com/page/products_dyn?product_family=238, Feb. 2013.

[12] B. Jung and W. P. Burleson, “Efficient VLSI for Lempel–Ziv Compression in Wireless Data Communication Networks,” IEEE Transactions on VLSI Systems, vol. 6, pp. 475–483, Sept. 1998.

[13] S. James and S. Thomas, "Data compression via textual substitution," Journal of

[14] S. Mitra and K. S. Kim, "X-Compact: An Efficient Response Compaction Technique, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 23, no. 3, Mar. 2004.

[15] Kun-Jin Lin; Cheng-Wen Wu, "A low-power CAM design for LZ data compression," IEEE Transactions on Computers, vol.49, no.10, pp.1139-1145, Oct 2000.

[16] Lee, C.Y.; Yang, R.-Y., "High-throughput Data Compressor Designs Using Content Addressable Memory," IEE Proceedings - Circuits, Devices and Systems, vol.142, no.1, pp. 69-73, Feb 1995.

[17] N. Ranganathan and Selwyn Henriques, “High-Speed VLSI Designs for Lempel-Ziv-Based Data Compression,” IEEE Transactions on Circuits and Systems-II:

Analog and Digital Signal Processing, vol. 40, no. 2, Feb. 1993.

[18] Shih-Arn Hwang; Cheng-Wen Wu, "Unified VLSI systolic array design for LZ data compression," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, no. 4, pp.489-499, Aug. 2001.

[19] S. Jones, “100 Mbit/s Adaptive Data Compressor Design Using Selectively Shiftable Content-Addressable Memory,” IEE Proceedings-G, vol. 139, no. 4, pp.

498-502, Aug. 1992.

[20] L. Li, K. Chakrabarty and N. A. Touba, “Test Data Compression Using Dictionaries with Selective Entries and Fixed-Length Indices,” ACM Transactions on Design Automation of Electronic Systems, vol. 8, no. 4, pp. 470-490, Oct. 2003.

[21] J. Ziv ,and A. Lempel, "A Universal Algorithm for Sequential Data Compression,"

IEEE Transactions on Information Theory, vol. IT-23, no. 3, May 1977.

[22] R. Kapur, T. W. Williams, and S. Mitra, “Historical Perspective on Scan Compression”, IEEE Design & Test of Computers, vol. 25, issue 2, pp.115-120, Mar.-Apr. 2008.

[23] M. Bellos and D. Nikolos, "Deterministic Test Vector Compression/Decompression Using an Embedded Processor," in Proceedings 5th European Dependable Computing Conference (EDCC-5), pp. 318-331, Budapest,

Hungary, April 2005.

[24] USB 3.1 Specifications, http://www.usb.org/developers/docs/usb_31_052016.zip [25] B. W. Y. Wei, R. Tarver, J.-S. Kim and K. Ng, “A Single Chip Lempel-Ziv Data

Compressor,” in IEEE International Symposium on Circuits and Systems (ISCAS’93), pp. 1953-1955, vol.3, May 1993.

相關文件