Thesis Organization - 適用於相同多核心系統的測試壓縮及診斷機制之LZSS壓縮演算法及硬體架構設計

Chapter 1 Introduction

1.3 Thesis Organization

The thesis consists of five chapters. Chapter 2 reviews the related work of test stimulus compression and test response compaction. Chapter 3 introduces the LZSS compression algorithm used in test stimulus compression and test response compaction.

Chapter 4 presents the hardware design of the decompressor for on-chip test stimulus decompression as well as the compressor for on-chip test response compactor. Finally, Chapter 5 entails the conclusion and future work.

Chapter 2 Review of Test Response Compaction Techniques

2.1 Related Works

2.1.1 Majority-Based Test Access Mechanism for Parallel

Testing of Multiple Identical Cores

There are several works related to the test compression of identical multicore systems. In [6], the authors proposes a test access mechanism (TAM) for identical multicore systems, as shown in Figure 2.1(a). They assume that the majority value of the outcomes of all cores are always correct, and therefore use them as golden responses.

In test response compaction, all scan chains in every single core are compared (XORed) with the majority values respectively. After the XOR stage, the values are ORed

However, the method suffers from two problems. Although we can deduce from the faulty response the correctness of the core, we have a hard time to know the exact fault location. That is, we cannot diagnose the circuit. The author in [6] proposes a method to refeed the scan chain and do the fault simulation again, and go through different paths to do the diagnosis. It may not be acceptable because this doubles the test time as well as test cost.

The method of [6] has two drawbacks: the first being its non-scalable complexity.

The more cores, the larger the majority analyzer (MA) is, and the hardware overhead grows superlinearly. The second is that the diagnosis needs two phases. It first generate the comparison result. If the outcome is all correct, then we are done. However, if the outcome is different from the majority value, then we needs to set MA_sel in Figure 2.1(b) to select the output of the faulty core, and using the flip-flops shown in Figure 2.2 to get the original uncompacted test response.

(a)

(b)

Figure 2.1 The system architecture of related work [6]: (a) Full system architecture and (b) An example of 3-input majority analyzer (MA).

2.1.2 A Novel Test Access Mechanism for Failure Diagnosis of

Multiple Isolated Identical Cores

The method of [9] also resembles that of [6]. The difference is that, first, there are separate pins for golden response input in [9]. Second, as for the OR tree, the circuit is grouped into different sets of partitions, as shown in Figure 2.4, which means different ways to do test response compaction. In [9], every “partition” is the OR combination in the partition in every circle. The circle means a combination (e.g. OR), of all the values in the circle, which are the difference bits.

Figure 2.3 The system architecture of related work [9].

The compaction process goes as follows. In each output pin of each partition, if the OR result is 0, which means all the input pins of the OR gate are 0, the respective scan chains are fault-free. On the contrary, if the OR result is 1, which means all the input pins of the OR gate are not 0, at least one of the scan chains is faulty. When we need to do circuit diagnosis, we can first collect the compacted test response data from different compaction partitions. As shown in Figure 2.5, the circle shown are the test response with value 1, which means error output. To diagnose for the exact position of the faults, we infer from the faulty test response. Cells shown in red background are diagnosed with fault; however, the cell in yellow background cannot be diagnosed whether correct or faulty. Also, the hardware overhead is non-scalable because it needs more sets of “partitions” to generate the “equations” when the circuit grows large.

(a) (b)

(c)

Figure 2.4 Three different combinations of scan cells result in three different partition circuits: (a) the first partition circuit, and (b) adding the second partition circuit, and

Figure 2.5 Compaction scheme for diagnosis.

2.2 Problems of Related Works

In [6], the test response compaction is lossy. Therefore, we cannot obtain the exact test response for diagnosis. In the identical multicore systems, the diagnosis for each core is even more difficult, because we cannot obtain the exact result of each core. Also, there is test escape due to the aliasing problem of the compacted syndrome.

The underlying reason is that they compress in the same core between different scan chains rather than the same scan chain in respective cores. It does not utilize the information redundancy, and therefore lower the compressibility. The scalability of both methods is not enough, which means as the number of cores grows, the hardware area overhead of the compactor grows superlinearly.

Table 2.1 Comparison between related works and proposed approach.

2.3 Summary

The architecture of related works [6] and [9] are both non-scalable. [6] uses a majority analyzer, which grows at least as fast as the number of cores grows. [9] uses multiple compaction, and we need more sets of compactors as the scan chain and core number grows. Also, the test response information is hard to recover, which makes diagnosis after compaction impossible. If we can get full test response information, we can improve diagnosis.

Chapter 3 LZSS-Based Test Compression for Identical Multiple Core Systems

3.1 System Description

Traditionally, test response compaction are categorized as space compaction and time compaction. Space compaction refers to compacting data in the space dimension, while time compaction refers to compacting data in the time dimension. [1] However, in the identical multiple core systems, there are three dimensions in multicore test response data: core, space (scan chain) and time, where the core dimension is the additional third dimension. We would like to exploit the third data dimension. As shown in Figure 3.1, there are two options. Figure 3.1(a) shows that, if we compress all the scan chains in the same core, just as the configuration described in Chapter 2, the correlation between the data of different scan chains will be low. It results in poor

(a) (b)

Figure 3.1 Two possible compaction schemes: (a) compress different scan chains in the same core vs. (b) compress the respective same chains between different cores.

3.2 LZSS Compression Algorithms

3.2.1 Introduction to the Algorithm

There are some lossless compression algorithms: Huffman code, arithmetic code, LZ-based compression algorithms, etc. These compression algorithms share the common property of being able to recover the information totally when doing decompression. However, the Huffman code requires additional information of symbol probability distribution, in order to do near-optimal coding of the information. This requires that, before a data stream is compressed, we need to scan through the whole data stream going to be compressed and extract the symbol probability distribution.

Also, constructing the Huffman code requires a binary tree data structure, which causes

high hardware complexity. Arithmetic code also suffers from the need of preprocessing to get the symbol distribution.

Dictionary-based coding, such as LZ-based compression algorithms and its variants [13][21], are widely used in file compression software and unix system utility.

There are kinds of dictionaries, some are static while others are dynamic (or adaptive).

Using the dictionary, the algorithm replaces the repeating sequence with a shorter code.

The LZSS algorithm belongs to the LZ-based compression algorithm, and is similar to the LZ1, or say LZ77 algorithm, where the pointer to the referenced sequence and the match length is encoded.

The LZSS algorithm is one of the algorithms in the LZ-based compression algorithms. It have several merits: the first is that it uses dynamic dictionary, which does the compression process and the update of the dictionary simultaneously.

Therefore, it adapts to changing probabilities distributions of the symbols. The second is that the hardware complexity of the LZSS is low, due to the shifting property of the sliding dictionary, which can be easily realized as shift registers in digital circuits. Due to the near-optimal compression ratio performance and the low hardware complexity,

Figure 3.2 Pseudocode for LZSS compression algorithm.

The definition of the compression ratio varies between different literatures. We define the compression ratio (CR) as follows:

. (3.1)

A lower CR means a better the compression efficiency.

3.2.2 Compression Process

The pseudocode for the LZSS compression algorithm is shown in Figure 3.2. The algorithm can be represented as three parts: the sliding window (i.e., the sliding

window will be empty. In every cycle next, the lookahead buffer will try to find a “best match sequence” with the data in the sliding window. As shown in Table 3.1, if there is a match, then the data in the lookahead buffer will be substituted by the codeword (1, p, l), where p means the match position, and l means the match length. Otherwise, the

first symbol in the lookahead buffer will be coded as (0, s), where s means unmatched symbol. At the end of each encoding cycle, we shift the whole three parts. The encoding process goes on until there is no data in the input stream and the lookahead buffer.

In the example of

Figure 3.3, the yellow background colored block represents matched symbol in the lookahead buffer, and the blue background colored block represents matched symbol in the respective sliding window. For example, in the initial state, the sliding window is empty, so there is no match with the lookahead buffer. As the encoding process goes, in the sixth row, the result is match starting from the position no.4 with consecutive length 3.

Figure 3.3 The LZSS encoding algorithm.

In some cases, if only match one symbol is matched, we may not encode it as a matched case if the size of matched codeword symbol is still greater than the unmatched one. In other words, the matched codeword does not have coding gain, which result in expansion of the data. Note that in this example, we assume that the sliding window size is 256, the lookahead buffer size is 5, and the bit width of the symbol is 8. Therefore, if there is only one symbol is matched, encoding the symbol into (1, p, l) will lose bits more than (0, s), as shown in Table 3.2. So we choose to encode the symbol into the unmatched case if the matched length is less than 2. Also, due to there are only 4 natural numbers starting from 2 to 5, we only need two bits to represent the match length.

Similarly, the encoding of numbers from 2 to 9 only requires three bits. That’s why we choose 5, 9 as the LA_BF in this experiment.

Match

Table 3.2 The compression gain at certain representation

Figure 3.4 Experiment result demonstrating the LZSS compressibility.

The compression ratio is bounded below by

which means that the main parameters that affect CR is b and LA_BF. Owing to the

3.2.3 Decompression Process

The decoding process resembles the encoding process much. The decoding process can be divided into three parts: decoding, read-out and shift-in. Received codeword is either matched (1, p, l) or unmatched (0, s). If we get an unmatched codeword, the readout result is simply the symbol in the codeword. If we get a matched codeword, we will go to the sliding window position p specified by the codeword, and get the first l symbols from the position p. After the read-out process, we get the decoded output, and at the same time shift the decoded symbols into the sliding window.

3.2.4 LZSS Algorithm Design Parameters

There are three main design parameters of the LZSS algorithm: the sliding window size SW, the lookahead buffer size LA_BF, and the bit width of the symbol b. The lookahead buffer size LA_BF determines how long a match can be consecutively, also known as the maximum match length. The sliding window is the dictionary memory, which means the past history of sequence of symbols. Thus, if the sliding window is not big enough to recall the past data, we may miss the chance of match, and thus compromises the compression ratio. On the other hand, if there are only few kinds of repeating patterns, we won’t need a long sliding window. The bit width of the symbol

compression, b equals to the number of scan chains. In test response compaction, b equals to the number of cores. The codeword size changes with LA_BF and SW gradually. The codeword consists of the encoding of the position p and the length l plus 1 flag bit. As LA_BF and SW increases, the codeword increases logarithmically.

In Figure 3.4, we can see that the compression ratio as the sequence to be compressed repeats more times, the compression ratio becomes better, and gradually reach the compression ratio bound. This also shows that LZSS compression algorithm is better at compressing “symbol patterns” rather than compressing “single symbols”

(like Huffman code or arithmetic code).

3.3 LZSS-Based Test Stimulus Compression

3.3.1 Don’t-care Bits Filling

After the test patterns are generated using ATPG, there are some don’t-care bits (X’s) in the test patterns. Before the test patterns are applied to the DUT, the don’t-care bits in them should be filled. We can fill the test patterns properly so that the compression ratio can be better. We can use a naïve approach, which fills don’t-care bits according to the neighbor specified bits, which is called adjacency filling to fill the don’t-cares, as shown in Figure 3.6. Because we use a lossless compression algorithm, we can keep the fault coverage while the other approaches might discard unused patterns, which degrades the fault coverage.

Figure 3.5 Proportion of don’t-care bits in 200 test stimulus patterns of the s38417 benchmark circuit

Figure 3.6 Naïve don’t-care bits assignment using adjacency filling.

3.3.2 Simulation Settings and Results

We use one of the ISCAS’89 benchmark circuits, s38417, as our design under test, and we generate test patterns for it. The ATPG results are shown in Table 3.3, including the number of test patterns and the fault coverage (FC).

In Figure 3.7, the compression ratio becomes better as the proportion of don’t-care bits increases. We can see that when the don’t-care bits proportion reaches 95% or above, the compression ratio falls below 0.5, and the compression is not good when the don’t-care proportion is inadequate.

Also, the compression ratio becomes worse as the lookahead buffer size becomes larger. It’s because increasing the lookahead buffer size not only increases the maximum consecutive matches, but also increases the codeword size. Therefore, the most suitable lookahead buffer size in this case is 5.

Table 3.3 Simulation setting of test stimulus compression.

Figure 3.7 Experiment of test stimulus compression: fill the test stimulus don’t-care bits using adjacency filling.

3.4 LZSS-Based Test Response Compaction for Identical Multiple Cores

3.4.1 Validation of Suitability

We assume that the identical multicore system has N cores, where each core having different probabilities of error. The probabilities of error of all cores pi, i=1,…,N, are modeled as i.i.d. exponential random variables with parameter λ, as shown in Figure 3.8. We assume that λ=0.025 as an example in this experiment, and the histogram of probabilities of error is shown in Figure 3.9. In the test response patterns of every core, we use the probability of error to randomly flip the response data at the probability of pi to model the faulty test response, as shown in Figure 3.10.

Figure 3.9 The histogram of core error probability in this experiment.

Figure 3.10 The formation of an LZSS symbol for test response compaction.

Figure 3.11 The histogram of compression ratio result of 100 test patterns, each with length of 1024 bits.

Each LZSS symbol is formed by grouping test bits of all the in the respective same scan chain at a time slot, as shown in Figure 3.10. Therefore, each symbol in the LZSS compressor is the same as the number of cores in the system, which is 64 in this experiment. We then feed the test bits into the LZSS encoding algorithm. We randomly generate bits of length 1024, which are the bits of the test response, and flip every bit of each core with the probability of errors as specified above. We repeat the experiment for 100 times. The setting of the LZSS encoding experiment is shown in Table 3.4.

The result of the compression of test patterns is shown in Figure 3.11. We can see

Table 3.4 The settings of Experiment 1.

Testing configuration

Test pattern length 1024 bits

Number of test patterns 100

Multicore parameters

Number of cores (N) 64

Core error probability parameter (λ) 0.002 LZSS design parameters

Lookahead buffer size 8 symbols Sliding window size 192 symbols

Figure 3.12 The compression ratio versus the error rate parameter λ and core count N.

There is a subtle relationship between CR and the core error rate parameter λ and core count N, as shown in Figure 3.12. We can see that as the core error parameter decreases from 0.01 to 0.0005, the compression ratio decreases. Also, we can see the general trend that, as the core count N increases from 12 to 64, CR first decreases dramatically, and then rises gradually. This phenomenon arises from the saturation of the redundancy of LZSS symbols. Recall that the core number N maps to the bit width of the symbol in the LZSS compression algorithm. Initially as N is very small (such as 12 in this case), we cannot obtain patterns of similar error cores. As the core number grows, there are more identical symbols, which means identical error outputs. Finally, this effect saturates, and the average codeword size increases as the bit width of the symbol N increases, which degrades CR.

3.4.2 Determination of Design Parameters

We need to determine the design parameters of the LZSS encoding engine to maximize the

compression ratio CR under reasonable hardware constraints. A large sliding window and lookahead

buffer can help match more symbols, and therefore improve CR. But the hardware cost as well as the

Table 3.5 The settings of experiment 2.

Testing configuration

Test pattern length 1024 bits

Number of test patterns 100

Multicore parameters

Number of cores (N) 16

Core error probability parameter (λ) 0.025 LZSS design parameters

Lookahead buffer size 2,3,5,9,17 symbols Sliding window size 64, 128, 256, 512, 1024 symbols

Figure 3.13 The compression ratio result as LA_BF and SW varies.

The simulation result of this experiment shows that the compression ratio saturates at the point of (LA_BF, SW) = (5, 1024) in the design space of LZSS, as shown in Figure 3.13. In addition, when we consider the hardware area overhead of the compactor, we know that larger lookahead buffer as well as larger sliding window requires more hardware overhead. Therefore, we choose the point where the CR has diminished return, which is (LA_BF, SW) = (5, 256). Also in Figure 3.13, we can see that, in this specific setting of lookahead buffer size and sliding window size, the compression ratio can be below 0.5 when the core error rate parameter λ is below 0.025.

3.5 Summary

In test stimulus decompression, we can see that the decompressor reaches a compression ratio of about 50% under 95% proportion of don’t-care bits in test patterns.

The test response compactor can achieve compression ratio of less than 0.5 when there are 16 cores with the core error rate parameter λ=0.025. As for design parameters, the lookahead buffer should be long because there might be long consecutive matches.

Increasing LA_BF can improve the compression ratio, at the small cost of codeword

Chapter 4 Hardware Architectures for Test Stimulus Decompressors and Test Response Compactors

4.1 Hardware Architecture for Test Stimulus Decompressor

4.1.1 LZSS Decoder Architecture

The LZSS decoder is a simple and direct implementation. The decoding cycle consists of three stages: codeword decomposition, symbols read-out, symbols shift-in.

At the first stage, the translation process, the read-out circuit translates the codeword into match flag, symbol, index and match length. At the remaining stages of the decoder, including the read-out and the shift-in stages, operates on a constant cycle-by-cycle

Figure 4.1 Hardware architecture for LZSS decoder.

Figure 4.2 The finite state machine (FSM) of the decoder.

在文檔中適用於相同多核心系統的測試壓縮及診斷機制之LZSS壓縮演算法及硬體架構設計 (頁 28-0)