Chapter 2 Aging Effect and Delay Fault Testing
2.2 Delay Fault Testing
2.3.2 Fault injection and detection
The cause of path delay fault is that the target transition occurs later than the specified clock period and the target register might catches the wrong value. As shown in Figure 2-10, we could observe the fault behaviors in gate-level and RT-level simulation.
Figure 2-10 Fault behaviors in gate-level and RT-level simulation
always @($target path output triggered) begin if (path successfully activated) begin
$deposit ($target register, ~$target register ) end
end
Figure 2-11 Pseudo code of fault injection testbench
Figure 2-11 is the pseudo code of fault injection testbench. While the target path is successfully activated, the value of the target register might be bit-flipped just as the behavior of path delay fault.
The observability solution for the fault detection is comparing the results in the data memory with the golden ones after the program execution. The fault is defined as the detected fault (DT) if the fault effect could be propagated to the data memory and the computational results are different from the golden ones. Otherwise, the fault is an undetected fault (UD) if the fault effect is masked during the simulation and the results in the data memory are same as the golden ones. However, since the limitation of the observability, only check the execution results in the data memory might reduce the fault coverage significantly. Some faults are verified to be undetected faults because they do not have any influence on the computational results in the data memory but they actually exist in the circuit. The reason why these faults could not be observed is that there are no
memory directly. As a result, the development of creating instructions for storing the computational results that could not be observed directly in the data memory would be the future work. With these observation instructions, we could greatly increase the fault coverage.
Chapter 3
Proposed Methodology for
Software-Based Self-Test on
Delay Defects
3.1 Proposed Methodology
The objective of the proposed methodology is early aging defect identification by software-based self-test approach for processors. The proposed methodology could be separated into three main steps, pre-processing, test generation and fault simulation. The detail of each step might be explained respectively in the following sections.
3.2 Pre-Processing
The main goal of pre-processing is to do the constraint extraction. Since there are some states that cannot be reached through any instruction sequence. If we could figure out the constraint of the processor previously, we might avoid the unreachable states and generate the patterns that could be converted into instructions. Nowadays, fully automation of constraint extraction is still not mature. Although some automatic methodology of constraint extraction has been proposed recently, they cannot meet the sufficient accuracy. As a result, manually constraint extraction is still adopted especially in industry. In this thesis, we do the constraint extraction manually. However, with the complexity of processors keep increasing, the importance of the automation of constraint extraction cannot be overemphasized.
The development of the test program for a complex processor usually follows the divide and conquer approach. That is to say, the processor is segmented in several
sub-modules, and a test program specialized on each of them is developed. The target module that we would like to test is the arithmetic logic unit (ALU). Therefore, figure out the input and output of the ALU is the previous step before constraint extraction.
Figure 3-1 Arithmetic logic unit (ALU)
Figure 3-1 is the ALU module that we are going to test. The ALUOp wire is a 5 bits wire that control the ALU operation. Table 3-1 shows the relation between the ALU operation signal and the instruction. The ReadData1 wire and ReadData2 wire are 32 bits wires which are the operands of the ALU. The Shamt wire is a 5 bits wire that be used for shift operations. The result wire is a 32 bits wire which is the ALU computation result.
The EXC_Ov is the overflow flag and the ALUStall is the stall signal for long ALU operation such as divide.
The method for deriving the constraints is reading the RTL code and comments of our processor test. By reading the source code, I could realize the function of each inputs and how they work. Then, I would apply some random programs and dump out the input signals per cycle. The constraints could be confirmed by analyzing the simulation results.
Table 3-1 Mapping of ALU operation signal and instruction
Table 3-2 Constraints of ALU input
The reset, EX_Stall and EX_Flush wires should be zero during the processor execution. There are no constraint of the 32 bits operands. The Shamt wire only has value while executing the SLL, SRA ad SRL instruction; otherwise it might be zero. The Operation wire has three constraint values, 5, 6 and 21. According to Table 3-1, 5 and 6 could be mapped to the instruction DIV and DIVU. Since DIV and DIVU instructions might execute 32 cycles, we could not control them with two successive patterns.
Therefore, other methods should be adopted for testing the divider module. And 21 is the reserved value for MIPS32 release 2 instruction.
3.3 Test Generation
Figure 3-2 Flowchart of test generation
Figure 3-2 shows the proposed methodology flow for test generation. The test
difference is the process of critical paths extraction by static timing analysis. The reason why the path delay fault testing needs the process of static timing analysis and the detail of each steps in the methodology flow will be explained in the following section
3.3.1 Static timing analysis
Static timing analysis (STA) is a fast and reasonable measurement for computing the circuit timing without simulating the entire circuit by input patterns. Unlike the number of transition delay fault which is linear to the circuit size, the number of path delay fault is exponential to the circuit size. It is impracticable to test the whole path delay faults in the circuit. As a result, we tend to do the static timing analysis to figure out the critical paths. The critical path is defined as the serially combinational gates of a path with maximum delay which may have higher probability having the timing violation. The process of critical paths extraction might greatly reduce the fault list size. We consider the slack of setup time violation to find out critical paths in the circuit. If the required signal arrives too late, it may cause the setup time violation. The transition of input signals, different operating environment and manufacturing variations contribute to the delay of signal arrival time, as well as the aging defect. The slack is defined as the difference between the required time and the arrival time. A positive slack implies that the arrival time is earlier than the required time. That is to say, the path with positive slack might not
affect the overall delay of the circuit. Conversely, the negative slack implies that a path is too slow, and the path must be sped up if the whole circuit is to work at the desired speed.
3.3.2 Automatic test pattern generation (ATPG)
As stated above, there are two major method for test program generation, ATPG-aided test generation and simulation-based test generation. ATPG-ATPG-aided test generation is the deterministic, stable and efficient method with acceptable fault coverage. It stands on the view point of circuit analysis and generates the high quality test patterns. Nevertheless, the difficulties is the constraint extraction and the mapping between test patterns and instructions.
$path {
Figure 3-3 Critical path example reported by Synopsys Primetime
In the steps for setting desired ATPG options, we might set the constraints that we extracted previously. With the constraints, we could assure that the test patterns might be converted to the instructions successfully. There are two types of constraint option in TetraMAX, add_atpg_constraints and add_atpg primitives. Add_atpg_constraints command defines on nets that must be satisfied during pattern generation. In this
command, we should specify a name to identify the constraint, the constraint value (0, 1, Z), and the place in the design to apply the constraint. Add_atpg_primitives command creates a primitive that is added to the design and has its inputs connected to specified nets. When you constrain the output of the added primitive, it forces the pattern generation algorithm to conform to specified logical conditions at the connection points. In this command, we should specify a name for the added primitive, its logical function, and its input connections.
3.3.3 Pattern-to-instruction converter
After we derive the test patterns, the next step is to generate the test program. The test program format could be separated into three main sections, operands preparation, two-vector and result store. In the rest of this section, an example of converting patterns to instructions might be showed and illustrated in detail. Figure 3-4 is a pattern that generated by Synopsys TetraMAX. First, we need to figure out the value of each input of ALU according to their position. Then, we might get the value that should be applied on the ALU inputs. In this step, we might also check whether the pattern is legal that meets the constraints we set. Figure 3-5 is the result after doing the test pattern classification.
We could observe that the pattern meets the constraints and could be converted into instructions. Table 3-3 is the mapping table that records the relation between ALU inputs
and instructions. The mapping table is derived manually by reading the RTL description of the processor and doing some simulation. However, with the SoC design becoming more and more complex, the automation of generating the mapping table between test patterns and instructions should be developed in the future. According to the mapping table, we might convert the patterns into two-vector instructions. Although it seems that we finish the conversion patterns and instructions. There are still some tricky details about operand preparation and register usage that would be discussed as follows.
{ pattern 1 fast_sequential } { vector }
vector("_default_WFT_") := [ 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 1 0 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 0 1 0
1 0 1 1 0 0 0 0 0 0 0 ];
{ capture }
vector("_default_WFT_") := [ 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 1 1 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1
1 0 0 0 0 0 0 0 0 ];
Figure 3-4 A test pattern generated by Synopsys TetraMAX
Figure 3-5 Classification of pattern information
Figure 3-6 Example of test program generated by our methodology
Table 3-3 Mapping table for patterns to instructions convertor
Since we need to execute the test pattern during two successive cycle, the operands should be prepared in advance. As shown in Figure 3-6, we could observe that the operands have been prepared before the two-vector part. Each operand needs the combination of two instructions LUI and ADDI to reach the target value. The reason why
we need two instruction to reach the target value is because one instruction might load 16 immediate value at most. However, the operands are 32 bits values. As a result, we should utilize two instructions to load the 32 bits operands. LUI is the instruction that could load upper 16 bits value to the target register. ADDI is the instruction that could add lower 16 bits value to the target register. There is a small detail we need to take care, the LUI instruction might load the upper 16 bits immediate value with the lower 16 bits value set to zero. Therefore, LUI instruction must place before the ADDI instruction, or the wrong value would be loaded to the register. On the other hand, in order to ensure that the two instructions might be executed during two successive cycles. We should choose different registers for storing operands. That is to say, we should avoid reusing the registers which are used in the first vector. Since the processor contains full data forwarding unit, the two instructions might not be executed during the two successive cycles if we do not avoid the problem of data dependency. As shown in Figure 3-7, we might observed that the registers that used in the second vector might not be used in the first vector.
sllv $t2, $t1, $t0 vector 1 srlv $t5, $t4, $t3 vector 2
Figure 3-7 Example of two vectors from the test program
Figure 3-8 Instructions distribution of the test program
Figure 3-8 is the pie chart that suggests the instructions distribution of the test program. According to the pie chart, we could observe that 66% of the instructions in the test program do the work for operands preparation. The number of instructions for operands preparation is four times as much as the number of instructions for two-vector or result store. In my opinion, if we could create the instruction that could load 32 bits value to the register at a time, we could greatly shrink the size of the test program.
3.4 Fault Simulation
After we generate the test program, the next step would be the fault simulation for confirming the effect of the test program that we generated. In the process of fault simulation, the simulation-based fault injection method would be adopted. Figure 3-9 is the flowchart of fault simulation.
Figure 3-9 Flowchart of fault simulation
Chapter 4
Experiment Result
4.1 Experiment Setup
The target processor in our experiments is a MIPS32 processor. Figure 4-1 is the MIPS32 processor architecture from [31]. This processor is an opensource design and could be downloaded from Github. This design was created by Grant Ayers and funded by the eXtensible Utah Multicore (XUM) project at the University of Utah from 2010-2012.
It is a standalone MISP32 processor, all required MIPS32 instructions are implemented, including hardware multiplication and division. This is a bare-metal processor, without memory management unit (MMU) and floating point unit (FPU). The hardware divider is small, multi-cycle and runs asynchronously from the pipeline allowing some masking of latency.
Figure 4-1 MIPS32 processor architecture
Figure 4-2 Single-issue in-order 5-stage pipeline
This MIPS32 processor architecture is the single-issue in-order 5-stage pipeline including Instruction fetch, Instruction decode, Execute, Memory and Write back stages.
Figure 4-2 illustrates the pipeline stages of the architecture.
Besides, the memory interface is separated from the processor. The original design of the memory utilizes four-way handshake to exchange the data. Figure 4-3 explains the mechanism of the four-way handshake. This interface is simple and robust but the performance of the system is limited. The minimum CPI is increased from 1 to between 3 and 4. It is not practical in SoC designs nowadays. In addition, this handshake mechanism causes the pipeline stages to be stalled. The stalled pipeline stage would lead to the untestable faults. In order to prevent over-testing the functionally untestable faults, we modify the design to make CPI be close to 1. However, accessing the data memory still needs two cycles to exchange the data.
The experiments run on a Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30 GHz with 32 GB RAM.
Figure 4-3 Four-way handshake mechanism
The EDA tools we used in the methodology flow are described as follows:
• Synthesize the RTL Verilog designto get the gate-level circuit. There are totally 1,885 flip-flops in the synthesized design.
Synopsys Design Complier
• Static timing analysis tool for figuring out the vulnerable paths with lower slack. The critical paths list could be used not only for generating the fault list but also for test pattern generation by TetraMAX.
Synopsys Primetime
• ATPG tool for generating the test patterns which would be used to be converted to assembly test program.
Synopsys TetraMAX
Cadence NC-Verilog
4.2 Result Statistics
In this section, we would display some experimental results and evaluate the quality of the test program generated by our methodology.
4.2.1 Transition delay fault testing
Table 4-1 and Table 4-2 are the results of transition delay fault testing by TetraMAX and software-based self-test respectively. Figure 4-4 are the equations of coverage calculation. In Table 4-1, we could observe that the ATPG-untestable faults account for around 40% of the total faults. The main reasons that cause the ATPG-untestable faults could be conclude into two points. The first reason is the lack of design-for-testability (DFT). Without the DFT insertion, the observability of the circuit would be decreased and cause the poor performance of the fault coverage. The second reason is the addition of ATPG constraints. Since we need to ensure that the test patterns could be converted into instructions, some unreachable states or illegal conditions should be confined previously.
Compare Table 4-1 and Table 4-2, the fault coverage of the transition delay fault testing by software-based self-test is a bit higher than the fault coverage of the testing by TetraMAX. This result implies that we convert the test patterns into the instructions completely. Besides, some faults are detected by software-based self-test accidently.
Table 4-1 Transition delay fault testing by TetraMAX
Fault coverage =𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝐷𝐷𝑓𝑓
𝑇𝑇𝑇𝑇𝐷𝐷𝑓𝑓𝑓𝑓 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝐷𝐷𝑓𝑓 × 100%
Test coverage = 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝐷𝐷𝑓𝑓
𝑇𝑇𝑇𝑇𝐷𝐷𝑓𝑓𝑓𝑓 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝐷𝐷𝑓𝑓 − 𝐴𝐴𝑇𝑇𝐴𝐴𝐴𝐴 𝑓𝑓𝑢𝑢𝐷𝐷𝐷𝐷𝑓𝑓𝐷𝐷𝑓𝑓𝑢𝑢𝑓𝑓𝐷𝐷 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝐷𝐷𝑓𝑓 × 100%
Figure 4-4 Equations of coverage calculation
4.2.2 Path delay fault testing
Except for the transition delay fault which represents the large delay defect, the path delay fault which represents the small delay defect is also a hot issue when it comes to delay fault testing.
Table 4-2 are the experimental results by different monitoring conditions respectively. As mentioned before, the fault size of the path delay fault is exponential to the circuit size. It is impractical to test the whole path delay faults in the circuit. Actually,
we only test the subset of the whole path delay faults in the circuit. Since the critical paths with lower slack might have higher probability to get the path delay fault, we choose the top thousand critical paths for the path delay fault testing in our experiment. The second row is the fault coverage by the constrained ATPG and non-constrained ATPG. The fault coverage by the constrained ATPG is around 40% less than the non-constrained ones.
That is to say, whether the constraints be applied or not might have great influence on the fault coverage. The third row is the number of activated paths according to the three different methods. As we mentioned in subsection 3.4.1, the non-robust monitoring is the method with loosen condition and could activate more paths than the other methods. On the contrary, the robust monitoring with stricter condition could active less paths than the other methods.
Table 4-2 Path delay fault testing by software-based self-test
Figure 4-5 Venn diagram of fault detection
Figure 4-5 is the Venn diagram of the fault detection. The faults that could be detected by the non-robust monitoring might be detected by the robust, robust* and constrained ATPG methods. The faults that could be detected by the robust* monitoring might be detected by the robust and constrained ATPG methods. To sum up, the faults that could be detected by the monitoring with strict condition might also be detected by the monitoring with loose condition.
4.2.3 Random program evaluation
The major objective of the random program evaluation to evaluate the quality of test program generated by our methodology. The processes are generating random test programs, doing fault simulation by our simulator and finally comparing the results. The random program is not totally random. Table 4-3 illustrates the format of the random
instructions in the second step. Second, we would randomly choose the instructions according to the MIPS32 instruction set reference. In this step, we would generate three different programs with successive one vector, two vectors and three vectors. The
instructions in the second step. Second, we would randomly choose the instructions according to the MIPS32 instruction set reference. In this step, we would generate three different programs with successive one vector, two vectors and three vectors. The