Experimental Environment - Experimental Results

Chapter 5 Experimental Results

5.1 Experimental Environment

We conduct all the experiments on a HP wx8400 workstation. The commercial simulators are the Cadence NC-Verilog simulator and the Cadence NC-SC simulator.

Table 5.1 shows the detail of the experimental hardware and software platform.

Table 5.1 Experimental environment Hardware

CPU: Intel® Xeon® CPU 2.0GHz

RAM: DDR2-667 ECC FB-DIMM 14GB Software

OS: CentOS 5 x86_64 (with Linux 2.6 kernel) Cadence NC-Verilog version 6.1

Cadence NC-SC version 6.1

5.2 Cases Study

We use the proposed verification strategy on in-house ACARM9 core with ARM ISA version 5E [9] fully compatible. To check if the output of RTL correct, efficient two-layered cycle-accurate model [11] is used as a golden model.

5.2.1 Simulation Environment

We run co-simulation for the ACARM9 RTL code and the golden model and use the constrained-random code generator as instruction memory. Generator would send instructions in machine code to both RTL and model when receiving instruction enable signal which comes from RTL. We also use a comparator to collect all outputs of both RTL and model such as external signals and registers and compare them every cycle when running simulation. Comparator would dump detail information if the outputs of RTL and model have mismatches. The information includes signal name or register number, values, and cycle number. Figure 5.1 shows the environment.

Figure 5.1 Simulation environment

5.2.2 Script Files and Bugs Found

To test different cases on the ACARM9 core, we write different base pattern with corresponding constraint setting in script files. The ACARM9 core has passed the deterministic verification and other real applications such as Dhrystone, Whetstone, and JPEG2000 encoder. Finally, it is used to execute an MP3 program on the FPGA successfully. However we still find two more bugs in the ACARM9 RTL code using the proposed method; moreover, two bugs in the model. The details of the four bugs with contents of script files used are listed below.

(1) Bug of Shift Operation in RTL

In ARM ISA, the operand shift operations can be parallely executed with other operations. The script file of shift operation testing (1) which focuses on this point is shown in Figure 5.2. The first instruction in the base pattern would give random value to registers by loading from memory. We find one bug in the RTL code and one bug in the model individually by using this script file.

.const

Figure 5.2 Script file of shift operation testing (1)

The RTL code has calculation error in the logic shift right (lsr) operation with register specified shift amount when the shift amount is zero. In ARM ISA, the shift amount of shift operations has two sources: instruction specified or register specified.

Different sources will cause different special operation for the particular value of shift amount. The correct outputs of logic shift right with zero shift amounts for different shift amount sources are listed in Table 5.2. For this case, the correct output should be the input operand without shift operation, but the RTL code recognizes it as special case in logic shift right operation with instruction specified shift amount: “lsr #0” which is used to encode “lsr #32”. It has a zero result with bit 31 of operand as the carry output.

Table 5.2 Logic shift right with zero shift amounts Shift amount source Special case? Encode Output

Instruction specified Y lsr#32 Zero out, C = bit 31 Standard

Instruction specified Y lsr#32 Zero out, C = bit 31 RTL

(2) Bug of Shift Operation in Model

The model has one bug which is found by the script file of shift operation testing (1), too. The bug happens when the RTL executes the three kinds of shift operation except rotation right with register specified shift amount as the shift amount equal to or greater than 32.

In ARM ISA, only the least significant byte (bit 0 to bit 7) of the contents of the shift amount register is used to determine the shift amount. But the model takes only least 5 bits (bit 0 to bit 4) of the shift amount register as the shift amount. Figure 4.3 shows the different bit-range of the shift amount register used as shift amount in standard and the model. The correct results of this case are result zero for both logic shift left and right, and result filled with bit 31 of the operand for arithmetic shift right. The outputs of the model are shift operation with shift amount which is the remainder of the contents of the register divided by 32. Table 4.3 lists the different results of the three kinds of shift operations in standard and the model.

Figure 5.3 Different used bit-range of shift amount register.

Table 5.3 Results of shift operations in standard and the model Shift type Shift amount = 32 Shift amount > 32 Logic shift right Zero out, C = bit 31 Zero out, C = 0 Logic shift left Zero out, C = bit 0 Zero out, C = 0 Standard

Arithmetic shift right Filled with bit 31, C = bit 31 Logic shift right No effect

Logic shift left No effect Model

Arithmetic shift right No effect

Operand shifted with shift amount mod 32, C = corresponding bit Note: shift amount means the least significant byte of the shift amount register “C” means the carry-out flag

(3) Bug in Usage of the Program Counter (PC)

For the ARM core architecture, register 15 (r15) holds the Program Counter (PC) and has limits when used as a destination register or an operand. We change the script file of shift operand testing (1) by adding a new parameter which has higher probability to produce r15 and modifying tokens in the base pattern. Figure 5.4 shows the script file of shift operand testing (2). One more bug in the model is found by using this script file.

In this script file, r15 is used as an operand. The value of r15 will be different depending on the source of shift amount. It will be the address of the instruction, plus 8 bytes for instruction specified shift amount or 12 bytes for register specified shift amount. The model fetches the value of r15 which is the address of the instruction plus 8 bytes for both cases. Table 5.4 lists the value of r15 for standard and the model with different shift amount source.

.const

Figure 5.4 Script file of shift operand testing (2)

Table 5.4 Value of r15 fetched as operand in different source of shift amount Instruction specified Register specified

Standard PC + 8 PC + 12

Model PC + 8 PC + 8

(4) Bug in Combination of Two Multiplication Instructions

We find one more bug in the RTL code by the script file shown in Figure 5.5. This script file focuses on all instructions about multiplication in ARM ISA version 5E;

moreover, random selection is used to make random combinations of these multiplication instructions.

Figure 5.5 Script file of the multiplication instructions

The bug happens when the RTL sequentially executes a signed or unsigned multiply long (smull or umull) instruction or a signed or unsigned multiply accumulate long (smlal or umlal) instruction after an Enhanced DSP instruction: 16-bit signed integer multiply (smul<x><y>) instruction. The multiply long instruction will output total 64-bit result and divide it into higher 32-bit part and lower 32-bit part then output sequentially in two cycles. But the lower 32-bit part of the result of the signed multiply long instruction would become the output of the previous instruction (smul<x><y>).

As the description in Section 4.2, we use a 32x32 multiplier in the RTL code for multiplication and modify it such that the multiplication becomes a multi-cycle operation to fit the timing constraint. A simple block diagram of multiplier in RTL is shown in Figure 5.6, and the multiplication FSM is shown in Figure 4.5. The multiplier deals with multiplication instructions which belong to multiply long instructions and have 64-bit output in three-cycle operation and other normal multiplication instructions have only 32-bit output in tow-cycle operation. When the multiplier handles a multiplication instruction, the controller will save some signals from the decoder in the status registers and control whole the process. The status registers can be separated into several parts which save corresponding control signals for detail functions like signed/unsigned, accumulation, DSP operation, and multiply long, etc. After finishing calculating, the result will be saved in the output register and the controller will clear the status registers and jump into standby state. The multiplier will output the lower 32-bit part of result in the second cycle and higher part in the third cycle for multiply long instructions and output result only in the second cycle for other multiplication instructions.

Operand A Operand B Operand C

Multiplier

Figure 5.6 Block diagram of multiplier in RTL

But the controller does not clear the status registers after finishing smul<x><y>

instruction. If other multiplication instruction except the multiply long instructions is executed after smul<x><y> instruction, some of the status registers which store the control signals about smul<x><y> instruction would be updated and the multiplier would work as normal. If a multiply long instruction is executed after smul<x><y>

instruction, these status registers would not be all updated. The output register will be locked (output enable signal is low) in the second cycle of the process of multiply long instruction such that the lower 32-bit part of result becomes the result of previous instruction (smul<x><y>). In the third cycle, the controller sets the output enable signal as “high” and the higher 32-bit part of result is outputted correctly. Table 5.5 lists the all results of the instruction smul<x><y> and multiply long instruction in RTL and standard when multiply long instruction is executed after smul<x><y> sequentially.

Table 5.5 Output of multiplier in standard and RTL Output Multiply

Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

smul<x><y> None Result S

Standard

Multiply long None Result lo Result hi

smul<x><y> None Result S

RTL

Multiply long None Result S Result hi

Note: suppose that the output of smul<x><y> is “Result S” and the correct outputs of multiply long are “Result lo” and “Result hi”

5.3 Compare with Pure Random Verification

We use a pure random code generator to compare the performance with the proposed constrained-random code generator. The pure random one is also written in SystemC and will output legal machine codes randomly without any constraint. The constrained-random code generator and the pure random code generator are used as an instruction memory in the co-simulation environment individually to measure performance. We run 100,000,000-cycle simulation for both generators and compare their simulation time. The simulation time for each generator are shown in Table 5.6.

The proposed generator has about 35% overhead in simulation time when compared to pure random one.

Table 5.6 Simulation time of two generators Generator Simulation time (seconds)

Constrained-random 8,314

Pure random 6,156

Chapter 6 Conclusions and Future Works

A constrained-random code generator has been presented in this thesis. The proposed verification strategy provides an all-purposed syntax in the user input script file that helps generate test patterns efficiently. It can define the generated range of every segment in assembly code such as operation code, conditional code, operands, and immediate value. By changing a small part of constraint setting in the script file, generator can easily output different patterns to cover different corner cases. It can also define the relation between segments to test some corner cases such as data hazards.

Moreover, the program flow of the output pattern can be also controlled by the base pattern in script file. The generator can output the codes in sequence as the order in the script file or randomly schedule the codes. Finally, the generator is applied on in-house ARM9 core and finds out more bugs even if this core has passed many verification strategies before. However, the proposed strategy cannot cover the verification of external interrupt behaviors at this moment. Consequently, an advance method [12] may be applied to our strategy to test the external interrupt behaviors of a processor.

References

[1] Marcin Kazmierczak, “White-box verification techniques in Networking ASIC Design”, Thesis Dissertation, Department of Information Technology, Lund Institute of Technology, Sep 2001.

[2] Janick Bergeron, Writing testbenches: functional verification of HDL models, Kluwer Academic Publishers, Norwell, MA, 2000.

[3] Source: Collett International 2000.

[4] Nathan Kitchen, Andreas Kuehlmann, “Stimulus generation for constrained random simulation,” IEEE/ACM International Conference on Computer-Aided Design, 2007, pp. 258-265.

[5] Prabhat Mishra, Nikil Dutt, “Functional Coverage Driven Test Generation for Validation of Pipelined Processors,” Design, Automation, and Test in Europe, pp.

678-683, 2005.

[6] Ilya Wagner, Valeria Bertacco, Todd Austin, “StressTest: an automatic approach to test generation via activity monitors,” Annual ACM IEEE Design Automation Conference, 2005, pp. 783-788.

[7] Prabhat Mishra, Nikil Dutt, “Specification-driven directed test generation for validation of pipelined processors,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 13, Issue 3, 2008.

[8] Jason C. Chen, Synopsys Inc, Applying Constrained-Random Verification to Microprocessors, Dec 2007 at http://www.edadesignline.com/howto/204800266 [9] ARM, ARM Architecture Reference Manual.

[10] ARM, ARM9E-S Core Revision: r2p1 Technical Reference Manual.

[11] Chien-De Chiang, Juinn-Dar Huang, “Efficient Two-Layered Cycle-Accurate Modeling Technique for Processor Family with Same Instruction Set Architecture.” In Proceedings of International Symposium on VLSI Design, Automation and Test (VLSI-DAT), 2009, pp. 235-238.

[12] Fu-Ching Yang, Wen-Kai Huang, Ing-Jer Huang, “Automatic Verification of External Interrupt Behaviors for Microprocessor Design,” DAC, 2007, June.

[13] S. Fine and A. Ziv. “Coverage directed test generation for functional verification using bayesian networks.” In Proceedings of Design Automation Conference (DAC), 2003, pp. 286–291.

[14] A. Aharon and D. Goodman and M. Levinger and Y. Lichtenstein and Y. Malka and C. Metzger and M. Molcho and G. Shurek. “Test program generation for functional verification of PowerPC processors in IBM.” In Proceedings of Design Automation Conference (DAC) , 1995, pp. 279–285.

[15] J. Miyake and G. Brown and M. Ueda and T. Nishiyama. “Automatic test generation for functional verification of microprocessors.” In Proceedings of Asian Test Symposium (ATS) , 1994, pp. 292–297.

[16] J. Shen and J. Abraham and D. Baker and T. Hurson and M. Kinkade and G.

Gervasio and C. Chu and G. Hu. “Functional verification of the equator MAP1000 microprocessor.” In Proceedings of Design Automation Conference (DAC), 1999, pp. 169–174.

[17] P. Mishra and N. Dutt. “Graph-based functional test program generation for pipelined processors.” In Proceedings of Design Automation and Test in Europe (DATE), pp. 182–187, 2004.

[18] H. Iwashita and S. Kowatari and T. Nakata and F. Hirose. “Automatic test pattern generation for pipelined processors.” In Proceedings of International Conference on Computer-Aided Design (ICCAD) , 1994, pp. 580–583.

[19] S. Ur and Y. Yadin. “Micro architecture coverage directed generation of test programs.” In Proceedings of Design Automation Conference (DAC, 1999), pp.

175–180.

[20] Cadence Specman Elite V5.0 For Linux, at http://www.cadence.com/products/

functional_ver/specman_elite/index.aspx

在文檔中應用於處理器驗證之腳本導引的限制隨機樣本產生器 (頁 39-0)