Chapter 5 The Implementation of C-AES coprocessor
5.3 Parameter initialization Engine
The parameter initialization engine contains computation logics and several registers to generate and store the necessary coefficients for our cipher engine and key generator. Fig. 5.3 shows the block diagram of this module. There are 8 64–bit registers and 3 8-bit registers used to store all necessary parameters listed in Table 5.3 .
) (x xtimei
7 0≤ i≤
Fig. 5.3 Simple block diagram of parameter initialization engine.
Since the shortest latency shown in Fig. 5.2 is 18 clock cycles, the parameter initialization can spent 18 cycles to compute, and will not introduce more delay into latency. Thus, in order to reduce hardware cost, the compatible input order of parameters and the architecture, which uses only 8 8-bit matrix multipliers, is proposed and the configuration time just matches 18 cycles. The detailed input order and computation schedule is listed in Fig. 5.4. In addition, the calculation of C ,c0 C , c1
2
C ,c C , are achieved by c3 xtimei(cj), 0≤ i≤7, 30≤ j≤ , which is described in
Section 3.3.
Table 5.3 The necessary parameters for the cipher engine and the key generator.
δ′
Fig. 5.4 The computation schedule of parameter initialization.
Chapter 6 Verification and Result Comparison
In this chapter, the design methodology and verification based on Intellectual Property (IP) reuse are introduced. In section 6.1, the rules in IP Qualification (IPQ) Guidelines are described. We follow these rules to implement the synthesizable HDL code of our design in the front end. Moreover, the chip design flow and verification in each level are illustrated in section 6.2 and 6.3. Finally, the experimental results and comparison are given in section 6.3.
6.1 IP-Based Design
IP-based and platform-based designs are more and more important in SoC (System-on-Chip) era. The design time can be decreased to meet the increasing complexity on single chip by using the reusable IP, and let the verification more efficient by the platform-based design flow. Generally speaking, Silicon Intellectual Property (SIP) may be divided into three types described as follow. In our proposed design, the soft IP implementation is focused in the front end.
(1) Soft IP indicates that IP designed in the form of synthesizable HDL code.
(2) Firm IP indicates that IP delivered in the form of gate-level netlist after synthesis.
(3) Hard IP indicates that IP delivered generally in the form of GDSII format, which is fully placed, routed and optimized for power, size, or performance, and mapped to specific process technology.
6.1.1 IP Qualification Guideline Overview
The general rules proposed in the IP Qualification (IPQ) guidelines are a set of best practices for creating reusable designs for use in an SoC design methodology.
There practices are based on several reusable methodology literatures and experiences from Steering Committee members of IPQ Alliance in developing reusable designs.
Reusable macros that have already been designed and verified help users aware of all need-to-know issues in advance. If the blocks do not conform to this standard for reusable methodology, the efforts for integrating pre-existing blocks into new SoC could become excessively high.
The quality criteria, which have to be taken into account, come from various sources: The reuse methodology manual (RMM) contains a set of rules and guidelines that help ensure that a design is reusable and technology-independent. IPQ describes that language subset of VHDL or Verilog that are synthesizable and verifiable with any compliant tool. Further efforts on quality are under way in the Virtual Socket Interface Alliance (VSIA).
6.1.2 Soft IP Design Flow
The standard soft IP design flow is illustrated in Fig. 6.1. IP creators must follow the rules in the IP Qualification guidelines, which are the basis for industry-wide solutions to develop reusable and higher quality IP. Here, the IPQ guidelines classify the reusable methodology into three categories:
(1) Design guidelines:
The design guidelines include coding rules and design issues. Soft IP that follows the rules can ensure that the HDL code is readable, portable and reusable. In addition, these rules also help achieve optimal synthesis and simulation results. The guidelines are categorized as follow:
- HDL (Verilog & VHDL) coding guidelines.
- Design style guidelines.
- Synthesis script guidelines.
(2) Verification guidelines:
In verification guidelines, a set of rules are provided which need to be followed by IP creators to improve the verification quality of the IP. The guidelines are categorized as follow:
- Soft IP verification guidelines.
- Coding guidelines in writing testbench codes.
- IP prototyping.
(3) Deliverable guidelines:
In verification guidelines, the rules ensure that users can obtain all the necessary information about this IP. According to the documents and script files provided by IP creators, users can rebuild the whole design on their workstations or servers. The guidelines are categorized as follow:
- General deliverables.
- Documentation deliverables.
- Design files deliverables.
- Verification deliverables.
- Hardware related software deliverables.
- IP prototyping deliverables.
The detailed descriptions of these guidelines are in the IP Qualification v1.0.
IP Package Flow Soft IP Creation Flow
IP Qualification Guideline Deliverables
Guideline
Design Guideline - Coding guideline - Design style - Synthesis script
Verification Guideline - Soft IP verification - test bench
Design Spec.
Coding
HDL Analysis
Synthesis Script Design
Synthesis
Design For Testability
Pre-Layout Verification Cycle-based
Simulation
Static Timing Verification Logic
Synthesis
Power Synthesis
Physical Synthesis RTL Coding
Style Check
Code Coverage
Analysis
Functional Simulation
General deliverables Document delieverables Design file deliverables Hardware related software
IP Prototyping
To Back-End Flow
Deliverables Collection
&
IP Package
To IP Integrators
nLint VN
Fig. 6.1 Soft IP design flow.
6.2 Chip Design Flow
Our chip design flow is shown in Fig. 6.2. The RTL code is designed and simulated in Verilog-XL compiler, and Synopsys Synthesis tool is used to synthesis our design with one scan chain and create the gate level netlist.
Fig. 6.2 Cell-based design flow
Then, the gate level netlist is applied to gate level simulation and compared the result with RTL code simulation to check out the correctness. We use Apollo to placement and routing, and Calibre to check DRC and LVS result. After post-layout level gate simulation is correct, NanoSim is exploited to take transistor level simulation.
6.3 Verification Strategy
Since a single verification strategy would not sufficiently handle the complexity in SoC problems, a multilevel verification approach is developed. It contains several functional models to verify a single IP, and will increase the verification speed and efficiency at the system level. In the following sections, the implemented functional models and verification are described.
6.3.1 Untimed functional model
The first complete model of our proposed design is presented in abstract form as an untimed functional model (UFM), in which all functionality is implemented with MATLAB to verify the correctness of the configurable AES algorithm. Besides, it can also produce the test patterns efficiently for following simulation models. The MATLAB software model is shown in Fig. 6.3.
6.3.2 Timing Accurate model
The timing accuracy of a model illustrates how similarly it behaves to the constraints of the final design with respect to time. In our proposed design flow, the synthesis tool generates the timing accurate gate-level netlist from the RTL code, and the gate propagation delays are analyzed by those constraints defined in the specification of UMC 0.18μm CMOS technology. After synthesis, the gate-level simulation at the highest estimated operation frequency is needed for verifying the correctness of the synthesis result.
Fig. 6.3 MATLAB software model.
6.3.3 FPGA Prototyping
An FPGA prototyping is implemented on the ARM Integrator/Logic Module (LM), which provides a platform for developing digital IPs on the AMBA-based SoC design. The ARM Integrator contains ARM CPU, AMBA bus and FPGA. The further details about this platform are described in [37], and the system architecture of the C-AES coprocessor on the ARM Integrator is shown in Fig. 6.4. Within LM, the registers listed in Table 6.1 are mapped to our C-AES coprocessor. Thus, the ARM CPU on core module can manipulate our C-AES coprocessor easily by these registers.
Fig. 6.5 shows that we debug in the ARM Developer Suite (ADS), and a test bench of the encryption/decryption loop is simulated.
Fig. 6.4 The C-AES coprocessor on the ARM Integrator.
Fig. 6.5 The hardware driver running on the ARM ADS.
Table 6.1 Register map of the C-AES coprocessor.
address Size Function
0xCC00_0000 9 Each bit represents the control or response signal of the C-AES coprocessor separately, such as [RESET, Key Change, CBC, Key Length, RDONE, Wait Buffer, Working, OE]
0xcCC0_0004 32 Represents WDATA or RDATA according the direction of data transfer.
6.3.4 Coding Style Rule Check
A programmable rule checker has been integrated in the IP Qualification framework. The SpringSoft nLint is used for static lint checking. The lint tool can find errors and warnings in many aspects including naming, synthesis, simulation and DFT issues. Common syntax errors, such as typing errors, unmatched bus width, and undeclared objects, can be quickly located. Moreover, some logical errors like unreachable state can also be found. The lint tool indicates bad coding style that may load to poor readability and reusability. Our proposed design passes the lint tool checking with all rules defined by IPQ Alliance.
Fig. 6.6 The report of coding style rule check
6.3.5 Code Coverage
Generally speaking, a coverage-driven verification methodology makes the verification flow more complete and efficient, and coverage report gives us a sense of
the good and the bad of our HDL design and test bench. The coverage-driven verification can be performed using several coverage metrics. A simple example of these metrics is the code coverage. By investigating the code coverage helps the designer find untested or redundant code in early stage of development and the quality of the stimuli can be measured. Therefore, coverage gives the information that you need to know when you are ready for RTL sign-off. With a high coverage score, you can have more confidence that the code, in passing, works correctly, and we use Verification Navigator to measure the code coverage. The report is listed in Fig. 6.7.
Fig. 6.7 The report of code coverage estimated by Verification Navigator.
6.3.6 Design for Testability
Considering the ASIC testing, the scan chain design is inserted. In our design flow, the Synopsys DFT compiler is used to conduct in-depth testability analysis at the Register Transfer Level (RTL), and to implement the effective test structures at the hierarchical block level. The report of fault coverage shown in Fig. 6.8 is calculated by TetraMax, and it is 99.98% with 231 test patterns.
Fig. 6.8 The report of fault coverage calculated by TetraMax.
6.3.7 Physical Verification
In physical verification, Automatic Placement and Routing (APR), on-line Design Rule Check (DRC) and Layout Versus Schematic (LVS) are done by Synopsys Astro, and off-line DRC and LVS are verified by Mentor Graphics Calibre. Finally, the post-layout simulation is passed using Verilog-XL.
6.4 Results and Comparisons
The C-AES coprocessor design has been implemented using a UMC 0.18μm CMOS technology. It was synthesized using a standard-cell library. The critical path of only about 3.84ns shown in Table 6.2 is obtained.
Table 6.2 The comparison between cipher engine and key generator.
Cipher Engine Key Generator
Gate Counts (K) 38.55 21.68
Percentage of area size (%) 47.60 26.77
Critical path (ns) 3.33 3.84
Fig. 6.9 shows a chip layout of the C-AES coprocessor, and the whole chip has a size of around 1.72×1.66mm2, with a gate count of around 80,986 gates. The I/O interface takes 25.42% of the overall area, since the selectors with wide data width and the registers for storing IV, initial cipher key, text, and parameters described in Sec. 5.3 are required. The key generator module consumes about 26.77% of the area, and the main cipher engine module occupies 47.60% of the overall area. All these data are summarized in Table 6.3.
655 2
. 1 721 .
1 × mm
Fig. 6.9 Chip layout and feature of C-AES coprocessor.
In Table 6.4, we compare our implementation with some other AISCs presented recently. In [12], the authors presented a test chip that provides the AES encryption and decryption with different block sizes (128, 192, and 256 bits) and key lengths (128, 192, and 256 bits). Here the best performance (with the block size of 128 bits) is shown for comparison. Due to its LUT-based implementation of the S-Box, the hardware cost is high.
Table 6.3 Area statistics of C-AES coprocessor.
Component Gate counts (K) Percentage (%)
Cipher Engine 38.55 47.60
- ShiftRows() 0.35 0.43
- 16 8-bit inverters inGF(24) 5.31 6.56 - 16 8-bit matrix multipliers 4.23 5.22
- new MixColumns() 18.32 22.62
Key Generator 21.68 26.77
Main controller 0.17 0.21
Input interface 17.72 21.88
- Registers 6.61 8.16
- )xtimei(x 1.23 1.52
- 4 8-bit Matrix Multipliers 1.02 1.26
Output interface 2.87 3.54
- Registers 1.37 1.69
total 80.99 100
Another case is a pipelined design implemented by a 0.35μm CMOS technology
[23]. The S-Boxes were implemented based on the work reported in [38], instead of LUT-based design. However, theirs supports keys of 128, 192, and 256 bits by the same way described in [12], it requires an addition memory to store all the necessary round keys in advance. Compared with other designs which generate the round keys on the fly, it occupies extensive hardware resources. In addition, it should be noted that a pipelined design has the difficulty in maintaining the same throughput rate in a feedback cipher mode such as CBC. For example, the performance of their 4-stage pipeline design will be scaled down by four in the feedback cipher modes
In [19], the authors presented a compact architecture for the Irondale algorithm, where the hardware resources can be efficiently shared between data encryption, data decryption, and even key expansion. Table 6.4 only shows the result for their 128-bit data path—using one clock cycle for each round. Moreover, the S-Box is also optimized by introducing the composite field GF(((22)2)2). Since the data paths of 192 and 256-bit key expansion are not suitable for developing the compact hardware, their key generator only supports 128-bit key length, and the round keys are generated on-the-fly. Under the logic optimization applied to the constant arithmetic components, it has a very small gate counts.
In [34], another AES-128 module was implemented, and it is very similar to above concept [19], but has a smaller area. This is mainly because a new common sub-expression elimination (CSE) is applied to reduce the area cost. In addition, they also focus on the merge functions of the affine transformation and MixColumns() to increasing throughput. In our design, a new combined architecture described in Sec.
3.4.2 provides a more effective method for high throughput.
In [35], it is also a configurable AES coprocessor that all the encryption and decryption procedures are the same as the original AES algorithm, but m(x), Affine matrix, const(x) and the row vector c(x), are all configurable. In their design, additional 16 256×16-bit and 16 64-bit ROMs are used to store all alternatives of these parameters. For this reason, their design can only spend 3 clock cycles for parameter initialization, while our approach, which is described in Sec. 5.3, requires at least 18 cycles to receive the input data and compute the parameters. In order to support the configurability, they extend the composite field arithmetic approach to implement the new data path of round function, which can be processed with a fixed
irreducible polynomial in the composite field GF((24)2) . In other words, MixColumns() transformation is still executed by the multiplier in composite field
) ) 2 (( 4 2
GF . Although the difficulty in providing the configurability is solved, the overhand is quite considerable. It is because that additional 32 8-bit matrix multipliers for S-Boxes, and 64 8-bit field multipliers for MixColumns() are required in the data path of round function. Thus, our approach, which combines the matrix multipliers in S-Boxes and MixColumns(), provides a successful solution for high speed and low cost.
Table 6.4 Comparison of AES designs
Verbauwhede [12]
Su [23]
Satoh [19]
Hsiao [34]
Su [35]
Ours
Technology 0.18μm 0.35μm 0.11μm 0.18μm 0.25μm 0.18μm
Clock Rate 125MHz 200MHz 224MHz 117MHz 66MHz 250MHz
Throughput (Gbps)
1.6 1.33 1.14
2.381 2.008 1.736
2.61 1.49
0.844 0.704 0.603
3.20 2.67 2.28
Gate Counts 173K 58.430K 21.337K 16.917K 200.5K 80.99K
Throughput/
Gate Count (Kbps/gate)
9.25 7.68 6.59
41.49 34.98 30.24
122.23 88.08
4.21 3.51 3.01
39.51 32.97 28.15 On-the-fly Key
Generation No No Yes Yes Yes Yes
configurability No No No No Yes Yes
Although the throughput to gate count ratio is about 3 times smaller to the best value reported in Table 6.4, our C-AES coprocessor can easily provide a new encryption algorithm by arbitrarily selecting a combination of the parameters, and all specified key lengths can be supported. Considering the pre-gate throughput rate and functionality, our design is quite competitive.
Chapter 7 Conclusions and Future Work
7.1 Conclusions
A Rijndael algorithm with changeable coefficients is designed in this work, and we use the on-the-fly key generator instead of memory-based key scheduler to reduce the hardware resources. In addition, ECB and CBC operating modes are supported in our design. The whole chip has a size of around 1.72×1.66mm2, with a gate count of around 80,986 gates. The throughput is about 3.2Gbps for 128-bit keys, 2.67Gbps for 192-bit keys, and 2.29Gbps for 256-bit keys, respectively. The goal of this design is providing customized security for virtual private network (VPN) application. In VPN, sessions do not need to compatible with standard traffics; hence, the enterprise can configure their own coefficients to protect their network. In addition, our designs provides throughput over gigabit per seconds, so they are suitable for Fast Ethernet or Giga Ethernet.
7.1 Future Work
The C-AES coprocessor is also designed to operate in AMBA system with AHB specification. It resides in the address map of an ARM compatible processor, and serves as a coprocessor to provide encryption computing. Someday, it may become the building block of network security processor. To manipulate the data transfer efficiently, the future work may lie in developing the AHB bus mastering capability to off-load data movement and encryption operations form the host processor. Thus similar to the behavior of DMA devices, the data transfer and processing are all done by our coprocessor, which leaves the host processor to execute more sophisticated flow control or exception handling.
Bibliography
[1] National Institute of Standards and Technology (NIST), Advanced Encryption Standard (AES), National Technical Information Service, Springfield, VA 22161, Nov. 2001.
[2] National Institute of Standards and Technology (NIST), Data Encryption Standard (DES), National Technical Information Service, Springfield, VA 22161, Oct. 1999.
[3] W. Stallings, Cryptography and Network Security: Principles and Practice. 3rd ed., Prentice-Hall Inc., Upper Saddle River, N.J., 2003.
[4] E. Barkan, and E. Biham, “In How Many Ways Can You Write Rijndael?”, Proceedings of ASIACRYPT, Dec. 1-5, 2002, pp. 160-175, Springer-Verlag, 2002.
[5] P. Fergguson and G. Huston, “What is a VPN?—Part I,” The Internet Protocol Journal, vol. 1, pp. 2–19, June 1998. http://www.cisco.com/warp/public/759/.
[6] J. Daemen, and V. Rijmen, “AES Proposal: Rijndael,”AES Algorithm Submission, Sep. 3, 2000.
[7] A. Dandalis, V. K. Prasanna, and J. D. P. Rolim, “A comparative study of performance of AES final candidates using FPGAs,” Cryptographic Hardware and Embedded Systems (CHES) 2000, vol. 1965 of LNCS, pp. 125–140, Springer-Verlag, Aug. 2000.
[8] K. Gaj and P. Chodowiec, “Fast implementation and fair comparison of the final candidates for advanced encryption standard using field programmable gate arrays,” Proc. RSA Security Conf., Cryptographer’s Track, vol. 2020 of LNCS, pp. 84–99, Springer-Verlag, Apr. 2001.
[9] P. Chodowiec, K. Gaj, P. Bellows, and B. Schott, “Experimental testing of the Gigabit IPSec compliant implementations of Rijndael and triple DES using SLAAC-1V FPGA accelerator board,” Proc. Information Security Conf. (ISC), vol. 2200 of LNCS, pp. 220–234, Springer-Verlag, Oct. 2001.
[10] K. U. Jarvinen, M. T. Tommiska, and J. O. Skytta, “A fully pipelined memoryless 17.8 Gbps AES-128 encryptor,” Proc. Int. Symp.
Field-Programmable Gate Arrays (FPGA), (Monterey), pp. 207–215, ACM Press, 2003.
[11] K. U. Jarvinen, M. T. Tommiska, and J. O. Skytta, “A fully pipelined memoryless 17.8 Gbps AES-128 encryptor,” Proc. Int. Symp.
Field-Programmable Gate Arrays (FPGA), (Monterey), pp. 207–215, ACM Press, 2003.
[12] I. Verbauwhede, P. Schaumont, and H. Kuo, “Design and performance testing of a 2.29-GB/s Rijndael processor,” IEEE Journal of Solid-State Circuits, vol.
38, pp. 569–572, Mar. 2003.
[13] V. Fischer and M. Drutarovsky, “Two methods of Rijndael implementation in reconfigurable hardware,” Cryptographic Hardware and Embedded Systems (CHES) 2001, vol. 2162 of LNCS, pp. 77–92, Springer-Verlag, May 2001.
[14] S. Morioka and A. Satoh, “A 10Gbps full-AES crypto design with a twisted-BDD S-Box architecture,” Proc. IEEE Int. Conf. Computer Design (ICCD), (Freiburg, Germany), pp. 98–103, Sept. 2002.
[15] K. Gaj and P. Chodowiec. Comparison of the hardware performance of the AES candidates using reconfigurable hardware. Proc. 3rd AES Conf. (AES3).
[Online].
Available: http://csrc.nist.gov/encryption/aes/round2/conf3/aes3papers.html [16] M. McLoone and J. V. McCanny, “Rijndael FPGA implementation utilizing
look-up tables,” IEEE Workshop on Signal Processing Systems, Sept. 2001, pp.
349–360.
[17] M. McLoone and J.V. McCanny, “Apparatus for Selectably Encrypting and
[17] M. McLoone and J.V. McCanny, “Apparatus for Selectably Encrypting and