Key Expansion - Algorithm Specification - 利用Coprocessor介面在32位元嵌入式系統上實現加解密加速器

Chapter 2 Algorithm Specification

2.3 Key Expansion

The AES algorithm takes the key K, and performs a Key Expansion routine to generate a key schedule. The Key Expansion generates a total of Nb(Nr+1) words: the algorithm requires an initial set of Nb words, and each of the Nr rounds requires Nb words of Key data. The expansion of the input key into the key schedule proceeds according to the pseudo code in Fig. 2.10. As the pseudo code goes, there are different operations to be performed depending on i. Subword() is a operation, consists of 4 SubBytes() operations. RotWord() just performs a cyclic shift, word [a0, a1, a2, a3] will be shifted as [a₁, a₂, a₃, a₀]. The Rcon[i] array, consists of [x^i-1, {00}, {00}, {00}], where the x^i-1is power of x, with irreducible polynomialm x =x +x +x +x+1

( )

⁸ ⁴ ³ in GF(2⁸). The following word, w[i], is derived from the XOR of the preview word, w[i-1], with the word w[i-Nk], which is Nk positions earlier, where Nk means the key length in word. For words in positions that are a multiple of Nk, a transformation, followed by an XOR with the round constant, Rcon[i], is applied to w[i] prior to the XOR with the word w[i-Nk]. This transformation consists of RotWord() and SubWord().

The Key Expansion routine will produce a key array like upper of Fig. 2.11.

Whenever the AddRoundKey() routine is invoked, the current index i will increased 4, and the four key after index i are used as input of AddRoundKey(), just as illustrated in lower of Fig. 2.11.

KeyExpansion(byte key[4*Nk], word w[Nb*(Nr+1)], Nk)

Chapter 3 AES Design

In this chapter, we will propose the AES architecture. AES was announced since 2001. Until now, lots of research had shown their AES implementations in hardware or software. In this chapter, we will introduce what is S⁺Core and how to realize our AES architecture.

This chapter is organized as follows. In section 3.1, consider the system requirement, we modify the AES architecture to match the system request, then new architecture will be given. In section 3.2, we introduce the overview of S⁺Core, and what is Coprocessor interface (CI). In section 3.3, we present the architecture of the AES.

3.1 AES System Architecture

At the beginning, we use S⁺Core’s simulator to process the AES encryption procedure, and we find the S⁺Core’s compiler compile the program inefficient.

Because the instruction set of S⁺Core is limited. So, we can use Coprocessor Interface to solve this problem. The coprocessor is like we plus an accelerator on the system. We will introduce what is S⁺Core and its Coprocessor Interface in Sec. 3.2. The Coprocessor Interface is the I/O device of our AES design. Our core gets the data from S⁺Core and starts to encrypt and decrypt the data through the Coprocessor Interface.

Fig. 3.1 shows the block diagram of our AES architecture by Coprocessor Interface.

The coprocessor’s general registers receive or transmit data from S⁺Core only when MTC or LDC instruction is executed, and S⁺Core gets data from the coprocessor’s general registers when MFC or STC instruction is executed. Take AES-128 for example: First, we give eight MTC or LDC instructions for data and key transmitting from CPU’s general registers or memory unit. We will discuss these instructions in detail in Sec. 3.2.1.And then we start the AES coprocessor when the start signal is assert. Because AES-128 needs 10 cycles to generate the correct cipher, we need Freeze signal to stall the CPU, or the cipher will be wrong. When the Ready signal is assert which means current data is valid, we can transfer the correct 128-bit cipher to coprocessor’s output registers. Finally, we need 4 MFC or STC instructions for transmitting the 128 bits cipher to CPU. Fig. 3.2 shows the waveform of AES-128 encryption.

Figure 3.1 Block diagram of AES architecture by CI

CLK rst

START

DATA & Key

Cipher out Freeze

Ready

Figure 3.2 Waveform of AES-128 encryption

Because of speed consideration, we insert a pipeline register in the core of AES, and Fig. 3.3 shows the block diagram of the core.

Figure 3.3 Pipelined AES core

3.2 Overview of S

⁺

Core Platform

The S⁺Core™ [12] is Taiwan's first self-defined 32-bit RISC CPU with Sunplus-owned instruction set architecture (ISA). The ISA has 32/16-bit hybrid

instruction mode and parallel conditional execution for high code density, high performance and versatile application. The micro-architecture includes AMBA bus for SoC integration, coprocessor and custom engine interface for function flexibility, and SJTAG for efficient debugging and In-Circuit Emulation (ICE).

The user friendly development environment including S⁺Core IDE, simulator, optimization GNU C/C++ compiler and GDB enable users to develop the high quality application in fast time.

The most important feature of S⁺Core is that it has Optional customer-defined coprocessors. That means we can define a new instruction for some dedicated function.

Because of that, we can improve the performance of S⁺Core by customer-defined coprocessors. We will introduce how to use Coprocessor Interface in next section.

3.2.1 Coprocessor Interface (CI)

The S⁺Core can plug coprocessor 1~3 for some dedicated function; for example, floating pointer device, DSP device. Coprocessor device can plug into S⁺Core by

“Coprocessor Interface”. Up to three Coprocessors may be employed on one design. In this section, we only discuss the coprocessor interface and coprocessor instruction for coprocessor.

S⁺Core will issue the instruction to coprocessor by CI in first stage. Coprocessor will determine the instruction that belongs to them or not first. And then coprocessor will execute the instruction in it. A coprocessor may contain up to 32 general registers.

Each of these registers is up to 32 bits wide. Typically, programs use the general register for loading and storing data on which the coprocessor operates. Data is moved to the coprocessor’s general register from the processor’s general registers with the

MTCz instruction. Data is moved from the coprocessor’s general register to the processor’s general register with the MFCz instruction. Main memory data is loaded into or stored from the coprocessor’s general register with the LDCz and STCz instructions. Fig. 3.4 shows the interaction of S⁺Core CPU and coprocessor.

All coprocessor instructions share one main opcode encoding. There are three types of coprocessor instruction: coprocessor register transfer instructions, coprocessor data transfer instructions and coprocessor operation instructions. The Sub-OP field distinguishes different coprocessor instructions while CP# specifies the coprocessor number. Coprocessor register transfer instructions are MTC# (move to coprocessor) and MFC# (move from coprocessor). Coprocessor data processing instruction are LDC# (coprocessor load) and STC# (coprocessor store). And Fig. 3.5 shows the coprocessor instructions format.

Coprocessor

Pipeline signal data

Freeze

Figure 3.4 Coprocessor Interface

mtc/mfc OP rD CrA CP# Sub-OP

ldc/stc OP rD CrA CP# Sub-OP

cop OP CrD CrA CrB Cop-Code CP# Sub-OP

0 imm10

Figure 3.5 Coprocessor instructions format

3.3 Proposed AES Architecture

Fig. 3.6 shows the architecture of our AES core design. The core is composed of three parts, Main function unit, Key Unit, and Control Unit. The Control Unit count the different number of count for different key length. And it generates data ready signal and busy signal to control the AES architecture. The detail components of the Main function Unit will be discussed in Sec. 3.3.1. In our AES design, encryption can run in different key length, such as 128, 192, 256-bit key, and we will discuss how the Key Unit works in Sec. 3.3.2.

Main function Unit Key Unit Control

Unit

128 128

Cipher Massage

Key 256

128

Figure 3.6 Block diagram of the AES

3.3.1 Efficient implementation of the Function Unit

Because of speed consideration, we process the 128-bit block message in each cycle. The basic architecture unrolls only one full cipher round, and iteratively loops

data through this round until the entire encryption is completed. The basic component is shown in Fig. 3.7.

Figure 3.7 Main function Unit for encryption

SubBytes and InvSubBytes Transformation [4]

The multiplicative inversion in GF(2⁸) involved in the SubBytes is a hardware demanding operation, it takes at least 620 gates to implement by repeat multiplications in GF(2⁸) [7]. However, the gate count can be reduced greatly by using composite field arithmetic. In the SubBytes transformation, using substructure sharing, the isomorphic mapping function can be implemented by 12 XOR gates with 4 XOR gates in the critical path. Meanwhile, the combined inverse isomorphic mapping and the affine transformation can be implemented by 19 XOR gates, and the critical path consists of 4 XOR gates also. In the composite field GF((2⁴)²), an element can be expressed as

h l

s x+ , where s ^{s s}^h^, ^l^∈^GF

( )

²⁴ and x is a root of P2(x). Using Extended Euclidean algorithm, the multiplicative inverse of s x_h + modulo Ps_l 2(x) can be computed as in (3.1)

(

s xh +sl

)

⁻¹= Θ +sh x

(

s xh +sl

)

Θ (3.1) where^{Θ =}

(

^s^h²^λ⁺^{s s}^{h l} ⁺^s^l²

)

⁻¹. The proof of this equation is as follow:

Proof: The problem of finding the inverse of S(x) = s x_h +s_l module

( )

² found by using the Extended Euclidean Algorithm for one iteration. First, we need to rewrite P₂(x) in the form of

( ) ( ) ( ) ( )

P x2 =Q x S x +R x (3.3) where Q(x) and R(x) are the quotient and remainder polynomials of dividing P2(x) by S(x), respectively. By long division, it can be derived that

( )

^h¹

(

¹ ^h¹ ^l

)

^h¹

Q x =s x⁻ + +s s s⁻ ⁻ (3.4)

( ) (

¹ ^h¹ ^l

)

^h¹ ^l

R x = + +λ s s s s⁻ ⁻ (3.5) Substituting (3.4) and (3.5) into (3.3) and multiplying s to both sides of the _h² equation, it follow that Since addition and subtraction are the same in the extended field of GF(2), the first term on the right side of (3.7) can be moved to the left side. Comparing (3.2) and (3.7), it can be observed that

( ) ( )

h h l

S⁻ x = Θ +s x s +s Θ (3.8) According to (3.1), the multiplicative inversion in GF(2⁸) can be carried out in GF((2⁴)²) by the architecture illustrated in Fig. 3.8.

Figure 3.8 Implementation of the SubBytes Transformation

The SubBytes can be described by

' 1

, ,

i j i j

S =MS⁻ +C (3.9) where M is an 8*8 binary matrix, and C is an 8-bit binary vector with only 4 nonzero bits. The InvSubBytes performs the following operation on each byte of the State

( )

' 1

, ,

i j i j

S = M⁻ S +C ⁻ (3.10) From (3.10), the InvSubBytes transformation can be implemented according to the block diagram illustrated in Fig. 3.9.

Figure 3.9 Block diagram of the InvSubBytes transformation

MixColumns and InvMixColumns Transformation

Various architectures have been proposed for the implementation of the MixColumns transformation [6], [8], [9], [10], [11]. Applying substructure sharing both to the computation of a byte and between the computation of the four bytes in a column of the State, an efficient MixColumns implementation architecture can be derived. Particularly, (2.10) can be rewritten as

According to (3.11), the MixColumns transformation can be implemented by the architecture shown in Fig. 3.10. The function of the block “XTime” is to compute constant multiplication by {02}16. An element of GF(2⁸) can be expressed in polynomial form as S =s x₇ ⁷+s x₆ ⁶+s x₅ ⁵+s x₄ ⁴+s x₃ ³+s x₂ ²+s x₁ +s₀ , where

Therefore, the “XTime” block can be implemented by 3 XOR gates with only one XOR gate in the critical path. As illustrated in Fig. 3.5, the total number of XOR gates for computing one column of the State is 108, and the critical path is 3 XOR gates. The InvMixColumns multiplies the input polynomial by constant polynomial:

( )

( ) { }

⁰ ³

{ }

⁰ ²

{ } { }

⁰⁹ ⁰

d x =c⁻ x = b x + d x + x+ e (3.12) The InvMixColumns is far more complex and occupies larger area. A. Satoh et al.

[6] proposed an implantation based idea. This implementation yields logic optimizations since InvMixColumns shares logic resources with MixColumns.

We propose a different method for exploring resource sharing. Recall (2.9) and (3.12). ^{a x}

( ) ( ) { }

^•^{d x} ⁼ ⁰¹ . If we multiply both sides of the equation by d(x) we obtain ^{a x}

( )

^•^d²

( )

^x ⁼^{d x}

( )

^{, where} ^d²

( ) { }

^x ⁼ ⁰⁴ ^x²⁺

{ }

⁰⁵ . Note that two of the

coefficients of the ^d²

( )

^x are equal to {00}. The MixColumns and InvMixColumns can be implemented using shared logic resources as shown in Fig. 3.11.

S0 S1 S2 S3

XOR XOR XOR XOR

XOR XOR

XTime XTime XTime XTime

XOR XOR XOR XOR

S3' S0' S1' S2'

Figure 3.10 Implementation of the MixColumns Transformation

( )

a x

( )

d2 x

Figure 3.11 Implementation of MixColumns and InvMixcolumns

3.3.2 Reconfigurable Key Unit

Fig. 3.12 shows our Key Unit, it is composed of two part, control logic and Key- Generator. The KeyGenerator generate the round key for AES encryption every round.

The counter count 10 rounds for 128-bit key, 12 rounds for 192-bit key and 14 rounds for 256-bit key. But we only need 128-bit key in every round, so we use registers to store the round key for next round when 192-bit key and 256-bit key scheduling. The SBox in KeyGenerator is the same as that in Main Function Unit. And the control logic chooses the correct round key for AddRoundKey transformation of Main Function Unit in every round.

Figure 3.12 Block diagram of Key Unit for Encryption

The Key Unit for decryption is different with from that for encryption. At first, we generate all keys for each round and stored in the STACK. When all key we need for decryption is ready, we start to decrypt the cipher. The Fig. 3.13 shows our Key Unit for decryption.

Figure 3.13 Block diagram of Key Unit for Decryption

Chapter 4 Simulation and FPGA Verification

AES arithmetic in hardware and design for embedded system are given in this work. This chapter shows the hardware implantation results. The hardware implementation results and design flow are described in Sec. 4.1. The RTL synthesizer uses Synopsys¹ Design Compiler for ASIC. The FPGA verification will discuss in Sec.

4.2.

4.1 ASIC Implementation

Fig. 4.1 illustrates the entire ASIC design and testing flow with various CAD (Computer Aided Design) tools. The design is done by pre-layout gate-level simulation but the pre-layout simulation can not calculate the circuit speed precisely. The results for post-layout gate-level simulation will be worse than the results shown in former.

Tab. 4.1 compares our design with other proposed paper. [14] implements the SBox using Look-up-Table. [13] uses composite field arithmetic to implement the SBox. Our design is 2-stage pipelined. The throughput in 128 bit-key length is 1.82Gbps.

Figure 4.1 ASIC design flow

Table 4.1 The AES Core Comparison

Kuo [14] Lai [15] Horng [13] Ours

Technology 0.18 0.25 0.18 0.18

Clock rate (MHz) 154 125 125 150

Gate count 173K 80K 67.9K 47.5K

Throughput (Gbps) 1.6 1.454 1.6 1.82

Pipeline stage 1 6 1 2

Key Size All 128 All All

Function E E/D E/D E

And Tab. 4.2 compares S⁺Core with AES-128 encryption accelerator and S⁺Core

without AES-128 encryption accelerator. Tab. 4.2 shows the time we need to encrypt the first data. We don’t need to spend so much time calculating data as before did.

Table 4.2 The comparison of S⁺Core with accelerator and not

S⁺Core (without accelerator) 3329 cycles S⁺Core (with accelerator) 206 cycles

4.2 FPGA Verification

Figure 4.2 illustrates the FPGA design and testing flow in contrast to the ASIC design flow. Besides the RTL simulation, we also verified our design by using Field Programmable Gate Array (FPGA). Our design is implemented in S⁺Core, and the operation clock rate is about 33MHz. Tab. 4.3 shows the hardware utilization of our design.

Figure 4.2 FPGA design flow

Table 4.3 The hardware utilization on S⁺Core

Device S⁺Core

Number of Slice Flip-Flops 20801/93184 (22%) Number of 4 Input LUT 62400/93184 (66%)

clcok rate 33MHz

Chapter 5 Conclusions

First, we have proposed an efficient AES design supported 128, 192, and 256 bits key length. Because of our real time variable KeyGenerator, we don’t need to store all round keys. We only need 10% storage area than others. Second, by implementing the multiplicative inverter in composite field, the area cost can be smaller then that in Look-up-table (LUT). The whole design area can be also reduced by sharing the hardware for encryption and decryption. We also proposed an AES accelerator for 32-bit Embedded Processors. Third, we extend the instruction set of the processor.

Because of that, we only need less than 300 cycle count. The processor without accelerator needs over 3000 cycle count to process AES encryption. We speed up 10 times by our design. Besides, from the analysis of various instruction schedules, the 2-stage pipelined architecture is suitable and efficient for most schedules. The total gate count is about 47.5K gates, and maximal throughput is about 1.82Gbps with UMC 0.18 μm process.

Bibliography

[1] W. Stallings, Cryptography and Network Security: Principles and Practice.

Prentice Hall, 2002.

[2] Recommendation on Key Management, NIST Special Publications Std. 800-57, 2005.

[3] J. Daemen and V. Rijmen, AES Proposal: Rijndael, AES Algorithm Submission, September 3, 1999.

[4] X. Zhang, K. K. Parhi, “High-speed VLSI Architectures for the AES algorithm,”

IEEE Trans. On VLSI Systems, vol. 12, no. 9, pp. 957-967, 2004.

[5] C. Paar, “Efficient VLSI architecture for bit-parallel computations in Galois field,”

Ph.D. dissertation, Institute for Experimental Mathematics, University of Essen, Essen, Germany, 1994.

[6] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A compact Rijndael hardware architecture with S-Box optimization,” in Proc. ASIACRYPT 2001, Gold Coast, Australia, Dec. 2000, pp. 239-254.

[7] M. H. Jing, Y. H. Chen, Y. T. Chang, and C. H. Hsu, “The design of a fast inverse module in AES,” in Proc. Int. Conf. Info-Tech and Info-Net, vol. 3, Beijing, China, Nov. 2001, pp. 298–303.

[8] V. Fischer and M. Drutarovsky, “Two methods of Rijndael implementation in reconfigurable hardware,” in Proc. CHES 2001, Paris, France, May 2001, pp.

77–92.

[9] H. Kuo and I. Verbauwhede, “Architectural optimization for a 1.82 Gbits/sec VLSI implementation of the AES Rijndael algorithm,” in Proc. Cryptographic Hardware and Embedded Systems (CHES) 2001, Paris, France, May 2001, pp.

51–64.

[10] C. C. Lu and S. Y. Tseng, “Integrated design of AES (advanced encryption standard) encrypter and decrypter,” in Proc. IEEE Int. Conf. Application Specific Systems, Architectures Processors, 2002, pp. 277–285.

[11] X. Zhang and K. K. Parhi, “Implementation approaches for the advanced encryption standard algorithm,” IEEE Circuits Syst. Mag., vol. 2, no. 4, pp.

24–46, 2002.

[12] http://w3.sunplus.com/products/S%2Bcore.asp

[13] C. L. Horng, “An AES cipher chip design using on-the fly key scheduler”, Master Thesis, Dept. Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, June 2004.

[14] I. Verbauwhede, P. Schaumont, and H. Kuo, “Design and performance testing of a 2.29Gb/s Rijndael Processor”, IEEE Jour. of Solid-State Circuits, vol. 38, no. 3, March 2003, pp. 569-572, 2003.

[15] Y. K. Lai, L. C. Chang, L. F. Chen, C. C. Chou, and C. W. Chiu, “A novel memory less AES cipher architecture for networking applications”, in Proc.

IEEE Circuit and Systems Symp, May 2004.

About the Author

姓名：葉博元 Po-Yuan Yeh 出生地：台北市

出生日期：1982. 11. 09

學歷：

1989. 9 ~ 1995. 6 台北市立康寧國民小學 1995. 9 ~ 1998. 6 台北市立三民國民中學 1998. 9 ~ 2001. 6 國立台北師大附中

2001. 9 ~ 2005. 6 國立中正大學電機工程學系學士 2005. 9 ~ 2007. 8 國立交通大學電子研究所系統組

在文檔中利用Coprocessor介面在32位元嵌入式系統上實現加解密加速器 (頁 29-0)