Organization of this thesis - 利用Coprocessor介面在32位元嵌入式系統上實現加解密加速器

Chapter 1 Introduction

1.3 Organization of this thesis

This thesis is organized as follows. In Chapter 2, we present the AES algorithm. It contains AES basic operations. Chapter 3 shows the proposed architecture of the encryption of AES design. In addition, we also introduce the S⁺Core Platform. The

verification method and simulation result will be shown in Chapter 4. We make a brief conclusion and future work in the last chapter.

Chapter 2 Algorithm Specification

In this chapter, the Advanced Encryption Standard (AES) algorithm is described.

2.1 Advanced Encryption Standard (AES) Specification

The input and output of AES specification is described follow in Table 2.1. For the AES algorithm, the length of the input block and the output block is 128 bits, and the different key length will execute different iteration count. For key length 128 bits, 10 iteration is execute, and 12 iteration for 192 bits, 14 iteration for 256 bits.

Table 2.1 AES specification relations

In/Output Block Size Key Length Number of Rounds

AES-128 128 bits 128 bits 10

AES-192 128 bits 192 bits 12

AES-256 128 bits 256 bits 14

The input – the array of bytes in0, in1, … in15 – is copied into the State array as illustrated in Fig. 2.1. The Cipher or Inverse Cipher operations are then conducted on this State array, after which its final value is copied to the output – the array of bytes

out0, out1, … out15. Roughly, there are 4 kinds of transformation:

1. Non-linear byte substitution, so called SubBytes().

2. Cyclic shift on each row of the State array by different offsets, so called ShiftRows().

3. Mixing the data within each column of the State array, so called MixColumns().

4. Adding the round key with the State, so called AddRoundKey().

in0 in4 in8 in12

2.1.1 Basic Galois Field Arithmetic

The basic unit for processing in the AES algorithm is byte, and most operation in AES round function is based on GF(2⁸) arithmetic addition and multiplication. The addition in GF(2⁸) is defined as XOR operation, but the multiplication between 8 bit vector will produce a vector longer than 8 bits which in GF(2⁸). Therefore, the finite field multiplication always performs a modular multiplication, that modulo with irreducible polynomial after multiplication. For AES, the irreducible polynomial is

8 4 2

( ) 1

m x =x +x +x + + (2.1) x

2.1.2 Composite Field Arithmetic

Composite filed arithmetic can be employed to reduce the hardware complexity.

We call two pairs a composite field [5] if

z GF(2ⁿ) is constructed from GF(2) by Q(y);

z GF((2ⁿ)^m) is constructed from GF(2ⁿ) by P(x).

Composite fields will be denoted by GF((2ⁿ)^m), and a composite field GF((2ⁿ)^m) is isomorphic to the field GF(2^k) for k = nm. Additionally, composite fields can be built iteratively from lower order fields. For example, the composite field of GF(2⁸) can be built iteratively from GF(2) using the following irreducible polynomials [6]:

( ) ( ) ^{( )}

element in GF(2⁸) to its composite field and vice versa. The 8×8 binary matrix δ are decided by the field polynomials of GF(2⁸) and its composite fields. Such a matrix can be found by the exhaustive-search-based algorithm in [5]. The δ matrix corresponding to p x( )=x⁸+x⁴+x³+ + and the field polynomials in (2.4) can be x 1 found as below:

1 1 0 0 0 0 1 0

2.2 Encryption and Decryption Procedure

The encryption and decryption procedure are shown in Fig. 2.2. At the beginning of encryption procedure, the plain block is XORed with initial round key, by AddRoundKey() procedure. After an initial Round Key addition, the State array is applied to SubBytes(), ShiftRows(), MixColumns(), and AddRoundKey() for 10, 12, or 14 times (depending on the key length), with the final round differing slightly from the first Nr-1 rounds. The State is only applied to SubBytes(), ShiftRows(), and AddRoundKey(), then the cipher block is outputted. Similar to the encryption procedure, the decryption is applied to the reverse direction.

Figure 2.2 The Procedure of Encryption and Decryption

Cipher(byte in[4*Nb], byte out[4*Nb], word w[Nb*(Nr+1)])

AddRoundKey(state, w[Nr*Nb, (Nr+1)*Nb-1]) out = state

end

Figure 2.3 Pseudo Code for the Cipher

InvCipher(byte in[4*Nb], byte out[4*Nb], word w[Nb*(Nr+1)]) begin

The Cipher is described in the pseudo code in Fig. 2.3, and the inverse cipher is described in the pseudo code in Fig. 2.4.

2.2.1 SubBytes() and InvSubBytes() Transformation

The SubBytes() transformation is a non-linear byte substitution that operates independently on each byte of the State using a substitution table (S-box). The S-box is invertible and consists of two transformations:

1. Take the multiplicative inverse in the finite field GF(2⁸), the element {00}

is mapped to itself. Tab. 2.2 shows the multiplicative inverse of {xy}16

using Equ. 2.1 as the irreducible polynomial.

2. Apply the following affine transformation (over GF(2)):

i i (i+4)mod8 (i+5)mod8 (i+6)mod8 (i+7)mod8 i

b = ⊕b b ⊕b ⊕b ⊕b ⊕c (2.6) for 0≦ i＜8, where b_i is the i^th bit of the byte b, and c_i is the i^th bit of the

byte c with the value {63}16 or {01100011}2.

In matrix form, the affine transformation element of the S-box can be expressed as:

(2.7) Fig. 2.5 illustrates the effect of the SubBytes() transformation on the State. The S-Box used in the SubBytes() transformation is presented in hexadecimal form in Tab.

2.3. For example, if s1,1 = {53}, then the substitution value would be determined by the intersection of the row with index ‘5’ and the column with index ‘3’ in Tab. 2.3. This would result in s’1,1 having a value of {ed}.

Table 2.2 Multiplicative Inverse table for the byte {xy}₁₆

0 1 2 3 4 5 6 7 8 9 a b c d e f 0 00 01 8d f6 cb 52 7b d1 e8 4f 29 c0 b0 e1 e5 c7 1 74 b4 aa 4b 99 2b 60 5f 58 3f fd cc ff 40 ee b2 2 3a 6e 5a f1 55 4d a8 c9 c1 0a 98 15 30 44 a2 c2 3 2c 45 92 6c f3 39 66 42 f2 35 20 6f 77 bb 59 19 4 1d fd 37 67 2d 31 f5 69 a7 64 ab 13 54 25 e9 09 5 ed 5c 05 ca 4c 24 87 bf 18 3f 22 f0 51 ec 61 17 6 16 5e af d3 49 a6 36 43 f4 47 91 df 33 93 21 3b 7 79 b7 97 85 10 b5 ba 3v b6 70 d0 06 a1 fa 81 82 8 83 7e 7f 80 96 73 be 56 9b 9e 95 d9 f7 02 b9 a4 9 de 6a 32 6d d8 8a 84 72 2a 14 9f 88 f9 dc 89 9a a fb 7c 2e c3 8f b8 65 48 26 c8 12 4a ce e7 d2 62 b 0c e0 1f ef 11 75 78 71 a5 8e 76 3d bd bc 86 57 c 0b 28 8f a3 da d4 e4 0f a9 27 53 04 1b fc ac e6 d 7a 07 ae 63 c5 db e2 ea 94 8b c4 d5 9d f8 90 6b e b1 0d d6 eb c6 0e cf ad 08 4e d7 e3 5d 50 1e b3 f 5b 23 38 34 68 46 03 8c dd 9c 7d a0 cd 1a 41 1c

Figure 2.5 SubBytes() applies the S-Box to each byte of the State array

Table 2.3 S-Box, a substitution table for the byte {xy}16

InvSubBytes() is the inverse of the byte substitution transformation, in which the inverse S-box is applied to each byte of the State. This is obtained by applying the inverse of the affine transformation (2.4) followed by taking the multiplicative inverse in GF(2⁸).

The inverse S-box used in the InvSubBytes() transformation is presented in Tab.

2.4.

Table 2.4 Inverse S-Box, a substitution table for the byte {xy}16

2.2.2 ShiftRows() and InvShiftRows() Transformation

In the ShiftRows() transformation, the bytes in the last three rows of the State are cyclically shifted over different numbers of bytes(offsets). The first row, r = 0, is not shifted.

Specifically, the ShiftRows() transformation proceeds as follows:

, ,( ( , )) m od

r c r c shift r N b N b

S = S ₊ for 0< <r 4 and 0≤ <c 4 (2.8) Where the shift value shift(r, Nb) depends on the row number, r, as follows:

( )

^{1, 4} ¹

shift = ^shift

( )

^{2, 4} ⁼² ^shift

( )

^{3, 4} ⁼³ (2.9) Fig. 2.6 illustrates the ShiftRows() transformation.

s_3,3

Figure 2.6 ShiftRows() operates on the row of the State

The InvShiftRows() transformation proceeds as follows:

' '

,( ( , )) m od ,

r c shift r N b N b r c

S ₊ = S for 0< <r 4 and 0≤ <c 4 (2.10) Fig. 2.7 illustrates the InvShiftRows() transformation.

s_3,3

Figure 2.7 InvShiftRows() operates on the row of the State

2.2.3 MixColumns() and InvMixColumns() Transformation

The MixColumns() transformation takes a linear operation on each column – 32-bit word of the State. The linear operation treats the column of the State as a four-degree polynomial over GF(2⁸) and multiplies the column with a fixed polynomial a(x) modulo x⁴+ . The polynomial a(x) is given by 1

{ }

{ } { }

( ) 03 01 01 02

a x = x + x + x+ (2.11) The polynomial is co-prime tox⁴+ and therefore is invertible. This operation can 1 also be written as a matrix multiplication. Let^{S x}^'

( )

⁼^{a x}

( ) ( )

^⊗^{s x}

Fig. 2.8 describes the effect of the MixColumns() transformation on the State. The elements in column 1 are processed at the same time. After the operation, a(x), the results are generated in column 1.

Figure 2.8 MixColumns() operates on each column of the State

The InvMixColumns() multiplies with the inverse matrix of MixColumns() as follow

2.2.4 AddRoundKey() Transformation

In the AddRoundKey() transformation, a Round Key is added to the State by a simple bitwise XOR operation. Each Round Key consists of Nb words from the key schedule (described in Sec. 2.3). Those Nb words are each added into the columns of the State, such that

' ' ' '

0, 1, 2, 3, 0, 1, 2, 3,

[S _c,S _c,S _c,S _c] [= S _c,S _c,S _c,S _c]⊕[W_{round Nb c}_∗ ₊ ] for 0≤ <c Nb (2.14) Where [w are the key schedule words described in Sec. 2.3, and round is a value _i] in the range0≤round≤Nr. In the Cipher, the initial Round Key addition occurs when round = 0. The application of the AddRoundKey() transformation to the Nr rounds of the Cipher occurs when 1≤round≤Nr.

The action of this transformation is illustrated in Fig. 2.9, wherel=round Nb∗ .

S'

S

S'

Figure 2.9 AddRoundKey() XORs each column of the State with a word

2.3 Key Expansion

The AES algorithm takes the key K, and performs a Key Expansion routine to generate a key schedule. The Key Expansion generates a total of Nb(Nr+1) words: the algorithm requires an initial set of Nb words, and each of the Nr rounds requires Nb words of Key data. The expansion of the input key into the key schedule proceeds according to the pseudo code in Fig. 2.10. As the pseudo code goes, there are different operations to be performed depending on i. Subword() is a operation, consists of 4 SubBytes() operations. RotWord() just performs a cyclic shift, word [a0, a1, a2, a3] will be shifted as [a₁, a₂, a₃, a₀]. The Rcon[i] array, consists of [x^i-1, {00}, {00}, {00}], where the x^i-1is power of x, with irreducible polynomialm x =x +x +x +x+1

( )

⁸ ⁴ ³ in GF(2⁸). The following word, w[i], is derived from the XOR of the preview word, w[i-1], with the word w[i-Nk], which is Nk positions earlier, where Nk means the key length in word. For words in positions that are a multiple of Nk, a transformation, followed by an XOR with the round constant, Rcon[i], is applied to w[i] prior to the XOR with the word w[i-Nk]. This transformation consists of RotWord() and SubWord().

The Key Expansion routine will produce a key array like upper of Fig. 2.11.

Whenever the AddRoundKey() routine is invoked, the current index i will increased 4, and the four key after index i are used as input of AddRoundKey(), just as illustrated in lower of Fig. 2.11.

KeyExpansion(byte key[4*Nk], word w[Nb*(Nr+1)], Nk)

Chapter 3 AES Design

In this chapter, we will propose the AES architecture. AES was announced since 2001. Until now, lots of research had shown their AES implementations in hardware or software. In this chapter, we will introduce what is S⁺Core and how to realize our AES architecture.

This chapter is organized as follows. In section 3.1, consider the system requirement, we modify the AES architecture to match the system request, then new architecture will be given. In section 3.2, we introduce the overview of S⁺Core, and what is Coprocessor interface (CI). In section 3.3, we present the architecture of the AES.

3.1 AES System Architecture

At the beginning, we use S⁺Core’s simulator to process the AES encryption procedure, and we find the S⁺Core’s compiler compile the program inefficient.

Because the instruction set of S⁺Core is limited. So, we can use Coprocessor Interface to solve this problem. The coprocessor is like we plus an accelerator on the system. We will introduce what is S⁺Core and its Coprocessor Interface in Sec. 3.2. The Coprocessor Interface is the I/O device of our AES design. Our core gets the data from S⁺Core and starts to encrypt and decrypt the data through the Coprocessor Interface.

Fig. 3.1 shows the block diagram of our AES architecture by Coprocessor Interface.

The coprocessor’s general registers receive or transmit data from S⁺Core only when MTC or LDC instruction is executed, and S⁺Core gets data from the coprocessor’s general registers when MFC or STC instruction is executed. Take AES-128 for example: First, we give eight MTC or LDC instructions for data and key transmitting from CPU’s general registers or memory unit. We will discuss these instructions in detail in Sec. 3.2.1.And then we start the AES coprocessor when the start signal is assert. Because AES-128 needs 10 cycles to generate the correct cipher, we need Freeze signal to stall the CPU, or the cipher will be wrong. When the Ready signal is assert which means current data is valid, we can transfer the correct 128-bit cipher to coprocessor’s output registers. Finally, we need 4 MFC or STC instructions for transmitting the 128 bits cipher to CPU. Fig. 3.2 shows the waveform of AES-128 encryption.

Figure 3.1 Block diagram of AES architecture by CI

CLK rst

START

DATA & Key

Cipher out Freeze

Ready

Figure 3.2 Waveform of AES-128 encryption

Because of speed consideration, we insert a pipeline register in the core of AES, and Fig. 3.3 shows the block diagram of the core.

Figure 3.3 Pipelined AES core

3.2 Overview of S

⁺

Core Platform

The S⁺Core™ [12] is Taiwan's first self-defined 32-bit RISC CPU with Sunplus-owned instruction set architecture (ISA). The ISA has 32/16-bit hybrid

instruction mode and parallel conditional execution for high code density, high performance and versatile application. The micro-architecture includes AMBA bus for SoC integration, coprocessor and custom engine interface for function flexibility, and SJTAG for efficient debugging and In-Circuit Emulation (ICE).

The user friendly development environment including S⁺Core IDE, simulator, optimization GNU C/C++ compiler and GDB enable users to develop the high quality application in fast time.

The most important feature of S⁺Core is that it has Optional customer-defined coprocessors. That means we can define a new instruction for some dedicated function.

Because of that, we can improve the performance of S⁺Core by customer-defined coprocessors. We will introduce how to use Coprocessor Interface in next section.

3.2.1 Coprocessor Interface (CI)

The S⁺Core can plug coprocessor 1~3 for some dedicated function; for example, floating pointer device, DSP device. Coprocessor device can plug into S⁺Core by

“Coprocessor Interface”. Up to three Coprocessors may be employed on one design. In this section, we only discuss the coprocessor interface and coprocessor instruction for coprocessor.

S⁺Core will issue the instruction to coprocessor by CI in first stage. Coprocessor will determine the instruction that belongs to them or not first. And then coprocessor will execute the instruction in it. A coprocessor may contain up to 32 general registers.

Each of these registers is up to 32 bits wide. Typically, programs use the general register for loading and storing data on which the coprocessor operates. Data is moved to the coprocessor’s general register from the processor’s general registers with the

MTCz instruction. Data is moved from the coprocessor’s general register to the processor’s general register with the MFCz instruction. Main memory data is loaded into or stored from the coprocessor’s general register with the LDCz and STCz instructions. Fig. 3.4 shows the interaction of S⁺Core CPU and coprocessor.

All coprocessor instructions share one main opcode encoding. There are three types of coprocessor instruction: coprocessor register transfer instructions, coprocessor data transfer instructions and coprocessor operation instructions. The Sub-OP field distinguishes different coprocessor instructions while CP# specifies the coprocessor number. Coprocessor register transfer instructions are MTC# (move to coprocessor) and MFC# (move from coprocessor). Coprocessor data processing instruction are LDC# (coprocessor load) and STC# (coprocessor store). And Fig. 3.5 shows the coprocessor instructions format.

Coprocessor

Pipeline signal data

Freeze

Figure 3.4 Coprocessor Interface

mtc/mfc OP rD CrA CP# Sub-OP

ldc/stc OP rD CrA CP# Sub-OP

cop OP CrD CrA CrB Cop-Code CP# Sub-OP

0 imm10

Figure 3.5 Coprocessor instructions format

3.3 Proposed AES Architecture

Fig. 3.6 shows the architecture of our AES core design. The core is composed of three parts, Main function unit, Key Unit, and Control Unit. The Control Unit count the different number of count for different key length. And it generates data ready signal and busy signal to control the AES architecture. The detail components of the Main function Unit will be discussed in Sec. 3.3.1. In our AES design, encryption can run in different key length, such as 128, 192, 256-bit key, and we will discuss how the Key Unit works in Sec. 3.3.2.

Main function Unit Key Unit Control

Unit

128 128

Cipher Massage

Key 256

128

Figure 3.6 Block diagram of the AES

3.3.1 Efficient implementation of the Function Unit

Because of speed consideration, we process the 128-bit block message in each cycle. The basic architecture unrolls only one full cipher round, and iteratively loops

data through this round until the entire encryption is completed. The basic component is shown in Fig. 3.7.

Figure 3.7 Main function Unit for encryption

SubBytes and InvSubBytes Transformation [4]

The multiplicative inversion in GF(2⁸) involved in the SubBytes is a hardware demanding operation, it takes at least 620 gates to implement by repeat multiplications in GF(2⁸) [7]. However, the gate count can be reduced greatly by using composite field arithmetic. In the SubBytes transformation, using substructure sharing, the isomorphic mapping function can be implemented by 12 XOR gates with 4 XOR gates in the critical path. Meanwhile, the combined inverse isomorphic mapping and the affine transformation can be implemented by 19 XOR gates, and the critical path consists of 4 XOR gates also. In the composite field GF((2⁴)²), an element can be expressed as

h l

s x+ , where s ^{s s}^h^, ^l^∈^GF

( )

²⁴ and x is a root of P2(x). Using Extended Euclidean algorithm, the multiplicative inverse of s x_h + modulo Ps_l 2(x) can be computed as in (3.1)

(

s xh +sl

)

⁻¹= Θ +sh x

(

s xh +sl

)

Θ (3.1) where^{Θ =}

(

^s^h²^λ⁺^{s s}^{h l} ⁺^s^l²

)

⁻¹. The proof of this equation is as follow:

Proof: The problem of finding the inverse of S(x) = s x_h +s_l module

( )

² found by using the Extended Euclidean Algorithm for one iteration. First, we need to rewrite P₂(x) in the form of

( ) ( ) ( ) ( )

P x2 =Q x S x +R x (3.3) where Q(x) and R(x) are the quotient and remainder polynomials of dividing P2(x) by S(x), respectively. By long division, it can be derived that

( )

^h¹

(

¹ ^h¹ ^l

)

^h¹

Q x =s x⁻ + +s s s⁻ ⁻ (3.4)

( ) (

¹ ^h¹ ^l

)

^h¹ ^l

R x = + +λ s s s s⁻ ⁻ (3.5) Substituting (3.4) and (3.5) into (3.3) and multiplying s to both sides of the _h² equation, it follow that Since addition and subtraction are the same in the extended field of GF(2), the first term on the right side of (3.7) can be moved to the left side. Comparing (3.2) and (3.7), it can be observed that

( ) ( )

h h l

S⁻ x = Θ +s x s +s Θ (3.8) According to (3.1), the multiplicative inversion in GF(2⁸) can be carried out in GF((2⁴)²) by the architecture illustrated in Fig. 3.8.

Figure 3.8 Implementation of the SubBytes Transformation

The SubBytes can be described by

' 1

, ,

i j i j

S =MS⁻ +C (3.9) where M is an 8*8 binary matrix, and C is an 8-bit binary vector with only 4 nonzero bits. The InvSubBytes performs the following operation on each byte of the State

( )

' 1

, ,

i j i j

S = M⁻ S +C ⁻ (3.10) From (3.10), the InvSubBytes transformation can be implemented according to the block diagram illustrated in Fig. 3.9.

Figure 3.9 Block diagram of the InvSubBytes transformation

MixColumns and InvMixColumns Transformation

Various architectures have been proposed for the implementation of the MixColumns transformation [6], [8], [9], [10], [11]. Applying substructure sharing both to the computation of a byte and between the computation of the four bytes in a column of the State, an efficient MixColumns implementation architecture can be derived. Particularly, (2.10) can be rewritten as

According to (3.11), the MixColumns transformation can be implemented by the architecture shown in Fig. 3.10. The function of the block “XTime” is to compute constant multiplication by {02}16. An element of GF(2⁸) can be expressed in polynomial form as S =s x₇ ⁷+s x₆ ⁶+s x₅ ⁵+s x₄ ⁴+s x₃ ³+s x₂ ²+s x₁ +s₀ , where

Therefore, the “XTime” block can be implemented by 3 XOR gates with only one XOR gate in the critical path. As illustrated in Fig. 3.5, the total number of XOR gates for computing one column of the State is 108, and the critical path is 3 XOR gates. The InvMixColumns multiplies the input polynomial by constant polynomial:

( )

( ) { }

⁰ ³

{ }

⁰ ²

{ } { }

⁰⁹ ⁰

d x =c⁻ x = b x + d x + x+ e (3.12) The InvMixColumns is far more complex and occupies larger area. A. Satoh et al.

[6] proposed an implantation based idea. This implementation yields logic optimizations since InvMixColumns shares logic resources with MixColumns.

We propose a different method for exploring resource sharing. Recall (2.9) and (3.12). ^{a x}

( ) ( ) { }

^•^{d x} ⁼ ⁰¹ . If we multiply both sides of the equation by d(x) we obtain ^{a x}

( )

^•^d²

( )

^x ⁼^{d x}

( )

^{, where} ^d²

( ) { }

^x ⁼ ⁰⁴ ^x²⁺

{ }

⁰⁵ . Note that two of the

coefficients of the ^d²

( )

^x are equal to {00}. The MixColumns and InvMixColumns can be implemented using shared logic resources as shown in Fig. 3.11.

S0 S1 S2 S3

XOR XOR XOR XOR

XOR XOR

XTime XTime XTime XTime

XOR XOR XOR XOR

S3' S0' S1' S2'

Figure 3.10 Implementation of the MixColumns Transformation

( )

a x

( )

d2 x

Figure 3.11 Implementation of MixColumns and InvMixcolumns

3.3.2 Reconfigurable Key Unit

Fig. 3.12 shows our Key Unit, it is composed of two part, control logic and Key- Generator. The KeyGenerator generate the round key for AES encryption every round.

The counter count 10 rounds for 128-bit key, 12 rounds for 192-bit key and 14 rounds for 256-bit key. But we only need 128-bit key in every round, so we use registers to store the round key for next round when 192-bit key and 256-bit key scheduling. The SBox in KeyGenerator is the same as that in Main Function Unit. And the control logic chooses the correct round key for AddRoundKey transformation of Main Function Unit in every round.

Figure 3.12 Block diagram of Key Unit for Encryption

The Key Unit for decryption is different with from that for encryption. At first, we generate all keys for each round and stored in the STACK. When all key we need for decryption is ready, we start to decrypt the cipher. The Fig. 3.13 shows our Key Unit for decryption.

Figure 3.13 Block diagram of Key Unit for Decryption

Chapter 4 Simulation and FPGA Verification

AES arithmetic in hardware and design for embedded system are given in this work. This chapter shows the hardware implantation results. The hardware implementation results and design flow are described in Sec. 4.1. The RTL synthesizer uses Synopsys¹ Design Compiler for ASIC. The FPGA verification will discuss in Sec.

在文檔中利用Coprocessor介面在32位元嵌入式系統上實現加解密加速器 (頁 14-0)