Design of AES Crypto Engines
3.4 Median Throughput AES Engine for WLAN
The architecture of median throughput and low cost AES engine shown in Fig.3.18 contains an AES crypto core and an IO buffer. The AES crypto core consists of three major blocks:
control unit, key expansion unit, and integrated data process unit. The IO buffer is designed
!
Figure 3.18: Block diagram of the median throughput AES architecture due to the limitation of pin allocation.
The key expansion unit can be implemented with buffer memories or separate key expan-sion data-paths [26,32]. Both approaches require additional cost for decryption. In addition, different key expansion flows for different key lengths would further raise the hardware cost, leading to the difficulty in implementing full-key-length AES.
On the other hand, the data process unit must be able to perform both encryption and decryption. The most straight forward method is to utilize two dedicated data process units, one for encryption and the other for decryption [26,32]. In our proposed architecture, the en-cryption and the deen-cryption data-paths are integrated as one data process unit with minimum area overhead.
3.4.1 Data-path Unit
As shown in Fig.3.19, the integrated data process unit contains four basic transformation modules and controlled multiplexers to switch between encryption or decryption. Different data flows are indicated in Fig.3.19 with different lines. The black line indicates data flow for encryption and the gray line indicates data flow for decryption. The dotted line is used to support different operation modes.
To reduce hardware cost, the decryption path is combined with the encryption data-path. The SubBytes used in encryption is different from that used in decryption; therefore,
!"
# $ %
$ %
&%
Figure 3.19: Architecture of integrated data process unit
Figure 3.20: The data flow of encryption and decryption process.
total 32 SubBytes modules are required if they are implemented based on LUT. As a result, composite field based SubBytes is a better method to integrate the encryption and decryption data-path [20,54] because the multiplicative inversion in SubBytes can be shared in SubBytes and Inv-SubBytes. To further reduce cost of the integrated data process unit, the data flow of decryption must be modified to merge the data-path of encryptor and decryptor.
To integrate data-paths of encryption and decryption for hardware resource sharing, the data flow of the decryption process is modified. As shown in Fig. 3.20, the data flow of decryption can be exactly the same as that of encryption by changing the processing order of: 1) inverse SubBytes (InvB) and inverse ShiftRows (InvS), and 2) inverse MixColumns (InvM) and AddRoundKeys (ARK). The first modification can be done without additional overhead since both transformations are byte-oriented. However, the swapping of InvM and ARK requires an additional MixColumns transformation in the key expansion unit. This
is because the InvM is defined over field GF ((28)4) but the ARK is defined over GF (2).
Combined equations of ARK and Inv-/MixColumns in encryption and decryption can be expressed as
Encryption (s(x) × a(x) mod x4+ 1) + k(x) Decryption (s(x) + k(x)) × a−1(x) mod x4+ 1
Note a(x) and a−1(x) are constant polynomials over GF ((28)4)defined in FIPS-197, and s(x)and k(x) represent 32-bit data blocks of processed data and round keys, respectively.
The MixColumns transformation can be represented by s(x) × a(x) mod x4 + 1 and Ad-dRoundKeys can be expressed by s(x) + k(x). By the distributive law, the equation for decryption can be reformulated as
(s(x) × a−1(x) mod x4+ 1) + (k(x) × a−1(x) mod x4+ 1).
In this way, the data flow of decryption and encryption would be exactly the same with an ad-ditional MixColumns transformation applied to round keys. To eliminate the MixColumns, the processing order of InvM and ARK in the decryption is left unchanged. Note that the AddRoundKeys module in gray shown in Fig.3.19 is used for decryption. This additional AddRoundKeys can be reused to support different operation modes.
3.4.2 Key Expansion Unit
The proposed key expansion architecture is shown in Fig.3.21. It can be applied in both encryption and decryption with different key lengths: 128-, 192-, and 256-bit. The key
gen-Figure 3.21: On-the-fly key expansion unit for median throughput
erating algorithm of key length 256 for encryption can be modeled as following equations:
W00 = Subword(Rotword(W7)) ⊕ Rcon ⊕ W0
W10 = W00 ⊕ W1
W20 = W10 ⊕ W2
W30 = W20 ⊕ W3
W40 = Subword(W30) ⊕ W4
W50 = W40 ⊕ W5
W60 = W50 ⊕ W6
W70 = W60 ⊕ W7
(3.19)
Note that Wn0 are next round key words and Wnare current round key words, and each 128-bit round key contains four round key words. Subword contains four S-boxes to substitute a word and Rotword shifts the input left by one byte. Rcon is a constant array defined in FIPS-197 [8]. Because round keys used in the decryption flow are in reverse order, the key expansion unit needs to on-the-fly compute these reversely ordered round keys. To generate
such reversely ordered round keys, the last round key is needed and then the following round keys can be generated. That is, it needs to compute Wnfrom Wn0. The round key expansion process for decryption can be written as follows:
W0 = Subword(Rotword(W60⊕ W70)) ⊕ Rcon ⊕ W00
W1 = W00⊕ W10
W2 = W10⊕ W20
W3 = W20⊕ W30
W4 = Subword(W30) ⊕ W40
W5 = W40⊕ W50
W6 = W50⊕ W60
W7 = W60⊕ W70
(3.20)
Note that the first round key used in decryption is the same as the last round key used in encryption. If the key length is 256-bit, at most 14 cycles are required to produce the initial round key in the first decryption operation. To speedup the decryption process, the last round key could be stored in a buffer, then the following decryption process can start immediately when the AES crypto core receives ciphertext blocks.
As defined in FIPS-197 [8], key expansion processes for key length 128 and 192 are quite similar. The data flow of AES-128 is the solid line shown in Fig.3.21, and the dash line is the additional data flow for AES-192. The complexity of key expansion unit is raised significantly when considering key length 256. As shown in above equations, the round key expansion process needs two Subword modules in AES-256. The additional Subword leads to higher hardware cost and also increases the critical path. Since only 128 bits are required as round keys, the key expansion process of AES-256 can be divided into two phases. In encryption, {W00, W10, W20, and W30}are computed in the first phase and {W40, W50, W60, and W70}are generated by using the same Subword module in the second phase. In decryption, {W4, W5, W6, W7} are generated in the first phase and {W0, W1, W2, W3} are generated
Figure 3.22: Implementation results of median throughput AES engine.
Figure 3.23: Die micrograph for median throughput AES engine
in the second phase. Note that the required W60 ⊕ W70 when computing W0 have already been stored in W7 in the first phase. The dotted line in Fig.3.21 indicates the data flow in AES-256.
3.4.3 Implementation Results
The median throughput and low cost AES engine is implemented in UMC 90 nm CMOS technology and implementation results are shown in Fig. 3.22. Synthesis results under differ-ent timing constraints are marked in solid circles and implemdiffer-entation results after back-end flow is marked in hollow circle.
This design is also fabricated in UMC 90 nm technology and the die micrograph is shown in Fig. 3.23. The core area is 0.069mm2 where 63% of the fabricated chip is the AES
Figure 3.24: Shmoo plot for median throughput AES engine Table 3.3: Comparison between median throughput AES engines
Design Technology Frequency Throughput Gates Power Mbps (MHz) (Gb/s) (103) (mW) /K-gates
Proposed 90 nm 131.8 1.69 15.58 5.02 108.5
G¨urkaynak [26] 0.25µm 166 2.12 119 600 17.8
Hodjat [29] 0.18µm 330 3.84 54 79 48.6
Lin [32] 0.13µm 333 4.27 40.9 86.2 49.5
core and 37% is the IO buffer. The average power consumption of this chip is 5.02 mW when operating at 131.8 MHz. Fig. 3.24 shows the Shmoo plot for the maximum operating frequency under different conditions. The maximum operating frequency can be 145 MHz when the supply voltage is 1.1V and the maximum operating frequency is 106 MHz with 0.9V supply voltage. All FIPS-197 test patterns and random patterns are fully tested.
In Table 3.3, the proposed design is compared with some other designs in terms of throughput and hardware cost. Only designs with measured results are listed in this ta-ble. The performance metric, Mbps/K-gate, is also listed to show the efficiency and the normalized performance is also given in the table for comparison.
G¨urkaynak [26] proposed a full-duplex design which can achieve throughput up to 4.24 Gb/s in ECB and CBC modes and up to 2.12 Gb/s in OFB and CFB modes. However, the encryptor and the decryptor are not easy to operate in parallel because it requires at least 768 IO pins to make both encryptor and decryptor operate at the same time. Hodjat [29] proposed a design adopting table look-up based S-boxes to reduce the critical path but also leads to higher hardware cost. Lin [32] proposed a two-stage pipelining architecture
! " # $ ! %" &' !( "
) !(%"
* + & ! " $ , $! -"
# !((" &' ! ." / & !(0" &!(."
Figure 3.25: Summary of area throughput trade-offs of AES engines.
with high throughput. However, separately designed encryptor and decryptor result in much higher hardware cost. Moreover, the pipelining architecture also limits the implementation of feedback operation modes such as OFB and CFB.
3.5 Summary
In this chapter, different architectures of AES engines are proposed for different applications.
Fig. 3.25 shows the area and throughput trade-offs for different architectures. The detailed implementation technology and results can be found in previous sections. Note that designs with synthesis results are indicated by solid marks while designs with implementation results are indicated by hollow marks.
For designs with throughput higher than 10 Gb/s, the proposed design can achieve the highest throughput without considering the hardware cost. For designs with median through-put, the proposed AES engine has almost the smallest hardware cost. Satoh’s [20] design is smaller than the proposed because only AES-128 and ECB mode is supported. However, the proposed design can support AES-128, AES-192, and AES-256 under all different modes of operation. As last, for the low cost AES engine, the proposed can also outperform others in
Table 3.4: Design summary on different AES architectures
Implementation Technique High Throughput Area Efficient Low Cost
Unrolling 2 - fully Non Non
Pipelining Yes No No
Data Width 128 128 8
Key Expansion Off-line On-the-fly On-the-fly
SubBytes Implementation LUT Composite Field Composite Field
terms of equivalent gate counts.
In this section, we also give a brief summary on design selection for different consider-ations in Table 3.4. This table gives the prospective selection of implementation techniques for different architectures.