具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

(1)

國

立

交

通

大

學

電子工程學系電子研究所

博士論文

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

High-Performance Elliptic Curve Cryptographic Processor with

Side-Channel Attack Resistance

研究生：李人偉

(2)

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

High-Performance Elliptic Curve Cryptographic Processor with

Side-Channel Attack Resistance

研究生：李人偉 Student：Jen-Wei Lee

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國立交通大學

電子工程學系電子研究所

博士論文

A Dissertation

Submitted to Department of Electronics Engineering and Institute of Electronics

College of Electrical and Computer Engineering National Chiao Tung University

in partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy in

Electronics Engineering

June 2013 Hsinchu, Taiwan

(3)

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

學生：李人偉指導教授：李鎮宜博士

國立交通大學

電子工程學系暨電子研究所

摘要

現今，電子通訊帶給人類社會極大便利的資訊交流快速發展，相對應的保護個人訊息安全需求也日趨漸增。在資訊安全領域裡面，傳統的對稱式密碼系統能在使用者端妥善的加密保護資料隱密性，但這都還不足以解決金鑰配置、明文完整性以及不合法授權使用的問題。非對稱式密碼系統，又稱公開金鑰密碼系統，其被開發用來滿足前述應用的需求。在過去的幾年中，橢圓曲線密碼學是一個被提出相對傳統 RSA 演算法安全度較高的可實現方法，但是目前還尚未有合適橢圓曲線密碼處理器的設計對應方法。在本論文，我們從系統的角度探索密碼處理器的設計，包含從最上層的演算法、次之硬體運算單元架構以及底層的電子電路設計。為了追求高硬體效能，我們著手採用了一些改善硬體速度、硬體複雜度以及能量消耗的設計技巧，除此之外，一個合宜的密碼處理器，也必須包含側漏資訊攻擊的防禦。如何能在硬體運算時不洩漏和金鑰有關的訊息，也不因為防禦設計上造成硬體複雜度增加過度的代價，這些都將是設計上的挑戰也是我們實現電路的目標。如上所述，我們提出了一些新的設計方法，包含隨機式運算與金鑰不相依的硬體排程方法，此設計的特色除了適合系統實現的整合，也因為不需額外的參數與離線計算，所以硬體計算可以符合標準化的規範，另外一個優點是和過去的文獻相比，我們的側漏資訊攻擊防禦硬體代價也相較為低。為了提供更穩健的保護能力，我們也提出一個新的單一晶片真實亂數產生器設計方法，其能提供足夠的亂度給硬體作隨機式運算。針對這些提出的設計方法，我們的橢圓曲線密碼處理器架構在硬體效能與側漏資訊攻擊防禦都有相較過去文獻的優異表現。更進一步呈現我們的研究貢獻，透過聯電 90 奈米製程，我們針對各種應用製作開發晶片。

(4)

第一顆為 0.41 mm2 160 位元長的橢圓曲線密碼處理器，其能各別在 GF(p160)與 GF(2160)有限域的 0.34 ms 11.7 µJ 與 0.29 ms 9.3 µJ 下完成一次橢圓曲線點乘法計算，此優異的硬體效能顯示其將適合在手機通訊產品上的開發使用。第二顆是 521 位元長的橢圓曲線密碼處理器，其能各別在 GF(p521)與 GF(2521)有限域的 3.40 ms 與 2.77 ms 時間內完成一次橢圓曲線點乘法計算，其中透過橢圓曲線點產生法，能減少一半的公開金鑰傳遞訊息量，此設計是達到至今運算最快的橢圓曲線密碼處理器，其將適合高速的雲端伺服器應用。另外一顆是操作在低電壓 0.5 V 與低時脈頻率 25 MHz 的 192 位元長的橢圓曲線密碼處理器，其能各別在 GF(p192)與 GF(2192)有限域的 10.8 ms 438 µJ 與 9.2 ms 437 µJ 下完成一次橢圓曲線點乘法計算，此優異的低能量消耗顯示其將適合在未來的物聯網產品上的開發使用。最後，這些晶片也都經過收集上百萬條能量軌跡的側漏資訊攻擊防禦量測驗證其安全性。

(5)

High-Performance Elliptic Curve Cryptographic

Processor with Side-Channel Attack Resistance

Student：Jen-Wei Lee Advisors：Dr. Chen-Yi Lee

Department of Electronics Engineering and Institute of Electronics

National Chiao Tung University

ABSTRACT

Nowadays, the fast development of network communication in electronics

indus-try brings the people to a quick and convenient life, while the demand in safety for

protecting the personal private data from revealing significantly increases as well. In

security, the conventional symmetric-key scheme can locally achieve the encryption,

but the decryption key and ciphertext are still needed to be sent without disclosure,

modification, duplication, forgery, and even unauthorized access. The asymmetric-key

scheme or so called public-key cryptosystems (PKC) is developed to satisfy these

re-quirements. In recent years, a new coming approach, elliptic curve cryptography

(ECC), has been adopted in several applications for ensuring the security of

infor-mation exchange. However, the suitable solution of ECC processor has not appeared

so far.

In this dissertation, we investigate the design of crypto engine through a system

view, from top to down, including the algorithm, operation scheduling,

pro-cessing-element architecture, and also circuit-level implementation. For pursuing the

achievement of high-performance accelerator, several improvement techniques for the

hardware speed, hardware complexity, and power consumption are promoted. Besides,

to deliver a decent design of crypto engine, the device security such as the

(6)

coun-target. And then, both of these design issues lead to a big challenge, where it requires

the device to be implemented without both of the key-dependent processed data and

much overhead of SCA resistance.

As above, we proposed a new design method, which is based on the randomized

computation and key-independent scheduling manner, to protect the private date

stored in device from the side-channel information leakage. The feature is that it is

suitable for the system integration and the usage of the standard without any

pre-computation. Another advantage is that the overhead of protected design is lower

than that of related previous works. The robustness of SCA resistance is examined by

exploiting an on-chip true-random number generator (TRNG) with sufficient

random-ness. Moreover, the corresponding design architecture of hardware implementation is

introduced, and our ECC processor outperforms both in the hardware efficiency and

protection against SCAs as compared with the other approaches.

To show more our contributions, we further conduct our research for several

standard applications. Fabricated by UMC 90-nm CMOS technology, a 0.41

mm

2

160-bit ECC chip can achieve 0.34/0.29 ms 11.7/9.3 µJ for one GF(p)/GF(2

m

)

el-liptic curve scalar multiplication (ECSM), which is effective at the hardware cost and

suitable for the mobile device; a 521-bit ECC chip performs each GF(p

521

) ECSM in

3.40 ms and GF(2

521

) ECSM in 2.77 ms, where it saves 50% data transmission of

pub-lic key by on-chip elliptic curve point generation (ECPG). This is the fastest design

and also applicable for the cloud computing; a 192-bit ECC chip achieves 10.8/9.2 ms

438/437 µW GF(p

192

)/GF(2

192

) ECSM at scaled 0.5 V and 25 MHz, where it is

effi-cient at the power consumption and suitable for the applications of Internet of Things

(IoT). In addition, the SCA resistance for each design is demonstrated by millions of

measurements.

(7)

誌謝

在博士班的求學生涯中，承蒙交大培育、多位師長提攜、朋友的協助以及家人的支持，讓我能一路度過各個挑戰，最後順利完成我的博士學位。我要非常感謝一起指導我研究的李鎮宜教授和張錫嘉教授，兩位教授給予我廣大的研究發揮空間以及豐富的研究軟硬體資源，除此之外，老師們也積極的領導我參與研究相關的國際學術活動，還有強調自我運動健康以及社交活動的重要性，同時藉由這幾年的研究過程，更讓我從老師們身上學習到待人處世和視野遠見的培養，讓我博士班生涯有了更全面性的成長。我也要感謝陳榮傑教授在我研究論文的過程，耐心地教導我建立起數學理論的基礎。另外，要感謝我的口試委員：吳安宇教授、周世傑教授、謝明得教授、蔡宗漢教授、黃元豪教授與賴伯承教授在百忙之中參加我的口試，給我許多寶貴的建議，讓我看到研究上許多不同的面向，並啟發我未來的研究方向。接著，我要感謝 SI2 實驗室以及 Ocean 研究團隊的全體同仁，讓我得以在此學習成長，吸收許多寶貴經驗。在學業上，透過與大家的討論，使得研究更加完善充實；在生活上，也因為有了各位，在面對研究挑戰的路途上，有更多的歡笑和難以忘懷的甜美回憶。在這段期間，謝謝交大電子系所提供給我博士班的獎助學金，讓我能心無旁騖、全心全力盡快完成博士班研究與不斷擴充自己的專業能力。也要感謝行政院國家科學委員會的出席國際學術會議補助，讓我能除了吸收世界各地優秀學者所提供的研究資訊之外，也提高台灣在國際學術研究上的能見度。我要謝謝鍾菁哲學長、許騰仁學長、林建青學長、游瑞元學長、陳志龍學長、林義閔學長、余建螢學長與涂博銘先生，幫助我了解製作晶片過程與驗證系統行為。感謝范銘隆學長，從我第一次出國參加會議研討一直到我畢業，這一路上都是我最棒的心靈導師。我也很感謝天公伯和交大土地公爺爺，總是能在很多關鍵時刻給我強運。然後，我要由衷的向我台北與宜蘭的大家庭所有人，獻上我最誠摯的感謝，謝謝你們這一路照顧、陪伴我從小到大。我也特別感謝我的爸爸、媽媽不辭辛勞對我的栽培，謝謝我的哥哥，能夠分享我的喜怒哀樂並給予經濟上的幫助，也謝謝我的二舅和屘姨，我知道你們一直都很關心我，待我有如自己的孩子一樣。這些日子多虧有全家人的體諒，有你們無悔的付出與關心，讓我無後顧之憂地完成博士學業。在這邊以此論文獻給你們，作為我們全家共同分享的成果。最後，虔誠祈禱上天保佑台灣這塊美麗的寶島，我願和熱愛這片土地的任何人一起努力。大家加油！為台灣加油！

(8)

List of Figures

1.1 A model of network security. . . 2

1.2 A model of symmetric-key encryption. . . 2

1.3 A simplified model of PKC. . . 4

1.4 Research review of ECC hardware implementation. . . 9

1.5 Our top-to-down design methods of SCA-resistant ECC processor. . . 10

2.1 Security comparison of ECC versus RSA. . . 15

2.2 The data flow of each AES mode, where the nonce and initial vector (IV) are an arbitrary number and secrete value, respectively. The functional notation MSBTlen/LSBTlen denotes the most/least significant Tlen bits of the data, and Tlen/Clen is the bit length of the MIC/ciphertext. . . 21

2.3 Message can be securely sent based on MK to a specific party by using both of asymmetric and symmetric-key algorithm without pre-knowledge encryption and decryption keys. . . 22

3.1 Scenario of side-channel attacks on hardware device. . . 25

3.2 Power consumption of CMOS circuits with supply voltage Vdd and leakage current Ileak. . . 25

3.3 (a) Environment of power measurement. (b) Current running through the chip is recorded by measuring the voltage drop via a resistor in series with the core power and supply power. . . 26

3.4 SPA attacks on the unprotected ECC chip using LR-DA binary method of ECSM, where the power traces are recorded by 50.0 mV/div voltage resolution and 2.0 ms/div time base. . . 29

(12)

3.6 Correlation analysis obtained from an unprotected ECC chip by conducting

the DPA attacks. . . 33

3.7 Correlation analysis obtained from an unprotected ECC chip by conducting the ZPA attacks. . . 37

3.8 Example of the CPA attacks for the LR-DAA ECSM. . . 38

3.9 Correlation analysis obtained from an unprotected ECC chip by conducting the CPA attacks. . . 40

4.1 Example of randomized Montgomery operations. . . 42

4.2 The domain conversion can be achieved in pre/post-process stage, where this overhead of several modular operations can be neglected for overall ECSM. . . 48

4.3 Generating a random EC point over DFs. . . 50

4.4 Computing a square root over DFs. . . 50

5.1 RO-RNG circuit, where the frequency of ring oscillator (RO), f1 is faster than that of sampling clock, f2. . . 55

5.2 RO-RNG with jitter amplifier. . . 56

5.3 Random normal distribution of clock jitter for the sampled sequence. . . . 56

5.4 Proposed method to amplify jitter with configurable delay cell. . . 58

5.5 On-the-fly generation of control signals based on the LFSR. . . 59

5.6 Elementary IHDC. . . 61

5.7 On-off switch circuit. . . 62

5.8 The transition waveform of CIHDC, where one CIHDC can increase jitter by several tens of picosecond at rising and falling edge. . . 62

5.9 The die photo, where D1 and D2 are the RO-RNG with and without jitter amplifier, respectively. . . 63

5.10 3-level IHDC. . . 64

5.11 On-off switch circuit of 3-level IHDC. . . 64

5.12 Layout of 3-level CIHDC. . . 65

6.1 Block diagram of our DF-ECC processor. . . 66

(13)

6.3 (a) Data path separation of UV comparison and RS calculation. (b) The fully-pipelining scheme of hardware implementation for the proposed radix-4 RMD in Algorithm 7. . . 69 6.4 The overall DF modular operations are integrated into a fully-pipelined

GFAU. . . 72

6.5 The priority-oriented scheduling for (a) conventional RL-DAA ECSM and

(b) modified RL-DAA ECSM, where the solid line is the ECPD operation flow and the dash line is the ECPA operation flow. . . 76 6.6 Two-level memory hierarchy for heterogeneous two-PE architecture. . . 81

6.7 Example of data access sequences MOV GFAU(R reg) to MAS(S reg) and

MOV MAS(R reg) to GFAU(S reg)(a) without (b) with local memory

syn-chronization scheme. The data transitions through MEM for interleaved processing in (a) can be eliminated in (b). . . 83 7.1 Detailed data flow for the proposed priority-oriented scheduling of ECSM

calculation over DFs. . . 85

7.2 SPA attacks on the protected ECC chip using LR-DAA binary method

of ECSM, where the power traces are recorded by 50.0 mV/div voltage resolution and 2.0 ms/div time base. . . 89

7.3 DPA attacks on protected ECC device processing ECSM with randomized

computation, where the random sequence fails NIST P800-22 test suite. . . 91

7.4 DPA attacks on protected ECC device processing ECSM with randomized

computation, where the random sequence passes NIST P800-22 test suite. . 92

7.5 Correlation analysis obtained from a protected ECC chip by conducting

the ZPA attacks. . . 93

7.6 Correlation analysis obtained from a protected ECC chip by conducting

the CPA attacks. . . 94 7.7 System architecture of soECC-B. . . 97

7.8 Chip micrograph of our 521-bit DF-ECC processor, where soECC-B is

shown in (b). . . 98 7.9 System architecture of soECC-P. . . 100 7.10 Shmoo plot for the measurement results of chip soECC-P. . . 101

(14)

7.11 Chip micrograph of our 160-bit DF-ECC processor, soECC-P. . . 101

7.12 System architecture of soECC-S. . . 103

7.13 Shmoo plot for the measurement results of chip soECC-S. . . 105

7.14 Chip micrograph of our 521-bit DF-ECC processor, soECC-S. . . 106

7.15 System architecture of our CE. . . 107

7.16 The power consumption of CE chip working at different supply voltage and operation frequency. . . 109

7.17 Chip micrograph of our CE cooperating with embedded processor and other components, such as data memory (DM), program memory (PM), sensing interface, and bio-signal processing module. . . 109

7.18 Layout view of our 521-bit DF-ECC processor, ECC-DF521. . . 111

(15)

List of Tables

2.1 Formulas of EC Point Calculation (ECPC) in Affine Coordinates . . . 16

2.2 Operations for ECPC over DFs in Various Coordinates . . . 18

4.1 Operations in Randomized Montgomery Domain . . . 42

4.2 Analysis of Various Division Algorithms . . . 46

5.1 Functionality of CIHDC . . . 60

6.1 Implementation Results of GF (p256) GFAU and MAS on Xilinx Virtex-II FPGA Device with Comparison . . . 73

6.2 Architecture for Parallel Computing GF (p) Square Roots . . . 78

6.3 Architecture for Parallel Computing GF (2m_{) Square Roots . . . 79}

7.1 Time Analysis of Proposed Priority-Oriented Scheduling . . . 87

7.2 Implementation Analysis for Different DF-ECC Designs . . . 88

7.3 Chip Summary of soECC-B . . . 97

7.4 Chip Summary of soECC-P . . . 99

7.5 Chip Summary of soECC-S . . . 104

7.6 Chip Summary of soECC-G . . . 110

7.7 Comparison Among Previous Approaches for GF (p) . . . 112

7.8 Comparison Among Previous Approaches for GF (2m_{) . . . 113}

7.9 Comparison Among Previous Approaches . . . 114

(16)

Glossary

ADD – Addition. 53

AES – Advanced Encryption Standard. 1 ALU – Arithmetic Logic Unit. 6

ASIC – Application-Specific Integrated Circuit. 8

CBC-MAC – Cipher Block Chaining Message Authentication Code. 19 CCM – CTR with CBC-MAC. 19

CE – Crypto Engine. 115

CIHDC – Configurable Interlaced Hysteresis Delay Cell. 60 CMAC – Cipher-based Message Authentication Code. 19 CPA – Collision Power-Analysis. 24

CTR – Counter. 19

DES – Data Encryption Standard. 1 DF – Dual Field or Dual-Field. 5

DF-ECC – Dual-Field Elliptic Curve Cryptography (or Cryptographic). 115 DHK – Diffie-Hellman Key. 115

DLP – Discrete Logarithm Problem. 12 DPA – Differential Power-Analysis. 24 DSA – Digital Signature Algorithm. 13 EC – Elliptic Curve. 16

ECC – Elliptic Curve Cryptography (or Cryptographic). 3, 4 ECDLP – Elliptic Curve Discrete Logarithm Problem. 3 ECIES – Elliptic Curve Integrated Encryption Scheme. 19 ECPA – Elliptic Curve Point Addition. 16

(17)

ECPC – Elliptic Curve Point Calculation. 17 ECPD – Elliptic Curve Point Doubling. 16 ECPG – Elliptic Curve Point Generation. 34 ECPS – Elliptic Curve Point Subtraction. 16 ECSM – Elliptic Curve Scalar Multiplication. 27

ECSP-DSA – Elliptic Curve Signature Primitive Digital Signature Algorithm. 19 ECSP-NR – Elliptic Curve Signature Primitive Nyberg-Rueppel. 19

ECSVDP-DH – Elliptic Curve Secret Value Derivation Primitive Diffie-Hellman. 19 ECSVDP-MQV – Elliptic Curve Secret Value Derivation Primitive Menezes-Qu-Vanstone. 19

FHE – Fully Homomorphism Encryption. 119 FPGA – Field-Programmable Gate Array. 8

GF (2m_{) – Notation of “Galois field with characteristic 2 and degree m” or “extension}

binary field”. 5–7

GFAU – Galois Field Arithmetic Unit. 71

GF (p) – Notation of “Galois field with characteristic p” or “prime field”. 5–7 HT – Half Trace. 67

IBE – ID-based Encryption. 119 IC – Integrated Circuit. 54

IEEE – Institute of Electrical and Electronics Engineers. 4 IHDC – Interlaced Hysteresis Delay Cell. 59

IoT – Internet of Things. 119

JS-GFAU – Jacobi Symbol and Galois Field Arithmetic Unit. vii, 68 KO – Karatsuba-Ofman. 5

LR-DA – Left-to-Right Double-and-Add. 28

LR-DAA – Left-to-Right Double-and-Add-Always. 28 LR-DAS – Left-to-Right Double-and-Add/Substract. 112 LS – Lucas Sequence. 67

(18)

m – Notation of “bit size of the operating field length”. 5, 6 MAS – Multiplier-Adder/Subtractor. 53 MD – Modular Division. 53 MK – Master Key. 20 MM – Modular Multiplication. 53 MS – Modular Squaring. 17

n – Notation of “maximum bit size of the operating field length”. 41 ONB – Optimal Normal Basis. 5

p – Notation of “prime”. 13

PKC – Public-Key Cryptosystems. 2, 3, 12 PRNG – Pseudo-Random Number Generator. 54 RADD – Randomized Addition. 42

RFID – Radio-Frequency Identification. 7

RL-DAA – Right-to-Left Double-and-Add-Always. 39 RMD – Randomized Montgomery Division. 42

RMM – Randomized Montgomery Multiplication. 42 RNG – Random Number Generator. 10

RNS – Residue Number System. 5 RO – Ring Oscillator. x, 55

RO-RNG – Ring-Oscillator-based Random Number Generator. 54 RSUB – Randomized Subtraction. 42

SCA – Side-Channel Attack. 4, 8 SPA – Simple Power-Analysis. 24 SUB – Subtraction. 53

TADD – Notation of “computation time of ADD”. 53

TMD – Notation of “computation time of MD”. 53

TMM – Notation of “computation time of MM”. 53

TRNG – True-Random Number Generator. vii, 54 TSUB – Notation of “computation time of SUB”. 53

(19)

(20)

Chapter 1 Introduction

A general model for the network security is shown in Figure 1.1, where the message is to be transferred from one party to another across some sort of wireless communications or Internet service. The two parties, who are the principals in this transaction, must cooperate the exchange to take place. A logical information channel is established by defining a communication protocol such as GSM, Wi-Fi, and TCP/IP. The trusted third party may be needed to achieve secure transmission. For example, a third party may be responsible for distributing the secret information to the two principals while keeping it from any opponent. Or a third party may be needed to arbitrate disputes between the two principals concerning the authenticity of a message transmission.

Symmetric-key encryption is a form of cryptosystem where the encryption and decryp-tion are performed by using the same key. It is also known as convendecryp-tional encrypdecryp-tion. Symmetric-key encryption transforms security-related message, plaintext, into ciphertext using a secret key and an encryption algorithm, such as stream cipher RC4 [1], block ci-pher DES (Data Encryption Standard) [2], and block cici-pher AES (Advanced Encryption Standard) [3]. By using the same key and decryption algorithm, the plaintext is recovered from the ciphertext. The traditional symmetric-key ciphers use the substitution and/or transposition techniques. Substitution techniques map plaintext elements into ciphertext elements (each letter retains its position but changes its identity). Transposition tech-niques systematically transpose the positions of plaintext elements (each letter retains its identity but changes its position). An example model of conventional encryption is shown in Figure 1.2.

(21)

Information Channel S e c u re Me s s a g e Me s s a g e T ra n s fo ra m ti o n S e c u re Me s s a g e M e s s a g e T ra n s fo ra m ti o n

Trusted Third Party

Opponent

Sender Recipient

Figure 1.1: A model of network security.

Information Channel S e c u re Me s s a g e Me s s a g e T ra n s fo ra mt io n S e c u re Me s s a g e Me s s a g e T ra n s fo ra mt io n Sender Recipient Sender's Secrete Key Sender's Secrete Key

Figure 1.2: A model of symmetric-key encryption.

On the other hand, asymmetric-key encryption is developed to achieve the encryp-tion and decrypencryp-tion with two different keys in which one is a public key and another one is a private key. It is also known as public-key cryptosystems (PKC). In contrast to

(22)

the symmetric-key encryption, asymmetric-key algorithms are based on the mathematical functions rather than on the substitution and transposition. For one thing, although the asymmetric-key ciphers can achieve the same function of encryption, the symmetric-key ciphers will not be abandoned because the computational overhead of current asymmetric-key encryption schemes. Usually, the both of symmetric-asymmetric-key and asymmetric-asymmetric-key encryp-tion schemes are used together in a security system. For example, the short message such as encryption key is securely generated from the combination of recipient’s public key and sender’s private key by asymmetric-key ciphers in an open channel, and then the long message such as plaintext is scrambled by symmetric-key ciphers. In this case, only the corresponding recipient who has the correct private key can unscramble the ciphertext, where the decryption key can be obtained from the combination of sender’s public key and recipient’s private key in the similar way. In addition to the message encryption, due to the flexibility of using the public/private key pair, the PKC have profound consequences in the area of confidentiality and authentication. For more practical examples, they can be referred to the standardized applications, such as WPAN, NFC, SSL, and PGP. An example model of PKC is shown in Figure 1.3. Note that the PKC are an efficient ap-proach to solve the problem of key distribution for the symmetric-key encryption, where the encryption key is usually assumed to be unknown for the recipient before establishing the communication session.

The traditional achievable method for the PKC is RSA [4], which was publicly de-scribed in 1977 by Ron Rivest, Adi Shamir, and Leonard Adleman. The difficulty of attacking RSA is based on the hard problem of finding the big prime factors of a compos-ite number. To provide a sufficient security, the key size is usually selected to be several thousands of bits. This big key size results in a high complexity in computation, and it is inconvenient for user in practical implementation. According to these, in 1985, ellip-tic curve cryptography (ECC) is independently discovered by Victor Miller [5] and Neal Koblitz [6] to be an alternative scheme for PKC. Its security is based on the hardness of a different problem, namely the elliptic curve discrete logarithm problem (ECDLP). Cur-rently, the best algorithms known to solve ECDLP have fully exponential running time, in contrast to the subexponential-time algorithms known for the integer factorization. This means that a desired security level can be attained with significantly smaller keys

(23)

Information Channel S e c u re Me s s a g e Me s s a g e T ra n s fo ra mt io n S e c u re Me s s a g e M e s s a g e T ra n s fo ra m ti o n Sender Recipient Sender's Private Key Recipient's Public Key Recipient's Private Key Sender's Public Key

Figure 1.3: A simplified model of PKC.

in elliptic curve systems than those of RSA. For instance, it is generally accepted that a 160-bit elliptic curve key provides the same level of security as a 1024-bit RSA key. The advantages that can be gained from smaller key size include speed and efficient use of power consumption, transmission bandwidth, and memory storage.

The security protocol based on ECC schemes has been specified in IEEE standards, IEEE P1363 [7], in 2000 with an extended version [8] appeared in 2004. The ECC is also applied in practical commercial electronic products using IEEE 802.15.4 [9] and IEEE 802.15.6 [10]. They are the standards in physical layer for currently existing low power and low cost solution of ZigBee [11], Bluetooth low energy [12], and wireless body area networks [13], which are extensively used in industry, business, and medical treatment, respectively. Moreover, high-speed crypto engine is indispensable for the ubiquitous appli-cations of computing server. The hardware accelerator of ECC, so called ECC processor, is dedicated to reducing the system retard from high computation complexity of ECC functions. Instead of conventional performance-oriented design methods, a current issue for delivering a decent crypto engine is the device security such as the protection of side-channel attacks (SCAs). In [14], a demonstration shows that the key in circuit device can be easily broken by power measurement, while the suitable solution of SCA resistance for ECC processor has not been discovered much yet.

(24)

1.1 Previous Works

1.1.1 Elliptic Curve Cryptographic (ECC) Processor

To date, several works of the ECC hardware implementation have been published in [15–28]. To save hardware complexity, single finite field architecture either for prime field GF (p) [17, 19, 21, 26, 29] or extension binary field GF (2m_{) [15, 25, 27], and fixed}

modulus approach on specific elliptic curves (ECs) [20–22] can be used. However, the applications of IEEE P1363 including digital signature are approved for supporting dual-field (DF) functions on arbitrary ECs. Exploiting carry-save adder trees in word-based multipliers is a common technique to integrate DF data path [16, 24, 28], but the limit of integration for distinct arithmetic units still results in large hardware cost.

In general, the GF (2m_{) design is faster than the GF (p) design because of carry-free}

addition over GF (2m_{). Besides, there are some well-known techniques to pursue}

high-speed GF (2m_{) ECC design. A divide-and-conquer algorithm, Karatsuba-Ofman (KO)}

multiplication [30], is applied to reduce the computation complexity of number of bit operations. Classical methods to multiply two m-bit polynomials require O(m2_{) bit}

oper-ations. The KO algorithm reduces this to O(mlog₂3_{). As the polynomial modulus is fixed,}

the reduction over GF (2m_{) is simple [31], and then the throughput of KO multiplier can}

be elevated by adopting fully pipelining architecture [32, 33]. Another design technique over GF (2m_{) using fixed polynomial modulus is the fast squaring [34]. The binary}

rep-resentation of a polynomial a(z)2 _{is obtained by inserting a zero-bit between consecutive}

bits of the binary representation of a(z). Thus the computation complexity is most dom-inated by the reduction over GF (2m_{), which is easily achieved by combinational circuit}

using exclusive-OR gates only. In contrast to standard (polynomial) representation of elements over GF (2m_{), optimal normal basis (ONB) representation [7, 34] has benefits in}

squaring because it can be achieved by simple shifting operations. But it is inevitable for the computing overhead of conversion between the standard and ONB representation.

For arithmetic over GF (p), based on Chinese remainder theorem, residue number system (RNS) [35, 36] represents a large integer using a set of smaller integers, so that computation may be performed more efficiently. This briefs the long delay within the data path of carry-propagation adder, and the multiple multipliers can be implemented

(25)

with parallelism. RNS implementations bear the extra cost of an input converter (binary-to-RNS) to translate numbers from a standard binary format into residues and an output converter (RNS-to-binary) to implement the translation from RNS to a binary represen-tation. An RNS implementation applied to GF (p) ECC processor is presented in [23], where the technique of data flow graph for the optimization of ECC function is utilized as well.

For the implementation of scalable architecture performing flexible field length and arbitrary modulus, Montgomery algorithm [37] is commonly adopted. It is an efficient approach to achieve the modular multiplication over DFs, where the long-precision integer division is not required during the calculation of Montgomery multiplication (or called Montgomery modular multiplication). The key idea is that the reduction after integer multiplication can be achieved by shifting bit position as the domain constant is selected to be two to the power of m or x with degree m (i.e., 2m _{over GF (p) and x}m _{over GF (2}m_)),

where the constant 2m _{and x}m _{is so called Montgomery constant. Another benefit for}

the hardware implementation of Montgomery algorithm is that the GF (p) and GF (2m₎

arithmetic logic unit (ALU) is suitable for integration in VLSI circuit because the sum of carry-save adder is equal to two bitwise exclusive-OR operators [15,27,38]. The overhead is the multiplexer to select the data path between operating fields. In [39, 40], a word-based Montgomery multiplier is presented to avoid the high fanout of AND operators in conventional serial-parallel architecture [15]. In [16, 24, 41], a w × w multiplier is exploited to tradeoff between the hardware speed and area cost with flexible size w. As w equals field length m, one modular multiplication can be performed within several cycle periods [17,42]. Note that, although the Montgomery algorithm still requires the overhead of conversion between integer and Montgomery domain, it can be immediately achieved by Montgomery division described in [38].

For high speed target, a usually adopted technique is the parallel computation with multiple processing elements (PEs) of homogeneous architecture [18, 24, 43]. However, in practice, this approach by directly duplicating the PEs has less hardware utilization for various operations. Another approach of improving computation speed of ECC processor is the window methods [34]. The key idea is to store some pre-computed data in device, and then the on-line running time can be reduced.

(26)

On the contrary, the parallel computation and window methods requiring the overhead of device memory would not be suitable for the low power and low cost applications such as radio-frequency identification (RFID). ISO/IEC 18000-3 [44] is an international standard for the item level identification of the passive RFID, and it also describes the parameters for air interface communications at 13, 56 MHz. Several previous works [22,29, 45, 46] are targeted at the implementation of low hardware complexity. In [45], a 192-bit GF (p)/GF (2m_{) ECC processor supporting hash function [47] and consuming less than}

30 µW is reported, while the execution time is over 1 second per operation due to low operating frequency 175 kHz. In [46], the GF (2m_{) fast squaring approach is exploited}

to efficiently computed inversion in affine coordinates. In [29], a 192-bit GF (p) ECC processor is presented, where a radix-4 Montgomery multiplication approach is used and the inversion is achieved by extended Euclidean algorithm [34]. In [22], a 163-bit GF (2m₎

ECC design with micro-controller and bus manager is implemented to connect to the front-end module in RFID device. A dedicated register file management is used to save the high complexity of multiplexers. To further save the number of temporary register, a common Z projective coordinate system modified from [48] is exploited.

To pursue the embedded system market, in [49], a hardware/software co-design of ECC processor is implemented and performed at 12 MHz on an 8051 micro-controller. Communication overhead due to operand transfers is reduced by integration of a direct memory access unit and through the inclusion of an additional I/O register into the hardware accelerator. In [50], a cryptographic core compliant with the IEEE 802.15.4 standard [9] and based on FPGA is described. It consists of three components including an AES-CCM module, a content-addressable memory achieving an access control list, and an RSA module based on Montgomery arithmetic.

1.1.2 Side-Channel Attacks (SCAs)

Traditional cryptanalysis assumes that an adversary only has access to input and out-put pairs without the knowledge about internal states of the device. However, the advent of side-channel analysis showed that a cryptographic device can leak critical information. By monitoring the timing, power consumption, electromagnetic emission of the device or by injecting faults, adversaries can obtain the information about internal processed data

(27)

or operations, and then the key is extracted out of the cryptographic device without math-ematically breaking the primitives. This kind of attacks using side-channel information is so called side-channel attacks (SCAs).

In 1999, Kocher [51] has presented a real threat on the hardware device by power measurement. The detailed description for the attacks on symmetric-key crypto engine is given in [14], and the power-analysis attacks are successfully conducted on the micro-processor, ASIC, and even FPGA. The common techniques against power-analysis attacks for symmetric-key crypto engine are the dual-rail logic cell equalizing the power consump-tion and the masking in substituconsump-tion which depends on the key value. The previous one needs to change the design flow including the back-end physical layout to ensure inter-connect capacitances of the true and false output nodes of logic gates are equal; the last one requires the overhead of hardware speed and cost from combinational circuit. Several published papers [52–55] show other kinds of logic cells to “balance” the power consump-tion. On the other hand, a systematic overview for most of currently existing SCAs and countermeasure on asymmetric-key design is reported in [56]. However, most of the previous approaches illustrate the theoretical analysis rather than real implementation together with measurement results. In Chapter 3, we will give more description about the principle and show the evaluation of power-analysis attacks on ECC device from power measurement.

1.1.3 Summary of Paper Survey

The research age of ECC hardware implementation is briefly shown in Figure 1.4. The ECC processor with small key size and single field has less hardware complexity [22, 25, 29, 49], but it sacrifices the security. The DF design [24, 28, 45, 57, 58] and large key size approach [38,59] have higher security level. However, there is still relatively little design targeted at the applications such as cloud computing and portable device, where the both of flexibility and device security are necessary.

(28)

SmallKey SingleField SmallKey DualFields LargeKey DualFields LargeKey DualFields SCAResistance (~ 2009) (~ 2011) (~ 2012) SmallKey DualFields SCAResistance

Cloud Computing

Portable Device

Low Security Level High

Figure 1.4: Research review of ECC hardware implementation.

1.2 Motivation and Design Challenge

As described in sub-section 1.1, the suitable solution of ECC processor to provide hardware efficiency against SCAs has not so far appeared. In our work, not only the per-formance but also the practical applications are taken into consideration. For instance, the speed is a key factor for server computing. But the RFID device and portable appli-cations are targeted at the requirements of low power and low cost. These would bring a big difficulty to the hardware designer due to the trade-off between speed and cost for current design approaches.

The following are to list the items about our design target:

1. Low SCA-resistant overhead of speed, cost, power and no modification of circuit design flow

2. Performance improvement from delivering a new hardware architecture

3. Compliance with current standards, such as IEEE P1363 and IEEE 802.15.4/6 4. A high-speed ECC design for the cloud computing

(29)

1.3 Our Solution

RNG

Hardware

Architecture

Scheduling

ECSM

Randomization against SCAs

Key-independent parallel

computation

High radix and multiple types

of processor elements

Enlarging random space

Figure 1.5: Our top-to-down design methods of SCA-resistant ECC processor. Figure 1.5 briefly illustrates our proposed solution for the design objective. In the upper-level view, we try to randomize the processed data and schedule the operation tasks in the key-independent manner for breaking the dependence on attacking model. The noticeable things are that these methods would not bring much modification for both of the hardware architecture and circuit design flow, and little overhead is added to the protected design. For hardware components, the high-radix and heterogeneous processing element architecture is used to accelerate the modular operations with utilization improvement as compared to the conventional approaches. The reconfigurable computing is exploited by arithmetic unit integration for the reduction of hardware complexity. Besides, for the multiple processing element design, memory hierarchy is adopted to address the data bandwidth with benefits in power saving. Finally, we use circuit-level design techniques to improve the randomization ability of random number generator (RNG) in which the robustness against SCAs is achieved.

(30)

1.4 Dissertation Organization

The remainder of this dissertation is outlined as follows. Chapter 2 reviews the basics of PKC and arithmetic of ECC over finite field. Chapter 3 presents the principle and evaluation of various SCAs on ECC processor. Our proposed countermeasure of SCAs and hardware design of random source are introduced in Chapter 4 and Chapter 5, respectively. For the proposed processing elements, operation scheduling, parallel computation, and memory architecture of ECC processor, they are given in Chapter 6. Chapter 7 shows the implementation and experiment results of our ECC processor with performance analysis and power measurement. Finally, Chapter 8 concludes our work and gives several new research targets for the future as well.

(31)

Chapter 2 An Overview of Cryptographic

Algorithms

2.1 Public-Key Cryptosystems (PKC)

Public-key cryptosystems (PKC) refer to a cryptographic system requiring two sepa-rate keys, one of which is secret and one of which is public. Although different, the two parts of the key pair are mathematically linked. One key locks or encrypts the plaintext, and the other unlocks or decrypts the ciphertext. Neither key can perform both functions by itself. The public key may be published without compromising security, while the private key must not be revealed to anyone not authorized to read the messages.

PKC use asymmetric-key algorithms and can also be referred to by the more generic term “asymmetric-key encryption.” The algorithms used for PKC are based on mathe-matical relationships that presumably have no efficient solution. The most notable ones being the integer factorization and discrete logarithm problem (DLP). Although it is com-putationally easy for the intended recipient to generate the public and private keys, to decrypt the message using the private key, and easy for the sender to encrypt the mes-sage using the public key, it is extremely difficult or effectively impossible for anyone to derive the private key, based only on their knowledge of the public key. This is why, unlike symmetric-key algorithms, a public-key algorithm does not require a secure initial exchange of one or more secret keys between the sender and receiver. The use of these algorithms also allows the authenticity of a message to be checked by creating a digital

(32)

signature of the message using the private key, which can then be verified by using the public key. In practice, only a hash of the message is typically encrypted for signature verification purposes.

There are three primary kinds of PKC: public-key distribution systems, digital signa-ture systems, and public-key cryptosystems, which can perform both public key distri-bution and digital signature services. Diffie-Hellman key (DHK) exchange is the most widely used public-key distribution system, while the digital signature algorithm (DSA) is the most widely used digital signature system.

For the history of PKC, the pioneering paper by Diffie and Hellman [60] presented an approach to cryptography and challenged cryptologists to come up with a cryptographic algorithm that met the requirements for public-key systems. The first achievable method is the RSA [4]. It is a block cipher in which the plaintext and ciphertext are integers between 0 and n − 1 for some n. Plaintext is encrypted in blocks, with each block having a binary value less than some number n. A typical size for n is 1024 bits or 309 decimal digits. The following are the brief description of the RSA algorithm.

For some plaintext block M and ciphertext block C, encryption and decryption of RSA are of the following form.

  

C = Me _{(mod n)}

M = Cd _{(mod n) = (M}e₎d _{(mod n) = M}ed _{(mod n).}

Both sender and receiver must know the value of n. The sender knows the value of e, and only the receiver knows the value of d. Thus, this is a public-key encryption algorithm with a public key of P U = {e, n} and a private key of P R = {d, n}. For this algorithm to be satisfactory for public-key encryption, the following requirements must be met.

1. It is possible to find values of e, d, n such that Med _{(mod n) = M for all M < n.}

2. It is relatively easy to calculate Me _{(mod n) and C}d _{(mod n) for all values of M <}

n.

3. It is infeasible to determine d given e and n.

The preceding relationship holds if e and d are multiplicative inverses modulo φ(n), where φ(n) is the Euler’s totient function. For p, q prime, φ(pq) = (p − 1) × (q − 1). The

(33)

relationship between e and d can be expressed as ed (mod φ(n)) = 1. This is equivalent to saying ed ≡ 1 (mod φ(n)) and d ≡ e−1 _{(mod φ(n)). That is, e and d are multiplicative}

inverses (mod φ(n)). Note that, according to the rules of modular arithmetic, this is true only if d (and e) is relatively prime to φ(n) (i.e., gcd(φ(n), d) = 1).

We are now ready to state the RSA scheme. The ingredients are the following. • p, q two prime numbers (private, chosen)

• n = pq (public, calculated)

• e, with gcd(φ(n), e) = 1 and 1 < e < φ(n) (public, chosen) • d ≡ e−1 _{(mod φ(n)) (private, calculated)}

The private key consists of {d, n} and the public key consists of {e, n}. Suppose that user Alice has published her public key and that user Bob wishes to send the message M to Alice. Then Bob calculates C = Me _{(mod n) and transmits C. On receipt of this}

ciphertext, user Alice decrypts by calculating M = Cd _{(mod n).}

For the security of RSA, there are three approaches to attacking RSA mathematically. 1. Factor n into its two prime factors. This enables calculation of φ(n) = (p−q)×(q−1),

which, in turn, enables determination of d ≡ e−1 _{(mod φ(n)).}

2. Determine φ(n) directly, without first determining p and q. Again, this enables determination of d ≡ e−1 _{(mod φ(n)).}

3. Determine d directly, without first determining φ(n).

Most discussions of the cryptanalysis of RSA have focused on the task of factoring n into its two prime factors. Determining φ(n) given n is equivalent to factoring n. With presently known algorithms, determining d given e and n appears to be at least as time-consuming as the factoring problem [61]. Thus, we can use factoring performance as a benchmark against which to evaluate the security of RSA.

For the size of n, a number of other constraints have been suggested by researchers. To avoid values of n that may be factored more easily, the algorithm’s inventors suggest the following constraints on p and q.

(34)

1. p and q should differ in length by only a few digits. Thus, for a 1024-bit key, both p and q should be on the order of magnitude of 1075 _{to 10}100_.

2. Both (p − 1) and (q − 1) should contain a large prime factor. 3. gcd(p − 1, q − 1) should be small.

The key size of 1024 bits was generally considered the minimum necessary for the RSA encryption algorithm. However, it would result in high complexity of hardware cost and time execution. Figure 2.1 shows the comparison of security strengths for ECC versus RSA. It is shown that the key size of ECC can be several tens of times shorter than that of RSA with equivalent security. This also means that the user has convenience in using the shorter key by ECC approach.

(35)

2.2 Arithmetic of Elliptic Curve Cryptography (ECC)

over GF (p) and GF (2

m

)

As described in IEEE P1363 [7], the standardized elliptic curve (EC) over GF (p) is y2 _{= x}3 _{+ a}

px + bp, where x, y ∈ GF (p) and 4a3p + 27b2p 6= 0 (mod p), and the other

one over GF (2m_{) is y}2_{+ xy = x}3 _{+ a}

bx2 + bb with x, y ∈ GF (2m) and bb 6= 0. For the

ECC schemes, the discrete logarithm problem (DLP) is based on the elliptic curve scalar multiplication (ECSM) such that KP = P +P +· · ·+P with an integer private key K and a point P (x, y) on EC. The ECSM is applied in ECC as a means of producing a trapdoor function. Thus the security of ECC depends on the intractability of determining K from Q = KP given known values of Q and P . The fundamental theorem of arithmetic about ECC is described in the guide books [62, 63].

For implementation, the ECSM is the most time-critical operation, and it can be achieved by the serial EC point addition and doubling (ECPA and ECPD) with binary method. Note that the ECPA is to perform P3(P3x, P3y) ← P1(P1x, P1y) + P2(P2x, P2y)

with P1 6= ±P2, and the ECPD calculates P3(P3x, P3y) ← 2P1(P1x, P1y) with P1 6= −P1.

The dual-field (DF) arithmetic of ECPA and ECPD in affine coordinates is summarized in Table 2.1, where the EC point subtraction (ECPS) can be achieved by performing the ECPA with modification of coordinate values such as P (x, y) → −P (x, −y) over GF (p) and P (x, y) → −P (x, x + y) over GF (2m_).

Table 2.1: Formulas of EC Point Calculation (ECPC) in Affine Coordinates

Field ECPA ECPD

GF (p) λ = P1y−P2y P1x−P2x λ = 3P2 1x+ap 2P1y P3x = λ 2_{− P} 1x − P2x P3x = λ 2_{− 2P} 1x P3y = λ(P2x − P3x) − P2y P3y = λ(P1x− P3x) − P1y GF (2m₎ λ = P1y+P2y P1x+P2x λ = P1x + P1y P1x P3x = λ2+ λ + P1x+ P2x+ ab P3x = λ2+ λ + ab P3y = λ(P2x+ P3x) + P3x+ P2y P3y = λ(P1x + P3x) + P3x + P1y

(36)

The ECPC can be implemented in several coordinate systems, where the computa-tional complexity analysis can be referred to [64] and [22]. The major operations of ECPC over both GF (p) and GF (2m_{) are summarized in Table 2.2, where the notation of MD,}

MM, and MS represents the modular division, multiplication, and squaring, respectively. Note that the GF (2m_{) MS with a fixed irreducible polynomial [34] can be performed}

within relative fewer cycles than those of the MD and MM, but the fixed irreducible poly-nomial method restricts the flexibility and results in the low security. Since our work is targeted at supporting the arbitrary irreducible polynomial, the MS is regarded as the MM with the same multiplier and multiplicand. From comparison Table 2.2, it can be found that as the time ratio _MMMD is smaller than 3, the ECSM performance is the fastest in the affine coordinates over DFs. Otherwise, the computation time is less in the projective coordinates.

(37)

Table 2.2: Operations for ECPC over DFs in Various Coordinates

Field ECPC Modular Operations

GF (p) ECPD A_{← 2A} _{1MD + 1MM + 2MS} SP_{← 2SP} _{7MM + 5MS} J_{← 2J} _{4MM + 6MS} J_m_{← 2J}_m _{4MM + 4MS} J_C_{← 2J}_C _{5MM + 6MS} ECPA A_{← A + A} _{1MD + 1MM + 1MS} SP_{← SP + SP} _{12MM + 2MS} J_{← J + J} _{12MM + 4MS} J_m_{← J}_m_{+ J}_m _{13MM + 6MS} J_C_{← J}_C_{+ J}_C _{11MM + 3MS} J_{← J + A} _{8MM + 3MS} J_m_{← J}_m_{+ A} _{9MM + 5MS} J_C _{← J}_C_{+ A} _{8MM + 3MS} GF (2m₎ ECPD A_{← 2A} _{1MD + 1MM + 1MS} LD_{← 2LD} _{4MM + 1MS} LD_m_{← 2LD}_m _{5MM + 1MS *} ECPA A_{← A + A} _{1MD + 1MM + 1MS} LD _{← LD + LD} _{2MM + 4MS} LD_m_{← LD}_m_{+ LD}_m _{2MM + 3MS}

A_{:affine, SP:standard projective, J:Jacobian, J}_m_{:modified Jacobian,}

J_C_{: Chudnovsky Jacobian, LD:L´opez-Dahab, LD}_m_{:modified L´opez-Dahab.} * The respective coordinates z are unequal values.

(38)

2.3 Specifications for Applications

2.3.1 IEEE P1363

The standard IEEE P1363 [7] with its extension version [8] specify several

prim-itives based on ECC to achieve the cryptographic schemes. For the key agreement

schemes, the primitives include elliptic curve secret value derivation primitive Diffie-Hellman (ECSVDP-DH) [5, 6, 65] and elliptic curve secret value derivation primitive Menezes-Qu-Vanstone (ECSVDP-MQV) [66]. For the signature schemes with appendix, the primitives include elliptic curve signature primitive Nyberg-Rueppel (ECSP-NR) [5,6, 67] and elliptic curve signature primitive digital signature algorithm (ECSP-DSA) [5,6,68]. In addition, elliptic curve integrated encryption scheme (ECIES) [69] is adopted to im-plement encryption and decryption.

2.3.2 IEEE 802.15.4/6

As specified in the IEEE 802.15.4 [9], the symmetric-key cryptographic algorithm uses block cipher AES [3] with three operation modes; that is, counter (CTR), ci-pher block chaining message authentication code (CBC-MAC), and CTR with CBC-MAC (CCM) [70]. Also, it is applied to conduct the security schemes involving with encryp-tion, authenticaencryp-tion, and message integrity, respectively. In addition to two AES operation modes, cipher-based message authentication code (CMAC) [71] and CCM exerted in IEEE 802.15.6 [10], an asymmetric-key cryptographic algorithm based on ECC [7] is adopted to achieve the message exchange with Diffie-Hellman key (DHK) agreement [65] on an open channel.

AES algorithm

As described in [3], the AES cipher processes a 128-bit plaintext block with either 128, 192, or 256-bit secret key to generate a 128-bit ciphertext block. The design with larger key size provides higher security level but it has more processed cycles. A round is the basic transformation function in AES algorithm, and the number of rounds for one AES encryption depends on the key size. Key sizes 128, 192, and 256-bit refer to 10, 12, and 14 rounds respectively for single 128-bit input message. The round function consists of four

(39)

basic transformations: SubByte, ShiftRow, MixColumn, and AddRoundKey, except for the last round, which is without MixColumn. The KeySchedule algorithm expends the secret key in a word-oriented fashion, and it generates a 128-bit round key every round to add the state value by a simple bit-wise exclusive-OR operation in AddRoundKey, where the state value is a 16 8-bit temporary data for AES round calculation.

Encryption, authentication, and message integrity

Figure 2.2(a) and Figure 2.2(b) show the AES schemes including encryption (or de-cryption), authentication, and message integrity by using CTR, CBC-MAC/CMAC, and CCM modes, respectively. In the CTR mode, the plaintext is encrypted by performing bit-wise exclusive-OR logic operator with a block-stream ciphertext, which is produced from the AES output by feeding in a block message consists of nonce and counter. Note that the data flow of decryption in CTR mode is the similar with that of encryption. For the CBC-MAC and CMAC modes, a message integrity code (MIC) is produced by a chain reaction of AES encryption for detecting any tampering in the plaintext. For achieving the message integrity scheme (i.e., authenticated encryption), the CCM mode is efficiently implemented by a combined operation of CTR and CBC-MAC modes. Message exchange with DHK agreement

Figure 2.3 shows the procedure before message exchange between two parties commu-nicating over an insecure channel based on well-known DHK agreement [65]. Address A and Address B represent the media access control address of Alice and Bob, respectively, and security suite indicates the security level of cipher function. Note that both of the public-key generation and DHK agreement can be achieved by performing the ECSM from a selected private key. As communicating in an open channel, the delivered message is encrypted and decrypted by using the AES CCM mode based on a master key (MK), which is refreshed and activated when a new party is joining in the network.

(40)

AES Encryption Key c 5 0 1 Nonce Counter AES Encryption Key c 5 0 2 Nonce Counter AES Encryption Key c 5 F F Nonce Counter S1|| S2|| || Sm S1 S2 Sm Plaintext _|| AES Encryption Key Plaintext M127~0 AES Encryption Key Plaintext IV Plaintext AES Encryption Key Last Block M255~128 0 K1 CMAC MSBTlen(… AES Encryption Key c 5 0 0 Nonce Counter S0 Ciphertext MIC MSBTlen(… AES Encryption Key Plaintext M383~256 (a) Encryption AES Encryption Key c 5 0 1 Nonce Counter AES Encryption Key c 5 0 2 Nonce Counter AES Encryption Key c 5 F F Nonce Counter S1|| S2|| || Sm S1 S2 Sm MSBClen-Tlen(Ciphertext… _≠ AES Encryption Key Plaintext M127~0 AES Encryption Key Plaintext IV Plaintext AES Encryption Key Last Block M255~128 0 K1 CMAC MSBTlen(… AES Encryption Key c 5 0 0 Nonce Counter S0 Invalid MIC MSBTlen(… LS BTlen (C ip he rte xt … Plaintext AES Encryption Key Plaintext M383~256 (b) Decryption

Figure 2.2: The data flow of each AES mode, where the nonce and initial vector (IV) are an arbitrary number and secrete value, respectively. The functional notation MSBTlen/LSBTlen denotes the most/least significant Tlen bits of the data, and Tlen/Clen

(41)

1a. Choose private key KA and

calculate public key QA=ECSM(KA, P… 1b. Choose private key KB andcalculate public key QB=ECSM(KB, P…

4b. Reply Ack frame 2a. Select 128-bit Nonce A

3a. Send Nonce A, QAAddress A, Address B, and Security Suite

7b. Compute DHK=x-coordinate(ECSM(KB,QA……

KMAC 3B=MSB64(CMAC(MSB128(DHK…,Address A||Address B||Nonce A||Nonce B||Security Suite…… KMAC 4B=MSB64(CMAC(MSB128(DHK…,Address B||Address A||Nonce B||Nonce A||Security Suite……

Bob

2b. Select 128-bit Nonce B

8b. Send Nonce B, QB, KMAC 3B, Address A, Address B, and Security Suite

6a.Reply Ack frame

7a. Compute DHK=x-coordinate(ECSM(KA,QB……

KMAC 3A=MSB64(CMAC(MSB128(DHK…,Address A||Address B||Nonce A||Nonce B||Security Suite…… KMAC 4A=MSB64(CMAC(MSB128(DHK…,Address B||Address A||Nonce B||Nonce A||Security Suite……

10a. Check KMAC 3A=?KMAC 3B Do not proceed if check fails 11a. Send Nonce A, QA,KMAC 4A, Address A, Address B, and Security Suite

12b. Reply Ack frame

13b. Check KMAC 4A=?KMAC 4B Do not proceed if check fails

14. Both parties compute & activate their new MK=CMAC(LSB128(DHK…, Nonce A||Nonce B… 9a. Reply Ack frame

5b. Send Nonce B, QB, and Address A, Address B, and Security Suite

Alice

Figure 2.3: Message can be securely sent based on MK to a specific party by using both of asymmetric and symmetric-key algorithm without pre-knowledge encryption and decryption keys.

(42)

Chapter 3 Side-Channel Attacks (SCAs)

Modern security systems apply the cryptographic algorithms to provide confidentiality, integrity, and authenticity of data, where the cryptographic algorithms are mathematical functions that usually take two input parameters, including message (also called plain-text) and a cryptographic key. The cryptographic algorithms map these parameters to an output, called ciphertext, and this process is regarded as the encryption. In current cryptography, the cryptographic algorithms are assumed to be known, which means that all details about the cryptographic algorithms are publicly available and only the crypto-graphic key is kept secret. This notion can be traced back to Auguste Kerckhoffs [72], who was a Dutch cryptographer of the 19th century, and the concept is famous as “Kerckhoffs’ principle”.

Breaking a cryptographic algorithm typically means that finding the secret key is based on some public information, such as instance pairs of plaintexts and ciphertexts. A cryptographic algorithm is considered to be secure in practice if there are no attacks known that can break it within a reasonable amount of time and with a reasonable amount of computing power. Many algorithms are designed such that the effort of breaking them grows significantly or exponentially with the number of bits of the key. Consequently, the length of the key is an important factor in the security of a cryptographic algorithm.

Crypto engines are the electronic devices, such as an application-specified integrated circuit (ASIC), field-programmable gate array (FPGA), or microprocessor, that imple-ment cryptographic algorithms using the keys stored on them. The fact that crypto engines are used to accelerate the cryptographic algorithms, while this leads to a new

(43)

issue for the practical security of the algorithms. In practice, not only the security of the cryptographic algorithm should be taken into concern. The security of the whole system, i.e. the crypto engine that implements the cryptographic algorithms, needs to be considered. Breaking a crypto engine means extracting the key of the device. A person who tries to extract the key of a crypto engine in an unauthorized way is the attacker, and then any attempt to extract the key in an unauthorized way is viewed as an attack. In order to evaluate the security of a crypto engine, it is necessary to make assumptions about the knowledge that an attacker has about it. The strongest assumption is that the attacker is assumed to know the details of the crypto engines.

In recent years, several kinds of attacks on crypto engines have become public. Side-channel attacks (SCAs) are the attacks based on information leakage obtained from the physical implementation of cryptosystems, rather than brute force or theoretical weak-nesses in the algorithms. In Figure 3.1, for example, the timing information, power consumption, electromagnetic leaks or even sound can provide an extra source of infor-mation, which can be exploited to break the system. Among of them, the power-analysis attacks, initially presented by Kocher [51], have received such a large amount of attention because they are very powerful and because they can be conducted relatively easily. The basic idea of this kind of attacks is to reveal the key of a crypto engine by analyzing its power consumption. The variation of power consumption is directly to reflect the difference of key-dependent processed data, where the total power consumption Ptotal of

a cell is the sum of static power Pstat and dynamic power Pdyn as shown in Figure 3.2.

Consequently, the power-analysis attacks pose a serious threat to the security of crypto engines in practice.

In this dissertation, we have tried our best to investigate the state-of-the-art ap-proaches of power-analysis attacks. They include the simple power-analysis (SPA) at-tacks [51], differential power-analysis (DPA) atat-tacks [73], zero-value power-analysis (ZPA) attacks [74], and collision power-analysis (CPA) attacks [75]. The concepts of them are described in the following sub-sections, and we also show the successful attacks of the power measurement conducted on the devices. Figure 3.3(a) and Figure 3.3(b) show our power-analysis verification environment of the chip, where it is powered by an ECC crypto engine fabricated by UMC 90-nm CMOS technology.

(44)

Processing Time

Electromagnetic

Emission

Current, Voltage

Plaintext, Ciphertext, Key and Password

Data I/O

Figure 3.1: Scenario of side-channel attacks on hardware device.

total

dyn

stat

L

dd2

leak

dd

L

dd

Figure 3.2: Power consumption of CMOS circuits with supply voltage Vdd and leakage

(45)

(a)

(b)

Figure 3.3: (a) Environment of power measurement. (b) Current running through the chip is recorded by measuring the voltage drop via a resistor in series with the core power

(46)

For a quick preview, the ECC processor is targeted at accelerating the elliptic curve scalar multiplication (ECSM) KP , where K is the private key and P is the point on elliptic curve. Thus the object of power-analysis attacks on ECC processor is to extract the private key K by the measured power traces of ECSM calculation. Since P is usually public, it is reasonable to assume that the attacker has the information about P , and the attacker can control or inject any input values of P as possible.

(47)

3.1 Simple Power-Analysis (SPA) Attacks

Simple power-analysis (SPA) is the technique that involves directly interpreting power measurement collected during cryptographic operations. In other words, the attacker tries to derive the key more or less directly from a given power trace. The SPA attacks are useful in practice if only one or very few power traces are available for a given set of inputs. In the attacked device, the key must have a significant impact on the power consumption, otherwise the effectiveness of SPA attacks is reduced by the noise.

Algorithm 1 shows the conventional ECSM by the left-to-right double-and-add (LR-DA) binary method [17]. With this approach, there is a branch in Step 4, where the ECPA depends on the value of i-th bit position of the key K. It means that the execution time of ECSM is correlated to the hamming weight of the key, and then the SPA attacks become a threat to reveal the key value through recording power traces over time. Algorithm 1 LR-DA ECSM

Input: K and P Output: KP 1: Let Q0 ← 0 2: For i from m − 1 to 0 do 3: Q0 ← 2Q0 4: If Ki = 1 then Q0 ← Q0+ P 5: Return Q0

Figure 3.4 shows the power traces for different hamming weight of the key over time obtained from an unprotected ECC chip performing LR-DA ECSM in Algorithm 1, where the hamming weight of the key is denoted by H(K). As the chip is processing, it consumes 1.79 mW at 10 MHz, which results in a voltage drop above 50 mV across the measured resistor. From these waveforms, the key value in the chip using LR-DA ECSM can be distinguished by visual inspections because the processing time is dependent on the hamming weight of the key.

As shown in Algorithm 2, the left-to-right double-and-add-always (LR-DAA) ECSM performing the uniformed ECPC in each iteration can resist the SPA attacks [22], but it averagely requires 50% ECPA operation overhead. In sub-section 4.3, we present our

(48)

Figure 3.4: SPA attacks on the unprotected ECC chip using LR-DA binary method of ECSM, where the power traces are recorded by 50.0 mV/div voltage resolution and 2.0 ms/div time base.

(49)

new method which not only resists the SPA attacks but also has more efficiency in the overhead of computing ECSM.

Algorithm 2 LR-DAA ECSM Input: K and P Output: KP 1: Let Q0 ← 0, Q1 ← P 2: For i from m − 1 to 0 do 3: Q0 ← 2Q0 4: Q1 ← Q0+ P 5: Q0 ← QKi 6: Return Q0

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

國

立

交

通

大

學

電子工程學系 電子研究所

博 士 論 文

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

High-Performance Elliptic Curve Cryptographic Processor with

Side-Channel Attack Resistance

研 究 生：李人偉

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

High-Performance Elliptic Curve Cryptographic Processor with

Side-Channel Attack Resistance

研 究 生：李人偉 Student：Jen-Wei Lee

指導教授：李鎮宜 Advisor：Chen-Yi Lee

國 立 交 通 大 學

電子工程學系 電子研究所

博 士 論 文

具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

學生：李人偉 指導教授：李鎮宜 博士

國立交通大學

電子工程學系暨電子研究所

摘要

High-Performance Elliptic Curve Cryptographic

Processor with Side-Channel Attack Resistance

Student：Jen-Wei Lee Advisors：Dr. Chen-Yi Lee

Department of Electronics Engineering and Institute of Electronics

National Chiao Tung University

ABSTRACT

Nowadays, the fast development of network communication in electronics

indus-try brings the people to a quick and convenient life, while the demand in safety for

protecting the personal private data from revealing significantly increases as well. In

security, the conventional symmetric-key scheme can locally achieve the encryption,

but the decryption key and ciphertext are still needed to be sent without disclosure,

modification, duplication, forgery, and even unauthorized access. The asymmetric-key

scheme or so called public-key cryptosystems (PKC) is developed to satisfy these

re-quirements. In recent years, a new coming approach, elliptic curve cryptography

(ECC), has been adopted in several applications for ensuring the security of

infor-mation exchange. However, the suitable solution of ECC processor has not appeared

so far.

In this dissertation, we investigate the design of crypto engine through a system

view, from top to down, including the algorithm, operation scheduling,

pro-cessing-element architecture, and also circuit-level implementation. For pursuing the

achievement of high-performance accelerator, several improvement techniques for the

hardware speed, hardware complexity, and power consumption are promoted. Besides,

to deliver a decent design of crypto engine, the device security such as the

coun-target. And then, both of these design issues lead to a big challenge, where it requires

the device to be implemented without both of the key-dependent processed data and

much overhead of SCA resistance.

As above, we proposed a new design method, which is based on the randomized

computation and key-independent scheduling manner, to protect the private date

stored in device from the side-channel information leakage. The feature is that it is

suitable for the system integration and the usage of the standard without any

pre-computation. Another advantage is that the overhead of protected design is lower

than that of related previous works. The robustness of SCA resistance is examined by

exploiting an on-chip true-random number generator (TRNG) with sufficient

random-ness. Moreover, the corresponding design architecture of hardware implementation is

introduced, and our ECC processor outperforms both in the hardware efficiency and

protection against SCAs as compared with the other approaches.

To show more our contributions, we further conduct our research for several

standard applications. Fabricated by UMC 90-nm CMOS technology, a 0.41

mm

160-bit ECC chip can achieve 0.34/0.29 ms 11.7/9.3 µJ for one GF(p)/GF(2

)

el-liptic curve scalar multiplication (ECSM), which is effective at the hardware cost and

suitable for the mobile device; a 521-bit ECC chip performs each GF(p

) ECSM in

3.40 ms and GF(2

) ECSM in 2.77 ms, where it saves 50% data transmission of

pub-lic key by on-chip elliptic curve point generation (ECPG). This is the fastest design

and also applicable for the cloud computing; a 192-bit ECC chip achieves 10.8/9.2 ms

438/437 µW GF(p

)/GF(2

) ECSM at scaled 0.5 V and 25 MHz, where it is

effi-cient at the power consumption and suitable for the applications of Internet of Things

(IoT). In addition, the SCA resistance for each design is demonstrated by millions of

measurements.

電子工程學系電子研究所

博士論文

研究生：李人偉

研究生：李人偉 Student：Jen-Wei Lee

國立交通大學

電子工程學系電子研究所

博士論文

學生：李人偉指導教授：李鎮宜博士

誌謝