通用型橢圓曲線密碼系統純量乘法之實現

(1)

國

立

交

通

大

學

電子工程學系電子研究所碩士班

碩

士

論

文

通用型橢圓曲線密碼系統純量乘法之實現

An Implementation of Universal Dual-Field Scalar Multiplication on Elliptic

Curve Cryptosystems

研究生：劉耀仁

指導教授：張錫嘉教授

(2)

通用型橢圓曲線密碼系統純量乘法之實現

An Implementation of Universal Dual-Field Scalar Multiplication on Elliptic

Curve Cryptosystems

研究生：劉耀仁

Student：Yao-Jen Liu

指導教授：張錫嘉

Advisor：Hsie-Chia Chang

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

A Thesis

Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Electronics Engineering

January 2007

Hsinchu, Taiwan, Republic of China

(3)

通用型橢圓曲線密碼系統純量乘法之實現

學生：劉耀仁

指導教授

：張錫嘉

國立交通大學電子工程學系電子研究所碩士班

摘

要

這篇論文中介紹了一個同時適用在有限質數場和 GF(2

m

)有限場的橢圓

曲線純量乘法器的通用型硬體架構。這個所提出的純量乘法器能支援最多

256 位元任意長度的有限質數場，且它也能應付 GF(2

m

)有限場不同的場多項

式 degree 和 p(x)。實現此可變的通用型硬體架構是根據蒙哥馬利的技術，包

括蒙哥馬利的乘法器以及除法器。而所提出的蒙哥馬利模數除法理論也可以

用來取代在蒙哥馬利域中的一個模數反元素運算和一個模數乘法運算。這個

理論在計算模數除法時，比原本橢圓曲線所使用的方法效能較好，且設計成

硬體時也比其他模數除法理論需要較小的面積。而用所提出的純量乘法器架

構來計算橢圓曲線上的純量乘法也有合理的速度，例如它只需 3.3 毫秒就可

以完成一個 192 位元的純量乘法運算。

(4)

An Implementation of Universal Dual-Field Scalar Multiplication on

Elliptic Curve Cryptosystems

student：Yao-Jen Liu

Advisors：Hsie-Chia Chang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

An universal hardware architecture of scalar multiplier on elliptic curves

suitable for both GF(p) and GF(2

m

) is introduced in this thesis. The proposed

scalar multiplier can work in arbitrary field lengths within a maximum 256-bit

length in GF(p), and it also supports various field degrees and primitives in

GF(2

m

). The flexible universal hardware architecture is based on the Montgomery

techniques, including the Montgomery multiplier and divider. The Montgomery

modular division algorithm is also proposed to replace the inversion followed by a

multiplication in the Montgomery domain. It provides a better performance on

modular division operations than previous ECC techniques and also has a smaller

area size than other modular division architectures. The proposed scalar multiplier

architecture can perform the scalar multiplication on elliptic curves at a

reasonable speed. For a 192-bit scalar multiplication operation, it takes about 3.3

ms.

(5)

誌

謝

很珍惜在交大電子所的這兩年研究所生活，在這短短的兩年中除了學到了許多做研究的方法，更讓人難忘的是在這兩年中給予我指導的師長及帶給我歡樂的朋友。首先，要感謝的是我最敬愛的指導教授張錫嘉博士，大學時曾經修過老師的邏輯設計，便結下了這段不解之緣。老師給人的感覺就像是一位親切的學長般，帶著幽默且平易近人的方式與我們相處，不僅是良師也是益友，在研究所這兩年生涯中不斷地給我許多專業的指導和研究方向，生活上的大小事都會給予幫助；在這也要感謝老師與師母在我生病住院時還親自到醫院來探望，讓我實在倍感溫馨。還有建青學長、彥欽學姊每個禮拜都撥冗幫忙主持實驗室會議，而且都能及時給我們很多在學術上的建議與幫助。特別要感謝的是陳榮傑教授，橢圓曲線的課程給了我很多啟發，以及半年的學術研討交流，使我在論文上得到很多的幫助。再來要感謝的是曾經相處一年的已畢業學長們，年薪百萬、寫了三篇論文就一直拿來說嘴的小科學長，不辭辛勞擔任導遊、招待實驗室宜蘭花蓮五日遊然後猛收回扣的小安學長，任勞任怨、替實驗室管理主機跟郵件伺服器的毛律學長，大學一起玩社團、教導我橢圓曲線的維均學長，時常在實驗室放閃光的小天學長，口味與眾不同、蟒蛇也能當寵物的周毓堂學長；此外，還有陪伴我一起度過兩年研究所生涯的同袍，說話總是格外中肯、現在是老師博士班愛將的大頭，當了三年室友、下雨天時常讓我搭便車的胖威，熱心招待實驗室台中美食之旅的祐徵，從泰國回來就變了個樣、成天打嘴砲的大師兄；再來就是還有一年要努力的學弟妹們，能說善道、騙了很多專題小學妹進來的博元學弟，熬夜趕期中報告時會一邊熱心播報王建民即時戰況的佳瑋學妹，也是系籃明星球員而且選過模特兒的修齊學弟，會溫馨送宵夜、半夜一起在實驗室奮鬥的大嘴學弟，感謝所有一起陪伴過我的朋友們。最後要感謝的是我的家人，一直寵愛著我的父母、找不到詞可以形容的哥哥，還有很愛胡思亂想的妹妹，感謝你們一直在背後給我莫大的支持與鼓勵。

(6)

An Implementation of Universal Dual-Field Scalar

Multiplication on Elliptic Curve Cryptosystems

Student: Yao-Jen Liu

Advisor: Dr. Hsie-Chia Chang

Department of Electronics Engineering

National Chiao Tung University

(7)

Abstract

An universal hardware architecture of scalar multiplier on elliptic curves suitable for both GF (p) and GF (2m_{) is introduced in this thesis. The proposed scalar multiplier can}

work in arbitrary field lengths within a maximum 256-bit length in GF (p), and it also supports various field degrees and primitives in GF (2m_{). The flexible universal hardware}

architecture is based on the Montgomery techniques, including the Montgomery multiplier and divider. The Montgomery modular division algorithm is also proposed to replace the inversion followed by a multiplication in the Montgomery domain. It provides a better performance on modular division operations than previous ECC techniques and also has a smaller area size than other modular division architectures. The proposed scalar multiplier architecture can perform the scalar multiplication on elliptic curves at a reasonable speed. For a 192-bit scalar multiplication operation, it takes about 3.3 ms.

(8)

List of Figures

2.1 Elliptic curves over the reals . . . 8

2.2 _{Adding two distinct points P + Q = −R . . . .} 9

2.3 _{Doubling a point 2P = −R . . . 10}

4.1 Universal dual-field Montgomery multiplier . . . 31

4.2 Universal dual-field Montgomery Divider Architecture for U and V . . . . 34

4.3 Universal dual-field Montgomery Divider Architecture for R and S . . . 35

4.4 Circuit diagram for calculating λ . . . 36

4.5 Universal dual-field point adder . . . 37

4.6 Universal dual-field scalar multiplier . . . 38

5.1 ASIC design flow . . . 40

(11)

List of Tables

1.1 Comparable security strength for given cryptography . . . 2

2.1 Point addition formula over reals . . . 12

2.2 Point addition formula over GF (p) . . . 13

2.3 Point addition formula over GF (2m_{) . . . 14}

3.1 Latency of Montgomery modular inverse from and to both domain . . . 24

3.2 Latency of modular division using traditional method from and to both domain . . . 29

3.3 Latency of Montgomery modular division from and to both domain . . . . 29

4.1 Comparison of Montgomery multiplier in GF (p) . . . 32

4.2 Synthesize results for proposed universal dual-field Montgomery multiplier on ASIC and FPGA design . . . 33

4.3 The latency for Algorithm 3.6 . . . 33

4.4 A Comparison with FPGA results for modular division in GF (p) . . . 34

4.5 Synthesize results for proposed universal dual-field Montgomery divider on ASIC and FPGA design . . . 35

4.6 A Comparison with ASIC results for universal modular division algorithm 36 4.7 Synthesize results for proposed elliptic curve point adder . . . 37

4.8 Synthesize results for proposed elliptic curve scalar multiplier . . . 38

5.1 Elliptic Curve Scalar Multiplication ASIC Performance Comparison . . . . 41

(12)

Chapter 1 introduction

1.1 Background

Since the public-key cryptography was introduced by Diffe and Hellman [1] in 1976, the use of discrete logarithm problem in public-key cryptosystems has been recognized. This method of exponential key exchange came to be known as Diffie-Hellman key ex-change. RSA and El-Gamal are two of the popular public-key cyrptosystems widely used nowadays. The RSA algorithm based on the difficult of factoring large numbers was published by Rivest, Shamir and Adleman [2] at MIT1 _{in 1978. Further, the El-Gamal}

algorithm based on Diffie-Hellman key agreement describes the public-key system and digital signature schemes, and it was proposed by Taher ElGamal [3] in 1985.

The public-key cryptosystem such as RSA is still widely used in electronic commerce protocols and it is believed to be secure enough as long as it has sufficiently long keys. However, there are many efficient attacks known for both RSA and modular p discrete log based cryptosystems such as the Number Field Sieve [4] attacks for RSA and the index calculus attacks for the modular p systems.

The elliptic curve cryptography (ECC) is an approach to public-key cryptography based on the algebraic structure of elliptic curves over finite fields. The ECC was in-dependently proposed by Victor S. Miller of IBM2 _{in 1986 [5] and Neal Koblitz of the}

University of Washington in 1987 [6]. There are no subexponential algorithms known

1

Massachusetts Institute of Technology, located in Cambridge, MA, USA. http://web.mit.edu/

2

(13)

for the elliptic curve discrete logarithm problem (ECDLP) and denotes that there are no efficient attacks known on it. Consequently, the parameters for ECC can be chosen to be much smaller than the paramters for RSA with the same level of resistance against the best known attacks. Table 1.1 shows each different parameter size with the same level of security strengths compared with given cryptography [7].

Table 1.1: Comparable security strength for given cryptography

ECC (e.g., ECDSA) IFC (e.g., RSA) Symmetric key algorithms f = 160 − 223 k = 1024

-f = 224 − 255 k = 2048 -f = 256 − 383 k = 3072 AES-128 f = 384 − 511 k = 7680 AES-192 f = 512 ↑ k = 15360 AES-256

1 _{ECDSA (Appendix A.3).}

2 _{IFC denotes integer factorization cryptography.}

3 _{f is the size of n, where n is the order of the base point G.} 4 _{k is the size of the modulus p.}

5 _{Advanced Encryption Standard (AES) [8].}

Note that in Table 1.1, the difference of the size between ECC and RSA becomes more enormous as the security level increases. It is attractive that the ECC has much smaller parameters leads to more significant performance advantages contrast to RSA. Therefore, the ECC takes advantages for wireless applications where the computing power, memory and battery life are limited such as smart cards and wireless devices.

Furthermore, the performance of ECC mainly depends on the efficiency of its math-ematical arithmetics, namely, scalar multiplication. Given p, a positive integer, and a point P on an elliptic curve. The scalar multiplication kP can easily be defined as adding the (k − 1) copies of P to itself. There are some algorithm to compute the multiple of points on elliptic curves. More details will be discussed in chapter 2.3 later.

At the end of this thesis in appendix, some schemes for ECC are listed. Appendix A.1 is El-Gamal on elliptic curves, Appendix A.2 is Elliptic Curve Diffie-Hellman (ECDH) [9]

(14)

1.2 Motivation

In recent years, security issues on communications are more and more significant as the wireless industry explodes. The ECC has become an important role in public-key cryptographic systems. There are more and more applications using ECC as authenti-cation for transactions and encryption or signature for secure messaging. For example, ECDSA has been used to sign the product key by Microsoft3 _{since Windows 95}4_.

Since the scalar multiplications on elliptic curves are needed, there are much advanced research on modular arithmetic operations over finite fields such as the Montgomery’s technique [11] for modular multiplication which will be discussed in chapter 3.1.1. Then the Montgomery’s technique for modular inversion has been described in [12] and will be mentioned later in chapter 3.2.2. However, scalar multiplication on elliptic curves needs modular division operations in affine coordinates, and there are few implementations on ECC applications using dedicated modular division component. There were still no efficient modular division algorithms known for past hundreds of years. Instead, modular division is achieved by computing the inversion followed by multiplication, but it takes longer latency between domain transformation using Montgomery’s technique in this way. Thus, a Montgomery modular division algorithm is developed in this thesis to shorten the timing for scalar multiplication on elliptic curves using Montgomery’s technique.

In Freescale5 _{MPC190 security processor, it includes elliptic curve operations in either}

GF (p) or GF (2m_{) in its features, and its programmable field size is from 55 to 511 bits.}

The point operations (addition, doubling and multiplication) involve one or more finite field operations, which are addition, multiplication, inverse and squaring.

Therefore, in this thesis, an approach is also provided to compute the scalar multi-plication on elliptic curves in both GF (p) and GF (2m_{), and the Montgomery technique}

can be used to deal with various finite field degrees and different primitive polynomials in GF (2m_{). Further, in this feature, one more finite field operation, division, is involved}

here to replace the inversion followed by a multiplication, and it is used in hardware to accelerate the scalar multiplication operations in ECC applications.

3

Microsoft Corporation. http://www.microsoft.com/

4

A consumer-oriented graphical user interface-based operating system developed by Microsoft in 1995.

5

(15)

1.3 Thesis Organization

In this thesis, the total solution to elliptic curve operations in hardware and software is given. In Chapter 2, the preliminary mathematical background of elliptic curves is introduced. In Chapter 3, the Montgomery’s techniques for the finite field arithmetic are detailed in this chapter. It shows each algorithm in both prime field and binary extension field versions. Additionally, an algorithm for Montgomery modular division is proposed for the combination of Montgomery inversion and Montgomery multiplication. In Chapter 4, all the proposed universal dual-field architectures are described in this chapter. An universal architecture of Montgomery multiplication for both prime field and binary extension field is proposed first. Then the implementation of the proposed Montgomery modular division algorithm is presented. The division hardware is the major part of the scalar multiplier on elliptic curves. In Chapter 5, it shows the hardware implementation results and test consideration for the scalar multiplier on elliptic curves. Finally, the conclusion is given in Chapter 6.

(16)

Chapter 2 Elliptic Curves

Elliptic curves [13] [14] are not ellipses as shown in literal. In mathematics, an elliptic curve is an algebraic curve defined by a cubic equation such as y2 _{= x}3 _{+ ax + b, which}

is non-singular, i.e. its graph has no cusps or self-intersections. Elliptic curves received their name from their relation to elliptic integrals such as

Z z2 z1 dx √ x3_{+ ax + b} and Z z2 z1 x dx √ x3_{+ ax + b} (2.1)

that arose in connection with the computation of the circumference of ellipses.

2.1 Basic Facts

Let F be an algebraically closed field and F2_{denote the affine plane A}2_{, the usual plane,}

A2(F) = {(x, y)|x, y ∈ F}. Let C(x, y) be an irreducible polynomial over F, and the curve C means the set of zeros of C in the affine plane F2_{, i.e. {(x, y) ∈ F}2_{|C(x, y) = 0}.}

Assume that P is a point (xp, yp) on the curve C. If both of the partial derivatives vanish

at P , that is ∂C(xp,yp)

∂x =

∂C(xp,yp)

∂y = 0, then the point P is called a singular point on the

curve C. A curve is called a singular curve if and only if it has at least one singular point on it, otherwise it is called a non-singular curve. An elliptic curve commonly used in cryptography is a non-singular curve because of its better security level relative to a singular curve. A singular elliptic curve is thought of insecure in general. Definition 2.1 shows the algebraic equation of the elliptic curve in a more general form.

(17)

Definition 2.1. _{An elliptic curve E over the field F defined by an affine Weierstrass} equation is an equation of the form

Y2_{+ a}

1XY + a3Y = X3+ a2X2+ a4X + a6, ∀ai ∈ F (2.2)

let E(F) denote the elliptic curve E over F, i.e. the set of points (x, y) ∈ F2 _{that satisfy}

this equation, along with the point at infinity denoted by O.

Definition 2.2. _{The point at infinity called O is the intersection of the y-axis and the} line at infinity. The line at infinity is the set of points on the projective plane for which Z = 0. Therefore, the point at infinity O is (0, 1, 0) in the projective plane, i.e. the equivalence class with X = Z = 0.

No further details about projective plane are shown in this thesis since only affine coordinates are discussed in the remaining chapters.

In order to describe a singular or non-singular curve clearly, an important quantity ∆ related to the elliptic curve called the discriminant of E is defined.

Definition 2.3. ∆ is the discriminant of E and is given by

∆ = −b22b8− 8b34− 27b26+ 9b2b4b6, where                      b2 = a21+ 4a2 b4 = 2a4+ a1a3 b6 = a23+ 4a6 b8 = a21a6+ 4a2a6 − a1a3a4 +a2a23− a24 (2.3)

and the symbols above correspond to (2.2).

Theorem 2.1. A cubic curve defined by a Weierstrass equation (2.2) is singular if and only if its discriminant ∆ is zero.

The Definition 2.1 is feasible for any field F. However, the elliptic curves commonly used in cryptography are over the finite field GF (q), where q is either a large prime p or a power of p. If q is a large prime p, the prime field GF (p), also labeled as Fp or Zp, is a field

of characteristic p where p 6= 2, 3, that is, Pk

i=11 = k 6= 0 for 1 ≤ k < p and

Pp

i=11 = 0.

(18)

where p is typically chosen as 2 for the sake of binary property in hardware. The finite fields are also called Galois fields, in honor of their discoverer.

On the basis of various characteristics, the Weierstrass equation (2.2) can be simplified into different forms by a linear change of variables. The following paragraphs shows the equation for a field of characteristic 6= 2, 3 and a field of characteristic 2.

Let F be a field of characteristic 6= 2, 3 and char(F) denote the characteristic of F. Since the char(F) 6= 2, substitute (X, Y ) by (X, Y −a1X+a3

2 ) on the left hand side in (2.2).

Y2+ a1XY + a3Y substitute (X, Y ) → (X, Y − a1X + a3 2 ) ⇒ (Y −a1X+a3 2 )2+ a1X(Y − a1X+a3 2 ) + a3(Y − a1X+a3 2 ) = Y2₋a2 1 4X 2₋ a1a3 2 X − a2 3 4 (2.4)

Notice that both XY and Y term are eliminated so the coefficients a1 and a3 should be

zero. Thus the equation (2.4) results in Y2 _{by substitution for a}

1 = a3 = 0. Further, the

char(F) 6= 3 so substitute (X, Y ) by (X −a2

3 , Y ) on the right hand side in equation (2.2).

X3+ a2X2+ a4X + a6 substitute (X, Y ) → (X − a2 3 , Y ) ⇒ (X −a2 3 )3+ a2(X − a2 3 )2+ a4(X − a2 3) + a6 = X3_{+ (}−1 3 a 2 2+ a4)X + (₂₇2 a32− 13a2a4+ a6) (2.5)

Then again, the X2 _{term is eliminated so that the coefficient a}

2 should be zero and the

equation (2.5) results in X3_{+ a}

4X + a6 by setting a2 = 0. According to (2.4) and (2.5),

let a1 = a2 = a3 = 0, a4 = a, a6 = b and the equation (2.2) is modified as follows

Y2 = X3_{+ aX + b, a, b ∈ F} (2.6) where char(F) 6= 2, 3. Note that the elliptic curve is a smooth curve, i.e. the curve is non-singular. Review in Theorem (2.1), an elliptic curve should have its discriminant nonzero. Therefore, the discriminant of the cubic curve (2.6) can be derived through (2.3) by substitution for a1 = a2 = a3 = 0, a4 = a, a6 = b. Thus ∆ = −16(4a3+ 27b2) 6= 0.

For a field of characteristic 2, only the non-supersingular case is considered. In brief, non-supersingular has the result of the coefficient a1 6= 0. Since a1 6= 0, substitute (X, Y )

by (a2 1X +a 3 a1, a 3 1Y + a2 1a4+a23 a3

1 ) in (2.2) likewise. A simplified form is obtained as follows

Y2+ XY = X3+ aX2_{+ b, a, b ∈ F} (2.7) where char(F) = 2. There is no need to care whether or not the cubic polynomial on the right hand side in (2.7) has multiple roots.

(19)

2.2 Elliptic Curves Arithmetic

Elliptic curve cryptography makes use of elliptic curves where the variables and co-efficients are belong to a finite field. Two kinds of elliptic curves are commonly used in cryptographic applications. They are prime curves over GF (p) and binary curves over GF (2m_{) respectively. Before discussion on the above curves, the elliptic curves over the}

reals are first introduced because some of the basic concepts are easier to visualize.

2.2.1 Elliptic Curves over the Reals

According to equation (2.6), a definition for elliptic curves over the reals is given below.

Definition 2.4. A non-singular elliptic curve E over the reals is an equation of the form

y2 = x3+ ax + b (2.8) where a, b ∈ R are constants such that 4a3_{+ 27b}2 _{6= 0.}

It can be shown that the condition 4a3_{+ 27b}2 _{6= 0 is necessary and sufficient to ensure}

that the equation (2.8) has three distinct roots which may be real or complex numbers. Figure 2.1 shows two non-singular elliptic curves and one singular elliptic curve whose equation are y2 _{= x}3_{− 4x, y}2 _{= x}3_{+ 73, and y}2 _{= x}3_{− 3x − 2 respectively.}

−5 0 5 −10 −8 −6 −4 −2 0 2 4 6 8 10 x y y2 = x3 − 4x −10 0 10 −30 −20 −10 0 10 20 30 x y y2 = x3 + 73 −5 0 5 −10 −8 −6 −4 −2 0 2 4 6 8 10 x y y2 = x3 − 3x + 2

(20)

Let E be a non-singular elliptic curve over the reals. Given two points P and Q on E, the negative of P , denoted by −P , and the sum P + Q is defined as follows:

1. If P is the point at infinity O, then −P is O and P + Q is Q; that is, O is the additive identity which is also called zero element of the group of points.

2. If P is not the point at infinity O, then −P is the symmetry point of P on the curve E; that is, −P is the point with the same x-coordinate and negative the y-coordinate of P , i.e. −(x, y) = (x, −y). According to equation (2.8), if (x, y) is a point on the curve E, then the point (x, −y) is consequently on the curve E. 3. If P and Q are different points on E with different x-coordinates, then let l be the

line through P and Q, and the line l intersects the curve E in exactly one more point R. Then the sum P + Q = −R is defined and is illustrated in Figure 2.2.

−4 −3 −2 −1 0 1 2 3 4 5 6 −10 −8 −6 −4 −2 0 2 4 6 8 10 x y P Q R −R l E

Figure 2.2: Adding two distinct points P + Q = −R

4. If P and Q are different points on E with the same x-coordinates, that is, Q is a symmetry point of P equal to −P , then the sum P + Q = P + (−P ) = O is defined. 5. If P and Q are the same points on E, then let the line l be the tangent line to the curve at P and the point R be the only other point of intersection of l with the curve E. Thus the sum P + Q = P + P = 2P = −R is defined and is illustrated in

(21)

Figure 2.3. Furthermore, if the tangent line has a double tangency at P , that is, P is a point of inflection, then the sum P + Q = P + P = 2P = −P is defined.

−4 −3 −2 −1 0 1 2 3 4 5 6 −10 −8 −6 −4 −2 0 2 4 6 8 10 x y P R l −R E

Figure 2.3: Doubling a point 2P = −R

In figure 2.2, let (x1, y1), (x2, y2), (x3, −y3) and (x3, y3) denote the coordinates of P ,

Q, R and P + Q respectively. Let l : y = λx + β be the equation of the line through P and Q then λ = y2−y1

x2−x1 is the slope of the line l and β = y1− λx1 = y2− λx2 is the consequence

of the point P lying on the line l. Assume that t is a variable and (t, λt + β) denotes the coordinates of arbitrary points on the line l. The point on l simultaneously lies on the elliptic curve E if and only if (t, λt+β) satisfies equation (2.8) so that (λt+β)2 _{= t}3_+at+b

and rearrange it below by order of t.

t3_{+ (−λ}2)t2_{+ (a − 2λβ)t + (b − β}2) = 0 (2.9) Note that the equation has exactly three distinct roots and two of them are known as x1

and x2. Remember the relation between roots and coefficient mentioned in Vi´ete formula

first proposed by Fran¸cois Vi´ete (1540–1603), a French mathematician.

Theorem 2.2. (Vi´ete’s Formula) Assume P (x) is a polynomial of degree n with roots x1, x2, . . . , xn. For 1 ≤ i ≤ n, let Si be the sum of the products of distinct polynomial

(22)

where the roots are taken i at a time, i.e. Si is defined as the symmetric polynomial Πi(x1, . . . , xn) for i = 1, . . . , n, where Si = Πi(x1, . . . , xn) = X 1≤α1<α2<...<αk≤n xα1xα2· · · xαk (2.11)

For example, the first few values of Si are

S1 = Π1(x1, . . . , xn) = P 1≤i≤n xi = x1+ x2+ x3+ x4+ · · · S2 = Π2(x1, . . . , xn) = P 1≤i<j≤n xixj = x1x2+ x1x3+ x1x4+ x2x3+ · · · S3 = Π3(x1, . . . , xn) = P 1≤i<j<k≤n xixjxk = x1x2x3+ x1x2x4+ x2x3x4+ · · ·

and so on. Then Vi´ete’s formula states that

Si = (−1)i

an−i

an

(2.12)

Proof. The polynomial P (x) can also be written as

P (x) = an(x − x1)(x − x2) · · · (x − xn)

= an(xn− S1xn−1 + S2xn−2− · · · + (−1)nSn)

(2.13)

According to equation (2.10), setting the coefficients equal yields

an(−1)iSi = an−i

which is what the Vi´ete’s formula states for. Q.E.D.

The Viéte formula was proved by Viéte (1579) for positive roots only, and the general theorem was proved by Gérard Desargues (1591–1661). Therefore the sum of the roots s1

of a monic polynomial shown in (2.9) is equal to minus the coefficient of the second-to-highest order. A monic polynomial or normed polynomial is a polynomial whose leading coefficient is equal to 1. It concludes that the third root x3 in (2.9) is equal to λ2− x1− x2

since the sum of the three distinct roots s1 is λ2. Then the y-coordinate of R is λx3 + β

and y3 is minus the y-coordinate of R. Therefore the coordinate of P + Q in terms of

x1, x2, y1, y2 is shown below. x3 = ( y2− y1 x2− x1 )2_{− x}1− x2 y3 = ( y2− y1 x2− x1 )(x1− x3) − y1 (2.14)

(23)

In figure 2.3, let (x1, y1), (x2, y2), (x3, −y3) and (x3, y3) denote the coordinates of P ,

Q, R and P + Q respectively. Since P and Q are the same point, x2 = x1 and y2 = y1.

Let l : y = λx + β be the equation of the tangent line to the curve E at P . The slope of the tangent line at P can be derived by differentiation of the equation (2.8) as follows.

d dx(y 2_{) =} d dx(x 3_{+ ax + b)} (dy dx) = 3x2_{+ a} 2y (2.15)

So the slope of the tangent line λ = 3x21+a

2y1 . According to (2.14), substitute (

y2−y1

x2−x1) for

(3x21+a

2y1 ) and x2 = x1, y2 = y1. A formula for doubling a point is obtained.

x3 = ( 3x2 1+ a 2y1 )2_{− 2x}1 y3 = ( 3x2 1+ a 2y1 )(x1− x3) − y1 (2.16)

Table 2.1 shows the addition formula mentioned above.

Point Addition (P 6= Q) Point Doubling (P = Q) P + Q x3 = λ2− x1− x2 y3 = λ(x1− x3) − y1 λ = y2−y1 x2−x1 x3 = λ2− 2x1 y3 = λ(x1− x3) − y1 λ = 3x21+a 2y1

Table 2.1: Point addition formula over reals

2.2.2 Elliptic Curves over Prime Fields

Let p > 3 be a prime. Elliptic curves over GF (p) are defined almost the same as they are over the reals and the operations over the reals are replaced by modulus operations.

Definition 2.5. Let p > 3 be a prime. A non-singular elliptic curve E over the finite field GF (p) is an equation of the form

y2 _{≡ x}3+ ax + b (mod p) (2.17) where a, b ∈ GF (p) are constants such that 4a3_{+ 27b}2 _{6≡ 0 (mod p).}

(24)

According to equation (2.14) and (2.16), the point addition formula of the elliptic curves over GF (p) is shown in Table 2.2.

Point Addition (P 6= Q) Point Doubling (P = Q) P + Q x3 ≡ λ2 − x1 − x2 (mod p) y3 ≡ λ(x1− x3) − y1 (mod p) λ ≡ y2−y1 x2−x1 (mod p) x3 ≡ λ2− 2x1 (mod p) y3 ≡ λ(x1− x3) − y1 (mod p) λ ≡ 3x 2 1+a 2y1 (mod p)

Table 2.2: Point addition formula over GF (p)

2.2.3 Elliptic Curves over Extension of Binary Fields

Definition 2.6. Let p(x) be a primitive polynomial of degree m. A non-supersingular elliptic curve E over the extension of binary field GF (2m_{) is an equation of the form}

y2+ xy = x3 + ax2+ b (2.18) where a, b ∈ GF (2m_{) are constants.}

Note that in this subsection, all of the arithmetic operations are defined over GF (2m₎

and all of the parameters are belong to GF (2m_{), too. Assume that (x}

1, y1), (x2, y2) and

(x3, y3) denote the coordinates of P , Q and P + Q respectively. Then the coordinate of

−P is defined as (x1, x1+ y1) and P + (−P ) = O.

If P 6= Q, let l : y = λx+β be the equation of the line through P and Q then λ = y2+y1

x2+x1

is the slope of the line l and β = y1+ λx1 = y2+ λx2 is the consequence. The following

equation shows all of the points (t, λx + β) on l simultaneously lies on the curve E.

t3+ (λ2+ λ + a)t + (β2+ b) = 0 (2.19) Thus the third root x3 = λ2 + λ + x1 + x2 + a and the corresponding y-coordinate is

λx3+ β. So the negative of the y-coordinate y3 = (λx3+ β) + x3 = λ(x1+ x3) + x3+ y1.

If P = Q, let l : λx + β be the equation of the tangent line to the curve E at P . The slope of the tangent line at P can be derived by differentiation of the equation (2.18).

d dx(y 2_{+ xy) =} d dx(x 3_{+ ax}2_{+ b)} (dy dx) = x + y x (2.20)

(25)

So the slope of the tangent line λ = x1 + _xy11. Since P = Q, x3 = λ

2 _{+ λ + a and}

y3 = λ(x1 + x3) + x3 + y1; moreover, there is another formula commonly used for y3 by

changing varibale. Given λ = x1+_xy11 =

x2 1+y1

x1 that leads to λx1+ y1 = x

2

1. Thus rearrange

y3 = (λ + 1)x3+ λx1+ y1 and adapt it for y3 = (λ + 1)x3+ x21. Table 2.3 lists all obtained

formulas of the above together for GF (2m_).

Point Addition (P 6= Q) Point Doubling (P = Q)

P + Q x3 = λ2+ λ + x1+ x2+ a y3 = λ(x1+ x3) + x3+ y1 λ = y2+y1 x2+x1 x3 = λ2+ λ + a y3 = λ(x1+ x3) + x3+ y1 = (λ + 1)x3+ x21 λ = x1+_xy11

Table 2.3: Point addition formula over GF (2m₎

2.3 Elliptic Curves Scalar Multiplication

Scalar multiplication is used to compute a multiple of an Elliptic curve point kP , where P is an elliptic curve point and k is a positive integer smaller than the order of P , then kP is the point obtained by adding together k copies of P and this operation dominates the execution time of elliptic curve cryptographic schemes.

2.3.1 Double-and-Add Algorithm

Algorithm 2.1. (Double-and-Add Algorithm)

Input: A positive integer k < n, where n is the order of P ; and an elliptic curve point P . Output: The elliptic curve point kP .

1. Let knkn−1. . . k1k0 be the binary representation of k, where the leftmost bit kn is 1.

2. Set R = P .

3. For i from n − 1 down to 1 do 3.1 Set R = 2R.

3.2 If ki = 1, then set R = R + P .

(26)

The double-and-add algorithm is a basic method for calculating scalar multiplication. It achieves by repeated point double and add operations. The expected number of ones in the binary representation of k is m

2, where m is the length of the integer k. The

number of ones in k indicates the number of times that point addition performs and the number of times that point doubling operation performs is approximately equal to m. Thus Algorithm 2.1 averagely takes m

2 times point addition and m times point doubling

to perform m-bit elliptic curve scalar multiplication once.

2.3.2 Addition-Subtraction Method

If P (x, y) ∈ E(Fp) then −P = (x, −y); else if P (x, y) ∈ E(F2m) then −P = (x, x + y).

Thus the point subtraction is as efficient as point addition. Then Algorithm 2.1 is replaced by using addition-subtraction method and shown in Algorithm 2.2.

Algorithm 2.2. (Addition-Subtraction Method)

Input: A positive integer k < n, where n is the order of P ; and an elliptic curve point P . Output: The elliptic curve point kP .

1. Let enen−1. . . e1e0 be the binary representation of 3k, where the leftmost bit en is 1.

2. Let knkn−1. . . k1k0 be the binary representation of k.

3. Set R = P .

4.2 If ei = 1 and ki = 0, then set R = R + P .

4.3 If ei = 0 and ki = 1, then set R = R − P .

5. Output R.

2.3.3 Binary NAF Method

Owing to point subtraction is as efficient as point addition, the signed digit representa-tion k =P ki2iis used, where ki ∈ {0, ±1}. A non-adjacent form (NAF) is a useful signed

representation which has the property that no two consecutive bits in k are nonzero and has the fewest nonzero bits of any signed digit representation of k. Each positive integer k has its unique NAF, denoted by NAF(k). The NAF of an integer k can be computed efficiently by using Algorithm 2.3 [15].

(27)

Algorithm 2.3. (NAF of a Positive Integer) Input: A positive integer k.

Output: NAF(k). 1. Set i = 0. 2. While k ≥ 1 do

2.1 If k is odd, then set ki = 2 − (k mod 4) and then set k = k − ki; else set ki = 0.

2.2 Set k = k

2 and i = i + 1.

3. Output k, whose binary representation is (ki−1ki−2. . . k1k0).

Note that the length of NAF(k) is at most one bit longer than the binary representation of k and the average density of nonzero bits in NAF(k) is approximately m

3 [16], where m

is the length of the integer k.

Algorithm 2.4. (Binary NAF Method) Input: NAF(k) and an elliptic curve point P . Output: The elliptic curve point kP .

1. Let knkn−1. . . k1k0 be signed digit representation of k, where the leftmost bit kn is 1.

2. Set R = P .

3.2 If ki = 1, then set R = R + P .

3.3 If ki = −1, then set R = R − P .

4. Output R.

Then the Algorithm 2.4 modifies Algorithm 2.1 by using NAF(k) instead of the binary representation of k and averagely takes approximately m

3 times point addition and m times

point doubling to perform m-bit elliptic curve scalar multiplication once. Furthermore, it follows that the expected running time of Algorithm 2.2 and Algorithm 2.4 are the same.

(28)

Chapter 3 Finite Field Arithmetic

This chapter describes the basic arithmetic used in elliptic curve over GF (p) and GF (2m_{). Modular arithmetic such as modular multiplication is especially an important}

part in many cryptographic systems, so there are still many approaches to its improvement nowadays. Since affine coordinate is used for elliptic curve cryptography in this thesis, modular inverse and division arithmetics are also discussed in this chapter.

3.1 Modular Multiplication

Modular multiplication is widely used in many applications including public key cryp-tography such as RSA [2] algorithm. The RSA algorithm requires the computation of modular exponentiation and this modular exponentiation is achieved through a series of modular multiplications. Given an m-bit integer p, called the modulus, and two m-bit operands a and b, modular multiplication computes the result of a × b (mod p).

3.1.1 Montgomery Multiplication Algorithm

The Montgomery multiplication algorithm, which was proposed by P. L. Montgomery in 1985 [11], computes the modular multiplication without trial division. It turns the modular multiplication into iterations of addition and shift operations. Thus the Mont-gomery multiplication is quite appropriate for implementing modular multiplication with hardware. Let the modulus p be an m-bit integer with 2m−1 _{≤ p < 2}m_{, and let r be}

(29)

the p-residue of an integer a with 0 ≤ a < p is defined as A ≡ a · r (mod p). Given two p-residues, A and B, the Montgomery product of the two p-residues is shown below,

C ≡ A · B · r−1 (mod p) (3.1) ≡ (a · r) · (b · r) · r−1 (mod p)

≡ (a · b) · r (mod p)

≡ c · r (mod p), where 0 ≤ c < p (3.2) where r−1 _{is the inverse of r (mod p), i.e. r · r}−1 _{≡ 1 (mod p). The Montgomery}

reduction algorithm also involves another quantity, p′_{, which is an integer that satisfies}

r · r−1 _{− p · p}′ _{= 1. Here r}−1 _{and p}′ _{can both be derived by the extended Euclidean}

algorithm described later in chapter 3.2.1. Therefore, the Montgomery product shown in equation 3.1 can be obtained by Montgomery multiplication algorithm.

Algorithm 3.1. (Montgomery Multiplication Algorithm) 1. T = A · B

2. Q = T · p′ _{mod r}

3. U = (T + Q · p)/r

4. if U ≥ p then C = U − p, else C = U

Proof. Assume that the lengths of A and B are both m-bit and the value T is a product having double length equal to 2m. Step 2 simply converts Q to single length m. Note that in step 3, given r · r−1_{− p · p}′

= 1 which leads to p · p′

≡ −1 (mod r) so that Q · p ≡ T · p′

· p (mod r) ≡ −T (mod r) (3.3) where Q · p is also a product having the same length as T and T + Q · p ≡ 0 (mod r) which concludes that it can be divisable by r and U is its quotient.

U · r = T + Q · p ≡ T (mod p) (3.4) The result U ≡ T · r−1 (mod p) ≡ A · B · r−1 (mod p) is derived. Seeing that the modulus r is a little bigger than p and 0 ≤ A, B < p, the value of T/r = A · (B/r) is smaller than p and such is the same case with (Q · p)/r. Since U is an additive result of two p-residue

(30)

Among Montgomery multiplication algorithm, the operations of modulus r and di-vision by r are both trivial operations since r is a constant with power of two. Thus Montgomery multiplication has the advantage of hardware implementation, and it’s sim-pler and faster than traditional modular multiplication.

Observe in (3.2), the resulting value C of the Montgomery product is still in the p-residue format, that is to say, the output value from the Montgomery product can be used as input value to the next without converting from an ordinary residue to a p-residue. To convert an ordinary residue to a p-residue, let the integer a multiply by r2 _{(mod p), i.e.}

22m (mod p), with the Montgomery product operation. A = MonMul(a, r2)

≡ a · r2· r−1 (mod p) (3.5) ≡ a · r (mod p)

Note that the constant r2 _{(mod p) needs to be precomputed externally. Similarly, in order}

to convert the result A back to the ordinary residue, it can be achieved that A multiplies by a constant 1 with the Montgomery product operation.

a = MonMul(A, 1)

≡ A · 1 · r−1 (mod p) (3.6) ≡ (a · r) · 1 · r−1 (mod p)

Here no extra constants need to be precomputed. Moreover, the p-residue can also be called Montgomery domain. However, it’s not efficient to perform one single modular multiplication using Montgomery product since the conversion between Montgomery and real domain needs one more Montgomery product operation, and otherwise computation for the value p′ _{also takes much time.}

3.1.2 Modified Montgomery Multiplication Algorithm

There are a variety of ways to realize the Montgomery multiplication [17]. The radix-2 Montgomery multiplication algorithm [18] over GF (p) is shown in Algorithm 3.2 and it can easily adapt the field GF (p) for the field GF (2m_{). Algorithm 3.3 shows the binary}

(31)

Algorithm 3.2. (Montgomery Multiplication over GF (p)) Input: _{A, B and P , where A, B ∈ GF (p) and P is the modulus p.} Output: _{U , where U ≡ A × B × 2}−m _{(mod P ).}

1. Let am−1am−2. . . a1a0 be the binary representation of A.

2. Set U = 0. 3. For i from 0 to m − 1 do 3.1. Set T = (U + aiB). 3.2. Set U = (T + t0P )/2. 4. If U ≥ P , then set U = U − P . 5. Output U .

Algorithm 3.3. (Montgomery Multiplication over GF (2m₎₎

Input: _{A(x), B(x) and P (x), where A(x), B(x) and P (x) ∈ GF (2}m_{), and GF (2}m_{) is}

generated by P (x).

Output: _{U (x), where U (x) ≡ A(x) · B(x) · x}−m _{(mod P (x)).}

1. Let A(x) =Pm−1

i=0 aixi, ∀ai ∈ GF (2), be the polynomial representation of A(x).

2. Set U (x) = 0.

3. For i from 0 to m − 1 do

3.1. Set T (x) = (U (x) + aiB(x)).

3.2. Set U (x) = (T (x) + t0P (x))/x.

4. Output U (x).

The suffix i of the variable indicates the ith bit in the binary or polynomial repre-sentation of the variable, i.e. t0 denotes the least significant bit of T . Note that the

coefficients of the polynomial representation of A(x), i.e. am−1am−2. . . a1a0, are also the

binary representation of the integer A(2) since A(2) = Pm−1

i=0 ai2i. The addition of two

elements in GF (2m_{) is shown in the following equation,}

A(x) + B(x) = m−1 X i=0 aixi+ m−1 X i=0 bixi = m−1 X i=0 (ai⊕ bi)xi (3.7)

and the subtraction in GF (2m_{) is the same as addition in GF (2}m_{). Furthermore, division}

(32)

3.2 Modular Inversion

Modular inversion is used in cryptographic applications such as the Diffie-Hellman key exchange [20], the public and private key pair generations in RSA and point addition operations in ECC. Given an m-bit modulus p, modular inversion computes the inversion of a non-zero field element a ∈ GF (p). The inversion of a is denoted by a−1 _{(mod p),}

where a · a−1 _{≡ 1 (mod p). Furthermore, the multiplicative inverse of a exists if and only}

if a and p are relatively prime and its proof will be shown later.

3.2.1 Extended Euclidean Algorithm

In number theory, the Euclidean algorithm determines the greatest common divisor (GCD) of two integers. The GCD of a and b, written as gcd(a, b), is the largest positive integer that divides both a and b without remainder. Two integers are called relatively prime if and only if their GCD equals 1.

The extended Euclidean algorithm (EEA) [21] is an extension to the Euclidean algo-rithm and it can be used to solve the Bézout’s identity, a linear diophantine equation. In number theory, Bézout’s identity, named after Étienne Bézout (1730–1783), states that if a and b are non-negative integers, there exist integers x and y such that

ax + by = gcd(a, b) (3.8)

where x and y can be obtained by the EEA, but they are not uniquely determined. Set x′

= x − kb and y′ _{= y + ka, then (x}′_{, y}′_{) is another solution to (3.8) since ax}′ _{+ by}′ ₌

a(x −kb)+b(y +ka) = ax+by = gcd(a, b). The following is the proof of B´ezout’s identity. Proof. Let S be the set of all positive integers of ax+by, where x and y are integers. Since S is not empty, it has a smallest element by the well-ordering principle. Let s = axt+ byt

be the smallest element of the set S. According to the division algorithm, there are unique integers q and r that satisfy a = sq + r with 0 ≤ r < s. Then

r = a − sq = a − (axt+ byt)q = a(1 − qxt) + b(−qyt) (3.9)

Note that (1 − qxt) and (−qyt) are both integers so that r should be in the set S. But the

(33)

be equal to 0 and it leads to a = sq which indicates that a is divisible by s. Similarly, b is also divisible by s. Therefore s is one of the common divisor of a and b. Assume that c is another common divisor of a and b. Let a = cq1 and b = cq2, then

c|s = ax + by = c(q1x + q2y) (3.10)

Since c|s leads to c ≤ s, it implies that s is the GCD of a and b, i.e. s = gcd(a, b). Q.E.D. The EEA can solve the equation ax + by = gcd(x, y) efficiently. When a and b are relatively prime, i.e. ax + by = 1, then x is the multiplicative inverse of a (mod b). The following algorithm is a variant of the EEA.

Algorithm 3.4. (Modified Extended Euclidean Algorithm over GF (p)) Input: _{A and P , where A ∈ GF (p) and P is the modulus p.}

Output: _{R, where R ≡ A}−1 _{(mod P ).}

1. Set U = P , V = A, R = 0 and S = 1. 2. While V 6= 0 do

2.1. While U is even do 2.1.1. Set U = U/2.

2.1.2. If R is even, then set R = R/2. 2.1.3. Else set R = (R + P )/2.

2.2. While V is even do 2.2.1. Set V = V /2.

2.2.2. If S is even, then set S = S/2. 2.2.3. Else set S = (S + P )/2.

2.3. If U > V , then set U = U − V , R = R − S. 2.4. Else if V ≥ U, then set V = V − U, S = S − R. 3. Output R (mod P ).

The algorithm described above works by using the elimination method for solving the simultaneous equations below, where d and e are not really computed.

(

S · A + d · P = V

R · A + e · P = U (3.11) The algorithm terminates when V = 0, in which case U = 1, and then RA + eP = 1 leads

(34)

3.2.2 Montgomery Modular Inverse Algorithm

Based on the EEA, the algorithm proposed by Kaliski [12] computes the Montgomery modular inverse. Given an m-bit modulus p, the Montgomery modular inverse of a non-zero integer a ∈ GF (p) is defined as the integer x,

x ≡ a−12m _{(mod p)} _(3.12)

The following algorithm rewrites the Kaliski Montgomery inverse algorithm with com-bination of the two phases in one algorithm. It is alternative that the output can be in the Montgomery domain or in the integer domain.

Algorithm 3.5. (Montgomery Modular Inverse Algorithm over GF (p)) Input: _{A and P , where A ∈ GF (p) and P is the modulus p.}

Output: U , where

_{U ≡ A}−1₂m _{(mod P ).} U ≡ A−1 _{(mod P ).}

1. Set U = P , V = A, R = 0 and S = 1.

2. Set k = 0, where k is an integer with m ≤ k < 2m. 3. While V > 0 do

3.1. If U is even, then set U = U/2, S = 2S. 3.2. Else if V is even, then set V = V /2, R = 2R.

3.3. Else if U > V , then set U = (U − V )/2, R = R + S and S = 2S. 3.4. Else if V ≥ U, then set V = (V − U)/2, S = S + R and R = 2R. 3.5. Set k = k + 1.

4. For i from 1 to

_{k − m} k

do

4.1. If R is even, then set R = R/2. 4.2. Else set R = (R + P )/2.

5. If R ≥ P , then set R = 2P − R, else set R = P − R. 6. Output R.

Note that the upper in the braces derives Montgomery inverse result and the lower derives modular inverse result, that is to say, output U is equivalent to A−12m _{(mod P )}

or A−1 _{(mod P ) depends on that the iteration count is (k − m) or k in Step 4.}

The Step 3 is iterative and it steadily reduces U or V by one bit in each iteration. Obviously, U and V initially have at most 2m bits in total since 2m−1 _{≤ U < 2}m _and

(35)

0 < V < U . But U equals 1 and V equals 0 in the last iteration, therefore, the iteration count in Step 3 takes no more than (2m − 1) iterations. Similarly, U and V initially have at least m + 1 bits in total while V is equal to 1, thus, the iteration count in Step 3 takes no less than m iterations. So the boundary of k is m ≤ k < 2m.

If the input is originally in the Montgomery domain, i.e. A ≡ a2m _{(mod P ), then the}

output of the Montgomery inverse is given below.

U ≡ (a2m)−12m (mod P ) ≡ a−1 (mod P )

(3.13)

In order to convert the output to the Montgomery domain, the conversion between integer and Montgomery domain needs an additional Montgomery multiplication operation. If the input is originally in the integer domain, i.e. A ≡ a (mod P ), then the output in the integer domain needs m iterations more than output in the Montgomery domain.

Table 3.1: Latency of Montgomery modular inverse from and to both domain

Domain Latency (cycles) From → To

Int → Int Int → Mont Mont → Int Mont → Mont

MonInv MonMul Total 4m + 1 - 4m + 1 3m + 1 - 3m + 1 3m + 1 - 3m + 1 3m + 1 m + 1 4m + 2

1 _{Int means integer and Mont means Montgomery.} 2 _{MonInv denotes Montgomery inverse.}

3 _{MonMul denotes Montgomery multiplication.} 4 _{m is the bit length of the modulus p.}

Table 3.1 shows each latency in worst case of the Montgomery inverse from and to both domain. In worst case, the latencies of the result for an m-bit Montgomery inverse and for real modular inverse are (3m + 1) and (4m + 1) clock cycles respectively, and the Montgomery multiplication operation over GF (p) requires (m + 1) clock cycles. Further, the Montgomery multiplication operation over GF (2m_{) requires only m clock cycles.}

(36)

3.3 Modular Division

The modular division operation is traditionally accomplished by modular inversion followed by modular multiplication since the modular division is believed to be slow. It can be applied to the computation of the parameter λ in ECC.

3.3.1 Modular Division Algorithm

Given an m-bit modulus p, the modular division of two integers a, b ∈ GF (p), where b 6= 0, is defined as the integer x,

x ≡ ab−1 (mod p) (3.14) The following algorithm shows a binary add-and-shift algorithm proposed by Sheueling Chang Shantz [22] for modular divison in a residue class.

Algorithm 3.6. (Modular Division Algorithm over GF (p)) Input: _{A, B and P , where A, B ∈ GF (p) and P is the modulus p.} Output: _{U , where U ≡ AB}−1 _{(mod P ).}

1. Set U = P , V = B, R = 0 and S = A. 2. While U 6= V do

2.1. If U is even, then set U = U/2. 2.1.1. If R is even, then set R = R/2. 2.1.1. Else set R = (R + P )/2.

2.2. Else if V is even, then set V = V /2. 2.2.2. If S is even, then set S = S/2. 2.2.2. Else set S = (S + P )/2.

2.3. Else if U − V > 0, then set U = (U − V )/2, and R = R − S. 2.3.3. If R < 0, then set R = R + P .

2.3.3. If R is even, then set R = R/2. Else set R = (R + P )/2. 2.4. Else if V − U ≥ 0, then set V = (V − U)/2, and S = S − R.

2.4.4. If S < 0, then set S = S + P .

2.4.4. If S is even, then set S = S/2. Else set S = (S + P )/2. 3. Output R.

(37)

The modular division algorithm above is an iterative process of additions, parity-testings, and shifts. Like Montgomery modular inverse algorithm, it reduces U or V by one bit. But it is different that in the last iteration, U and V are both equal to 1. Thus the entire division routine takes no more than 2(m − 1) iterations.

Remember that in modular inverse algorithm, solving the simultaneous equation (3.11) can derive the inverse of the term A modular P . In order to solve this equation, it can be easily observed that an identical equation fits this equation and shown below.

( 1 · A + 0 · P = A

0 · A + 1 · P = P (3.15) Therefore the initial value of the variable U , V , R and S in modular inverse algorithm are set to be P , A, 0 and 1, respectively.

Further, the modular division algorithm is similar to the modular inverse algorithm that it also works by using the elimination method for solving another simultaneous equations below, where d and e are not really computed, too.

(

S · (A−1_{B) + d · P = V}

R · (A−1_{B) + e · P = U} (3.16)

Note that the modular division algorithm terminates when U = V = 1, in which case R · (A−1_{B) + eP = 1 and S · (A}−1_{B) + dP = 1. Since P is a prime, i.e. gcd(A}−1_{B, P ) = 1,}

the equation above definitely exists integer solutions that satisfy R · (A−1_{B) + e · P = 1,}

where R · (A−1_{B) ≡ 1 (mod P ), that is, R ≡ AB}−1 _{(mod P ). Similarly, S ≡ AB}−1

(mod P ), too. And it also can be easily obtained that an identical equation that fits equation (3.16) is written below,

( A · (A−1_{B) + d · P = B}

0 · (A−1_{B) + 1 · P = P} (3.17)

Thus the two algorithms of modular inverse and division only differ from the initial value of the variable S with S = A instead. Although d is not really computed, in this case, d = (−kB), where k is an integer that AA−1 _{= 1 + kP since AA}−1 _{≡ 1 (mod P ).}

(38)

3.3.2 Montgomery Modular Division Algorithm

Given an m-bit modulus p, the Montgomery modular division of the two integers a, b ∈ GF (p), where b 6= 0, is defined as the integer q, where

q ≡ ab−12m (mod p) (3.19) And given a primitive polynomial p(x) with degree m which generates the GF (2m_{), the}

Montgomery modular division of the two element a(x), b(x) ∈ GF (2m_{), where b(x) 6= 0,}

is defined as the polynomial q(x), where

q(x) ≡ a(x)b−1(x)xm _{(mod p(x))} _(3.20)

An alternative algorithm for calculating the Montgomery modular division or real modular division is proposed below and it is suitable for both GF (p) and GF (2m_).

Algorithm 3.7. (Montgomery Modular Division Algorithm over GF (p)) Input: _{A, B and P , where A, B ∈ GF (p) and P is the modulus p.}

Output: U , where

_{U ≡ AB}−1₂m _{(mod P ).} U ≡ AB−1 (mod P ). 1. Set U = P , V = B, R = 0 and S = A.

2. Set k = 0, where k is an integer with m ≤ k < 2m. 3. While V > 0 do

3.1. If U is even, then set U = U/2, S = 2S. 3.2. Else if V is even, then set V = V /2, R = 2R.

3.3. Else if U − V > 0, then set U = (U − V )/2, R = R + S and S = 2S. 3.4. Else if V − U ≥ 0, then set V = (V − U)/2, S = S + R and R = 2R. 3.5. If R ≥ P , then set R = R − P . 3.6. If S ≥ P , then set S = S − P . 3.7. Set k = k + 1. 4. For i from 1 to k − m k do

4.1. If R is even, then set R = R/2. 4.2. Else set R = (R + P )/2.

5. Set R = P − R. 6. Output R.

(39)

Algorithm 3.8. (Montgomery Modular Division Algorithm over GF (2m₎₎

Input: _{A(x), B(x) and P (x), where A(x), B(x) and P (x) ∈ GF (2}m), and GF (2m_{) is}

generated by P (x).

Output: U (x), where

_{U ≡ A(x)B}−1_(x)xm _{(mod P (x)).} U ≡ A(x)B−1_{(x) (mod P (x)).}

1. Set U (x) = P (x), V (x) = B(x), R(x) = 0 and S(x) = A(x). 2. Set k = 0, where k is an integer with m ≤ k < 2m.

3. While V (x) 6= 0 do

3.1. If U (2) is even, then set U (x) = U (x)/x, S(x) = xS(x). 3.2. Else if V (2) is even, then set V (x) = V (x)/x, R(x) = xR(x). 3.3. Else if U (2) − V (2) > 0,

then set U (x) = (U (x) + V (x))/x, R(x) = R(x) + S(x) and S(x) = xS(x). 3.4. Else if V (2) − U(2) ≥ 0,

then set V (x) = (V (x) + U (x))/x, S(x) = S(x) + R(x) and R(x) = xR(x). 3.5. If deg(R) = deg(P ), then set R(x) = R(x) + P (x).

3.6. If deg(S) = deg(P ), then set S(x) = S(x) + P (x). 3.7. Set k = k + 1.

4. For i from 1 to

k − m k do

4.1. If R(2) is even, then set R(x) = R(x)/x. 4.2. Else set R(x) = (R(x) + P (x))/x.

5. Output R(x).

The proposed Montgomery modular division algorithm above is based on EEA and the binary GCD algorithm [23]. It mainly modifies the Montgomery modular inverse algorithm by setting the dividend to the initial value of S or S(x). Note that all of the condition statements are almost the same in hardware design among these two algorithms, so it can compute not only Montgomery division but also modular division in both GF (p) and GF (2m_{) fields. The two algorithms only differ from the field addition operations and}

the modular condition statements in Step 3.5 and Step 3.6.

Although the Montgomery division can be performed by Montgomery inverse followed by Montgomery multiplication, it takes longer latency and needs extra one more

(40)

Mont-shows the latency of traditional method using a Montgomery inversion followed by a Montgomery multiplication.

Table 3.2: Latency of modular division using traditional method from and to both domain

MonInv MonMul MonMul Total 3m + 1 m + 1 - 4m + 2 3m + 1 m + 1 m + 1 5m + 3 3m + 1 m + 1 - 4m + 2 3m + 1 m + 1 m + 1 5m + 3

The total latency of the method above seems worse in some cases especially in Mont-gomery domain, so the MontMont-gomery modular division algorithm combine the inversion with multiplication to improve this shortcoming. The maximum number of iterations in Montgomery modular division algorithm is 3m or 4m depends on the output is in Montgomery or integer domain. Table 3.3 shows each latency in worst case of the Mont-gomery division from and to both domain. However, there are no additional MontMont-gomery multiplication operations needed in each case.

Table 3.3: Latency of Montgomery mod-ular division from and to both domain

Domain Latency (cycles) From → To Int → Int Int → Mont Mont → Int Mont → Mont GF (p) GF (2m₎ 4m _{4m − 1} 3m _{3m − 1} 4m _{4m − 1} 3m _{3m − 1}

According to Table 4.3, the Montgomery modular division algorithm is consequently suitable for Montgomery domain implementation design. Therefore, the proposed algo-rithm is consistent with the way of implementing ECC on affine coordinate using Mont-gomery architecture in this thesis.

(41)

Chapter 4 Proposed Universal Architectures

The proposed architecture of the universal dual-field scalar multiplier on ECC is pre-sented in this chapter. It is adapted to arbitrary lengths of the fields within a fixed length of the given design, and it is also adapted to both the prime finite fields, GF (p), and the binary extension fields, GF (2m_{). All of the required materials for mathematical theorems}

have been mentioned in early chapters. Then all of these main components used in the scalar multiplier are detailed in the following subsections.

In this thesis, all of the design in hardware is implemented using RTL (Register-Transfer-Level) Verilog HDL (hardware description language) and synthesized on both application-specific integrated circuit (ASIC) and field-programmable gate arrays (FP-GAs). The technology of ASIC design is using UMC1 _{0.18µm 1P6M CMOS process and}

the technology of FPGA design is using Xilinx2 _{Virtex-II XC2V8000 platform FPGAs.}

4.1 Universal Dual-field Montgomery Multiplier

In order to design an universal dual-field multiplier, the main problem is to deal with the variable binary extension field since the primitive polynomial and field degree are unfixed. An universal multiplier in GF (2m_{) using Montgomery multiplication algorithm}

is proposed [24] using bit-parallel architecture, but this architecture for bit-parallel com-putation is suitable for small field degree. Thus a series shift and addition operations is chosen to implement Montgomery multiplication algorithm.

1

(42)

Back to the Algorithm 3.2 and Algorithm 3.3 which are mentioned in subsection 3.1.2. It is known that the Montgomery multiplication in GF (p) always takes (m + 1) iterations, but the Montgomery multiplication in GF (2m_{) takes only m iterations which is one cycle}

less than the former since there is no modular operation in the last step. Besides, the addition operations in these two algorithms are also different that the adder in GF (2m₎

is the adder in GF (p) without carry propagation.

CSA 3to1 MUX B U/2 P mode a0 U ctrl 2to1 MUX >> 0 ADD 2to1 MUX <<

Figure 4.1: Universal dual-field Montgomery multiplier

Figure 4.1 shows the proposed architecture of the universal dual-field Montgomery multiplier. The finite field adder is achieved by a carry-save adder (CSA) follow by an adder. The sum and carry are separated by CSA, so the carry can independently be computed. The sum of CSA is directly chosen as the addition result in GF (2m_{) when the}

mode is in binary extension field. The carry of CSA is shifted left by one bit and added to the sum as the addition result in GF (p) when the mode is in prime field.

However, the negative of P is needed at the end of the Algorithm 3.2 and (−P ) is represented by its 2’s complement, P∗_{, where P}∗ _{= 2}m

− P . Then P∗

= 2m

− P = (2m

− 1 − P ) + 1 = ¯P + 1 (4.1) Note that ¯P denotes the 1’s complement of the positive integer P , where ¯P = (2m

−1)−P . The 1’s complement is to simply complement P bit-by-bit by replacing 0’s with 1’s and 1’s with 0’s. According to (4.1), the 2’s complement of the integer P can be formed by complementing P bit-by-bit and then adding 1. It can be achieved in hardware by a bit-wise inverter except the least significant bit (LSB) of P since P is odd, that is, the

(43)

2’s complement of P is always an odd, too. Moreover, the values of register U and A are shifted right by one bit and back to the input of CSA after each iteration finished.

There are some sample area and speed results for the Montgomery multiplier over GF (p) given in [25]. Its architecture is based on [26] and implemented on FPGAs using Xilinx Virtex-E XCV2000e. Using this architecture, each multiplication operation requires (m + 5) clock cycles. Further, the architecture for the Montgomery multiplier over GF (p) based on Algorithm 3.2 is almost the same as Figure 4.1 without the 2-to-1 multiplexer at the end. Each multiplication requires (m + 1) clock cycles. For more general comparison, the proposed design here is synthesized using the same technology with [25]. Table 4.1 lists the detailed area and speed results for both.

Table 4.1: Comparison of Montgomery multiplier in GF (p)

Bit-length Proposed A. Daly [25]

128-bit Area (Slices) 629 646 Frequency (MHz) 65.90 81.23 Latency (cycles) 129 133 Throughput (Mbit/s) 65.4 78.2 256-bit Area (Slices) 1164 1292 Frequency (MHz) 47.43 58.24 Latency (cycles) 257 261 Throughput (Mbit/s) 47.2 57.1 512-bit Area (Slices) 2176 2588 Frequency (MHz) 29.35 56.05 Latency (cycles) 513 517 Throughput (Mbit/s) 29.29 55.5

Note that [25] uses pipelined mux/add architecture to improve in clock speed when the bit-length of its multiplier exceeds the maximum carry chain length, i.e. 1 column of the FPGA. Its architecture is especially designed for the specific type of FPGA platforms. Since the clock period is not under constrain in this thesis, the proposed architecture is designed for smaller scale to simultaneously adapt to both GF (p) and GF (2m_).

The implementation results of the proposed universal dual-field Montgomery multiplier is given in Table 4.2. It shows the probably gate count synthesized at 100MHz and shows the area and speed results on FPGAs. Each number of gates in the field Gatecount consists

(44)

doubles when the bit length doubles. The area is approximately in proportion to the bit length m of the field.

Table 4.2: Synthesize results for proposed universal dual-field Mont-gomery multiplier on ASIC and FPGA design

ASIC FPGA

Bit-length Gatecount_(Gates) Frequency_(MHz) _(Slices)Area Frequency_(MHz)

64-bit 4.2k (1.9k + 2.3k) 100 397 83.265 128-bit 8.3k (3.8k + 4.5k) 733 65.811 256-bit 16.3k (7.4k + 8.9k) 1387 47.432 512-bit 32.1k (14.5k + 17.6k) 2266 27.118 1024-bit 63.3k (28.8k + 34.5k) 4664 16.120

4.2 Universal Dual-field Montgomery Divider

The universal dual-field Montgomery divider is mainly used in Montgomery domain. It shortens the latency of the division in Montgomery domain by combination of an inversion and a multiplication. Table 4.3 shows the latency of modular division from and to both domain using Algorithm 3.6 method.

Table 4.3: The latency for Algorithm 3.6

MonDiv MonMul Total 2(m − 1) - _{2(m − 1)} 2(m − 1) m + 1 _{3m − 1} 2(m − 1) - _{2(m − 1)} 2(m − 1) m + 1 _{3m − 1}

In comparison with Table 3.3, the total latency for either domain to Montgomery domain is similar but it needs additional Montgomery multiplier in hardware. Table 4.4 shows the FPGA results of Algorithm 3.7 in comparison with other implementations [27] using Algorithm 3.6 with the same technology.

(45)

Table 4.4: A Comparison with FPGA results for modular division in GF (p)

Proposed A. Daly [27]

Bit-length _(Slices)Area _(MHz)Freq. _(Slices)Area _(MHz)Freq.

64-bit 1163 30.361 1212 45 128-bit 2132 26.091 2215 31 256-bit 3262 15.429 3872 17

According to Algorithm 3.7 and Algorithm 3.8, note that all of the condition state-ments are almost the same in hardware design among these two algorithms. It reduces the complexity of the control flow for the two different fields. The control flow is simul-taneously suitable for GF (p) and GF (2m_).

Figure 4.2 shows the circuit diagram for the registers U and V in the universal dula-field Montgomery divider. There are two subtracters in this circuit which compute (U −V ) and (V − U) separately. One of them is reused to compute (P − R) in order to share the subtracters. The term (P − R) is needed to derive the correct result in GF (p) when the division is finished at the last step. Further, it can use only one subtracter in only GF (p) by substituting U for −U in Algorithm 3.7.

– – 2to1 MUX ctrl 2to1 MUX ctrl P R 3to1 MUX ctrl >> 3to1 MUX ctrl V >> >> >> 2to1 MUX field 2to1 MUX field U V U U V

(46)

Figure 4.3 shows the circuit diagram for the registers R and S. Note that the source U denotes (P − R) and is derived from the circuit mentioned above. Since the registers U and R will not simultaneously be equal to (R + S) in the division algorithm, they can share the same hardware in this term (R + S). The decoder is used to determine the modular operation in GF (2m_{) but also costs around 6% extra portion of area in FPGA.}

However, it is unnecessary in specific length design.

2to1 MUX ctrl P S + R R 2to1 MUX field 2to1 MUX ctrl << S P 2to1 MUX << R P – P 2to1 MUX ctrl 2to1 MUX ctrl – P >> 2to1 MUX ctrl R _MUX4to1 ctrl 2to1 MUX ctrl 2to1 MUX ctrl U 2to1 MUX ctrl S R S

log2m-to-m bit

Decoder

m

Figure 4.3: Universal dual-field Montgomery Divider Architecture for R and S

Table 4.5 shows the synthesize results for the universal dual-field Montgomery divider on ASIC and FPGA. The area is also approximately in proportion to the bit length m.

Table 4.5: Synthesize results for proposed universal dual-field Mont-gomery divider on ASIC and FPGA design

ASIC FPGA

Bit-length Gatecount_(Gates) Frequency_(MHz) _(Slices)Area Frequency_(MHz)

64-bit 10.3k (2.6k + 7.7k)

100

1095 52.366 128-bit 20.8k (5.0k + 15.8k) 2164 41.917 256-bit 42.1k (10.0k + 32.1k) 4535 28.243

通用型橢圓曲線密碼系統純量乘法之實現

國

立

交

通

大

學

電子工程學系 電子研究所碩士班

碩

士

論

文

通用型橢圓曲線密碼系統純量乘法之實現

An Implementation of Universal Dual-Field Scalar Multiplication on Elliptic

Curve Cryptosystems

研究生：劉耀仁

指導教授：張錫嘉 教授

通用型橢圓曲線密碼系統純量乘法之實現

An Implementation of Universal Dual-Field Scalar Multiplication on Elliptic

Curve Cryptosystems

研 究 生：劉耀仁

Student：Yao-Jen Liu

指導教授：張錫嘉

Advisor：Hsie-Chia Chang

國 立 交 通 大 學

電 子 工 程 學 系 電 子 研 究 所 碩 士 班

碩 士 論 文

通用型橢圓曲線密碼系統純量乘法之實現

學生：劉耀仁

指導教授

國立交通大學電子工程學系 電子研究所碩士班

摘

要

這篇論文中介紹了一個同時適用在有限質數場和 GF(2

)有限場的橢圓

曲線純量乘法器的通用型硬體架構。這個所提出的純量乘法器能支援最多

256 位元任意長度的有限質數場，且它也能應付 GF(2

)有限場不同的場多項

式 degree 和 p(x)。實現此可變的通用型硬體架構是根據蒙哥馬利的技術，包

括蒙哥馬利的乘法器以及除法器。而所提出的蒙哥馬利模數除法理論也可以

用來取代在蒙哥馬利域中的一個模數反元素運算和一個模數乘法運算。這個

理論在計算模數除法時，比原本橢圓曲線所使用的方法效能較好，且設計成

硬體時也比其他模數除法理論需要較小的面積。而用所提出的純量乘法器架

構來計算橢圓曲線上的純量乘法也有合理的速度，例如它只需 3.3 毫秒就可

以完成一個 192 位元的純量乘法運算。

An Implementation of Universal Dual-Field Scalar Multiplication on

Elliptic Curve Cryptosystems

student：Yao-Jen Liu

Advisors：Hsie-Chia Chang

Department of Electronics Engineering & Institute of Electronics

National Chiao Tung University

ABSTRACT

An universal hardware architecture of scalar multiplier on elliptic curves

suitable for both GF(p) and GF(2

) is introduced in this thesis. The proposed

scalar multiplier can work in arbitrary field lengths within a maximum 256-bit

length in GF(p), and it also supports various field degrees and primitives in

GF(2

). The flexible universal hardware architecture is based on the Montgomery

techniques, including the Montgomery multiplier and divider. The Montgomery

modular division algorithm is also proposed to replace the inversion followed by a

multiplication in the Montgomery domain. It provides a better performance on

modular division operations than previous ECC techniques and also has a smaller

area size than other modular division architectures. The proposed scalar multiplier

architecture can perform the scalar multiplication on elliptic curves at a

reasonable speed. For a 192-bit scalar multiplication operation, it takes about 3.3

ms.

誌

謝

An Implementation of Universal Dual-Field Scalar

Multiplication on Elliptic Curve Cryptosystems

Student: Yao-Jen Liu

Advisor: Dr. Hsie-Chia Chang

Department of Electronics Engineering

National Chiao Tung University

Contents

List of Figures

List of Tables

Chapter 1

introduction

電子工程學系電子研究所碩士班

指導教授：張錫嘉教授

研究生：劉耀仁

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

國立交通大學電子工程學系電子研究所碩士班