Implementation Results - 抵抗簡單能量攻擊法的橢圓曲線運算單元之設計與實現

Solutions for elliptic curve arithmetics in both software and hardware are given in this work. The software simulation environment is constructed in C programing languages.

The design and test consideration are discussed in Chapter 6.1. The hardware imple-mentation results and design flow are described in Chapter 6.2. The RTL synthesizer uses Synopsys¹ Design Compiler for ASIC and Xilinx XST or Synplicity² Synplify Pro for FPGA. The Cadence³ Encounter is used for backend Auto Place & Route implementation.

6.1 Design and Test Consideration

The hardware is designed to accelerate the operations on elliptic curves and it deals with different field parameters using Montgomery technique. The main part in hardware is the point operation on elliptic curves and the implementation of scalar multiplication on hardware uses only Double-and-Add algorithm.

The Verilog code for this design was generated using the parameterized module for different values of m. The test patterns are generated randomly by software. The ver-ification for the design uses not only hardware-software co-simulation but also confirms with the examples of NIST⁴ publications for more confidence. No special technique is introduced in the FPGA implementation.

1Synopsys, Inc. http://www.synopsys.com/

2Synplicity, Inc. http://www.synplicity.com/

3Cadence Design Systems, Inc. http://www.cadence.com/

4National Institute of Standards and Technology. http://www.nist.gov/

6.2 Implementation Results and Comparison

6.2.1 ASIC Implementation

Table 6.1 shows the ASIC synthesized result comparison between the proposed GFAU and the others. The proposed universal dual-field GFAU consumes about 75% of the total gatecount of the universal dual-field Montgomery multiplier and the universal dual-field Mongomery divider proposed in [13]. In [33], a dual-field modular divider is proposed.

But it’s modular divider requires one more Montgomery multiplier to convert the result back into the Montgomery domain.

Table 6.1: ASIC synthesis results comparison

Length Freq.(MHz) Area(Gatecount)

ModDiv [33] MontDiv [13] MontMul [13] GFAU

128-bit 100 22.8k 20.8k 8.3k 23.65k

256-bit 100 45.6k 42.1k 16.3k 47.4k

512-bit 100 N/A N/A 32.1k 97.3k

1 GFAU can be synthesized at clock frequency 133MHz.

2 GFAU is synthesized with UMC 0.18-µm CMOS process.

3 Modular divider in [33] is synthesized with 0.5-µm CMOS process.

4 Montgomery divider and Montgomery multiplier in [13] are synthesized with UMC 0.18-µm CMOS process.

In this work, a universal field elliptic curve scalar multiplier and a universal dual-field elliptic curve arithmetic unit are proposed. The most important part of them is the proposed area-efficient GFAU. The ASIC synthesized gatecount are 226K and 277.5K respectively at 133MHz clock frequency using TSMC 0.18µm CMOS process. It takes 1.93ms to complete a 192-bit prime field elliptic curve multiplication using the proposed ECSM. To make a fair comparison, we multiply the GF(P192) equivalent gatecount by elliptic curve multiplication computational time. The value of ECSM and ECAU are 163.54(gates×ms) and 401.68(gates×ms). It’s better then previous works.

Table 6.2 shows a comparison for the ASIC performance of scalar multiplication.

Table 6.2: Elliptic Curve Scalar Multiplication ASIC Performance Comparison Author A. Satoh [34] G. Z. Lu [35] Y. J. Liu [13] ECSM ECAU

Field P192/2¹⁶⁰ P192/2¹⁹² P256/2²⁵⁶ P512/2⁵¹² P512/2⁵¹²

Process .13µm .25µm .18µm .18µm .18µm

Area(Gatecount) 118k 26.7k 292.5k 225k 277k

Freq.(Mhz) 137.7 285.7 75 133 133

EC mult.(ms) 1.44/0.19 9.75/6.75 3.3 1.93 3.86

P192 Equivalent

172.8 260.3 965.25 163.54 401.68

Area×EC mult.

(gatecount×ms)

Coordinate projective modifiedJacobian affine affine affine Multiplication multiplierbased systolicradix-2 radix-2 radix-2 radix-2

Division Fermat’s little Fermat’s little Mont. Mont. Mont.

theorem theorem division division division

Note 64-bit 8PEs with universal universal SPA

multiplier w = 8bits architecture architecture resistant

In [13], a novel Montgomery division algorithm is proposed and utilized in the imple-mentation of a universal dual-field elliptic curve scalar multiplier. The Montgomery mul-tiplier and the Montgomery divider occupy most of the area and no area reuse technique is introduced in his work. Therefore, the gatecount is 292.5k when the field length is 256.

The execution time for computing kP in GF (P192) is average 3.3 ms.

In work [35], the design uses Fermat’s Little Theorem for the modular inversion oper-ation. However, it is not considered efficient in a large field design since the computation complexity increases significantly.

Besides, the work [34] shows a great performance using a elliptic curve cryptographic processor. It has a optimized multiplier-based Montgomery multiplier and uses projective coordinates to avoid inversion operations. In scalar multiplication, it uses software NAF method to reduce the number of 1 terms in k.

In software simulation on C on Intel Core 2 Duo E7200 and 2G RAM, it takes around 17 seconds averagely to do scalar multiplication once. The simulation results below show

significant improvement on the computation time for scalar multiplication.

In the auto place and route stage, we face a big problem. The data path in the proposed design is 512 bit, there are too many wires in it. Therefore, the CAD tool cannot place them without negative timing slacks and design rule violations. We have tried it on UMC 0.18µm 1P5M, TSMC 0.18µm 1P5M and UMC 90nm 1P9M CMOS processes and enlarge the timing margin. But all these effort are ineffective. We have also tried 256-bit design, but it doesn’t work either. It indicates that the parallel architecture is not feasible with currently available APR tools. We suggest to use word-based architecture like [34] to solve this question.

6.2.2 FPGA Implementation

The FPGA synthesis result is showed in Table 6.3:

Table 6.3: 512-bit FPGA synthesis results.

GFAU UESM UEAU Slice 17131 34384 54376 Slice Flip Flop 2744 8505 16319 4 input LUT 33074 65904 94596 Clock rate(MHz) 20.8 20.48 17.75

C. J. McIvor proposed a multiplier-based architecture in [36]. With cascaded 16 × 16-bit multipliers, it only requires 32 clock cycles to accomplish one 256×256-16-bit Montgomery modular multiplication. It performs fast operation with relatively high area consumption.

The proposed architectures don’t have good area and timing performance in FPGA simulation. In our judgement, the highly reused hardware improve the gatecount synthe-sized by Synopsys design compiler, but in Xilinx ISE, the larger MUXs consume much more slices than the datapath does. So the result of FPGA synthesis shows more slices and longer critical path.

Table 6.4: Elliptic Curve Scalar Multiplication FPGA Performance Comparison Author C. J. McIvor [36] S. B. Ors [30] Y. J. Liu [13] ECSM ECAU

Field 2²⁵⁶ P160 P256/2²⁵⁶ P512/2⁵¹² P512/2⁵¹²

Platform XC2VP125 XV1000E XC2V8000 XC4VLX160 XC4VLX160

Slices 15755 N/A 18146 34384 54376

Freq.(Mhz) 39.46 91.3 18.768 20.48 17.75

EC mult. 3.86 14 18.77 1.93 3.86

(ms) (256-bit) (160-bit) (192-bit) (192-bit) (SPA)

Coordinate projective Jacobianmodified affine affine affine Multiplication multiplierbased systolicradix-2 radix-2 radix-2 radix-2

Division ModDiv Fermat’s little Mont. Mont. Mont.

theorem division division division

Note 16-bit Not scalar SPA

multipliers optimized multiplier resistant

Chapter 7

在文檔中抵抗簡單能量攻擊法的橢圓曲線運算單元之設計與實現 (頁 77-82)