Hardware Implementation - Implementation Results

Implementation Results

5.2 Hardware Implementation

5.2.1 ASIC Implementation

Figure 5.1 illustrates the entire ASIC design and testing flow with various CAD (Com-puter Aided Design) tools. The design is done by pre-layout gate-level simulation but the pre-layout simulation can not calculate the circuit speed precisely. The results for post-layout gate-level simulation will be worse than the results shown in former.

RTL Coding

Table 5.1 shows a comparison for the ASIC performance of scalar multiplication. In this work, the execution time for computing kP in GF (P192) is average 3.3 ms. The execution is probably range from 2.4 to 4.2 ms for the best and worst case, that is, the latency is probably range from 180k to 320k clock cycles.

In contrast to proposed design, the work [30] shows a great performance using a elliptic curve cryptographic processor. It has a Montgomery multiplier and uses projective plane to avoid inversion operations. In scalar multiplication, it uses software NAF method to reduce the number of 1 terms in k. However, the proposed design mainly shows a powerful dual-field arithmetic operator on elliptic curves by ASIC method.

In work [31], the design uses Fermat’s Little Theorem for the modular inversion oper-ation. However, it is not considered efficient in a large field design since the computation complexity increases significantly. The proposed Montgomery modular inversion or divi-sion algorithm based on EEA has an obviously improvement on the computation time for inversion computation. The division algorithm is chosen in this thesis because it always needs a division in elliptic curve point operations.

In software simulation on C, it takes around 300 ms averagely to do scalar multipli-cation once. Then the ECDSA takes about 1290 ms including signature and verifimultipli-cation.

The signature and verification have total four scalar multiplication operations. Therefore, the scalar multiplication spends the most time in ECDSA and requires extra hardware to accelerate its speed. The simulation results below show significant improvement on the computation time for scalar multiplication.

Table 5.1: Elliptic Curve Scalar Multiplication ASIC Performance Comparison

Author A. Satoh [30] G. Z. Lu [31] Proposed

Field GF (P192)/GF (2¹⁶⁰) GF (P192)/GF (2¹⁹²) GF (P256)/GF (2²⁵⁶)

Platform .13µm CMOS .25µm CMOS .18µm CMOS

Gatecount (Gates) 120.2k 26.7k 292.5k

Frequency (MHz) 137.7 285.7 75

EC mult. (ms) 1.44/0.19 9.75/6.75 3.3

Note 64-bit 8PEs with Universal dual-field

multiplier w = 8bits architecture

1 The timing for EC mult. of the proposed design is for 192-bit length.

5.2.2 FPGA Implementation

Figure 5.2 illustrates the FPGA design and testing flow in contrast to the ASIC design flow. In this thesis, since this work is mainly implemented on ASIC design, there is not any technique used to improve the performance on FPGA. Thus, there is no block RAM and specific length multiplier used to accelerate the speed on FPGA. Thus, the implementation results on FPGA is slightly worse in timing performance, but it is helpful in fast verification and gives reliable hardware information.

RTL Coding

Table 5.2 shows a comparison for the FPGA performance of scalar multiplication.

There are few similar parallel architecture of universal dual-field elliptic curve scalar multiplier, so the following table just lists some implementations for reference.

Table 5.2: Elliptic Curve Scalar Multiplication FPGA Performance Comparison

Author Field Platform Area Freq. Latency EC mult.

(Slices) (MHz) (Cycles) (ms) C. J. McIvor [32] GF (P256) XC2VP125Xilinx Pro 15755 39.46 151.3k 3.86 W. C. Hsu [33] GF (2¹⁶³) XC2V8000Xilinx LUTs8815 90 37k 0.41 N. A. Saqib [34] GF (2¹⁹¹) XCV3200EXilinx 24BRAM18314 9.99 573 0.05 Leong [35] GF (2¹⁷³) MicrocodedProcessor - - 310k 11.1 Proposed GF (2²⁵⁶) XC2V8000Xilinx 18146 18.768 250k 13.32

1 The timing for EC mult. of the proposed design is for 192-bit length.

The authors in [32] have proposed much about the Montgomery techniques recent years. The latency of the Montgomery multiplication is especially shorter than the pro-posed design in this thesis. It takes only 32 clock cycles to perform one 256-bit multi-plication and achieves by cascading 16 × 16-bit multipliers. The trade-off is the cost of area. The Montgomery multiplier here requires 11992 Slices. Despite of the amazing area consumption, it performs a fast operation speed for 256-bit scalar multiplication.

The work [33] and [34] shows the higher frequency and fewer latency design respec-tively. They both have a good performance on scalar multiplications. In [33], the design works on normal basis and uses projective plane method to avoid inversion operations. [35]

shows the performance using microcoded EC processor.

Chapter 6 Conclusion

A total solution in hardware and software to the scalar multiplication on elliptic curves in both GF (p) and GF (2^m) is given in this thesis. In order to deal with various field conditions, the Montgomery techniques are employed. In affine coordinates, due to the slow division while calculating the parameter λ in point additions, a Montgomery mod-ular division algorithm based on EEA is proposed instead of the inversion followed by a multiplication. The Montgomery divider plays an important role in elliptic curve scalar multiplication since it dominates 40% of total latency. Besides, the Montgomery multi-plier is also an important operation. The implementation of these two functions in this work shows a considerable trade-off on area and speed, so it is suitable for hardware design to accelerate most complicated operations on elliptic curves.

According to the implementation result, it is synthesized using .18µm CMOS tech-nology with 285k gates and using Xilinx Virtex-II XC2V8000 with 18146 slices in FPGA design. It takes about 300 ms to accomplish a scalar multiplication in software but takes only 3 ms in hardware. It is 100 times fast in speed. The result of proposed full parallel architecture for the scalar multiplier on elliptic curves seems a great consumption of area in comparison with others. However, the total area of entire scalar multiplier is not take into consideration in this thesis. It mainly shows the computation time in scalar multipli-cation using the proposed Montgomery method. It provides another way of implementing point additions in affine coordinates.

Appendix A

在文檔中通用型橢圓曲線密碼系統純量乘法之實現 (頁 51-56)