Previous Works - 具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器

1.1.1 Elliptic Curve Cryptographic (ECC) Processor

To date, several works of the ECC hardware implementation have been published in [15–28]. To save hardware complexity, single finite field architecture either for prime field GF (p) [17, 19, 21, 26, 29] or extension binary field GF (2^m) [15, 25, 27], and fixed modulus approach on specific elliptic curves (ECs) [20–22] can be used. However, the applications of IEEE P1363 including digital signature are approved for supporting dual-field (DF) functions on arbitrary ECs. Exploiting carry-save adder trees in word-based multipliers is a common technique to integrate DF data path [16, 24, 28], but the limit of integration for distinct arithmetic units still results in large hardware cost.

In general, the GF (2^m) design is faster than the GF (p) design because of carry-free addition over GF (2^m). Besides, there are some well-known techniques to pursue high-speed GF (2^m) ECC design. A divide-and-conquer algorithm, Karatsuba-Ofman (KO) multiplication [30], is applied to reduce the computation complexity of number of bit operations. Classical methods to multiply two m-bit polynomials require O(m²) bit oper-ations. The KO algorithm reduces this to O(m^log²³). As the polynomial modulus is fixed, the reduction over GF (2^m) is simple [31], and then the throughput of KO multiplier can be elevated by adopting fully pipelining architecture [32, 33]. Another design technique over GF (2^m) using fixed polynomial modulus is the fast squaring [34]. The binary rep-resentation of a polynomial a(z)² is obtained by inserting a zero-bit between consecutive bits of the binary representation of a(z). Thus the computation complexity is most dom-inated by the reduction over GF (2^m), which is easily achieved by combinational circuit using exclusive-OR gates only. In contrast to standard (polynomial) representation of elements over GF (2^m), optimal normal basis (ONB) representation [7, 34] has benefits in squaring because it can be achieved by simple shifting operations. But it is inevitable for the computing overhead of conversion between the standard and ONB representation.

For arithmetic over GF (p), based on Chinese remainder theorem, residue number system (RNS) [35, 36] represents a large integer using a set of smaller integers, so that computation may be performed more efficiently. This briefs the long delay within the data path of carry-propagation adder, and the multiple multipliers can be implemented

with parallelism. RNS implementations bear the extra cost of an input converter (binary-to-RNS) to translate numbers from a standard binary format into residues and an output converter (RNS-to-binary) to implement the translation from RNS to a binary represen-tation. An RNS implementation applied to GF (p) ECC processor is presented in [23], where the technique of data flow graph for the optimization of ECC function is utilized as well.

For the implementation of scalable architecture performing flexible field length and arbitrary modulus, Montgomery algorithm [37] is commonly adopted. It is an efficient approach to achieve the modular multiplication over DFs, where the long-precision integer division is not required during the calculation of Montgomery multiplication (or called Montgomery modular multiplication). The key idea is that the reduction after integer multiplication can be achieved by shifting bit position as the domain constant is selected to be two to the power of m or x with degree m (i.e., 2^m over GF (p) and x^m over GF (2^m)), where the constant 2^m and x^m is so called Montgomery constant. Another benefit for the hardware implementation of Montgomery algorithm is that the GF (p) and GF (2^m) arithmetic logic unit (ALU) is suitable for integration in VLSI circuit because the sum of carry-save adder is equal to two bitwise exclusive-OR operators [15,27,38]. The overhead is the multiplexer to select the data path between operating fields. In [39, 40], a word-based Montgomery multiplier is presented to avoid the high fanout of AND operators in conventional serial-parallel architecture [15]. In [16, 24, 41], a w × w multiplier is exploited to tradeoff between the hardware speed and area cost with flexible size w. As w equals field length m, one modular multiplication can be performed within several cycle periods [17,42]. Note that, although the Montgomery algorithm still requires the overhead of conversion between integer and Montgomery domain, it can be immediately achieved by Montgomery division described in [38].

For high speed target, a usually adopted technique is the parallel computation with multiple processing elements (PEs) of homogeneous architecture [18, 24, 43]. However, in practice, this approach by directly duplicating the PEs has less hardware utilization for various operations. Another approach of improving computation speed of ECC processor is the window methods [34]. The key idea is to store some pre-computed data in device, and then the on-line running time can be reduced.

On the contrary, the parallel computation and window methods requiring the overhead of device memory would not be suitable for the low power and low cost applications such as radio-frequency identification (RFID). ISO/IEC 18000-3 [44] is an international standard for the item level identification of the passive RFID, and it also describes the parameters for air interface communications at 13, 56 MHz. Several previous works [22,29, 45, 46] are targeted at the implementation of low hardware complexity. In [45], a 192-bit GF (p)/GF (2^m) ECC processor supporting hash function [47] and consuming less than 30 µW is reported, while the execution time is over 1 second per operation due to low operating frequency 175 kHz. In [46], the GF (2^m) fast squaring approach is exploited to efficiently computed inversion in affine coordinates. In [29], a 192-bit GF (p) ECC processor is presented, where a radix-4 Montgomery multiplication approach is used and the inversion is achieved by extended Euclidean algorithm [34]. In [22], a 163-bit GF (2^m) ECC design with micro-controller and bus manager is implemented to connect to the front-end module in RFID device. A dedicated register file management is used to save the high complexity of multiplexers. To further save the number of temporary register, a common Z projective coordinate system modified from [48] is exploited.

To pursue the embedded system market, in [49], a hardware/software co-design of ECC processor is implemented and performed at 12 MHz on an 8051 micro-controller.

Communication overhead due to operand transfers is reduced by integration of a direct memory access unit and through the inclusion of an additional I/O register into the hardware accelerator. In [50], a cryptographic core compliant with the IEEE 802.15.4 standard [9] and based on FPGA is described. It consists of three components including an AES-CCM module, a content-addressable memory achieving an access control list, and an RSA module based on Montgomery arithmetic.

1.1.2 Side-Channel Attacks (SCAs)

Traditional cryptanalysis assumes that an adversary only has access to input and out-put pairs without the knowledge about internal states of the device. However, the advent of side-channel analysis showed that a cryptographic device can leak critical information.

By monitoring the timing, power consumption, electromagnetic emission of the device or by injecting faults, adversaries can obtain the information about internal processed data

or operations, and then the key is extracted out of the cryptographic device without math-ematically breaking the primitives. This kind of attacks using side-channel information is so called side-channel attacks (SCAs).

In 1999, Kocher [51] has presented a real threat on the hardware device by power measurement. The detailed description for the attacks on symmetric-key crypto engine is given in [14], and the power-analysis attacks are successfully conducted on the micro-processor, ASIC, and even FPGA. The common techniques against power-analysis attacks for symmetric-key crypto engine are the dual-rail logic cell equalizing the power consump-tion and the masking in substituconsump-tion which depends on the key value. The previous one needs to change the design flow including the back-end physical layout to ensure inter-connect capacitances of the true and false output nodes of logic gates are equal; the last one requires the overhead of hardware speed and cost from combinational circuit. Several published papers [52–55] show other kinds of logic cells to “balance” the power consump-tion. On the other hand, a systematic overview for most of currently existing SCAs and countermeasure on asymmetric-key design is reported in [56]. However, most of the previous approaches illustrate the theoretical analysis rather than real implementation together with measurement results. In Chapter 3, we will give more description about the principle and show the evaluation of power-analysis attacks on ECC device from power measurement.

1.1.3 Summary of Paper Survey

The research age of ECC hardware implementation is briefly shown in Figure 1.4.

The ECC processor with small key size and single field has less hardware complexity [22, 25, 29, 49], but it sacrifices the security. The DF design [24, 28, 45, 57, 58] and large key size approach [38,59] have higher security level. However, there is still relatively little design targeted at the applications such as cloud computing and portable device, where the both of flexibility and device security are necessary.

SmallKey

Figure 1.4: Research review of ECC hardware implementation.

1.2 Motivation and Design Challenge

As described in sub-section 1.1, the suitable solution of ECC processor to provide hardware efficiency against SCAs has not so far appeared. In our work, not only the per-formance but also the practical applications are taken into consideration. For instance, the speed is a key factor for server computing. But the RFID device and portable appli-cations are targeted at the requirements of low power and low cost. These would bring a big difficulty to the hardware designer due to the trade-off between speed and cost for current design approaches.

The following are to list the items about our design target:

1. Low SCA-resistant overhead of speed, cost, power and no modification of circuit design flow

2. Performance improvement from delivering a new hardware architecture

3. Compliance with current standards, such as IEEE P1363 and IEEE 802.15.4/6 4. A high-speed ECC design for the cloud computing

5. An energy-efficient and cost-effectiveness ECC design for the portable applications

在文檔中具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器 (頁 24-28)