Jacobi Symbol and Galois Field Arithmetic Unit (JS-GFAU)

Memory ManagementOperation Scheduler

6.1 Jacobi Symbol and Galois Field Arithmetic Unit (JS-GFAU)

6.1.1 Fully-Pipelining Scheme

As the iterative operations shown in Algorithm 7 are performed within one cycle, the critical path is to calculate the results of operands R or S, which consists of the UV comparison with modular operations. The time-critical comparison operations such as U > V , ^U₂ > V , U > ^V₂ in Steps 8, 11, 14 achieved by subtraction are nearly equal to an addition delay. Since the results of operands R, S are irrelevant to the results of operands U or V , a fully-pipeline stage is inserted between the UV and RS data path to reduce the critical path delay. Figure 6.3(a) and Figure 6.3(b) illustrate the hardware behavior of the pipelining scheme. After initialization, the UV data path is determined at the first cycle. Then the next cycle is to set the values of operands R, S and simultaneously determine the second case of UV comparison. The following cycles can be deduced from this approach until V = 0. Although an additional cycle is needed after pipelining, this is negligible as the division takes hundreds of cycles.

6.1.2 Programmable Data Path of Modular Reduction with Lad-der Selection

To calculate the operands within finite filed set over GF (p) in Algorithm 5, Algo-rithm 7, and AlgoAlgo-rithm 8, a low-level parallel architecture with 2’s complement number system is exploited. The values of all operands are bounded by the interval [0, p). For instance, as processing the modular reduction of S ≡ 4S (mod p), the 4S can be achieved by bitwise shifting operand S left two bits, and the result is needed to be bounded in the interval [0, p). To achieve this, the arithmetic functions fSp1 = 4S − 3p, fSp2 = 4S − 2p, fSp3 = 4S − p, and fSp4 = 4S are carried out simultaneously, while the correct value is sequentially determined with a ladder selection by checking the signed bit. The arithmetic functions substrated by different multiple modulus are carried out simultaneously, while the correct value is sequentially determined with a ladder selection [58] by checking the signed bit: if fS is positive, then S = fS ; else if fS is positive, then S = fS ; else if

i = m - 1

Figure 6.3: (a) Data path separation of UV comparison and RS calculation. (b) The fully-pipelining scheme of hardware implementation for the proposed radix-4 RMD in Algorithm 7.

fS_p3 is positive, then S = fS_p3; else S = fS_p4. These multiple modular operations in the iterative calculation can be effectively implemented by using a programmable data path of bit-level architecture, which consists of the carry-save adders with a carry-lookahead adder at the last stage [38].

6.1.3 Modular Halving, Quartering by Bitwise Shifting

In Algorithm 7, the halving and quartering of the UV data path can be easily achieved by shifting right one and two bit positions because the least significant one and two bits of intermediate values are definitely zero. However, the least significant bit values of operands R and S are undetermined in the iterative calculation. Here, we use the modulus p to on-the-fly fix the least significant one and two bits of R, S to be zero.

To simplify the illustration, the intermediate value of R, S is denoted as X, where the subscribed means the bit position in binary representation. For calculating the modular halving operation ^X₂ (mod p), it is achieved by performing (X + X0 · p) >> 1 since the prime p must be an odd value. For the modular quartering operation ^X₄ (mod p), it is conducted by performing the following calculation: if (X1, X0) = (0, 0), X is shifted right two bit positions; if (X1, X0) = (1, 0), (X − 2p) >> 2 is performed; if (X1, X0) = (1, 1) or (0, 1), and there are two sub-cases. As the least significant two bits of X − p are (1, 0), (X − 3p) >> 2 is performed because −p1 is the complement value of −3p1. On the other hand, it is achieved by (X − p) >> 2. As a result, the overall modular halving and quartering operations in Algorithm 5, Algorithm 7, and Algorithm 8 can be implemented by bitwise shifting with simple logic gates without time-cost modular division.

6.1.4 Arithmetic Unit Integration

To map the multiple modular operations in Algorithm 7 into hardware unit without using distinct circuit components and without a quite complex multiplexer of operand selection, symmetric operations such as ^R−S₄ (mod p) and ^S−R₄ (mod p) can be executed by using the same computational unit with a swap logic circuit. In Algorithm 7, the RS data path within Step 4 to Step 18 is classified into two groups: the first group includes Steps 6, 9, 12, 15, and 18; the second one consists of the remainder. The two operands R

previous cycle. Furthermore, since the ECPC and ECPG are the computation of serial field operations, both of the temporary registers and modular operations in Algorithm 5, Algorithm 7, and Algorithm 8 can be reused.

Figure 6.4 shows the detailed architecture of Galois field arithmetic unit (GFAU), where it supports the radix-4 RMM in Algorithm 5, radix-4 RMD in Algorithm 7, and modular addition, subtraction over DFs. Without pipelining, the delay path is equal to (1) + (2) + (3) + (5) over GF (p) and (1) + (2) + (4) + (5) over GF (2^m). The delay path (1) can be eliminated due to the fully-pipeline stage of data path separation, so that the RS select signal is delayed one cycle from the UV select signal. Besides, the swap logic circuit can be implemented by an exclusive-OR logic operator to change the input operands of RS data path as the previous and current swap signals have inverse values.

After arithmetic processing, the ladder selection is to pick out the value belonging to the finite field set. Note that the MAS is implemented by similar design approach with less hardware complexity than that of GFAU, and the circuit components of MAS are depicted in gray color in Figure 6.4.

In comparison with the previous works on GF (p256) field arithmetic unit in [92] and [93], we also implement our processing elements (PEs) using the identical field length by the same FPGA family. Table 6.1 gives the performance results. Due to pipelined and highly integrated architecture, our design has benefits in the area-time (AT) product and outperforms others at least two times in the hardware speed.

>>1

Table 6.1: Implementation Results of GF (p256) GFAU and MAS on Xilinx Virtex-II FPGA Device with Comparison

Area

f (MHz)

Multiplication Division (Slices) Time (µs/Op.) AT Time (µs/Op.) AT

[93] 5,477 14 18.28 1 43.89 1

[92] 5,379 34 7.53 0.40 13.55 0.30

Our GFAU 9,213 37 3.46 0.29 4.98 0.18

Our MAS 4,843 37 3.46 0.13 -

-AT product = area × time.

6.2 Heterogeneous Processing Elements (PEs) and

在文檔中具側漏資訊攻擊防禦之高硬體效能橢圓曲線密碼處理器 (頁 87-92)