New RSA cryptosystem hardware design based on Montgomery's algorithm

(1)

Substituting for the residuesx1= m10 1 and for x3= 1 satisfying the odd parity condition and simplifying

2n < 0m1m3+ 2 0 m1 n < 0m1(n + 1) + 1 < 0m1 1 + 1_n + 1_n < 0m10 2 + 2_n < 0m3+ 2_n < 0m2: Therefore, 0m3:

Proof of C4: For the remaining range (m3 > > 0m3), the correctionm₁m₃=2 yields the correct result.

ACKNOWLEDGMENT

The authors wish to express their gratitude to the anonymous reviewers who have taken time to go through this brief, and are thankful for their comments, which have considerably enhanced the quality of this brief.

REFERENCES

[1] A. B. Premkumar, “An RNS to binary converter in2n + 1, 2n, 2n 0 1 moduli set,” IEEE Trans. Circuits Syst. II, vol. 39, pp. 480–482, July 1992.

[2] K. M. Ibrahim and S. N. Saloum, “An efficient residue to binary converter design,” IEEE Trans. Circuits Syst., vol. 35, pp. 1156–1158, Sept. 1988.

[3] S. Andraos and H. Ahmad, “A new efficient memoryless residue to binary converter,” IEEE Trans. Circuits Syst., vol. 35, pp. 1441–1444, Nov. 1988.

[4] P. Bernardson, “Fast memoryless over 64 bits, residue to binary con-verter,” IEEE Trans. Circuits Syst., vol. CAS-32, pp. 298–300, Mar. 1985.

[5] B. Guan and E. V. Jones, “Fast conversion between binary and residue numbers,” Electron. Lett., vol. 24, no. 19, pp. 1195–1197, Sept. 1988. [6] B. Vinnakota and V. V. B. Rao, “Fast conversion techniques for binary

to RNS,” IEEE Trans. Circuits Syst. I, vol. 41, pp. 927–929, Dec. 1994. [7] S. J. Piestrak, “A high speed realization of residue to binary number system conversion,” IEEE Trans. Circuits Syst. II, vol. 41, pp. 661–663, Dec. 1995.

[8] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer Technology. New York: McGraw-Hill, 1967.

A New RSA Cryptosystem Hardware Design Based on Montgomery’s Algorithm Ching-Chao Yang, Tian-Sheuan Chang, and Chein-Wei Jen

Abstract—In this paper, we propose a new algorithm based on

Mont-gomery’s algorithm to calculate modular multiplication that is the core arithmetic operation in an RSA cryptosystem. The modified algorithm eliminates over-large residue and has very short critical path delay that yields a very high-speed processing. The new architecture based on this modified algorithm takes about1:5n2clock cycles on the average to finish onen-bit RSA operation. We have implemented a 512-bit single-chip RSA processor based on the modified algorithm with Compass 0.6-m SPDM CMOS cell library. The simulation results show that the processor can operate up to 125 MHz and deliver the baud rate of 164 Kbits/s on the average.

I. INTRODUCTION

As the telecommunication network has grown explosively and the internet has become increasingly popular, security over the network is the main concern for further services like electronic commerce [1]. The fundamental security requirements include confidentiality, authentication, data integrity, and nonrepudiation. To provide such security services, most systems use public key cryptography. Among the various public key cryptography algorithms, the RSA cryptosys-tem[2] is the best known, most versatile, and widely used public key cryptosystem today. In public key cryptography algorithms, the essential arithmetic operation is modular multiplication, which is used to calculate modular exponentiation. However, modular exponentiation on numbers of hundreds of bits (512 bits or higher) makes it difficult for the RSA algorithm to attain high throughput.

An attractive method for faster implementations is based on Montgomery’s modular multiplication algorithm [3], [4], in which the quotient only depends on the least significant digit of operands. Vari-ous algorithm modifications and hard-ware designs of Montgomery’s algorithm can be found in [5]–[11]. To speed up processing, in [5] and [7], they used the high radix technique to reduce the required clock cycle number. In [6], they avoided the quotient determination by shifting the multiplicand 2 bit. This achieved a significant speed-up of modular multiplication available today. However, all the methods [5]–[9] suffer from the over-large residue. So an additional final reduction is required, which increases the hardware and time complexity. Though this problem is solved in [10] and [11], the iteration times are doubled. Also, in [5] and [7], the intermediate result is in carry-save form or redundant representation and the input operands are assumed to be in nonredundant binary form in the next modular multiplication. Therefore, additional cycles will be introduced at the start of the next iteration to convert the data format. In this paper, we propose a modified Montgomery’s algorithm to eliminate the aforementioned problems. The algorithm modifies the range of the partial product by separating the multiplication and modular reduction operation so that the output will fall in the right range after postprocessing. Therefore, we avoid the over-large residue problem and the additional subtraction procedure. The hardware Manuscript received January 6, 1997; revised October 1, 1997. This work was supported by the National Science Council, R.O.C., under Grant NSC 84-2215-E009–057. This paper was recommended by Associate Editor K. K. Parhi.

The authors are with the Department of Electronics Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C.

Publisher Item Identifier S 1057-7130(98)05048-4.

(2)

Montgomery’s modular multiplication algorithm [3] can avoid range comparison which is the critical operation in traditional division in modular multiplication. However, this algorithm requires some preprocessing and postprocessing to remove an extra factor and limit the range of intermediate output. To address these problems, Chen [10], [11] proposed a modified Montgomery’s algorithm. Compared with the original Montgomery algorithm, Chen’s algorithm has a simpler modular reduction step. The output will fall in the range smaller thanN. However, the iteration step is increased from n to 2n.

To eliminate such problems and keep the iteration step as low as in the original Montgomery algorithm, we propose a modified algorithm in C-like code as follows:

Algorithm 1 (The Modified

Montgomery’s algorithm)

(Inputs):

Modulus:N (n-bit binary representation)

Multiplier: = ( 0 0 1 1 1 )

Multiplicand: B (n-bit number) (Outputs): Result: ? ? 0(mod ), + < + Quotient: = ( 0 0 1 1 1 ) (Algorithm): MM(A, B, N) f P1: ? = = ? + ; ( , < , = ( 0 0 1 1 1 ) ) P[0]=0; P2: for ( = ; < ; ++) f = ( [ ] + ) mod ; // even or odd ? [ + ] = ( [ ] + ? + ) div 2; // right shift 1-bit

g

= [ ] + ; Return R;

g

In this modified algorithm, similar to Chen’s algorithm, the mod-ular multiplication operation is split into multiplication procedure

(P1) and Montgomery modular reduction procedure (P2). In

P1, we only take the lower n bits (C0) of the product to do the Montgomery reduction instead of whole 2n bits as in Chen’s algorithm. The highern bits (C1) of the product has the same weight 2n _as_{P [n]. We sum C}

1withP [n] to get the result (where P [n] is the result of theC0after Montgomery reduction). Though the value of our result(R = C1+ P [n]) is different from the result (R[n]) of Montgomery algorithm, they are equivalent in modulo N. This

The modified Montgomery algorithm cannot directly apply to modular exponentiation due to the extra factor20nand the residue. We add some preprocessing and postprocessing steps to solve these problems. First, to bound the output range, we add one extra bit of precision to intermediate resultsA and B for precision consideration. This will increase the iteration steps fromn to n+2 steps, and the ex-tra factor will be 0( + ). The extra factor is not removed explicitly. Instead, we first pre-process M by taking and ( ( + )mod ) to compute 0 ( + )(mod ), so the unwanted factor will be removed automatically.

After the last iteration of modular multiplication operation, we post-process R by taking the result and 1 as input operands to remove the extra factor, i.e., = ( ? + , ). We can observe that if the

input operand is 1, the highern bit of product ( ) will be zero, so the output result of postprocessing will be less than the modulusN. Therefore, we not only remove the unwanted factor + of the result but also make the result fall in the right range after postprocessing. The new modified algorithm for computing modular exponentiation is described below:

Algorithm 2 (The modified modular exponentiation algorithm)

(Inputs):

Modulus: N (n-bit number)

Exponent: = ( ₀ ₀ 1 1 1 )

Message: M (n-bit number) Constant: = ( + )mod (Function): ( ; ; ) = ? ? 0( + )_{(mod )} (Outputs): Result: [ 0 ] = mod , < (algorithm): MM E(M, E, N, C) f M0 = MM(M, C, N); // pre-processing [ ] = 0_; for ( = ; < 0 ; ++) f [ + ] = ( [ ], [ ], ); if ( 0 0 == ) [ + ] = ( [ + ], 0, ); else [ + ] = [ + ]; g [ 0 ] = ( [ 0 ], , ); // post-processing return [ 0 ]; g

This algorithm takes about the same cycles as Montogomery’s algorithm [3], [6] applied to modular exponentiation but needs less time because of a shorter critical path. The number of modular

(3)

Fig. 1. Architecture of the 512-bit RSA processor.

multiplications isdlog2 Ee+v(E) - 2, wherev(E) is the number of nonzero bits inE. So, for n-bit RSA modular exponentiation with equal probability for 0 and 1, the number of modular multiplication is(2n + 2) for the worst case and (1:5n + 2) for the average case. Algorithm 2 takes(2n + 2) 2 n clock cycles which is shorter than that in [10] and [11] which need(2n + 2) 2 2n cycles to complete a modular exponentiation in the worst case. Since cycle time is equal to that in [10] and [11], our algorithm takes less time to complete RSA operations and has higher throughput.

III. HARDWARE DESIGN ANDREALIZATION

A. Hardware Design

Fig. 1 shows the architecture of a 512-bit RSA processor based on the modified algorithm. We use four 512-bit linear shift regis-ters to store operands needed in computing 512-bit RSA operation ( mod ). The operations of the RSA processor are described below. In the initial stage, RSA operands are loaded into shift registers serially through an 8-bit input buffer. While loading messageM into the text register, we shift the exponent register until the first nonzero is the most significant bit and count the number of bits of exponent, dlog2Ee. After the initial stages, we start the multiplier. Once the first output bit of the multiplier is ready, we start the Montgomery module immediately. So the execution time of CPA, multiplier, and Montgomery module is almost overlapped. Therefore, the function units of our design are fully utilized during computation.

1) Carry-Propagation Adder and Serial Parallel Multiplier: The carry-propagation adder converts the carry-save form of the output from the Montgomery module to nonredundant binary form. It generates one bit output per cycle to the serial-parallel multiplier for the next iteration. The serial-parallel multiplier shown in Fig. 2 is to realize the multiplication and square of twon+1 bit numbers. It first generates then+2 lower bits of a product serially to the Montgomery module, then it stops and holds then higher bit of the product. The n higher bits of the product will be added with the output of the Montgomery module to get the modular multiplication result.

The multiplier itself is a linear array type of multiplier with a special input circuit. The linear array shown in Fig. 2 (neglecting the

ANDarray) is a direct systolic implementation. When the multiplier is

generating a product of two numbers, the parallel inputM0 is ready in the text register and another operand R[i] can arrive in serial. However, if we want to square one number, a serial input of the operand will make the multiplier fail. We solved this problem by

Fig. 2. Architecture and timing of linear array multiplier.

Fig. 3. The input data sequence for the squaring operation.

scheduling the serial input operands R[i] and insert some zeros to ensure the square operation from failure, as shown in Fig. 3.

2) Montgomery Module: The Montgomery module shown in Fig. 4 performs the modular reduction by repeating the following procedure:a_i= (P [i]+c_i) mod 2; P [i+1] = (P [i]+a_i3N +c_i)=2. C0 = (cn+1 cn cn01 1 1 1 c1 c0)2 is then + 2 lower bit of the product from the multiplier. C0 entered the Montgomery module one bit per cycle from the lower bit to the higher bit in series. The reduction step is a shift-and-add step that is very similar to the basic step of a multiplication. The quotient determination is a parity decision on the summation of the intermediate result and the carry. This can be done simply by an exclusive-OR gate with inputs of

ci and the LSB of the intermediate result in the previous iteration. Aftern + 2 iterations, the Montgomery module will add P [n + 2] and the n higher bits of the product from the multiplier together. The result is then sent to the carry-propagation adder for the next modular multiplier iteration.

3) Hardware Cost and Performance: The total clock cycle of one 512-bit RSA operation in the processor is

4 3 512 + 519 3 2 + (dlog₂Ee + (E)) 3 519

where(E) is the number of 1 bit in the exponent. It takes about 0.39 M clock cycles for the average case (equal to 0 or 1 probability) or 0.54 M clock cycles for the worst case. The hardware cost analysis of the design (excluding control and I/O buffer) is listed in Table. I. The size of the controller part that uses 11 states for the counter-based finite state machine is quite small compared with the other part, as the final layout shows.

4) A 512-Bit Single-Chip RSA Processor: Fig. 5 shows the layout of the 512-bit RSA chip. We partition the 512 bit into eight main

(4)

Fig. 4. The circuit of the Montgomery module.

TABLE I

HARDWARECOSTANALYSIS OF OURDESIGN FOR THEn-BITRSA ALGORITHM

blocks that were described in previous sections. The interconnections of data signals are all locally connected to neighboring blocks. This design has tried to minimize the effects of the global signal lines. We use a buffered tree structure to drive all global signals including control signals and clock lines. The tree structure resembles the H-tree structure. So the gate delay penalty caused by propagating the global signals is minimized to 1.5 ns. To include the effect of the wire loading, we use a Compass ISM (input slope model) delay model to estimate the critical path delay 6.06 ns that includes the gate delay and wire-loading delay. For a more conservative estimation, we add an extra 30% delay to the ISM results, so the total delay is 6.06 2 1.3 = 7.8 ns, roughly equals to 8 ns. So the chip can operate up to 125-MHz clock. (The 30% extra delay is according to our empirical experiences on the ISM model and physical measurement results.) The main technical characteristics of the chip are listed in Table II.

Some RSA chips presented so far are listed in Table III, which also gives cost and performance comparisons with our design. All the data are in the worst-case scenario. This table does not include data of [5], [6], [8], and [9] since no detailed chip information is available. The highest baud rate has been achieved in [7] by incorporating the Chinese Remainder Theorem (CRT) to gain the 42 speedup. Our design can also incorporate the CRT to gain such speedup. However, the CRT is only suitable for the users who know the factorization of modulusN. In [12], the number of clock cycles needed in a

512-TABLE II

FEATURES OF THERSA PROCESSOR

bit RSA operation has been greatly reduced by using the radix-32 technique. However, the critical path will also increase, so the clock rate will be lower. In our design, a higher clock rate can be applied because the critical path delay is shorter than others. Therefore, our design is the fastest chip, excluding the design in [7].

IV. CONCLUSION

In this paper, we propose a scheme based on Montgomery’s algorithm to implement the most widely used RSA cryptosystem. The modified Montgomery’s algorithm efficiently reduces the critical

(5)

Fig. 5. Layout of the RSA processor.

TABLE III

SOMERSA CHIPS PRESENTED SOFAR

path and operation cycles to increase the throughput rate. This scheme can avoid the final adjustment of residue and save the data format conversion time. A hardware design of a 512-bit RSA processor was also proposed and implemented based on this algorithm. The processor has high utilization and is regular to be cascaded for higher bits. The shortcoming of this design is that the global signals span 512 cells. A conservative analysis has included wire loading to estimate their effects. The processor takes about 0.54-M clock cycles to finish a 512-bit RSA encryption (decryption) and delivers a baud rate of 118 kbit/s at 125 MHz in the worst case. The variable encoding and decoding time for 512-bit RSA operation may not be desirable in some system designs but can be easily coped in asynchronous system designs. Besides, since we do not operate with all 512 bit, this design can save power consumption.

APPENDIX

The following is to verify that our modified Montgomery algorithm is modulo equivalent to the original Montgomery algorithm:

Verification:

In the for loop,by induction:

[ + ] ? + ₌ = = ( 3 ) + ? = = ( 3 ) so, [ ] < + [ ] ? = = 0 = ( 3 ) + ? = + ? Since = [ ] + Then ? = [ ] ? + ? = ? + + ? = ? + ? Hence, ? ? 0 _{(mod )} + < + REFERENCES

[1] A. Bhimani, “Securing the commercial internet,” Commun. ACM, vol. 39, no. 6, pp. 29–35, June 1996.

[2] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital signature and public-key cryptosystems,” Commun. ACM, vol. 21, pp. 120–126, Feb. 1978.

[3] P. L. Montgomery, “Modular multiplication without trial division,” Math. Comput., vol. 44, pp. 519–521, Apr. 1985.

[4] C. K. Koc, T. Acar, and B. S. Kaliski, Jr., “Analyzing and comparing montgomery multiplication algorithms,” IEEE Micro. Chip, Systems, Software and Applications, pp. 26–33, June 1996.

[5] H. Orup, “Simplifying quotient determination in high-radix modular multiplication,” in Proc. 12th Symp. on Computer Arithmetic, July 1995, pp. 193–199.

[6] S. E. Eldridge and C. D. Walter, “Hardware implementation of Mont-gomery’s modular multiplication algorithm,” IEEE Trans. Comput., vol. 42, pp. 693–699, June 1993.

[7] N. Shand and J. Vuillemin, “Fast implementations of RSA cryptogra-phy,” in Proc. 11th Symp. on Computer Arithmetic, 1993, pp. 252–259.

(6)

Analog Implementation of Fast Min/Max Filtering S. Siskos, S. Vlassis, and I. Pitas

Abstract— An analog implementation of running min=max filters

based on current-mode techniques is presented in this brief. Switched-current delay cells and Switched-current/voltage two inputs min/max selectors are used either for current or voltage inputs respectively. The voltage two input Min/Max circuit is designed using current conveyors and a modified structure of this is used to implement the running Min/Max filter for window sizen = 8. Simulation results demonstrate the feasibility of the proposed implementation, which can be extended to a higher window size.

Index Terms—Min/Max filters, mixed analog–digital integrated circuits,

nonlinear filters, running filters.

I. INTRODUCTION

In the recent years the use of nonlinear filters has exhibited a strong growth due to their capabilities to cope with system nonlinearities, non-Gaussian noise environments and sensor and perceptual system nonlinearities [1]. One of the most frequently used classes of non-linear filters is based on order statistics [2]. Let us suppose that the input samples in the filter window are denoted byx1; x2; 1 1 1 ; xn. If we order them according to their magnitude, we get their order statistics:x(1) x(2) 1 1 1 x(n). The minimal input sample is x(1) and the maximal input sample isx(n). Theith-order sample is denoted byx(i); 1 i n. The median of the input samples is x(v), where n = 2v + 1. Max/min filtering as well as median filtering are very frequently used in digital signal and image processing. In particular, maximum and minimum filtering are directly linked to the gray scale mathematical morphology operations dilation and erosion respectively [3]. Dilation/erosion is essentially a maximum or mini-mum operation respectively on the samples within the filter window. Both dilation and erosion have numerous applications, particularly in digital image filtering, edge detection, region segmentation and shape analysis. In the following, we shall concentrate our efforts in proposing digital filter architectures that are suitable to max/min filtering and that are easily implemented in a hybrid (analog/digital) Manuscript received February 7, 1997; revised August 8, 1997. Paper recommended by Associate Editor L. A. Akers.

S. Siskos and S. Vlassis are with the Laboratory of Electronics, Department of Physics, Aristotle University of Thessaloniki, 54006 Thessaloniki, Greece. I. Pitas is with the Department of Informatics Aristotle University of Thessaloniki, 54006 Thessaloniki, Greece.

Publisher Item Identifier S 1057-7130(98)05049-6.

Fig. 1. Themax calculation flow diagram for n = 8.

way. The motivation to built these architectures is to construct extremely fast, simple, and affordable filters that operate directly on the analog signal and can be easily incorporated to smart sensors as well as into smart cameras. The proposed architectures are essentially suited to one dimensional signal filtering (e.g., sound, ECG/EEGS, measurements). However, due to the separability property of the max/min filtering, the same architectures and their implementation can be used for two-dimensional signal (image) processing, by applying them along image rows and columns independently.

II. FASTSTRUCTURES FORRUNNINGMAX/MINFILTERING The problem of running max/min filtering can be formulated as follows. Letxi; i 2 Z by an one-dimensional signal. The output of amax or min filter yi; i 2 Z is given by

yi= T (xi; 1 1 1 xi0n+1) (1) wheren is the filter length (window size) and T is themax or min operator, respectively. Equation (1) is called “running”max or min filtering because after each output calculation, the filter window is shifted one position to the right (i.e., it “runs”).

The computational complexity, measured in number of compar-isons per output point, isC(n) = n 0 1. It is desirable to construct filter structures that have a smaller number of comparisons per output point in order to speed the filtering process. This is accomplished by employing the “divide-and-conquer” strategy.

Let as suppose that the filter window sizen is a power of two: n = 2k_{. It is easily seen that}_{max or min calculation of n numbers} can be split into the max or min calculation of two subsequences of length n=2 each:

yi= T (xi; 1 1 1 ; xi0n+1)

= T [T (xi; 1 1 1 ; xi0(n=2)+1); T (xi0(n=2); 1 1 1 ; xi0n+1)]: (2) This procedure can be repeated recursively until we reach subse-quences of length 2 [4]. In this case, the max or min calculation of two numbers is done by one comparison only. The corresponding flow diagram is shown in Fig. 1 forn = 8. Each dot corresponds to one comparison T [1; 1]. The flow diagram has log₂n stages. Only one extra comparison per output point is needed at each stage. Therefore, the computational complexity of this structure is reduced to C(n) = log₂n, which is much less than the complexity n 0 1