A
l . 5 V
CMOS
High-speed 16-bitt8-bit Divider Using the
Dynamic Circuit Techniques Suitable for Low-Voltage VLSI
uotient-Select Architecture and True-Single-phase Bootstrapped
C. C. Yeh, J. H. Lou and J. B. Kuo
Rm. 338, Dept. of Electrical Eng., National Taiwan University Roosevelt Rd., Sec.
4,
#1, Taipei, Taiwan 106-17Abstract-This paper reports a 1.5V high-speed
16-bit4-bit divider circuit using the quotient-select architecture and true-single-phase bootstrapped dy-
namic circuit techniques. Based on a 0.8pm CMOS
technology, the speed performance of this 16-bitt8-
bit divider circuit is improved by 45% as compared to
the divider using the non-restoring iterative architec- ture and the domino dynamic logic circuits without the bootstrapped technique.
I. INTRODUCTION
Division is an important function in a CPU arithmetic unit. Enhancing the speed performance of a divider cir- cuit is critical in raising the speed performance of a VLSI
CPU [1]-[3]. Fig. 1 shows the block diagram of a 16- bit+&bit divider circuit using a non-restoring iterative architecture [l]. A 16-bit dividend and an 8-bit divisor are assumed to be positive and smaller than 1. As in a stan- dard binary division operation, successively right-shifted values of the divisor are subtracted from or added to the the dividend. Also, for next-generation deep-submicron CMOS VLSI technology, low supply voltage is the trend. For sub-0.lpm CMOS technology, 1.5V is necessary. At a supply voltage of 1.5V, the speed performance of CMOS dynamic logic circuits such as NORA
[4],
domino, Zipper is better than that of CMOS static ones as a result ofreduced internal parasitic capacitances. However, as the serial fan-in is large, its associated propagation delay may increase drastically, which is especially serious at a low supply voltage. In this paper, by using a 1.5V CMOS dy- namic logic circuit with a bootstrapper technique, a 1.5V high-speed CMOS bootstrapped 16-bitt8-bit divider us-
ing the quotient-select architecture i s reported. It will
be shown that the speed performance of this 16-bits8- bit divider circuit is improved by 45% as compared t o the divider using the non-restoring iterative architecture and the domino dynamic logic circuits without the boot- strapped technique.
This work is supported under R.O.C. National Science Council Contracts #84-2622-E002-008, 011 & #85-2215-E0002-024.
10.1'1,
Fig. 1. The architecture of a 16-bit+8-bit non-restoring iterative divider.
11. CONVENTIONAL NON-RESORING ITERATIVE
DIVIDER
In a conventional 16-bit+8-bit non-restoring iterative divider, eight 'quo tient rows' are required. In each quo- tient row, there are A (adder) cells, an S (sign) cell, and a CLA (carry look-ahead) cell. Each A cell is com- posed of a full adder and a control signal implemented by an EXCLUSIVE-OR gate. The control signal decides whether an add or a subtract to be performed. In the jth
A
cell in the (i-1)th r ow, using a full adder, its associated sum and carry signals at the output are related to the sum and carry signals of the previous row as:A cell
s cell CLA cell
Fig. 2. The functional blocks of the ith quotient row in the 16- bit+8-bit non-restoring iterative divider.
where Qi-2 is the negative of the (i-2)th-bit quotient, and
Dj
is the negative of the jth-bit dividend. In addition, propagate and genetate signals for the carry look-ahead circuit have been produced:pi,j [ ( Q i - 2 CB
Dj)
CB S i - l , j + l CBc
i - I , j + ~ ]+ c i , j + 1 , (3)
Gi,j = [ ( Q i - z CB
Dj)
CB Si-l,j+l CB Ci-l,j+Z]'Ci,j+l. (4)
=
The
S
cell, which contains the sum portion of theA
cell, is used to compute the sign for each row:si
= (Qi-2 CB s i - 1 , 1 CB Ci-1,2) CB Ci,l- ( 5 )Eqs. (1)-(4) are applicable for the cells not at the top and left boundaries (2
5
i
5
b,+
1 and 15
j5
bd - 1). For the cells at the top and left boundaries(i
= 1 or j = b d ) , where b, is the quotient bit number, and bd is the divisor bit number, Eqs.(l)-(4) should be modified to include appropriate boundary conditions as shown in Fig. 1. Note that Ni is the ith bit dividend.As
shown in Fig. 2, the CLA cell is used to compute the final carry signal according to the following formula:Yi,j = Gi,j
+
Pi,jl$,j+l, j = 1,...,
b d , (6)LAi = & , I . (7)
Using an EXCLUSIVE-OR logic gate, the ith quotient bit
at the output is low if either output from the S cell or t
he CLA cell is 1.
The speed performance of a non-restoring iterative di- vider is determined by the speed of the propagate and ~ 367 BLOCK I LEVEL 1 LEVEL 2 LEVELS BLOCK2 BLOCKJ I I I I
Fig. 3. Block diagram of the 1.5V 16-bitt8-bit divider using the quotient-select architecture and true-single-phase bootstrapped dy- namic circuit techniques.
generate signals of the A-cells, the delay time of the LAi signal in the CLA cell and the speed of producing the quotient bit
&z
of the XOR in the quotient row. After the quotient bit of a quotient row(a)
is produced, its value is transferred to the next row. Then, the quotient bit of the next row(G)
is computed. No quotient bit of the next row can be computed until the quotient bit of the previous row is obtained.111. THE
PARALLEL-OUT
QUOTIENT-SELECT
DIVIDER As shown in Fig. 3, in the 3-bit parallel-out quotient- select architecture, instead of waiting for the quotient bit from the previous row, three quotient blocks have been used to produce the nine output quotient bits almost si- multaneously. In each block, three levels of quotient rows have been arranged. For example, under the quotient row of the first level, at the second level there are two quotientrows: a and b. At the third level, there are four quotient rows: w, x, y, and z. The second and the third levels of quotient rows have been arranged to produce three output quotient bits: Qo, Q 1 , and Qz simultaneously. In these
seven quotient rows in three levels, the input quotient bits of rows 0, a, b, w, x, y, and z have been designated as 1, 1, 0, 1, 0, 1, and 0, respectively.
Bootab8pp.d OR gate :
Output tz A+B
Bootstmppd AND gat. :
Output = A 8
Fig. 4. The 1.5V CMOS bootstrapped dynamic logic circuits in- cluding the CMOS bootstrapper circuit.
The seven individual output quotient bits of each row in three levels are
Qo,
Q o a , Q o b , Q o w , Q o z , Q o y , andQoz,
respectively. The inputs to the first block are the divi- dend bits: NO-Ns. The output quotient bit of row 0- QO may be 0 or 1. The sum and the carry signals produced by row 0 are transferred t o rows a and b such that Qoa and Qob can be computed immediately without waiting
for the generation of Qo. Another dividend bit
NS
is used as the input t o both the quotient rows a and b. Then, the output quotient bit- Q1 is equal t o Qoa or Q o b depend-ing on Qo by the multiplexer- MUX. If Qo is 1, MUX outputs Qoa as Q 1 . If & I is 0, MUX outputs Qob as Q 1 .
The outputs of the second-level quotient rows-Sal
-
Sa8, Ga2-
c a 8 ; s b l-
s b 8 , c b 2 - c b 8 are used as inputs t o thethird level. In addition, another dividend bit iV10 is used
as an input to the third level.
The output quotient bit QZ of the third level is deter- mined by a similar decision criterion as in the second level. Under the third level of the quotient rows, a multiplexer is used to select the sum and the carry of the first block-
&.z-s38, c 3 1 - c 3 8 . The second block uses the outputs from
the first block to generate the output quotient bits: Q 3 ,
Q 4 , and Q5. The third block generates the output quo-
tient bits: Qc, Q7, and Q8.
The speed of the 16-bit+&bit divider with the quotient- select architecture is determined by the propagation delay
of each
of
the three blocks as shown in Fig.3. The prop- agation delay of producing the output quotient bitsQ o ,
Q1, and Q 2 of the first block is mainly determined by the
propagati on delay of the sum and the carry signals (S's, C's) associated with each quoti ent row in all three lev- els in the first block. Although there are three levels in the first block, the speed of producing Q 1 and Q2 is not substantially slower than that of producing Qo since the
critical component of the propagation delay in producing the three output quotient bits- Qo, Q 1 and Q 2 is on the
adder circuit in the
A
cell. Therefore, the speed of gener- ating Q o , Q1 and QZ is about identical. Similar situationsexist for Q 3 , Q 4 and Q5 for the second block and Q s ,
Q ,
and Q8 for the third block.
As a result, the speed of the 16-bit+8-bit divider with the quotient-select architecture is about three times faster ~
368
'dd 'dd
?it
t
I
Fig. 5 . The 1.5V CMOS buffer circuit using the bootstrapper tech- nique.
as compared to that with the conventional non-restoring iterative architecture. Fig.
4
shows the 1.5V CMOS bootstrapped dynamic logic circuits including the CMOS bootstrapper circuit. As CK is low, it is the precharge period of bootstrapper circuits. During the precharge pe- riod, the internal node ( V d o ) is prechaxged t o V d d , and theoutput voltage Vout is predischarged t o ground via
MN.
The bootstrap capacitor (Cb) is charged t o Vdd- the left
side is grounded and the right side is at
Vb
= V d d . Dur-ing the precharge period, the right side of the bootstrap capacitor is separated from the output since MPB is off. As
CK
turns high, it's the logic evaluation period. Dur- ing the logic evaluation period, MPD, MP, M N turn off. During the logic evaluation period, the internal node volt- age V d o is determined by inputs A and B. If both A andB
are high, v d o is pulled low and VI is high. Owing tothe charge in the bootstrap capacitor, Vb will be boot- strapped to over Vdd- the internal voltage overshoot. In
addition, as MPB turns on, Vout is pulled high t o over
V d d . In the
CLA
cell, owing t o the 1.5V bootstrappedCMOS dynamic logic circuit, the signal swing of the in- put signals-P's and GIs exceeds 2V. As a result, the switching speed of the CLA cell is enhanced.
Fig. 5 shows the 1.5V CMOS buffer (B) circuit using the bootstrapper technique[ 51. During the pull-up tran- sient, the operation of the full-swing bootstrapped CMOS buffer circuit is divided into two periods regarding the bootstrap capacitor Cbp: (1) the charge build-up period and (2) the bootstrap period. Prior t o the pull-up tran- sient, the input is at OV and a t the output of the inverter
V,
is at 1.5V. Therefore, M ~ l b andM N ~
are off ; M N Z ~ is on. At the output of the buffer, Vout is at 0V. On the other hand, M N ~ ~ and c b p of the bootstrap segment are separated from the Mp2 and Mpl of the fundamental seg-I
50 100 150 200 250
Time (ns)
Fig. 6. Transient waveform of the 1.5V 16-bit~8-bit divider us- ing the quotient-select architecture and the true-single-phase boot- strapped dynamic circuit techniques.
ment. As a result, the bootstrap capacitor c b p has charge of 1.5Cbp Coulomb After the input ramp-up period, the right side of the bootstrap capacitor c b p is disconnected from ground since is off. Instead, it’s connected t o the gate of Mpl since MNlb is on. Due to the voltage change at the left side of the bootstrap capacitor Cbp, the right side of the bootstrap capacitor c b p changes to be- low OV- the internal voltage undershoot. As a result, the output voltage can switch at a faster pace since the gate of Mp1 is driven at below
OV.
Pull-down transient has a complementary configuration.I v . RESULTS AND DISCUSSION
Fig. 6 shows the transient waveform of the 1.5V 16- bit+8-bit divider using the quotient-select architecture and the true-single-phase bootstrapped dynamic circuit techniques. The load at the quotient bit output is O.lpf. At a supply voltage of 1.5V, the propagation delay of the output quotient bit QS is 107ns for the 16-bite&
bit divider using the 3-bit parallel-out quotient-select architecture. Compared with the propagation delay of the divider using the conventional non-restoring iterative architecture-l92ns, a speed enhancement of 1 . 8 ~ has be en reached, which is less than 3x as expected. This is due to the fact that the quotient-select architecture is not fully “parallel-processing”
.
The propagation delays of two consecutive quotient rows of a block are differed by the delay in a full adder. In addition, the extra delay due to the multiplexer and the buffer also contributes to the shrinkage in the speed enhancement.Conventional Divider
wlo Bootstrepper circuit
3-blt Parallelout Quotient-select Divider 40
2ot
I i 2:5a
3:s i 4‘5 Supply Voltage 01)Fig. 7. Delay time vs supply voltage of the true-single-phase CMOS bootstrapped divider.
V.
CONCLUSION
In this paper, a 3-bit parallel-out quotient-select archi- tecture has been studied. In fact, for a large-size divider system such as a 64-bite32-bit divider, a 8-bit parallel- out quotient-select architecture can be used t o further en- hance the speed performanc e. The more bits used in the parallel-out quotient-select structure, the more improve- ment in speed can be expected. However, a larger die area is also needed.
REFERENCES
[l] M. Cappa and V. C. Hamacher, ”An Augmented Iterative Ar- ray for High-speed Binary Division,” IEEE %ns. Computers,
Vol. C-22, No. 2, pp.172-175, Feb. 1973.
[2] R. Stefanelli, ”A Suggestion for a High-speed Parallel Binary Divider, ” IEEE %ns. Computers, Vol. C-21, No. 1, pp.42-
55, Jan 1972.
[3] A. Vandemeulebroecke, E. vanzieleghem, T. Denayer, and P. G. A. Jespers, ”A New Carry-Free Division Algorithm and its Applications to a Single-Chip 1024-b RSA Processor, ” ZEEE
J . Solid-state Circuits, Vol. 25, No. 3, pp. 748-756, June 1990. [4] N. F. Gonzales and H. J. DeMan, “NORA: A Racefree Dynamic CMOS Technique for Pipelined Logic Structures,” IEEE J . Solid-state Circuits, Vol. 18, pp. 261-266, June 1983.
[5] J. H. Lou and J. B. Kuo, “A 1.5V Full-Swing Bootstrapped CMOS Large Capacitive-Load Driver Circuit Suitable for Low- Voltage CMOS VLSI,” IEEE J . Solid-state Circuits, Vol. 32,
No. 1, pp. 119-121, Jan. 1997.
Fig. 7 shows the delay time vs. supply voltage of the true-single-phase CMOS boo tstrapped divider. As shown in the figure, regardless of the supply voltage, a consistent improvement in the speed for the 16-bit+8-bit divider us- ing the 3-bit parallel-out quotient-select architecture over the the one using the conv entional non-restoring iterative architecture can be seen.