A low-error and area-time efficient fixed-width booth multiplier

(1)

A Low-Error

and

Area-Time

Efficient Fixed-Width Booth Multiplier

Min-An Song, Lan-Da Van*, Ting-Chun Huang,

and

Sy-Yen

Kuo

Department of Electrical Engineering,

National Taiwan

University, Taipei, Taiwan,

R.O.C

*Chip

Implementation

Center

(CIC),

National

Applied

Research

Laboratories,

Taiwan,

R.O.C.

E-mail: sykuo(&cc.ee.ntu.edu.tw

and ldvani

cic.org.tw

Abstract

Inthispaper,wedevelop a new methodology for designing a lower-errorand area-time efficient 2 s-complement fixed-width Booth multiplier that receives two n-bit numbers and produces

ann-bit product. By properly choosing the generalized index and binary thresholding, wederive abettererror-cojmpensation bias

to reduce the truncation error. Since the proposed error-compensation bias is realizable, the constructing low-error fixed-width Booth multiplier is area-time efficient for VLSI implementation. Finally, we successfully apply the proposed fixed-width Booth multiplier to speech signal processing. The simulation results show that the perfoirnanceisistiperior tothat using the direct-truncation fixed-width Booth multiplier.

1. Introduction

In many digital signal processing (DSP) applications such as digital filters [6, 7] and wavelet transform, it is desirable to maintain fixed-width output word through tihe arithmetic operations. So, a low-error fixed-width multiplier [5-8] that receives ann-bitmultiplieras well as an n-bit

multiplicand

and producesan

n-bit

outputprodtictisthemost

important processing

element

for

digital

computing engine. In view of

the

algorithm

level,most

fixed-width

multipliers

are

b-ased

oneither the

Baugh-Wooleyalgorithm

[I1

orthe Booth algoirithm [2, 3

1.

The Baugh-Wooley based fixed-width multipliers

[5-7]

have been widely studied. Kingand Swartzlander [5] analyzed an adaptive error-compensationbias and proposedanri-bitfixed-width multiplier. In[6, 7],wegeneralized thiskind of Baugh-Wooley based fixed-width multipliers by properly choosi.ng the generalized index and binary thresholding. Thus, several 'lower-error and area-efficient fixed-width multipliers canbe obtained. However, thearea-time

efficient fixed-width multiplier cfannot be filly achieved by the Blauth-Woolcy based fixcd-widtl

nmltiplicrs. TIhercbore,

thc fixed-width Bootil

algoritlihi

iF,

currcntly

onie

of the

rcscarclh

topics.

The Boothalgorithm iswidelyused in theimplementation of ASIC-oriented products

because

a highler-speed and smaller multiplier can be obtained. The modified Booth algorithm was proposed by the MacSorley [3] in which a triplet of bits is scannedat atime. It is known that the recodingtechniqueofthe modified Booth algorithm has two main

adivantages.

One is that almosthalf the partial products

zompared

totheBaugh-Wooley multiplier can be saved. Hence, the number of rows of the subproductarraycanbereducedby 2.The

ofther

isthat,basedon

the firstadvantage, thecriticaldelay titne can,beshorterthanthat of the Baugh-Wooley multiplier. Area

savinig

of a fixed-width Booth multiplier can be achieved by directly truncating n least significantproduct bits and ,reserving nmostsignificant product bits. With this method,

sig,nificant

truncation errors would be introduced since no error c,ompensation is considered.

Thus,

the Booth-based scheme in [8] has been proposed to reduce the

truncation error;however, the proposed scheme

[8]

that lacksfor systematic analysis and derivation does not be a systematic methodology. In this paper, we are motivated to propose a

systematic design methodology for low-error area-time efficient Booth multipliers. Themethodology includes the following steps in order: I) Propose an error-compensation bias with a new

binary thresholding; 2) simulate the value of K and error performance of the proposed error-compensation bias usingour

generalized index, and then select the best index having lower error and

satisfying

thesamevalue of K for small width n;and 3) construct a low-error Booth multiplier structure. Based on our methodology, the resulting error compensation circuit can be easily realized without any area overhead under the limited truncationerror.Theorganization of this paper isasfollows. The modified Boothalgorithm isconciselyreviewed in Section 2. In Section 3, we propose a better error-compensation bias and presentthe simulation results for small width n. Theimproved error-compensation biascan bemappedto a new structure. The performance of the proposed fixed-width Booth multiplier are described in Section 4. Finally, brief statements in section 5 concludethepresentation.

2. Modified

Bootht

Mtiltiplier

Considering the multiplication of two 2 s-complemilent integers withn-bit multiplicand A and n-bitmultiplier B as

2n-I

P=AB=

Z,P,2i

i=O

(1)

n-2 n-2

where A= -a

2n-1

+

Eai2'

, B=

-bn12

+

Ebj2

j,

and

1=O j=0

P1 denotes the 1-th output product bit. Note that a, and

b,

ind(licate

dlata bits of multiplicand

and(i

miul(iplier,

respectively. Assumne n is eveCI and theu-bitimultiplier Ii canberewritten

(n-2)/2

B=

(b2i

l+b2i-2b2i+1)2

,

i=O

(2) where

b-1

=0.Note that thetermsinthe bracket inEq (2)have values of{-2,-1, 0, 1,

2).

Eachrecodedvalueperformsacertain operationonthemultiplicandA, and thenthe multipleadditions

ateach stagewould berequired inordertogeneratethecorrect

partialproduct. Itisworthmentioningthat theoperationof-A

canberealizedby the inversion ofthe

multiplicand

and addition of 'I at the least significant bit as illustrated in Table I.

Substituting Eq. (2) into Eq. (1),wecanobtainEq.(3)as

(n-2)/2 (n-2)/2

P=AB=

E(b2i_

±+b2i-2b2+1,)A22i

=

Si,

i=O i=O

(3)

(2)

where

Si

=(b2,i

1

+b2i

-2b2+

I)A22i

, and it is known thatthe scanning oftriplets begins from

b-I

to the MSB with one-bit overlapping.Thus,only the number ofn/2 partial-productrows

needs to be computed.

Table 1:Modified BoothEncodingTable

b21+,

b2i

b2i-I

Operationon ADD to LSB A o 0 0 0 0 0 0 1 -A 0 0 1 0 +A 0 0 1 1 +2A 0 1 0 0 -2A 1 0 1 -A I 1 0 -A I 1 1 0 0

_O

I X,7 k7,@ 51,6

I

3~

©D

S. S,,

s,"

I IX & I4 I

I

l l

ll

--.

Tncated

pt I ,r -- -- -___________ - -'% 1-.6

XSo.,

4

S%.

3 2 S .,S. I SuI::4SA3 S,2 SM S,

-I,IIQ,12],o

IS,

S;,

S2t --IIl>13121Cr' ) t

t

VVV 1

t

-t

t

P15

P14 P13

P12

PllPIO

P9

P8

's"

Fig. 1. Modified Booth

partial-product diagram

with

sign-generatesignextensionscheme foran8x8

multiplier.

Soas tosimplify the representation ofthe bit-productofeach rowforthe Boothalgorithm,wedefinethe

following

notation

S

,n22i+n-lI

2i+n-2 S 2i2

S1 "

i,n-l~

+

Si,n-22

+..+

S1,0

(4)

where

S,jj

represents thebitproductof the i-throw. Basedon

the conventional Booth arithmetic operations,sign extension of the partial products are required for each stage. However, the extended

sign

bits resultinalargerpowerandarea.Accordingto the

sign-generate

sign

extension scheme

[4],

for an 8

by

8

multiplier, the

sign

of thefinal resultcanbe

expressed

as

1J

23

2

iJ

9

s-(So,72)2+(SI7Z2')

2

(S2,7Z2')24

+(S3,7Z2J)26

j=8 j=8 j=8 j=8 9 8 - 1 13 12 =(2

+S0,72

)+(2"+S 2721 )+(2 +S272 )

+(2"

+S32

4)+2

2

(5)

where S is the final result sign. From Eq. (5), the partial products ofthe Boothalgorithm only needtoadd two elements (1,

Si7

) for each row and add an extra 'I in the

28

-weight position as shown in Fig.1, where main and remain represent main part and remain part of the least significant bit (LSB), respectively. Thus, thesign-generate sign extension scheme can reduce a lot of redundant full adders compared to the conventional sign extension method. The architecture of the Booth Multiplieras shown inFig.2 consists of Booth encoders, selectors (sel),full adders(FA)andhalf adders(HA).TheBooth encoder generates Ctrli [O:2] signals to control the selector to

choose -2A, -A, 0, IAor2A.

3. Design

of Fixed-Width

Booth

Multiplier

The 2n product fornbyn 2 s-complement multiplicationcan

bedivided intotwosectionsas

P=AB=MP + LP. (6)

Fig. 2. An8X 8 modified Booth

multiplier using sign-generate

sign extension

scheme.

The

most

accurate

trunication1

prodLlct

is

giveni

by

P_MP++

aTempx2n,

UTemp

=

ILPI

r

-(7)

(8)

Without loss ofgenerality,forn=8,Eq.(8)canbe denotedas

I

(S3,1

+

S2.3

+S '0,7)-+22

(S3,0

+ -+

SO,6

L2

22

Cr7eml9) 1

+

CfrI3[2D

+ ...+ I

So,

I +

I!

(SO,O

+

Ctrlo[2

1)

r

(9)

Thenwedefine the

following

terms

Emain =

S3,1

+S2,3+

SI,s

+

SO,7,

Eremain

= I

(S3o0

+

S2,2

+

Sj,4

+

SO,6

+

Ctrl3(21)

2'

+... + 7

(So,o

+

Ctrlo[2])

27 Thus;,

we canrewriteEq.(9)as

cr

Temp

=

[I(Emain

+

Eremain

)]-(10)

(I

1)

(12)

591

(3)

-Itis convenienttoperform exhaustive simulation ifwedefine the generalized index. Here the generalized index for 8 by 8 multipliersis definedas

i9M&x(q32q22q71qO)

=<

S3,1

>q3+<

S2,3

>q2+ <

S1,s

>q +<

So,7

>q (13)

where the binary parameters q3,

q2,q1

and

qo

E

(0,1),

and the operator

< TT

,if

q,=0

(T, ,if q =I

(14)

in which T is the complement of

binary

T . Furthermore,

0indeX(q3,q2,ql,qO)

is

referred

toas as

0Q,

where

Q=q

3x23

+q2

x22

+q1

X21

+qo

x20.

(15)

In accordance with the rounding values of K1 and K2, we simulate average error as shown in Fig. 3(b). Considering the

goal

oflower error and the restriction ofK, we can find that

OQ=O

index has better

performance

as shown in

Fig.

3(b)

with

[KI

Jr

=-I and [K2

Jr

=0 for n = 8 . Thus, the simple error-compensationcircuitcanbedescribedas

S3,1

+S2,3

+S,,5

+S0,7

-1 ,if

OQ=o

<4

aQ=o

=

S3,1

+

S2,3

+

S+,s

+

S0.7

+

°

,

if

OQ=0

=

4

(19) where Q=O=

S31

+S23+

S5

+

So,7

* Eq. (19) has been

completely

simulated for n.16 and can be mapped to a new

structure as shown in

Fig.

4. Note that the error-compensation circuit

only

needsthree ANDgates.

Note that Q has arange from 0 to 15. So as to evaluate the resulting performance, by applying

Eqs.

(13) to (15) into Eq.

(12),

we

get

6Temp

=

OQ

+

[

Emain

0Q

+

I

Eremain]

=w(<e3r

q3

+<S23

>92

+

<eS15

>q,

<SO7

qo

)+[K]r

9

(16)

where

K=2Emain-fQ

+I

Eremain° (17)

Based

onthe

binary

thresholding

concept

in

[6,

7], Eq.

(16)

can beapproximatedas

[<S3.1

>9)

+

<S2.3

>9'

+<

SO

>q'

+<SOJ7 >q°

44K,

1r

if

Oy

<

4 emp

{<S3,1

>71

+<S2,3

>92

+<S1,,

>9

+<s0,7

>

q

+[K2]

,

if =4

(18)

where K1 and K2 are

respectively

theaverage ofK forthose satisfying

OQ

<4 and

=Q

4. In order to

design

a

simple

and realizable

error-compensation

circuit,

we choose the indices which

satisfy

[KuIr,e

(l,0,l)

and

[K2J]rE{-1,0,1)

.

By

exhaustive search

simulation,

weobtain values ofK1 and

K2

as

shown inFig.3(a)for n=8.

-EK2

20 m

Q 20

Fig. 3.(a) Values ofKI and K2 versusdifferentQofthe

binary

thresholding. (b)

Averageerrors

by

exhaustive searchsimulation versusdifferentQofthe

binary

threshoding

for n=8.

B[:3:1 | Booth |Ctrli(2:01

,4

P P 12 Pa11 P 0 P9 PS

Fig. 4. Proposed low-error fixed-width 8 x 8 Booth multiplier with

4. Performance-and

Application

Discussion

In this section, we first simulate errorperformance in terms of maximum error, average error, and variance of error as listed in Trable 2 between the direct-truncation multiplier and the proposedfixed-width Booth multiplier. It is clearly seen that the new structure can achieve better error performance than the direct-truncation Booth multiplier. Regarding the number of gates and the critical pathdelay timeissuesaslisted in Table 3, comparison results show that the new Booth multiplier saves much areacost with respect to thefull-precision Booth multiplier based on the sign-generate sign extension scheme. Most importantly, the gate count and critical delay time of the proposed structure are close to those of the direct truncation multiplier, respectively. Thus, the proposed fixed-width Booth multiplier has the area-time efficient feature with better error performance. On the other hand, we apply the proposed

fixed-width multiplier to the 35-tap FIR filter for speech processing

(9].

For convenience of comparison of various fixed-width multipliers, we take 1000 samples for the consonant part and vowel part of"Chicken"asshown inFig. 5(a). Weareconcerned with whether thefiltered waveform isaccuratevia ourproposed

(4)

fixed-width multiplier,sothecorrect standardoutputis required. We useerror-free output as a standard, which is used to compare theaccuracy performances of fixed-width multipliers. Fig. 5(b) shows the standard filtering outputsignals andFigs. 5(c) shows thefiltering output signals processed by the 35-tap low-pass FIR filterapplying a directtruncation fixed-width multiplier. Using direct truncationmultiplier, it isseenfromFig. 5(c) that there are large averageerrorand variance of errorsin the consonant part. The smaller average error and variance of the errors especially for consonant part isobtained by usingourproposed fixed-width Boothmultiplierasshownin Fig.5(d).

5. Conclusions

Thispaperdevelopsa newmethodology fordesigning a low-error and area-time efficient fixed-width Booth multiplier. By properlychoosingbinary thresholding andthegeneralized index, we can derive a better error-compensation bias to improve the truncation error. Furthermore, this error-compensation bias can be easily constructed as a lower-error fixed-width Booth multiplier. Itis very suitable for VLSI digital signal processing applications where the accuracy, area, and speed issues are crucial.Finally, we successfully apply the proposed fixed-width multiplier to a digital FIR filter for speech processing application.

(a)

11'..Ig;, tt, s.. 18 tw s .,. j,, ,g, 11l;1n'1','' .* .... ., ,* .w .... .0 (b)

(c)

(d)

Fig. 5. (a) Original inputvoice signal, (b) standard output voice signal, (c)outputsignals usingthe directtruncationstructureand (d) outputsignals usingtheproposedBoothmultiplierstructure.

6. References

[I] C. R. Baugh and B. A. Wooley, "A two's complement parallel array

multiplication

algorithms", IEEE Trans. Comput.,vol.C-22, pp. 1045-1047,Dec. 1973.

(2]

A. D. Booth, "A signed binary multiplication technique," Quarterly Journal of Mechanics andApplied Mathematics, vol.4,part5, pp.236-240, 1951.

[3] 0. L. MacSorley, "High-speed arithmetic in binary computer",Proc.IRE,vol.49, pp.67-91, 1961.

[4] E. de Angel and E. E. Swartzlander, Jr., "Low power parallelmultipliers,"IEEE VLSISignalProcessing IX, 1996, pp. 199-208.

(5]

E. E. Swartzlander, Jr., "Truncated multiplication with approximaterounding," inProc. 33rdAsilomarConference

onSignals, Systems, and Computers, 1999,vol. 2,pp. 1480-1483.

[6]

L. D. Van, S. S. Wang, and W. S. Feng, "Design of the lower-error fixed-width multiplier and its application", IEEE Trans. Circuits Syst. 11, vol. 47, pp. 11 12-1118, Oct. 2000.

(7] L. D. Van and S. H. Lee, "A generalized methodology for lower-error area-efficient fixed-width multipliers," in Proc. IEEEInt.

Symp.

Circuits

Syst.,

May2002, vol. 1,pp.65-68, Phoenix, Arizona.

[8]

S. J. Jou and H. H.Wang, "Fixed-width multiplier for DSP application,"IEEE Int. Conf Computer Design, Sep. 2000, pp.318-322.

[9]

M. E. Paul, and K. Bruce, C Language Algorithms for Digital Signal Processing. EnglewoodCliffs, NJ: Prentice-Hall, 1991.

Table2:

Comparison

Resultsof Three Kinds ofErrorsamong Different Booth Multipliers

Multiplier Width Maximu Average Variance of

mError Error Error

Direct 4 32 10.88 67.20 Truncation 6 192 70.50 1465.86 Multiplier 8 1024 384.25 28510.19 _ 16 524288 196608.25 3661149123

'This

Work 4 16 4.59 28.50 6 85 21.60 716.86 8 443 103.12 16376.65 16 504268 62501.62 1486362529 Table3:ComparisonResults of Area andCritical Delay Time

amongDifferent BoothMultipliersfor n=8

Multiplier Area(# ofgatecounts) CriticalDelay Time

FA VIA Selector lFulllPrecisioni

Multiplier 28 12 36 1

37TrA

+

3TIjA

Based on Sign-Generate

Direct 1 8 16 7TFA + 3THA

Truncation Multiplier

This work 16 4 20 1

OTFA

+

TIM

593

1Se. .I la I

I

.04 As of of I 1 1,7,rrr -0 It -, i;i i4, 1,1,11,1,1,l,11t

i

i4i _." _It"