Chapter 1 Introduction
1.2 Thesis Organization
The rest of the paper is organized as follows. The Booth multiplier, Baugh-Wooley multiplier, subword multiplication and low-error fixed-width multiplier are briefly reviewed in Chapter 2. In Chapter 3, the proposed reconfigurable fixed-width multiplication engine with four configuration modes is presented. The comparison results in terms of area size and power saving are presented in Chapter 4. For FIR filter application, the error comparison and power scalable performance using various fixed-width and full-precision multipliers are illustrated in the same chapter. Last, brief statements conclude the presentation of this thesis.
Chapter 2
Fundamental Concepts
In this chapter, the fundamental concepts will be given, including the introduction to the Booth multiplier, Baugh-Wooley multiplier, subword multiplication, and low-error fixed-width multiplier.
2.1 Array Multipliers
In this thesis, we will introduce two kinds of reconfigurable fixed-width multipliers.
One is based on the Booth multiplier and the other is based on the Baugh-Wooley multiplier. The Booth and Baugh-Wooley multipliers are very famous algorithms used in digital signal processing. In the following, we will briefly review these two algorithms.
2.1.1 Booth Multiplier
Considering two 2’s-complement integer operands, we can respectively represent an n-bit multiplicand X and an n-bit multiplier Y as follows.
20 1
12 2
n-i=
i i n
n- + x
X = -x
(1)
20 1
12 2
n-i=
i i n
n- + y
Y = -y
(2)
where xi, yi{0 ,1}. The 2n-bit full-precision product PFP can be written as additions at each stage are required to generate the product. Substituting (4) into (3), we obtain (MSB) with a one-bit overlap. Table 2.1 lists the recoding rule and Fig. 2.1 shows the block diagram of the Booth recoding circuit. In Fig. 2.1, the Booth encoder generates three Booth recoding bits neg, X1, and X2 to the Booth selector and then the Booth selector selects input multiplicand {xj, xj-1} as output partial products. In order to simplify the representation of each partial product, we define the following notation.
2’s-complement Booth arithmetic operations, the partial product sign extensions are required for each stage, but these extended sign bits lead to large amount of area and power overhead. The sign S of an n by n multiplier can be expressed as
Substituting (6) and (7) into (5), we can obtain the partial-product array diagram for nxn Booth multiplier as depicted in Fig. 2.2, where notation w means to keep n+w most significant columns of the partial products for fixed-width multiplications. If w=n, the fixed-width multiplier becomes a full-precision multiplier. In this thesis, we would like to reconfigure the fixed-width multiplication engine to generate several useful multipliers under the limited hardware resource for DSP and image processing applications.
Table 2.1: Modified Booth recoding table y2i+1 y2i y2i-1 yi neg X1 X2
Fig. 2.1. Block diagram of Booth recoding circuit.
S0,n S0,n-1 S0,n-2
Fig. 2.2. Modified Booth partial-product diagram with sign-generate sign extension scheme for an nxn multiplier.
2.1.2 Baugh-Wooley Multiplier
Considering two 2’s-complement integer operands, we can respectively represent an n-bit multiplicand X and an n-bit multiplier Y as (1) and (2). The 2n-bit full-precision product PFP can be written as sums partial-product bits corresponding to each weighting. The partial-product array for nxn 2’s-complement multiplication are depicted in Fig. 2.3, where notation w means to keep n+w most significant columns of the partial products for fixed-width multiplications. If w=n, the fixed-width multiplier becomes a full-precision multiplier.
In this thesis, we would like to reconfigure the fixed-width multiplication engine to generate four useful multipliers under the limited hardware resource.
Fig. 2.3. Partial-product array diagram for an nxn Baugh-Wooley multiplier.
2.2 Subword Multiplication
Many DSP and computer applications demand to operate at lower resolution, where the data can be expressed in a half-word length [21-26]. Generally, applying the subword multiplication scheme, we can partition an n-bit operand into two independent n/2-bit operands or four independent n/4-bit operands; hence, the subword multiplier
can perform not only nxn full-precision multiplication but also two n/2xn/2 or four n/4xn/4 full-precision multiplications in parallel. Fig. 2.4 illustrates subword multiplication and the partial product array distribution [21-26]. In Fig. 2.4(a), two n-bit operands, X and Y, are partitioned into two independent pairs of n/2-bit subwords, and
then the two pairs of n/2-bit subwords are multiplied to produce two independent n-bit products: P1=X1Y1 and P0=X0Y0, where the partial product array distribution is addressed in Fig. 2.4(b). On the other hand, n/4xn/4 subword multiplication and the partial product array distribution are illustrated in Fig. 2.4(c) and 2.4(d), respectively. To our best knowledge, the current subword scheme is applied only to full-precision multiplication based on the full-precision multiplier infrastructure. In the following section, we will extend this subword scheme to fixed-width and full-precision multiplication using the fixed-width prototype multiplier.
X1 X0 partial-product array distribution, (c) four n/4xn/4 multiplications, and (d) four n/4xn/4 partial-product array distribution.
2.3 Low-Error Fixed-Width Multipliers
It is known that the various fixed-width multipliers with adaptive compensation biases have been widely discussed in [12-20]. Herein, regarding the tradeoffs of the
truncation error and area cost in [19-20], we choose w=1 (i.e., keeping n+1 most significant columns) and Q=0 for the prototype multiplier structure, where Q has been clearly defined in [19-20]. The error-compensation bias can be summarized as
the (n+2)th column counted from left to right of Fig. 2.2) for the Booth architecture. If the Baugh-Wooley architecture is the basic multiplier, Emain=xn1y0 + xn-2y1+ xn-3y2 from left to right of Fig. 2.3). Fig. 2.5 and Fig. 2.6(a) are the prototype Booth multiplier structure and the prototype Baugh-Wooley multiplier structure for n=8, respectively, where A, ND, HA, and FA denote AND gate, NAND gate, a half adder and a full adder, respectively, and the logic diagrams of the other processing elements are depicted in Fig.
2.6(b).
x7
sel sel sel sel sel
sel
FA FA logic diagrams of AOR, ANOR, AHA, AFA, NFA.
Chapter 3
Design of Reconfigurable Fixed-Width Multipliers
In this chapter, we describe the design methodology of the reconfigurable fixed-width multipliers based on the Booth architecture and the Baugh-Wooley architecture, respectively.
3.1 Reconfigurable Fixed-Width Booth Multiplier
In this section, we begin to demonstrate how to generate four different multipliers under the limited hardware resource of the fixed-width Booth multiplier. In this thesis, we use the fixed-width multiplier in Fig. 3.1 as our reconfigurable Booth multiplier prototype instead of the full-precision multiplier structure, where the fixed-width multiplier truncates partial products of the least significant part (LSP) as shown in the dash-lined region of Fig. 3.1 and compensate the error with adaptive compensation bias.
In Fig. 3.1, two modules denoted as MUL1 and MUL2 are used to reconfigure the following four different multipliers as listed in Table 3.1 through the corresponding four configuration modes (CMs). Thus, the proposed reconfigurable fixed-width Booth multiplier employing MUL1 and MUL2 is essentially different from the full-precision one [21-31]. Without loss of the generality, we use n=8 to investigate each CM case.
MUL1
MUL2
Truncated Region of
LSP
Fig. 3.1. Prototype structure of the proposed reconfigurable fixed-width Booth multiplier involving MUL1, MUL2, and discarding truncated region of LSP.
Table 3.1: Proposed four configuration modes of the reconfigurable fixed-width Booth multiplier
Configuration Mode (CM)
Function Descriptions Mode Applications CM1 nxn fixed-width multiplier High resolution computations:
Multiplication, matrix multiplication, square-root operation, filter, transform CM2 two n/2xn/2 fixed-width
multipliers
Parallel computations: Multiplication, matrix multiplication, square-root
operation, filter, transform CM3 n/2xn/2 full-precision
multiplier
Full-precision computations:
Multiplication, matrix multiplication, square-root operation, filter, transform
CM4 sum of two n/2xn/2
fixed-width multipliers
Parallel multiplication and add computations: Matrix multiplication,
filter, transform
3.1.1 CM1: nxn Fixed-Width Multiplier
CM1 is in charge of operating nxn fixed-width multiplication that receives two n-bit numbers and produces an n-bit product. Since CM1 is confined to w=1, the partial-product array diagram as shown in Fig. 3.2(a) with n=8 can be easily obtained from Fig. 2.2. The partial products in the dash-lined region of Fig. 3.2(a) denoted as σ
are used to compute the error compensation bias in (9). Note that the error performance results of the Booth-based CM1 are the same as those of [20] because we implement the same adaptive compensation bias circuits. Throughout the section 3.1, in order to completely achieve four configuration modes, we provide three configuration parameters CP0, CP1, and CP2 combining with the partial product setting to generate
Fig. 3.2. (a) Partial-product array diagram for nxn fixed-width multiplier with n=8, and (b) configuration parameter settings.
3.1.2 CM2: Two n/2xn/2 Fixed-Width Multipliers
CM2 plays a role of concurrently performing two n/2xn/2 fixed-width multiplications. In this configuration mode, we need two copies of hardware resource to implement CM2. The corresponding fixed-width subword operation of CM2 is illustrated in Fig. 3.3, where two subword products are X1Y0 and X0Y1 because of the limited hardware resource and each fixed-width multiplication has n/2-bit wide output.
Note that we can produce subword products X0Y0 and X1Y1 as the conventional subword multiplication by exchanging X0 and X1 before the Booth selector. However, X0Y0 and X1Y1 will lead to larger exchange hardware overhead. Thus, we adopt X1Y0 and X0Y1 subword operations. The partial-product array diagram is depicted in Fig.
3.4(a), where σCM2-1 and σCM2-2denote the error compensation biases of X1Y0 and X0Y1, respectively. In Fig. 3.4(a) with n=8, compared with CM1, MUL1 can be unchanged while MUL2 must configure {S3,4, S2,4} to {S3,4, S2,4}, and {S3,5, S2,5} to {1,1} circled by dash-line and solid-line, respectively. In the logic gate level, we OR the original partial-product with the control signal to generate 1 and use XOR gate to produce inverted sign-bit as shown in Fig. 3.4(b), where CS denotes the control signal and note that the input {x4, x3} are configured to {x3, x3}. In addition, S2,6, S2,7, S3,6 ,andS3,7 are set to zero. The configuration parameter settings of CM2 are addressed in Fig. 3.4(c), where CP0, CP1, and CP2 are set to 1, 1, and 0, respectively. Since other partial products are configured to 0, we do not need to AND these partial products with control signal.
Our approach is to AND three Booth recoding bits with the control signal as shown in Fig. 3.4(d) to save the number of AND gates while n increases. Hence, no matter how many partial products needed to be configured to 0, we merely require three AND gates if these partial products are in the same row. In addition, due to permanent inverted MSB output of the partial products in MUL2, the bits denoted as 0 represent one, but
the summation result of the most significant half of MUL2 still produces zero. In summary, the two products of CM2 can be generally expressed in (11a) and (11b).
1 partial-products of CM1.
X1 X0
Fig. 3.3. Subword operation for two n/2xn/2 fixed-width multiplications.
S0,8 S0,7
Fig. 3.4. (a) Partial-product array diagram for two n/2xn/2 fixed-width multipliers with n=8, (b) configuration settings of S2,4 and S3,4, (c) configuration parameter settings, and (d) configured Booth recoding circuit.
3.1.3 CM3: n/2xn/2 Full-Precision Multiplier
CM3 serves as performing an n/2xn/2 full-precision multiplication. The corresponding subword operation of CM3 is illustrated in Fig. 3.5, where the subword product is X1Y1 with n-bit wide output. Since the proposed reconfigurable structure to implement full-precision multiplication is based on the fixed-width multiplier fabric, it can be seen that MUL2 is able to achieve this mode operation. Next, the partial-product array diagram is depicted in Fig. 3.6(a) with n=8, where the configuration of S2,4 and S3,4 circled by dash-line are different from the ones of other modes because their multiplicand inputs of the Booth selector have to be configured from {x4, x3} to {x4, 0}
as shown in Fig. 3.6(b). The partial-product S3,2 is configured to neg2 circled by solid-line for 2’s-complement computation. In addition, S2,2, S2,3, S3,0,S3,1, and S3,3 are configured to 0 in MUL2. In this mode, the output of MUL1 needs to generate zero as shown in the upper diagram of Fig. 3.6(a). The configuration parameter settings of CM3 are addressed in Fig. 3.6(c), where CP0, CP1, and CP2 are set to 0, 1, and negn/2-1, respectively. For n=8, negn/2-1 is neg3. On the other hand, we AND three Booth recoding bits with the control signals to generate 0 in MUL2 as mentioned in the above paragraph.
However, in order to generate 0 in MUL1, we can use AND gates before the Booth encoder to AND the multiplier inputs with the control signal as shown in Fig. 3.6(d) such that the Booth encoder generates 0 before the Booth selector. In summary, the product of CM3 can be generally expressed in (12).
2 where Si,n/2 are the reconfigured partial products as similar to that in Fig.3.6(b).
X1 X0
Fig. 3.5. Subword operation for one n/2xn/2 full-precision multiplication.
0 0
Fig. 3.6. (a) Partial-product array diagram for n/2xn/2 full-precision multiplier with n=8, (b) configuration settings of S2,4 and S3,4, (c) configuration parameter settings, and (d) configured Booth recoding circuit.
3.1.4 CM4: Sum of Two n/2xn/2 Fixed-Width Multipliers
The main function of CM4 is to add two n/2-bit wide fixed-width multiplication results. The partial product array diagram is sketched in Fig. 3.7. In Fig. 3.7, the configuration settings are the same as those of CM2; however, the output arrangement is different. The details will be explained in the next paragraph.
S0,8 S0,7 S0,6 1
S1,8 S1,7 S1,6 S1,5 S1,4 1
0 0 0 1 S2,4 S2,3 S2,2 1
0 0 0 1 S3,4 S3,3 S3,2 S3,1 S3,0 1
1
CP0 CP1
CP2
MUL1
MUL2
P4 P8
P8 P8
P8 P7 P6 P5
σTHFWM1
σTHFWM2
+)
Fig. 3.7. Partial-product array diagram for sum of two n/2xn/2 fixed-width multipliers with n=8.
3.1.5 Proposed Structure
According to the above partial-product array analysis of the four configuration modes, we observe the following architecture design viewpoints.
1) Need to individually operate MUL1 and MUL2 for CM2. There exists a last-stage adder to sum the outputs of MUL1 and MUL2 to generate product for CM1, CM3, and CM4.
2) Need to reconfigure adaptive compensation circuit for nxn and n/2xn/2 fixed-width multiplication.
3) Need to reconfigure Booth encoder circuit for nxn and n/2xn/2 multiplication.
4) Need to control carryout signals according to different CMs.
5) Need to rearrange the output of the final product according to CMs.
According to the above viewpoints, the proposed reconfigurable structure for n=8 can be depicted in Fig. 3.8, where we separate the fixed-width multiplication array into two multiplier modules MUL1 and MUL2 and then their outputs are fed into the last-stage adder. First of all, there is one decoder which is charge of decoding OP code to generate control signals for two multiplier modules, where the truth table of this decoder is listed in Table 3.2. After observation, the parameters CP0, CP1, and CP2
circled by dash-line in Fig. 3.8 can be easily realized by CS[1], CS[0], and CS[2]neg3, respectively. For last-stage adder, since we do not need to sum up MUL1 and MUL2 for CM2, the least significant half inputs of the last-stage adder must be switched from the least significant half products of MUL2 to zero. Thus, the least significant half products of the last-stage adder can directly output the products of MUL1. In Fig. 3.8, we use AND gates to switch the least significant half products of MUL2 to zero. Second, for CM4, sign extension is required before summing up MUL1 and MUL2 so that sign bits are generated after summation. In Fig. 3.8, there are two multiplexers to select sign bits of MUL1 and MUL2, which are denoted as PMUL1[11] and PMUL2[11], respectively. Thus, the most significant half products of the last-stage adder will generate the sign bit denoted as S[12].
From viewpoint 2, since σCM1 is composed of σCM2-1 and σCM2-2, two adaptive compensation biases σCM2-1 and σCM2-2 are needed to carefully control. According to the binary thresholding mentioned in [20], if each adaptive compensation bias adds a constant K =1/2 forQ0,w10, the two adaptive compensation biases are not equivalent to the compensation design as shown in Fig. 7 of [20]. Thus, the design will lead to
larger truncation error for CM1 than that of adding a constant K=1/2 one time. Herein, we propose sub-calibration-circuit 1 (SCC1) and sub-calibration-circuit 2 (SCC2) to keep away from double constant addition and to achieve this reconfiguration for nxn and n/2xn/2 fixed-width multiplications. The logic diagram of SCC1 and SCC2 as shown in Fig. 3.9 is little area overhead, where the truth table of SCC1 and SCC2 is tabulated in Table 3.3. For CM1, if Km1=1 and Km2=1 (i.e.Q0,w1 0), then SCC1=1 and SCC2=0 to avoid double addition of constant K=1/2. Otherwise, SCC1=0 and SCC2=0 sinceQ0,w10. For CM2 or CM4, two independent n/2xn/2 multipliers are operated in parallel. Thus, SCC1 and SCC2 follow the values of Km1 and Km2 (i.e., SCC1=Km1 and SCC2= Km2).
From viewpoint 3, apart from the above mentioned configurations, the multiplier yn/2-1 has to be configured to 0 when CM2, CM3, or CM4 are performed. In Fig. 3.8, the input of the Booth encoder2 is {Y[5],Y[4],Y[3]} for CM1. On the other hand, while one of other three modes is selected, the input of the Booth encoder2 is {Y[5],Y[4],0}. In addition, three carryout signals denoted as Com1, Com2_1, and Com2_2 are also configured from viewpoint 4. Com1 is propagated to the last-stage adder only if CM1 is performed. Com2_1, and Com2_2 are propagated if either CM1 or CM3 is performed.
From viewpoint 5, the output arrangement of the final product is different according to the different modes. As shown in Fig. 3.8, there exists a multiplexer to select the most significant half of the final product. If {CS[1], CS[3]}={0, 0}, that means either CM1 or CM3 is performed. Thus, the output is switched to the most significant half of the last-stage adder (i.e. S[15:12]). If {CS[1], CS[3]}={1, 0}, that means CM2 is performed and then the output is switched to the least significant half of MUL2 (i.e. PMUL2[11:8]).
If {CS[1], CS[3]}={1, 1}, that means CM4 is performed and the output is extended to 8 bits with sign bit S[12].
MUL1
SCC1
Km1
Km2
Partial Product Reduction Tree
x[7] x[6] x[5] x[4] x[3]
Y[1:0],0 sel sel sel
0 Partial Product Reduction Tree Partial Product Reduction Tree
CS[1]
Fig. 3.8. Overall structure of the proposed reconfigured fixed-width Booth multiplier for n=8.
SCC1
Km1
Km1
Km2
0 1 CS[1]
Km2 CS[1]
SCC2 Fig. 3.9. Logical diagram of SCC1 and SCC2.
Table 3.2: Truth table of the decoder for the reconfigurable fixed-width Booth multiplier
OP[2:0] CS[3:0]
00(CM1) 0 0 0 1 01(CM2) 0 0 1 0 10(CM3) 0 1 0 0 11(CM4) 1 0 1 0
Table 3.3: Truth table of sub-calibration-circuit1 (SCC1) and sub-calibration-circuit2 (SCC2)
Km1 Km2 CM1 CM2 or CM4
Output of SCC1
Output of SCC2
Output of SCC1
Output of SCC2
0 0 0 0 0 0
0 1 0 0 0 1
1 0 0 0 1 0
1 1 1 0 1 1
3.2 Reconfigurable Fixed-Width Baugh-Wooley Multiplier
In this section, we begin to demonstrate how to generate four different multipliers under the limited hardware resource of the fixed-width Baugh-Wooley multiplier. In this
thesis, we use the fixed-width multiplier in Fig. 3.10 as our reconfigurable Baugh-Wooley multiplier prototype instead of the full-precision multiplier structure, where the fixed-width multiplier truncates partial products of the least significant part (LSP) as shown in the dash-lined region of Fig. 3.10. In Fig. 3.10, three modules denoted as MUL1, MUL2, and MUL3 are used to reconfigure the following four different multipliers as listed in Table 3.4 through the corresponding four configuration modes (CMs). Thus, the proposed reconfigurable fixed-width Baugh-Wooley multiplier employing MUL1, MUL2, and MUL3 is essentially different from the full-precision one [21-31]. Without loss of the generality, we use n=8 to investigate each CM case in the following.
MUL1
MUL3 MUL2
Truncated Region of
LSP
Fig. 3.10. Prototype structure of the proposed reconfigurable fixed-width Baugh-Wooley multiplier involving MUL1, MUL2, MUL3 and discarding truncated region of LSP.
Table 3.4: Proposed four configuration modes of the reconfigurable fixed-width Baugh-Wooley multiplier
Configuration Mode (CM)
Function Descriptions Mode Applications
CM1 nxn fixed-width multiplier High resolution computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM2 two n/2xn/2 fixed-width
multipliers
Parallel computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM3 n/2xn/2 full-precision
multiplier
Full-precision computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM4 two n/4xn/4 fixed-width
multipliers
Parallel computations: Multiplication, matrix multiplication, square-root operation, filter, transform
3.2.1 CM1: nxn Fixed-Width Multiplier
CM1 is in charge of operating nxn fixed-width multiplication that receives two
n-bit numbers and produces an n-bit product. It is known that the various fixed-width
n-bit numbers and produces an n-bit product. It is known that the various fixed-width