Thesis Organization - 低功耗可重組固定寬度乘法器之設計與實作

Chapter 1 Introduction

1.2 Thesis Organization

The rest of the paper is organized as follows. The Booth multiplier, Baugh-Wooley multiplier, subword multiplication and low-error fixed-width multiplier are briefly reviewed in Chapter 2. In Chapter 3, the proposed reconfigurable fixed-width multiplication engine with four configuration modes is presented. The comparison results in terms of area size and power saving are presented in Chapter 4. For FIR filter application, the error comparison and power scalable performance using various fixed-width and full-precision multipliers are illustrated in the same chapter. Last, brief statements conclude the presentation of this thesis.

Chapter 2 Fundamental Concepts

In this chapter, the fundamental concepts will be given, including the introduction to the Booth multiplier, Baugh-Wooley multiplier, subword multiplication, and low-error fixed-width multiplier.

2.1 Array Multipliers

In this thesis, we will introduce two kinds of reconfigurable fixed-width multipliers.

One is based on the Booth multiplier and the other is based on the Baugh-Wooley multiplier. The Booth and Baugh-Wooley multipliers are very famous algorithms used in digital signal processing. In the following, we will briefly review these two algorithms.

2.1.1 Booth Multiplier

Considering two 2’s-complement integer operands, we can respectively represent an n-bit multiplicand X and an n-bit multiplier Y as follows.

^



0 1

12 2

n-i=

i i n

n- + x

X = -x

(1)

^



0 1

12 2

n-i=

i i n

n- + y

Y = -y

(2)

where x_i, y_i{0 ,1}. The 2n-bit full-precision product P_FP can be written as additions at each stage are required to generate the product. Substituting (4) into (3), we obtain (MSB) with a one-bit overlap. Table 2.1 lists the recoding rule and Fig. 2.1 shows the block diagram of the Booth recoding circuit. In Fig. 2.1, the Booth encoder generates three Booth recoding bits neg, X1, and X2 to the Booth selector and then the Booth selector selects input multiplicand {x_j, x_j-1} as output partial products. In order to simplify the representation of each partial product, we define the following notation.

2’s-complement Booth arithmetic operations, the partial product sign extensions are required for each stage, but these extended sign bits lead to large amount of area and power overhead. The sign S of an n by n multiplier can be expressed as

Substituting (6) and (7) into (5), we can obtain the partial-product array diagram for nxn Booth multiplier as depicted in Fig. 2.2, where notation w means to keep n+w most significant columns of the partial products for fixed-width multiplications. If w=n, the fixed-width multiplier becomes a full-precision multiplier. In this thesis, we would like to reconfigure the fixed-width multiplication engine to generate several useful multipliers under the limited hardware resource for DSP and image processing applications.

Table 2.1: Modified Booth recoding table y2i+1 y2i y2i-1 y_i neg X1 X2

Fig. 2.1. Block diagram of Booth recoding circuit.

S0,n S0,n-1 S0,n-2

Fig. 2.2. Modified Booth partial-product diagram with sign-generate sign extension scheme for an nxn multiplier.

2.1.2 Baugh-Wooley Multiplier

Considering two 2’s-complement integer operands, we can respectively represent an n-bit multiplicand X and an n-bit multiplier Y as (1) and (2). The 2n-bit full-precision product P_FP can be written as sums partial-product bits corresponding to each weighting. The partial-product array for nxn 2’s-complement multiplication are depicted in Fig. 2.3, where notation w means to keep n+w most significant columns of the partial products for fixed-width multiplications. If w=n, the fixed-width multiplier becomes a full-precision multiplier.

In this thesis, we would like to reconfigure the fixed-width multiplication engine to generate four useful multipliers under the limited hardware resource.

Fig. 2.3. Partial-product array diagram for an nxn Baugh-Wooley multiplier.

2.2 Subword Multiplication

Many DSP and computer applications demand to operate at lower resolution, where the data can be expressed in a half-word length [21-26]. Generally, applying the subword multiplication scheme, we can partition an n-bit operand into two independent n/2-bit operands or four independent n/4-bit operands; hence, the subword multiplier

can perform not only nxn full-precision multiplication but also two n/2xn/2 or four n/4xn/4 full-precision multiplications in parallel. Fig. 2.4 illustrates subword multiplication and the partial product array distribution [21-26]. In Fig. 2.4(a), two n-bit operands, X and Y, are partitioned into two independent pairs of n/2-bit subwords, and

then the two pairs of n/2-bit subwords are multiplied to produce two independent n-bit products: P₁=X₁Y₁ and P₀=X₀Y₀, where the partial product array distribution is addressed in Fig. 2.4(b). On the other hand, n/4xn/4 subword multiplication and the partial product array distribution are illustrated in Fig. 2.4(c) and 2.4(d), respectively. To our best knowledge, the current subword scheme is applied only to full-precision multiplication based on the full-precision multiplier infrastructure. In the following section, we will extend this subword scheme to fixed-width and full-precision multiplication using the fixed-width prototype multiplier.

X1 X0 partial-product array distribution, (c) four n/4xn/4 multiplications, and (d) four n/4xn/4 partial-product array distribution.

2.3 Low-Error Fixed-Width Multipliers

It is known that the various fixed-width multipliers with adaptive compensation biases have been widely discussed in [12-20]. Herein, regarding the tradeoffs of the

truncation error and area cost in [19-20], we choose w=1 (i.e., keeping n+1 most significant columns) and Q=0 for the prototype multiplier structure, where Q has been clearly defined in [19-20]. The error-compensation bias can be summarized as

 the (n+2)th column counted from left to right of Fig. 2.2) for the Booth architecture. If the Baugh-Wooley architecture is the basic multiplier, E_main=x_n₁y₀ + x_n-2y₁+ x_n-3y₂ from left to right of Fig. 2.3). Fig. 2.5 and Fig. 2.6(a) are the prototype Booth multiplier structure and the prototype Baugh-Wooley multiplier structure for n=8, respectively, where A, ND, HA, and FA denote AND gate, NAND gate, a half adder and a full adder, respectively, and the logic diagrams of the other processing elements are depicted in Fig.

2.6(b).

sel sel sel sel sel

sel

FA FA logic diagrams of AOR, ANOR, AHA, AFA, NFA.

Chapter 3 Design of Reconfigurable Fixed-Width Multipliers

In this chapter, we describe the design methodology of the reconfigurable fixed-width multipliers based on the Booth architecture and the Baugh-Wooley architecture, respectively.

3.1 Reconfigurable Fixed-Width Booth Multiplier

In this section, we begin to demonstrate how to generate four different multipliers under the limited hardware resource of the fixed-width Booth multiplier. In this thesis, we use the fixed-width multiplier in Fig. 3.1 as our reconfigurable Booth multiplier prototype instead of the full-precision multiplier structure, where the fixed-width multiplier truncates partial products of the least significant part (LSP) as shown in the dash-lined region of Fig. 3.1 and compensate the error with adaptive compensation bias.

In Fig. 3.1, two modules denoted as MUL1 and MUL2 are used to reconfigure the following four different multipliers as listed in Table 3.1 through the corresponding four configuration modes (CMs). Thus, the proposed reconfigurable fixed-width Booth multiplier employing MUL1 and MUL2 is essentially different from the full-precision one [21-31]. Without loss of the generality, we use n=8 to investigate each CM case.

MUL1

MUL2

Truncated Region of

LSP

Fig. 3.1. Prototype structure of the proposed reconfigurable fixed-width Booth multiplier involving MUL1, MUL2, and discarding truncated region of LSP.

Table 3.1: Proposed four configuration modes of the reconfigurable fixed-width Booth multiplier

Configuration Mode (CM)

Function Descriptions Mode Applications CM1 nxn fixed-width multiplier High resolution computations:

Multiplication, matrix multiplication, square-root operation, filter, transform CM2 two n/2xn/2 fixed-width

multipliers

Parallel computations: Multiplication, matrix multiplication, square-root

operation, filter, transform CM3 n/2xn/2 full-precision

multiplier

Full-precision computations:

Multiplication, matrix multiplication, square-root operation, filter, transform

CM4 sum of two n/2xn/2

fixed-width multipliers

Parallel multiplication and add computations: Matrix multiplication,

filter, transform

3.1.1 CM1: nxn Fixed-Width Multiplier

CM1 is in charge of operating nxn fixed-width multiplication that receives two n-bit numbers and produces an n-bit product. Since CM1 is confined to w=1, the partial-product array diagram as shown in Fig. 3.2(a) with n=8 can be easily obtained from Fig. 2.2. The partial products in the dash-lined region of Fig. 3.2(a) denoted as σ

are used to compute the error compensation bias in (9). Note that the error performance results of the Booth-based CM1 are the same as those of [20] because we implement the same adaptive compensation bias circuits. Throughout the section 3.1, in order to completely achieve four configuration modes, we provide three configuration parameters CP0, CP1, and CP2 combining with the partial product setting to generate

Fig. 3.2. (a) Partial-product array diagram for nxn fixed-width multiplier with n=8, and (b) configuration parameter settings.

3.1.2 CM2: Two n/2xn/2 Fixed-Width Multipliers

CM2 plays a role of concurrently performing two n/2xn/2 fixed-width multiplications. In this configuration mode, we need two copies of hardware resource to implement CM2. The corresponding fixed-width subword operation of CM2 is illustrated in Fig. 3.3, where two subword products are X₁Y₀ and X₀Y₁because of the limited hardware resource and each fixed-width multiplication has n/2-bit wide output.

Note that we can produce subword products X₀Y₀ and X₁Y₁as the conventional subword multiplication by exchanging X0 and X1 before the Booth selector. However, X₀Y₀ and X₁Y₁will lead to larger exchange hardware overhead. Thus, we adopt X₁Y₀ and X0Y1 subword operations. The partial-product array diagram is depicted in Fig.

3.4(a), where σCM2-1 and σCM2-2denote the error compensation biases of X₁Y₀ and X₀Y₁, respectively. In Fig. 3.4(a) with n=8, compared with CM1, MUL1 can be unchanged while MUL2 must configure {S₃_,₄, S₂_,₄} to {S₃_,₄, S₂_,₄}, and {S₃_,₅, S₂_,₅} to {1,1} circled by dash-line and solid-line, respectively. In the logic gate level, we OR the original partial-product with the control signal to generate 1 and use XOR gate to produce inverted sign-bit as shown in Fig. 3.4(b), where CS denotes the control signal and note that the input {x4, x3} are configured to {x3, x3}. In addition, S2,6, S2,7, S3,6 ,andS3,7 are set to zero. The configuration parameter settings of CM2 are addressed in Fig. 3.4(c), where CP0, CP1, and CP2 are set to 1, 1, and 0, respectively. Since other partial products are configured to 0, we do not need to AND these partial products with control signal.

Our approach is to AND three Booth recoding bits with the control signal as shown in Fig. 3.4(d) to save the number of AND gates while n increases. Hence, no matter how many partial products needed to be configured to 0, we merely require three AND gates if these partial products are in the same row. In addition, due to permanent inverted MSB output of the partial products in MUL2, the bits denoted as 0 represent one, but

the summation result of the most significant half of MUL2 still produces zero. In summary, the two products of CM2 can be generally expressed in (11a) and (11b).

1 partial-products of CM1.

X1 X0

Fig. 3.3. Subword operation for two n/2xn/2 fixed-width multiplications.

S0,8 S0,7

Fig. 3.4. (a) Partial-product array diagram for two n/2xn/2 fixed-width multipliers with n=8, (b) configuration settings of S_2,4 and S_3,4, (c) configuration parameter settings, and (d) configured Booth recoding circuit.

3.1.3 CM3: n/2xn/2 Full-Precision Multiplier

CM3 serves as performing an n/2xn/2 full-precision multiplication. The corresponding subword operation of CM3 is illustrated in Fig. 3.5, where the subword product is X1Y1 with n-bit wide output. Since the proposed reconfigurable structure to implement full-precision multiplication is based on the fixed-width multiplier fabric, it can be seen that MUL2 is able to achieve this mode operation. Next, the partial-product array diagram is depicted in Fig. 3.6(a) with n=8, where the configuration of S_2,4 and S3,4 circled by dash-line are different from the ones of other modes because their multiplicand inputs of the Booth selector have to be configured from {x₄, x₃} to {x₄, 0}

as shown in Fig. 3.6(b). The partial-product S3,2 is configured to neg2 circled by solid-line for 2’s-complement computation. In addition, S_2,2, S_2,3, S_3,0,S_3,1, and S_3,3 are configured to 0 in MUL2. In this mode, the output of MUL1 needs to generate zero as shown in the upper diagram of Fig. 3.6(a). The configuration parameter settings of CM3 are addressed in Fig. 3.6(c), where CP0, CP1, and CP2 are set to 0, 1, and negn/2-1, respectively. For n=8, neg_n/2-1 is neg₃. On the other hand, we AND three Booth recoding bits with the control signals to generate 0 in MUL2 as mentioned in the above paragraph.

However, in order to generate 0 in MUL1, we can use AND gates before the Booth encoder to AND the multiplier inputs with the control signal as shown in Fig. 3.6(d) such that the Booth encoder generates 0 before the Booth selector. In summary, the product of CM3 can be generally expressed in (12).

2 where Si,n/2 are the reconfigured partial products as similar to that in Fig.3.6(b).

X1 X0

Fig. 3.5. Subword operation for one n/2xn/2 full-precision multiplication.

0 0

Fig. 3.6. (a) Partial-product array diagram for n/2xn/2 full-precision multiplier with n=8, (b) configuration settings of S_2,4 and S_3,4, (c) configuration parameter settings, and (d) configured Booth recoding circuit.

3.1.4 CM4: Sum of Two n/2xn/2 Fixed-Width Multipliers

The main function of CM4 is to add two n/2-bit wide fixed-width multiplication results. The partial product array diagram is sketched in Fig. 3.7. In Fig. 3.7, the configuration settings are the same as those of CM2; however, the output arrangement is different. The details will be explained in the next paragraph.

S_0,8 S_0,7 S_0,6 1

S_1,8 S_1,7 S_1,6 S_1,5 S_1,4 1

0 0 0 1 S_2,4 S_2,3 S_2,2 1

0 0 0 1 S_3,4 S_3,3 S_3,2 S_3,1 S_3,0 1

CP₀ CP1

CP₂

MUL1

MUL2

P₄ P₈

P₈ P₈

P₈ P₇ P₆ P₅

σTHFWM1

σTHFWM2

Fig. 3.7. Partial-product array diagram for sum of two n/2xn/2 fixed-width multipliers with n=8.

3.1.5 Proposed Structure

According to the above partial-product array analysis of the four configuration modes, we observe the following architecture design viewpoints.

1) Need to individually operate MUL1 and MUL2 for CM2. There exists a last-stage adder to sum the outputs of MUL1 and MUL2 to generate product for CM1, CM3, and CM4.

2) Need to reconfigure adaptive compensation circuit for nxn and n/2xn/2 fixed-width multiplication.

3) Need to reconfigure Booth encoder circuit for nxn and n/2xn/2 multiplication.

4) Need to control carryout signals according to different CMs.

5) Need to rearrange the output of the final product according to CMs.

According to the above viewpoints, the proposed reconfigurable structure for n=8 can be depicted in Fig. 3.8, where we separate the fixed-width multiplication array into two multiplier modules MUL1 and MUL2 and then their outputs are fed into the last-stage adder. First of all, there is one decoder which is charge of decoding OP code to generate control signals for two multiplier modules, where the truth table of this decoder is listed in Table 3.2. After observation, the parameters CP0, CP1, and CP2

circled by dash-line in Fig. 3.8 can be easily realized by CS[1], CS[0], and CS[2]neg₃, respectively. For last-stage adder, since we do not need to sum up MUL1 and MUL2 for CM2, the least significant half inputs of the last-stage adder must be switched from the least significant half products of MUL2 to zero. Thus, the least significant half products of the last-stage adder can directly output the products of MUL1. In Fig. 3.8, we use AND gates to switch the least significant half products of MUL2 to zero. Second, for CM4, sign extension is required before summing up MUL1 and MUL2 so that sign bits are generated after summation. In Fig. 3.8, there are two multiplexers to select sign bits of MUL1 and MUL2, which are denoted as P_MUL1[11] and P_MUL2[11], respectively. Thus, the most significant half products of the last-stage adder will generate the sign bit denoted as S[12].

From viewpoint 2, since σCM1 is composed of σCM2-1 and σCM2-2, two adaptive compensation biases σCM2-1 and σCM2-2 are needed to carefully control. According to the binary thresholding mentioned in [20], if each adaptive compensation bias adds a constant K =1/2 for_Q_₀_,_w_₁0, the two adaptive compensation biases are not equivalent to the compensation design as shown in Fig. 7 of [20]. Thus, the design will lead to

larger truncation error for CM1 than that of adding a constant K=1/2 one time. Herein, we propose sub-calibration-circuit 1 (SCC1) and sub-calibration-circuit 2 (SCC2) to keep away from double constant addition and to achieve this reconfiguration for nxn and n/2xn/2 fixed-width multiplications. The logic diagram of SCC1 and SCC2 as shown in Fig. 3.9 is little area overhead, where the truth table of SCC1 and SCC2 is tabulated in Table 3.3. For CM1, if Km1=1 and Km2=1 (i.e._Q_₀_,_w_₁ 0), then SCC1=1 and SCC2=0 to avoid double addition of constant K=1/2. Otherwise, SCC1=0 and SCC2=0 since_Q_₀_,_w_₁0. For CM2 or CM4, two independent n/2xn/2 multipliers are operated in parallel. Thus, SCC1 and SCC2 follow the values of Km1 and Km2 (i.e., SCC1=K_m1 and SCC2= K_m2).

From viewpoint 3, apart from the above mentioned configurations, the multiplier y_n/2-1 has to be configured to 0 when CM2, CM3, or CM4 are performed. In Fig. 3.8, the input of the Booth encoder2 is {Y[5],Y[4],Y[3]} for CM1. On the other hand, while one of other three modes is selected, the input of the Booth encoder2 is {Y[5],Y[4],0}. In addition, three carryout signals denoted as Com1, Com2_1, and Com2_2 are also configured from viewpoint 4. Co_m1 is propagated to the last-stage adder only if CM1 is performed. Com2_1, and Com2_2 are propagated if either CM1 or CM3 is performed.

From viewpoint 5, the output arrangement of the final product is different according to the different modes. As shown in Fig. 3.8, there exists a multiplexer to select the most significant half of the final product. If {CS[1], CS[3]}={0, 0}, that means either CM1 or CM3 is performed. Thus, the output is switched to the most significant half of the last-stage adder (i.e. S[15:12]). If {CS[1], CS[3]}={1, 0}, that means CM2 is performed and then the output is switched to the least significant half of MUL2 (i.e. PMUL2[11:8]).

If {CS[1], CS[3]}={1, 1}, that means CM4 is performed and the output is extended to 8 bits with sign bit S[12].

MUL1

SCC1

Km1

Km2

Partial Product Reduction Tree

x[7] x[6] x[5] x[4] x[3]

Y[1:0],0 sel sel sel

0 Partial Product Reduction Tree Partial Product Reduction Tree

CS[1]

Fig. 3.8. Overall structure of the proposed reconfigured fixed-width Booth multiplier for n=8.

SCC1

Km1

Km2

0 1 CS[1]

K_m2 CS[1]

SCC2 Fig. 3.9. Logical diagram of SCC1 and SCC2.

Table 3.2: Truth table of the decoder for the reconfigurable fixed-width Booth multiplier

OP[2:0] CS[3:0]

00(CM1) 0 0 0 1 01(CM2) 0 0 1 0 10(CM3) 0 1 0 0 11(CM4) 1 0 1 0

Table 3.3: Truth table of sub-calibration-circuit1 (SCC1) and sub-calibration-circuit2 (SCC2)

K_m1 K_m2 CM1 CM2 or CM4

Output of SCC1

Output of SCC2

Output of SCC1

Output of SCC2

0 0 0 0 0 0

0 1 0 0 0 1

1 0 0 0 1 0

1 1 1 0 1 1

3.2 Reconfigurable Fixed-Width Baugh-Wooley Multiplier

In this section, we begin to demonstrate how to generate four different multipliers under the limited hardware resource of the fixed-width Baugh-Wooley multiplier. In this

thesis, we use the fixed-width multiplier in Fig. 3.10 as our reconfigurable Baugh-Wooley multiplier prototype instead of the full-precision multiplier structure, where the fixed-width multiplier truncates partial products of the least significant part (LSP) as shown in the dash-lined region of Fig. 3.10. In Fig. 3.10, three modules denoted as MUL1, MUL2, and MUL3 are used to reconfigure the following four different multipliers as listed in Table 3.4 through the corresponding four configuration modes (CMs). Thus, the proposed reconfigurable fixed-width Baugh-Wooley multiplier employing MUL1, MUL2, and MUL3 is essentially different from the full-precision one [21-31]. Without loss of the generality, we use n=8 to investigate each CM case in the following.

MUL1

MUL3 MUL2

Truncated Region of

LSP

Fig. 3.10. Prototype structure of the proposed reconfigurable fixed-width Baugh-Wooley multiplier involving MUL1, MUL2, MUL3 and discarding truncated region of LSP.

Table 3.4: Proposed four configuration modes of the reconfigurable fixed-width Baugh-Wooley multiplier

Configuration Mode (CM)

Function Descriptions Mode Applications

CM1 nxn fixed-width multiplier High resolution computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM2 two n/2xn/2 fixed-width

multipliers

Parallel computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM3 n/2xn/2 full-precision

multiplier

Full-precision computations: Multiplication, matrix multiplication, square-root operation, filter, transform CM4 two n/4xn/4 fixed-width

multipliers

Parallel computations: Multiplication, matrix multiplication, square-root operation, filter, transform

3.2.1 CM1: nxn Fixed-Width Multiplier

CM1 is in charge of operating nxn fixed-width multiplication that receives two

n-bit numbers and produces an n-bit product. It is known that the various fixed-width

在文檔中低功耗可重組固定寬度乘法器之設計與實作 (頁 15-0)