Chapter 2 Previous Works
2.3 Summaries of Previous Works
The multiplication flow of a scalar MBE multiplier can be partitioned into three steps – PPG, PPRT, and CPA. For PPG, a race-free encoding scheme which outperforms other schemes in terms of timing, area, and power consumption is proposed. Sign encoding that prevents sign extension, and hot-one modification that integrates LSB with hot-ones both make the PPA more regular. PPRT often uses levels of FAs to perform carry-save addition, and TDM is an algorithm that helps construct a speed optimized PPRT. The number of PPs after the PPRT is reduced to two. A CPA is used to sum the two PPs to obtain the final product. SWP increases throughput and provides a performance boost in multimedia extensions or DSP processors. Without much overhead, SWP can be applied to MUL/MAC unit by rearranging PPA and the support of SWP accumulation.
The proposed scalar and SWP designs improve and innovate while utilizing some previous works. We’ll describe the proposed designs in more detail in the next chapter.
26
C HAPTER 3
P ROPOSED MAC D ESIGNS
3.0 Overview
In this chapter, the design methodology of the proposed MAC designs is elaborated. Section 3.1 presents the scalar version of the proposed MAC design: as described in Chapter 2, the MAC unit consists of three parts – PPG, PPRT, and CPA.
Based on the scalar MAC architecture, Section 3.2 enunciates the sub-word parallel (SWP) version of the proposed MAC. The differences, improvements and innovations are compared or highlighted in each section and briefly summarized in Section 3.3.
3.1 Scalar MAC (SMAC) Design
3.1.0 Specification
A high performance scalar MAC design which multiplies the N-bit multiplicand (mcand) by the N-bit multiplier (mlier) with/without accumulating a 2N-bit accumulator (accu) is proposed. It supports signed/unsigned/mixed-mode operation.
Table 3.1 lists the specification of the proposed SMAC. Fig. 3.1 shows the proposed SMAC execution flow. To be noted, the carry-out of final result is also provided.
Table 3.1. Specification of the proposed SMAC design.
Operation: m_out = accu + mcand × mlier (mode)
Bit Width of mcand 8/16/32/64
Bit Width of mlier 8/16/32/64 Bit Width of accu 16/32/64/128 Bit Width of m_out 16/32/64/128
Available modes 01:Signed/00:Unsigned/1?:Mixed-mode
27
Fig. 3.1. Execution flow of the proposed Scalar MAC design.
3.1.1 Scalar Partial Product Generation (SPPG)
The first phase of SPPG, modified Booth encoding (MBE), is to encode the triplets chosen from the multiplier and then decodes the multiplicand with respect to MBA selection table (Table 2.1). The proposed scalar design favors the race-free concept in [16] that diminishes the energy dissipation, and benefits from the implementation in [9] which saves one logic level and reduces area.
A special operating mode, mixed-mode, is integrated into the proposed scalar design. It forces the multiplicand and the accumulator to be signed, the multiplier to be unsigned, and produces a signed result after operation. Mixed-mode operation has a larger dynamic range, and will be explained in detail in Section 5.4.
However, the MBE scheme in [9] only applies to signed operands. To support unsigned/mixed-mode operation, some modification must be performed on the MBE.
28
By specification, both unsigned and mixed mode treat mlier, the multiplier, as an unsigned number; however, due to two’s complement (TC) format natively utilized in MBA, the MSB of mlier is the negatively weighted sign bit. It implies N+1 bits are required in TC format to fully represent an N-bit unsigned number by forcing the (N+1)th bit, the new MSB and sign bit, to a zero. Owing to the existence of the extra zero, an always positive PP is generated to support unsigned/mixed-mode operation.
This is why an N-bit DSP processor with an (N+1)-bit MAC unit supporting unsigned multiplication is frequent.
Briefly speaking, two methods are used to generate the extra PP. The first method uses MBE to generate by assuming {0,0,m} as the extra encoding triplet where m stands for the MSB of mlier, resulting in a PP equal to zero or mcand since the extra triplet is always {0,0,0} or {0,0,1}. The other method uses a similar concept by observing when unsigned/mixed-mode is asserted, a multiplexer with a string of zeros and mcand as two inputs and m as the control signal can select the extra PP. The result should be identical with the first method. As a result, both methods help unsigned/mixed-mode operation while neither of them influences on signed operation since the MBE selection of the extra signed-extended triplet {m,m,m} or the selection of MUX2s always equals zero. Section 5.1 will detail the way to support unsigned and mixed-mode operation.
Using either method, the logic of the extended triplet {s,s,m} or the extended bit s is dependent with m, the MSB of mlier, and the assigned mode under execution.
If naming mode[1] as mix (1: mixed-mode; 0: signed/unsigned mode) as well as mode[0] as tc (1: signed-mode; 0: unsigned-mode), the logic of s is derived as:
(~ ) s m tc = ⋅ ⋅ mix
.
(3.1)
29
In the proposed SPPG, the first method is utilized; besides, signed encoding is also integrated into the MBE, resulting in an N-bit-input and (N+2)-bit-output MBE.
Fig 3.2 demonstrates why the output PP requires two-bit extension: assume a 4-bit operand, 1000, is the mcand, and the current encoding triplet is {1,0,0} (-2x); it indicates the negation of mcand followed by one-bit left shift is to be performed.
Due to the need of one-bit left shift, a 5-bit temporary data is required, as shown in the second and third rows in the figure. The bit in bit position 5 is used to save the correct sign that may shift out 5-bit data boundary. If the operating mode is different, this saved bit may differ even if LSBs are the same. Moreover, this bit is also useful for sign encoding. Six bits are hence required for correct representation.
However, the logic of the extended two bits relates to the operating mode, two 2-input AND gates (AND2) are needed at the most significant two bits of the mcand to generate these two extended bits. These AND2s are added in the decoder in Fig.
2.4 while there’s no logic change on the remaining LSBs. This modification increases a little delay and is still area reduced.
Fig. 3.2. Decoding mcand 1000 in different modes when MBE selects -2x.
30
The second phase of SPPG, arranging scalar partial product array (SPPA), is to properly arrange the PPs generated from MBE. Two techniques, sign encoding (SE) and hot-one modification, are used to arrange the proposed SPPA.
As mentioned in Section 2.2.1, SE is done by replacing the sign-extension bits with {p,n,n} for the first PP and {1,p} for others, where n stands for the sign bit of the PP and p = ~n. This technique reduces the number of sign-extension bits to two or three and then considerably saves the area and power consumption as bit width grows.
Hot-one modification aligns the hot-one bits, obtained by two’s complementing of the preceding PP, all to the left position (hot2) with a slight logic change on the LSB of the preceding PP. It makes the LSB end of the PPA shorter and regular.
Both techniques help the proposed SMAC create a narrower-width SPPA which occupies less area, consumes less power, and assists the speed optimization of TDM PPRT. The proposed SPPG is architecturally similar to the PPG in [9].
3.1.2 Scalar Partial Product Reduction Tree (SPPRT)
Three-dimensional method (TDM) [8] is utilized to construct the proposed SPPRT with the architecture of Wallace Tree. A full-adder (FA) is the basic cell to build levels of CSAs. Fig. 3.3 shows the FA cell used in the proposed SPPRT.
Concerning TDM, it takes the delay information of each cell used in the tree. Instead of using logic cells like XOR, AND, and OR to build an FA, the SPPRT directly uses the standard high speed FA cell provided by the cell library. This helps not only simplify the generation algorithm but also estimate the delay more accurately. All that is required is to look up in the cell library databook [37] for the delay of six
31
paths in an FA (a-to-sum, b-to-sum, cin-to-sum, a-to-cout, b-to-cout, and cin-to-cout).
A simple software generator is developed to connect the FAs in the SPPRT using TDM.
TDM can be further optimized if the arrival time of each input bit of PPRT is given. It implies that this optimization is cell library dependent and hence and is hard to be reusable. Considering the proposed design, it is easy to obtain reusability.
Although the delay information is cell library dependent, to look it up and send it into the software generator to rebuild another SPPRT is effortless since only a standard FA cell is used. However, it’s not suitable to use the whole input signal delay profile to build the SPPRT since the synthesizer may generate different SPPG netlist each time the timing constraint varies. The ever-changing delay profile makes the PPRT not speed optimized and perhaps not reusable. As a remedy, logic optimization is left for the synthesizer to make. Since the delay profile is unpredictable and eventually a kind of estimation, the proposed scalar design simply assumes all signals arrive to the SPPRT simultaneously, leading to a reusable TDM SPPRT.
Fig. 3.3. FA cell used in the proposed SPPRT.
32
3.1.3 Scalar Carry-Propagate Adder (SCPA)
Both adders in [8] and [9] exploit the input operand delay profile to configure a hybrid adder scheme to accelerate addition and reduce area. This again is cell library dependent and hence is hardly reusable. For the proposed scalar design, architectural optimization using delay profile is not recommended. Each bit of two operands of the SCPA hypothetically leaves the SPPRT and arrives at the same time.
Fong adder [26] is implemented as the SCPA. The architecture of a 32-bit Fong adder has been shown in Fig. 2.9. There are three main reasons that Fong adder is utilized. First, it outperforms most other adders in terms of delay while it minimizes area cost compared to similar architectures. Second, the carry-out bit is provided so as to perform overflow/underflow check. Last but not least, Fong adder also supports SWP that meets our requirement with only a slight delay and area overhead.
The proposed SWP scheme is described in Section 3.2.
3.1.4 Summaries of the Proposed Scalar MAC Design
Fig 3.4 displays the proposed scalar architecture. It is partitioned into SPPG, SPPRT, and SCPA. In SPPG, a race-free encoding scheme is utilized with a high-speed and area-reduced MBE implementation supporting signed, unsigned, and mixed-mode operation. Sign encoding and hot-one modification are applied on the proposed SPPA. In SPPRT, a speed optimized reusable PPRT exploiting TDM is built. As for SCPA, Fong adder is used. Note the figure actually shows the multiplier design. It can easily perform MAC operation simply by feeding the multiplication result into SPPRT as another PP. The proposed SWP design utilizes essentially the same hardware of the proposed scalar design. The way to perform SWP is described in the next section.
33
Fig. 3.4. The proposed scalar architecture.
3.2 Sub-Word Parallel MAC (SWP MAC) Design
3.2.0 Specification
A high performance sub-word parallel MAC (SWP MAC) design based on the SMAC architecture is proposed. Table 3.2 lists the specification of the SWP MAC.
Kill signals separate SWs and each SW independently processes in its unique mode.
Table 3.3 lists the possible sub-word combinations. The detailed SWP reconfiguration scheme is provided in Section 5.3.
34
Table 3.2. Specification of the proposed SWP MAC design.
Operation: m_out = accu + mcand × mlier (mode)(kill)
Bit Width of mcand 8/16/32/64
Bit Width of mlier 8/16/32/64
Bit Width of accu 16/32/64/128
Bit Width of m_out 16/32/64/128
Bit Width of a Basic SW Input:8/Output:16
Bit Width of Each Kill 1
Bit Width of Each Mode 2
Available mode 01:Signed/00:Unsigned/1?:Mixed-mode;
independence among all sub-words
Table 3.3. Possible sub-word combinations of the proposed SWP MAC design.
Possible Sub-Word Combinations
64-bit A 64-bit SWP MAC is viewed consisting of two independent 32-bit SWP MACs; it has 5×5=25 possible combinations
3.2.1 Sub-Word Parallel MAC Execution Flow
Fig 3.5 shows the execution flow of the proposed SWP MAC: it is still partitioned into three main parts – SWPPG, SWPPRT, and SWCPA. To apply SWP, some modification should be made in each part – mostly lies in the preprocessing of SWPPG. SWPPG is described in Section 3.2.2; SWP accumulation is divided into
35
SWPPRT and SWCPA and explained in Section 3.2.3 and 3.2.4, respectively.
Fig. 3.5. Execution flow of the 32-bit proposed SWP MAC design.
3.2.2 Sub-Word Parallel PPG (SWPPG)
The proposed SWPPG has an identical MBE scheme as used in scalar PPG.
The difference lies in the preprocessing on the input operands and the arrangement of the sub-word parallel partial product array (SWPPA). The additional logic for SWP processes mostly in parallel with the SPPG; this enhancement incurs only a
36
slight timing overhead and some area overhead.
Operand preprocessing consists of two parts – masking and multiplexing on the multiplier and multiplexing on the multiplicand. Fig 3.6 shows a 32-bit example of masking and multiplexing on the multiplier: The bottom SW_0 is the 32-bit multiplier in scalar mode. There is a zero assumed to the right of the LSB for the use of first encoding triplet while there are two s0 bits, for the use of unsigned/mixed-mode operation, extended to the left of MSB where s0 is generated according to Eq. (3.1). These bits are necessary to complete MBE operation. When SWP modes are under execution, the assumed zero and the extend s bits should be appended to each SW as done in scalar mode. For instance in the top row of Fig.3.6, zeros are assumed at mlier[-1], mlier[7], mlier[15], and mlier[23], and s bits are extended to the left of each SW’s MSB. This modification results in 3-bit overlap between SWs, and some bits differ among SWP modes. Therefore mode-dependent multiplexing (selection) or zero-masking are required at these bit positions.
Fig. 3.6. A 32-bit example of masking and multiplexing on the multiplier.
To take an example, the fifth encoding triplet in 32-bit or 16-bit mode is mlier[9:7]; in 8-bit mode, mlier[7] should be masked to a zero, resulting a {mlier[9:8],0} encoding triplet. This demonstrates the necessity of zero-masking
37
between SW boundaries. Concerning multiplier multiplexing, it is important to note that the overlapped triplets {s0,s0,mlier[7]} between SW_0 and SW_1 in 8-bit mode for the use of unsigned/mixed-mode correction, is not sent to the MBE; instead, the correction PP is generated, simply using an 8-bit MUX2. This multiplexing eliminates the ambiguity in selecting which triplet to MBE. The MSB of SW_0, mlier[7] in this case, is the selection signal of the MUX2, i.e. when mlier[7] equals 1-bit one, mcand[7:0] as correction PP for SW_0 is required. The same idea can be applied to each SW boundary, avoiding using some 8-bit or 16-bit MBEs to generate correction PPs. All this is required is the scalar 32-bit MBEs.
As for preprocessing on multiplicand, the proposed SWPPA arranges PPA of each SW similar to what has been explained in Section 2.2.4. Some bits overlap and remain the same among different SWP modes while some bits, especially bits at SW boundaries, vary and require mode-dependent multiplexing (selection). The difference is in sign encoding (SE) bits plus one bit saved for the sign of PP and the hot-one modification bits. Fig. 3.7 shows the detailed view of the 32-bit proposed SWP PPA: Fig 3.7a shows the SWP PPA in scalar mode in which we can see 17 PPs including accumulator; SE bits and hot-one modification bits are also shown. Fig 3.7b displays the SWP PPA in 16-bit mode: the SE bits and the sign-bit of PP08 shares those of scalar PP08 while the hot-one modification bits don’t share; in contrast, PP01 has a same hot-one modification bits while it differs in SE bits and the sign-bit. Fig 3.7c depicts the 8-bit SWP PPA. Clearly, it tells that the difference mainly lies at SW boundaries. Fig 3.7d exemplifies the selection of PP01 among different modes. Although there exists three modes, only three bit positions actually require a 3-to-1 multiplexer (MUX3) for selection; some take AND2s or MUX2s while some do not demand any selection. As a note, even in the 64-bit proposed SWP design, the proposed SWP PPA requires still MUX3s for worst-case positions.
38
39
Fig. 3.7. Detailed view of the 32-bit proposed SWPPA with a selection example.
40
Therefore the preprocessing on multiplicand concerns the generation of SE bits, sign bits of PPs, and hot-one modification bits of each SW and their selection among modes. Table 3.4 lists the truth table of SE bits and sign bits of PPs used in the proposed SWP design. Table 2.5 and Eq. (2.1) have shown the logic of hot-one modification bits. These bits are generated in parallel with scalar MBE without introducing any timing overhead since their logic is not as complicated as an MBE.
The area overhead, as demonstrated in Fig. 3.7d, is not huge since most bits share those in the scalar PPA.
Table 3.4. Truth table of sign encoding bits and sign bits of PPs.
n s tc: 1:signed/0:unsigned; Y: multiplier; m: MSB of SW;
s: sign of corresponding PP; n: SE bit
41
Thanks to this SWP PPA, the proposed architecture offers more flexible SW combination schemes than previous works if both SWPPRT and SWCPA also support. The SWP combination scheme is controlled by the pre-decoded input kill signals. The pre-decoding is performed in parallel with the scalar MBE and thereby does not incur timing overhead. Fig. 3.8 shows the SWP schemes of the 32-bit proposed SWP design: Each kill signal conditionally enables/disables the carry-chain. Three kill signals provide 8 SW combinations; however, if {kill2,kill1,kill0} equals {0,0,1}, {1,0,0}, or {1,0,1}, the middle 16-bit SW obtains a fault PPA since the corresponding PPA has never been generated in this region. Fig.
3.8a to Fig. 3.8e shows the possible five SW combinations; Fig. 3.8f displays an invalid SW combination scheme. For 64-bit design using the proposed architecture, two 32-bit SW halves process in parallel, offering a total of 25 (5×5) different SW combinations.
Fig. 3.8. SW combinations of the 32-bit proposed SWP MAC design.
42
The proposed SWP design is characterized by SWP mode assignment as well;
each SW has its own operating mode. To take an example, if a 32-bit SWP operates in 8-bit SWP mode as sketched in Fig. 3.8b, the four SWs don’t have to perform the same signed/unsigned/mixed-mode MAC operation at the same time. Instead, each SW assigns its unique mode signal, and a total of 81 (3×3×3×3) different SW mode assignment schemes are allowed. Moreover a central mode signal assigned to all SWs, as used in [10], introduces high fan-out, and which consequently requires buffer insertion. SWP mode assignment ameliorates high fan-out.
Although this modification increases some input ports and places some restrictions on mode assignment, it provides reconfigurability and flexibility for the proposed design. Compared to the 64-bit proposed design, [10] offers only four SW combinations and all SWs should operate in a same central mode, and mixed-mode is not supported.
3.2.3 Sub-Word Parallel PPRT (SWPPRT)
To add SWP in the scalar PPRT, the behavior of carries traversing SW boundaries requires careful manipulation. On the whole, it involves carry-killing (blocking, breaking, disabling, etc) at SW boundaries on each level in the SWPPRT.
Both the proposed SWPPRT and the VPPRT in [10] exploit Wallace CSA Tree, using an FA as the basic building block. It implies both designs judiciously manage the carry-out or carry-in of FAs to conditionally break the carry-chain. For example,
Both the proposed SWPPRT and the VPPRT in [10] exploit Wallace CSA Tree, using an FA as the basic building block. It implies both designs judiciously manage the carry-out or carry-in of FAs to conditionally break the carry-chain. For example,