Sub-Word Parallel PPG (SWPPG) - Sub-Word Parallel MAC (SWP MAC) Design

Chapter 3 Proposed MAC Designs

3.2 Sub-Word Parallel MAC (SWP MAC) Design

3.2.2 Sub-Word Parallel PPG (SWPPG)

The proposed SWPPG has an identical MBE scheme as used in scalar PPG.

The difference lies in the preprocessing on the input operands and the arrangement of the sub-word parallel partial product array (SWPPA). The additional logic for SWP processes mostly in parallel with the SPPG; this enhancement incurs only a

slight timing overhead and some area overhead.

Operand preprocessing consists of two parts – masking and multiplexing on the multiplier and multiplexing on the multiplicand. Fig 3.6 shows a 32-bit example of masking and multiplexing on the multiplier: The bottom SW_0 is the 32-bit multiplier in scalar mode. There is a zero assumed to the right of the LSB for the use of first encoding triplet while there are two s0 bits, for the use of unsigned/mixed-mode operation, extended to the left of MSB where s0 is generated according to Eq. (3.1). These bits are necessary to complete MBE operation. When SWP modes are under execution, the assumed zero and the extend s bits should be appended to each SW as done in scalar mode. For instance in the top row of Fig.3.6, zeros are assumed at mlier[-1], mlier[7], mlier[15], and mlier[23], and s bits are extended to the left of each SW’s MSB. This modification results in 3-bit overlap between SWs, and some bits differ among SWP modes. Therefore mode-dependent multiplexing (selection) or zero-masking are required at these bit positions.

Fig. 3.6. A 32-bit example of masking and multiplexing on the multiplier.

To take an example, the fifth encoding triplet in 32-bit or 16-bit mode is mlier[9:7]; in 8-bit mode, mlier[7] should be masked to a zero, resulting a {mlier[9:8],0} encoding triplet. This demonstrates the necessity of zero-masking

between SW boundaries. Concerning multiplier multiplexing, it is important to note that the overlapped triplets {s0,s0,mlier[7]} between SW_0 and SW_1 in 8-bit mode for the use of unsigned/mixed-mode correction, is not sent to the MBE; instead, the correction PP is generated, simply using an 8-bit MUX2. This multiplexing eliminates the ambiguity in selecting which triplet to MBE. The MSB of SW_0, mlier[7] in this case, is the selection signal of the MUX2, i.e. when mlier[7] equals 1-bit one, mcand[7:0] as correction PP for SW_0 is required. The same idea can be applied to each SW boundary, avoiding using some 8-bit or 16-bit MBEs to generate correction PPs. All this is required is the scalar 32-bit MBEs.

As for preprocessing on multiplicand, the proposed SWPPA arranges PPA of each SW similar to what has been explained in Section 2.2.4. Some bits overlap and remain the same among different SWP modes while some bits, especially bits at SW boundaries, vary and require mode-dependent multiplexing (selection). The difference is in sign encoding (SE) bits plus one bit saved for the sign of PP and the hot-one modification bits. Fig. 3.7 shows the detailed view of the 32-bit proposed SWP PPA: Fig 3.7a shows the SWP PPA in scalar mode in which we can see 17 PPs including accumulator; SE bits and hot-one modification bits are also shown. Fig 3.7b displays the SWP PPA in 16-bit mode: the SE bits and the sign-bit of PP08 shares those of scalar PP08 while the hot-one modification bits don’t share; in contrast, PP01 has a same hot-one modification bits while it differs in SE bits and the sign-bit. Fig 3.7c depicts the 8-bit SWP PPA. Clearly, it tells that the difference mainly lies at SW boundaries. Fig 3.7d exemplifies the selection of PP01 among different modes. Although there exists three modes, only three bit positions actually require a 3-to-1 multiplexer (MUX3) for selection; some take AND2s or MUX2s while some do not demand any selection. As a note, even in the 64-bit proposed SWP design, the proposed SWP PPA requires still MUX3s for worst-case positions.

Fig. 3.7. Detailed view of the 32-bit proposed SWPPA with a selection example.

Therefore the preprocessing on multiplicand concerns the generation of SE bits, sign bits of PPs, and hot-one modification bits of each SW and their selection among modes. Table 3.4 lists the truth table of SE bits and sign bits of PPs used in the proposed SWP design. Table 2.5 and Eq. (2.1) have shown the logic of hot-one modification bits. These bits are generated in parallel with scalar MBE without introducing any timing overhead since their logic is not as complicated as an MBE.

The area overhead, as demonstrated in Fig. 3.7d, is not huge since most bits share those in the scalar PPA.

Table 3.4. Truth table of sign encoding bits and sign bits of PPs.

n s tc: 1:signed/0:unsigned; Y: multiplier; m: MSB of SW;

s: sign of corresponding PP; n: SE bit

Thanks to this SWP PPA, the proposed architecture offers more flexible SW combination schemes than previous works if both SWPPRT and SWCPA also support. The SWP combination scheme is controlled by the pre-decoded input kill signals. The pre-decoding is performed in parallel with the scalar MBE and thereby does not incur timing overhead. Fig. 3.8 shows the SWP schemes of the 32-bit proposed SWP design: Each kill signal conditionally enables/disables the carry-chain. Three kill signals provide 8 SW combinations; however, if {kill2,kill1,kill0} equals {0,0,1}, {1,0,0}, or {1,0,1}, the middle 16-bit SW obtains a fault PPA since the corresponding PPA has never been generated in this region. Fig.

3.8a to Fig. 3.8e shows the possible five SW combinations; Fig. 3.8f displays an invalid SW combination scheme. For 64-bit design using the proposed architecture, two 32-bit SW halves process in parallel, offering a total of 25 (5×5) different SW combinations.

Fig. 3.8. SW combinations of the 32-bit proposed SWP MAC design.

The proposed SWP design is characterized by SWP mode assignment as well;

each SW has its own operating mode. To take an example, if a 32-bit SWP operates in 8-bit SWP mode as sketched in Fig. 3.8b, the four SWs don’t have to perform the same signed/unsigned/mixed-mode MAC operation at the same time. Instead, each SW assigns its unique mode signal, and a total of 81 (3×3×3×3) different SW mode assignment schemes are allowed. Moreover a central mode signal assigned to all SWs, as used in [10], introduces high fan-out, and which consequently requires buffer insertion. SWP mode assignment ameliorates high fan-out.

Although this modification increases some input ports and places some restrictions on mode assignment, it provides reconfigurability and flexibility for the proposed design. Compared to the 64-bit proposed design, [10] offers only four SW combinations and all SWs should operate in a same central mode, and mixed-mode is not supported.

在文檔中高效能且可組態之子字組平行化乘加器設計 (頁 47-54)