Chapter 1 Introduction
1.3 Dissertation Organization
The rest of this dissertation is organized as follows. In Chapter 2, an ILP-based bitwidth-aware subexpression sharing for area minimization in filter design is presented. In Chapter 3, we demonstrate an expandable MDC-based FFT architecture and its generator targeting for the high-performance applications. Then, a probability-based static scaling optimization flow for fixed wordlength FFT processors is developed in Chapter 4. Finally, the concluding remarks and the future works are given in Chapter 5.
Chapter 2
Bitwidth-Aware Subexpression Sharing in FIR Filter
2.1 FIR Filter Implementation
In this section, we first introduce the fundamentals of an FIR filter design. Then, three different number representations as well as the related algorithms are briefly reviewed here.
2.1.1 Fundamentals of FIR filter design
A general form of a linear time-invariant N-tap FIR filter can be expressed as a convolution involving the last N input data and N constant filter coefficients. The output y(n) can be calculated as:
1
0 ( )
)
( N
k ck x n k
n
y (1)
where
1. x(n) is the input sequence,
2. y(n) is the corresponding output sequence, 3. ck are constant filter coefficients, and 4. N is the filter length.
Figure 3 illustrates a general fully-parallel transposed architecture of FIR filter, which requires N–1 additions and N multiplications to produce a single output value. It is also observed that one of the two inputs of every multiplication is always from the
present input sample. Therefore, an MCM block can be used to produce a set of products between an input value and a given set of constant coefficients. Since all coefficients are fixed-point constants, those constant multiplications can be solely achieved through a series of additions, and thus the use of costly regular multipliers can be completely eliminated as previously mentioned
2.1.2 Number representations
Three different number formats can be used to represent fixed-point constants:
pure binary form, the canonical signed digit (CSD) form, and the minimal signed digit (MSD) form. The pure binary form is the trivial unsigned binary representation, where every digit is either 0 or 1. In the CSD representation, three symbols, 0, 1, and
1, are used, where 1 denotes –1. The CSD representation has the following two properties: 1) the count of nonzero digits (i.e., 1 and 1) is minimal; and 2) any two adjacent digits cannot both be nonzero. In addition, the CSD representation for a
Table 1 Three different representations for the number 23 Representation
pure binary 010111
CSD 101001
MSD 011001 or 101001 Figure 3 A general architecture of FIR filter.
x
+
x
+
x
+ x
x(n)
cN-1 cN-2
…
c1 c0
Unit y(n) delay
Unit delay
…
…
Multiple Constant Multiplication (MCM)
number is unique, and this is how “canonical” comes from. Similar to the CSD form, the MSD form also adopts the same three symbols and requires that the count of nonzero digits is minimal. Unlike the CSD from, the MSD form allows adjacent nonzero digits, which makes itself no longer a canonical representation. In other words, a number may have multiple valid MSD representations. Note that the CSD representation is also a valid MSD representation. Table 1 gives an example where the number 23 is presented in those three representations with the length of six digits. The number 23 has a unique representation in pure binary form and CSD form, but has two feasible representations in MSD form. Besides, it is a guarantee that for any number the count of nonzero digits in pure binary representation is no less than that in CSD and MSD representations.
In hardware implementation, the count of nonzero digits of a value c basically determines the number of required additions to realize the multiplication by c. Figure 4 demonstrates three different ways for implementing a constant multiplication by 23.
Figure 4(a) shows a direct implementation based on the pure binary representation (i.e., 10111). Since there are four nonzero digits, three adders are required to complete the multiplication (i.e., 23x=16x+4x+2x+x). However, there are only three nonzero
(a) (b) (c)
Figure 4 Three different implementations for the number 23.
digits if the CSD format is considered (i.e., 101001). As shown in Figure 4(b), merely two adders are enough to accomplish the same multiplication (i.e., 23x=32x–8x–x, a subtraction a – b is regarded as an addition a + (–b) in this dissertation). Since the number 23 has two valid MSD representations, Figure 4(c) illustrates the other one (i.e., 011001), which also needs two adders only (i.e., 23x=16x+8x–x). Consequently, CSD and MSD representations are usually adopted in constant multiplications because both of them guarantee the count of nonzero digits for any given constant is always minimal.
2.1.3 Digit-based algorithms
A number of algorithms have been proposed to decompose a set of constants based on a specified number representation [20-33]. However, most of these previous methods only focus on the minimization of the total adder count. Since the area and delay of an adder is highly dependent on the bitwidth as mentioned, it is unwise to neglect its impact during optimization. For example, two different ways can be used to implement the constant multiplication of 11x: (1x+2x)+8x or (1x+8x)+2x. They both require two adders to complete the multiplication. However, the bitwidth of the result of (1x+8x), is apparently longer than that of (1x+2x). Since a wider result potentially requires wider adders for succeeding additions, it is actually a good idea to take the resultant bitwidth of addition outcome into account for better optimization outcomes.
2.1.4 Motivation of our work
Figure 5 (a) illustrates a sample multiplier-less MCM design, where the input x is 8-bit wide. Instead of using five costly multipliers, the five output values (the input x times 19, 21, 31, 121, and 125) can be produced by only seven adders. Note that every
adder must be wide enough to ensure the absence of overflow. Typically, an adder with n–1 bits is used to generate an n-bit sum. For example, Adder_1 shown in Figure 5(a) is used to produce an output of 5x = x + (x << 2), where the output is 11-bit wide and the trivial implementation of Adder_1 should be 10-bit wide, as depicted in Figure 6 (a). However, it is observed that the rightmost two adder bits in Figure 6(a) are actually redundant due to the 2-bit right shifting (i.e., x << 2). As a consequence, Adder_1 can be safely downsized to an 8-bit adder, as shown in Figure 6(b). This example clearly indicates that a more compact adder implementation can possibly be achieved if the relation between two input operands is carefully investigated.
Let us reexamine the implementation shown in Figure 5(a). The number labeled within a circle specifies the minimal bitwidth of the corresponding adder. Hence, a total of 67 bits is required for those 7 adders. Meanwhile, Figure 5(b) illustrates another implementation for the same MCM, which requires 8 adders but only 64 adder bits. That is, the implementation depicted in Figure 5(b) consumes more adders but less adder bits than that shown in Figure 5(a). As previously explained, we consider the solution given in Figure 5(b) is better. However, those previous approaches trying to minimize the adder count would prefer the solution shown in Figure 5(a). Consequently, it motivates us to develop a new area minimization algorithm for MCM designs that concentrates on total adder bitwidth minimization.
The details are elaborated in the following two sections.
(a) an MCM design with 7 adders and 67 bits
(b) same MCM design with 8 adders and 64 bits +8
Figure 5 A motivational example of multiplier-less MCM.
(a) a 10-bit adder
(b) an 8-bit adder
FA FA
FA FA
FA FA
x0
x1
x3
x7
0 0
x5
x6
x7 x1 0 0
FA
x2
x0
…
FA FA
FA FA
x0
x1
x3
x7
0 0
x5
x6
x7 x1
FA
x2
x0
…
Figure 6 Two alternative implementations of Adder_1, x + (x << 2).
2.2 Proposed Algorithm
In this section, we present a bitwidth-aware ILP-based area minimization algorithm for MCM designs. It uses a graph-based approach that keeps track of common subexpressions among constants as well as calculates the exact required bitwidth of each adder simultaneously. The details are described in the following sections.
2.2.1 Number decomposition and bitwidth calculation
The fundamental of the MCM optimization problem is to maximize the common subexpression sharing among given constants. Hence, it is generally true that the optimization outcome could be better if more ways are considered for decomposing a constant, i.e., a larger solution space. However, the number of possible ways for decomposing a number is actually infinite. For example, the number 3 can be achieved as 1+2, 4–1, 5–2, 6–3, 7–4, and so on. Fortunately, not all of them are appropriate while constructing an area-efficient solution. As explained in Section 2.2, the number of adders (Az) needed to accomplish a constant multiplication by z is equal to the number of nonzero bits of z (Bz) minus one, i.e., Az = Bz – 1. Hence, there is no reason to decompose z as x + y if Bz < Bx + By during optimization. For instance, it is not wise to decompose the number 3 as 5–2. In our method, a set of target constants D
= {di} is first converted into C = { ci | di = ci 2l, ci is an odd number}. That is, all constants ci are assumed odd numbers initially. Next, for every constant c with k (where k > 1) nonzero digits in its MSD representation, the proposed algorithm merely considers a finite set of number decompositions complying with the following format:
0
For example, the set of decompositions for the number 153 (10101001 in MSD) is enumerated in Table 2. Actually, a number decomposition c = d + e is created by partitioning the nonzero digits in c’s MSD form into two nonempty disjoint groups.
The group contains the least significant digit (LSD) actually defines the value of d (odd), while the other group specifies the value of f (even). As shown in the last column of Table 2, the nonzero digits are partitioned into gray and non-gray groups.
Therefore, for a number c that has m valid MSD representations and every one c. For example, the number 153 has eleven different subexpressions – {1, 3, 5, 7, 15, 19, 25, 33, 121, 129, 161}. Hence, according to (2), the decompositions and the subexpressions of any odd number larger than 1 can be identified using the approach described above.
An adder is required to carry out a decomposition of a constant multiplication cx
= dx + fx 2l = p + (q << l). Assume p is m-bit long and q is n-bit long, the adder bitwidth can thus be determined based on the following two scenarios:
1) m n + l: the adder bitwidth is m – l, 2) m < n + l: the adder bitwidth is n.
The bitwidth of the addition result is resolved by the larger value between m and n + l, namely, max(m, n + l). However, since each of the least l bits generates no carry during addition in both scenarios, the required adder bitwidth can be safely reduced to max(m, n + l) – l, as illustrated in Figure 7. In general, an adder with smaller bitwidth occupies smaller area and has shorter delay, as aforementioned.
While revisiting the previous decomposition example of 153, we further assume the input x is 8-bit wide, which is used to produce the output 153x. By Table 2, a possible decomposition is 153 = 121 + 32 = 121 + 1 25, which indicates that the output 153x can be obtained from the summation of 121x and 32x. Obviously, 32x is only 5-bit left shift of x. Also note that the left shift requires no additional hardware in the real implementation. Meanwhile, 121x needs to be further decomposed based on (2) in the same manner and its bitwidth is 15 bits (i.e., 8 + log2121). The adder type required to sum up 121x and 32x is the one illustrated in Table 7(a), where m = 15, n =
Table 2 All feasible decompositions of the number 153
index d + e d + f 2l 153 in MSD
1 1 + 152 1 + 19 23 10101001
2 161 + (–8) 161 + (–1) 23 10101001
3 121 + 32 121 + 1 25 10101001
4 25 + 128 25 + 1 27 10101001
5 (–7) + 160 (–7) + 5 25 10101001
6 33 + 120 33 + 15 23 10101001
7 129 + 24 129 + 3 23 10101001
8, and l = 5. Therefore, the required bitwidth of the adder is 15 – 5 = 10 bits.
2.2.2 Bitwidth-aware multiplier-less MCM design flow
Figure 8 gives the overall flow of the proposed bitwidth-aware ILP-based area minimization algorithm for MCM designs. At first, all constants (odd numbers) are collected into the constant-for-decomposition set C; that is, every constant in C needs to be further decomposed. Besides, the resultant subexpression set S is created to keep track of all constants that may be utilized during the MCM block construction. Then,
(a) m ≥ n + l
an arbitrary constant c is removed from C for decomposition. Based on the specified number representations, i.e., pure binary, CSD, or MSD, all decompositions of c are enumerated and the associated hardware cost in terms of total adder bit count is calculated using the method presented in Section 3.1.
Next, c is added into S after all decompositions of c are identified. For every subexpression of c that has not been present in S yet, it is added into D for further decomposition. This process is not terminated until D is empty. As a result, the set S contains all possible constant numbers that may appear in the final MCM block.
While performing constant number decomposition, the proposed approach concurrently builds a subexpression graph to keep track of all feasible decompositions for a given constant c. The graph also records the cost of every decomposition (i.e., adder bit count). Finally, based on this subexpression graph, a set of corresponding ILP constraints can be derived and then an ILP solver is utilized to produce an MCM design solution with the minimal total adder bits. The details of the ILP formulations are given in Section 4 later. Conventional look-up table based approaches require a
1. C { all constant numbers } 2. S { 1 }
3. while ( C is not empty )
4. Remove an arbitrary constant c from C 5. Add c into S
6. foreach ( decomposition d of c )
7. Identify two subexpressions s & t of d 8. Calculate the required adder bit count 9. Record d in the subexpression graph G 10. Add s & t into C if they are not in S yet 11. Derive ILP constraints from G
12. Find a solution with minimal adder bit count by ILP Figure 8 The proposed algorithm flow.
pre-computed table to store all decompositions of every odd number within a fixed bitwidth (e.g., 13-bit). Therefore, the table can become very huge as the bitwidth increases. In contrast, the proposed algorithm merely enumerates decompositions for a limited number of subexpressions.
2.3 Example of Subexpression Graph Construction
In this subsection, the CSD representation is in use for simplicity. Note that the proposed algorithm is applicable to the MSD one. We also use an example to demonstrate how a subexpression graph is dynamically constructed. In the following example, the 8-bit input is multiplied by two constant numbers, 21 and 153. First, these two constants are transformed in CSD form.
CSD
According to its CSD form, each constant can be further decomposed as a set of subexpressions based on the method described in Section 3.1
3
There are three and seven decompositions for 21 and 153, respectively, which is the same as (3) specifies. After decomposition, we also find that the number 21 has three subexpressions of {1, 5, 17} and the number 153 has eleven subexpressions of {1, 3, 5, 7, 15, 19, 25, 33, 121, 129, 161}. Every subexpression (except 1) needs to be further decomposed for finding out all its feasible decompositions and the associated
adder bit count, as the algorithm flow presented in Figure 8.
There are two types of vertices within a subexpression graph G: 1) a number vertex is associated with a constant value, and 2) an adder vertex specifies a decomposition and associates the decomposed number with its two subexpressions.
When a constant number c is removed from C, a corresponding number vertex is added into G if it is not present in G. For every feasible decomposition of c, a corresponding adder vertex is added into G. Similarly, a number vertex associated with each of two subexpressions, s and t, is also inserted into G if it is not present in G.
The adder vertex not only relates c to both s and t but also keeps track of the required adder bit count for this decomposition. Also note that every unique number c has exactly one corresponding vertex in G.
Figure 9 illustrates the partial subexpression graph for 21 and 153. The rightmost adder vertex indicates itself a decomposition of 21 (i.e., a fanin of the number vertex associated with 21). It also shows that 1 and 17 are two subexpressions (i.e., its two fanin number vertices) of this decomposition. In addition, it also specifies that the
<<3 <<3- <<7 <<3 <<3 - <<5<<2 <<4 <<2
17
Figure 9 Partial subexpression graph for the numbers 21 and 153.
decompositions for 21 and 153 respectively as shown in Figure 9. A subexpression larger than one should be further decomposed. Except for 17, Figure 9 does not show the succeeding decompositions due to page limitation. Note that Figure 9 also shows that 5 is a common subexpression of both 21 and 153, which implies the hardware cost may be reduced if 5 is shared by both of them in the final implementation. All chances of common subexpression sharing can be exhaustively identified in the proposed subexpression graph.
2.4 ILP Formulation
The problem of bitwidth-aware area minimization for MCM design is thus solved through integer linear programming (ILP). Three types of constraints are derived: 1) an existence constraint guarantees at least one of decompositions is selected for a specified number, 2) a dependency constraint ensures the two corresponding subexpressions would also be implemented if a specified decomposition is selected, and 3) an output constraint guarantees all the output constants of the given MCM are implemented. The ILP-related notations used in this section are given below.
sn: a 0-1 variable indicating if the subexpression of the value n is implemented.
dn: the count of feasible decompositions of the number n.
an,i: a 0-1 variable indicating whether the i-th decomposition of the number n is selected for implementation.
bn,i: the required adder bit count for implementing the i-th decomposition of the number n.
2.4.1 Existence constraints
1, at least one of its decomposition
must be selected for implementation, which can be formulated as the following shown in Figure 9. According to (4), at least one of those three decompositions must be selected. That is, two subexpressions serve as the inputs to the adder. Hence, there is no way to get the addition outcome if those two inputs are not available. That is, if the i-th constraint can be given as:
}
2.4.3 Output constraint
Assume that the set C includes all constant numbers of the target MCM block;
the following output constraint is applied to ensure that every element n C is properly implemented:
C n
s
n
1,
(6)2.4.4 Optimization goal
As aforementioned, the hardware cost of the target MCM block can be lowered if the total adder bit count in use can be minimized. Since every implemented adder must be associated with a variable an,i set to 1 and the bitwidth of that adder is bn,i, the goal of total adder bitwidth minimization can thus be accomplished through setting the objective of the ILP formulation as:
minimize
C
n i d
i n i n
n
b a
1
,
, (7)
subject to (4), (5), and (6).
2.5 Experimental Results
To evaluate the proposed algorithm, we compare it against a widely used graph-based technique revealed in [14] as well as a state-of-the-art digit-based technique presented in [29]. All experiments have been conducted on a Linux platform with two Intel Xeon 2.4 GHz processors and 12 GB main memory. For the preparation of test cases, the Remez algorithm [51, 52] is utilized to randomly generate FIR filters of various types, including low-pass, high-pass, band-pass, and band-stop. Besides, the MSD representation is adopted for the number decomposition,
and the Gurobi Optimizer [53] is used as ILP solver.
In Table 3, 12 8-tap filters are evaluated with 12-bit coefficients and input data.
For each method, #adders reports the total number of adders required in the MCM block, while #bits gives the total number of adder bits. Since RAG-n [14] allows the use of right shifters and accepts a mapping result that induces extra adders for a coefficient to maximize global subexpression sharing, it is capable of finding a design solution that is not presented in the solution space for digit-based algorithms. Hence, RAG-n is likely to better minimize the number of adders for an MCM design, as the results show. However, our method can better minimize the number of adder bits for every test case. Even in the case of bs8-3, the result of the proposed method needs two more adders but still requires one fewer adder bit than that of RAG-n.
To evaluate the area cost of a real implementation more precisely, we also generate the corresponding Verilog RTL code for a set of 128-tap filter designers and
Table 3 8-tap filter cost comparisons among the three methods Filters
RAG-n [14] Ho et al. [28] Ours
#adders #bits #adders #bits #adders #bits
lp8-1 6 97 6 91 6 91
lp8-2 8 155 8 129 8 126
lp8-3 7 136 8 127 8 121
hp8-1 6 104 6 104 6 100
hp8-2 8 139 8 134 8 134
hp8-3 7 118 8 120 8 112
bp8-1 10 206 11 190 11 181
bp8-2 9 169 9 161 9 152
bp8-3 10 189 10 180 10 161
bs8-1 9 157 9 162 9 156
bs8-2 7 118 7 109 7 105
bs8-3 8 143 9 152 10 142
then synthesize the RTL code into the gate-level design based on TSMC 0.18m technology. In Table 4 and Table 5, two configurations with different coefficient widths, 12-bit and 16-bit, are examined for each filter. In addition, the input is also assumed as wide as the coefficient. Similarly, #adders reports the total number of adders in the MCM block, #bits gives the total number of adder bits, and #gates reveals NAND2-equavilent gate count in the synthesized circuit. Table 4 clearly shows that for every test case the proposed algorithm needs more or same number of adders than the method in [28].
However, our method requires fewer adder bit count in every test case. For 12 12-bit test cases in Table 4, the average reductions on the adder bit count and the gate
However, our method requires fewer adder bit count in every test case. For 12 12-bit test cases in Table 4, the average reductions on the adder bit count and the gate