Thesis Organization - 應用二進位共用項分享之延遲且面積最佳化的有限脈衝響應濾波器合成技術

Chapter 1 Introduction

1.3 Thesis Organization

The remainder of this thesis is organized as follows. In Chapter 2, we briefly introduce the necessary terminology and contrast against prior related works. Chapter 3 gives some motivational examples. The proposed method is demonstrated in Chapter 4. Chapter 5 shows our experimental setup and presents the experimental results. Finally, Chapter 6 gives the concluding remarks of this thesis.

Chapter 2 Background

In this chapter, we briefly review the background knowledge and the primary previous work, [14] proposes a new CSE method using binary representation of coefficients, called Binary Subexpression Elimination (BSE). Section 2.1 presents the basic terminology which we use in BSE. Section 2.2 demonstrates a heuristic method to realize the BSE architecture.

2.1 Basic Terminology

 Coefficient, C: A binary number.

 Non-zero bits, NZB(C): Number of non-zero bits in a binary number C.

For example, NZB(101001)=3, NZB(11101)=4.

 Fragment, F(S, i): A number generated from left shifting the symbol S by i bits.

For example, F(S5,3)=101<<3=101000.

 Match, M: A match for a coefficient C with respect to A is a set of fragments such that ∑ F_i _i=C and ∑ NZB F_i _i NZB C .

For example, assume A={1, 11, 101, 111, 1001, 10101} and C=11010. We can find

a match M0={F(S3,3), F(S1,1)} such that 11000+10=11010 also find M1={F(S9,1), F(S₁,3)} such that 10010+1000=11010.Identically, M2={F(S13,1)} is a match for C.

But {F(S21,0), F(S5,0)} is not a match because NZB(10101)+ NZB(101)=5, not equal to NZB(11010)=3.

2.2 Previous Works

[14] proposes a new CSE method using binary representation of coefficients, called Binary Subexpression Elimination (BSE). BSE realizes a two-stage multiplier block architecture in Fig. 3. The first stage called alphabet generation unit generates binary common subexpressions or called symbols and the second stage called fragment summation unit uses these binary common subexpression to realize constant multiplication. Another application of BSE is constant multiplication for DCT which is proposed in [13].

[14] claims that their method offers average logical operator reduction of 21% over two CSE methods [15] and [16]. Even though the number of nonzero bits in the CSD representation is smaller than that in corresponding binary representation, the potential of the CSD-based CSE technique to reduce the number of adders by forming common subexpressions is less than that of binary when the number of nonzero bits is minimum.

The most important issue for BSE method is how to find out a match for a coefficient. In [14], they propose a heuristic method, called Sequential Longest Symbol Match (SLSM), which uses fixed-alphabet A={1, 11, 101, 111, 1001}. They claim that these symbols or called common subexpressions 11, 101, 111, 1001 are used to compose coefficients frequently and the reductions are not significant when other longer symbols like 10001 are used. In SLSM, we process one match at a time and sequentially match a coefficient from MSB to LSB in a piece-wise fashion. We choose the longest symbol for each partial match. Note that SLSM can only find one match for a coefficient though there are many possible solutions. For example,

assume C=1001101011. From MSB, 1(S1) or 1001(S9) can be chosen but we choose 1001(S9) instead of 1(S1) because 1001(S9) is the longest symbol where we can choose and so on. We can finally choose S9=1001 left shifts by 6 bits, S5=101 left shifts by 3 bits and S7=111. The result match is M={F(S9,6), F(S5,3), F(S3,0)} such that 1001000000+101000+11=C. After SLSM, we can use these fragment to realize the fragment summation unit, is shown in Fig. 4.

Fig. 3 BSE-based filter architecture

Fig. 4 An Example of SLSM

Chapter 3 Motivation

In this chapter, we briefly review the adder architectures and give four motivational examples to demonstrate the limitations of previous work. Based on SLSM and fixed-alphabet, we can find only one match for a coefficient. We try to find other possible matches such that we can get more benefit.

3.1 Delay and Area Calculation

While designing a multiplier-less MCM, different adder architectures can be used depending on the design constraints. There are two kinds of widely used adder architectures, as shown in Fig. 5. General carry propagation adder (CPA) adds two numbers x and y to a number a. The carry save adder (CSA) takes three numbers, x, y, and z, and transform it to two numbers, a and b, such that x + y + z = a + b. Note that the output of CSA are not the final summation result. An extra CPA is required to calculate the final two subexpressions, CSA is preferred to iteratively cover the subexpressions then use a CPA to get the final results.

Because CPA is more complicated than CSA, we assume the delay and area ratio of CPA and CSA is 2. For a 9 number adder operation, we use 7 CSA with 1 CPA to realize this operation in Fig. 6 (a) rather than 8 CPA operations in Fig. 6 (b), so that we can reduce 7 units area and 2 units delay.

Fig. 5 Two Kinds of Adder Architectures

Fig. 6 An Example of Delay and Area Calculation

3.2 An Example of Alphabet Selection

In SLSM process, the number of fragments in result match rests with the length of symbols where we choose in each partial selection from fixed-alphabet. Thus, in this example,

we add a longer symbol 10001 in Alphabet which is called A1 and try to reduce the number of fragments in result match.

We use SLSM with the coefficient C=1111000110010 to realize the alphabet generation unit and the fragment summation unit. Fig. 7 (a) shows the circuitry that corresponds to the match Ma={F(S7,10), F(S1,9), F(S3,4) , F(S1,1)} with alphabet A0={1, 11, 101, 111, 1001}

and Fig. 7 (b) shows the circuitry that corresponds to an alternative match Mb={F(S7,10), F(S17,5), F(S9,1)} with alphabet A1. Obviously, the fragment summation unit reduces 1 CSA cost that is equal to save 1 unit area and 1 unit delay. By adding longer symbol, it takes different alphabet and probably can get more efficient fragment summation unit.

Fig. 7 An Example of Alphabet Selection

3.3 An Example of Match Choice

SLSM provides only one match for a coefficient; actually we only need to insure that all

of 1’s in the coefficient should be corresponded to a fragment. The following example shows an alternative match where we use an interleaving way such to cover all of 1’s in the coefficient.

We use a fixed-alphabet A=A0 and assume coefficient C=1001101101. Fig. 8 (a) shows the result Ma={F(S9,6), F(S5,3), F(S5,0)} based on SLSM. But the pattern 101101 in C can separate to two fragment F(S9,2) and F(S9,0) instead of F(S5,3) and F(S5,0) by an interleaving way. An alternative match Mb={F(S9,6), F(S9,2), F(S9,0)} is shown in (b). In this case, we can eliminate the symbol S5 which is not used by getting different match so that the alphabet generation unit can reduce 2 units area.

Fig. 8 An Example of Match Choice

3.4 An Example of Sharing Issue

In major MCM methods, common subexpression sharing is an important way to reduce the hardware cost. But SLSM does not consider other coefficients when process a coefficient.

To consider more coefficients, the two coefficient C0=101101 and C1=011010 are used as this example.

Fig. 9 (a) shows SLSM result Ma0={F(S5,3), F(S5,0)} for C0 and Ma1={F(S3,3), F(S1,1)}

for C1 . However, we can find another efficient matches Mb0={F(S5,3), F(S5,3)} for C0 and Mb1={F(S5,1), F(S1,3)} for C1 as shown in (b). Take Ma1 and Mb1into comparison, Mb1 uses the same symbol S5 in Mb0 without generating a new symbol S3. On alphabet generation unit, we can save 2 units area. In this case, we try to use a common symbol S5 to realize the two coefficients instead of process the two coefficients individually.

<<2

(a) Circuitary of SLSM (b) An alternative matching

CPA (2)

Fig. 9 An Example of Sharing Issue

3.5 An Example of Timing Issue

SLSM cannot take a trade-off between area and delay because of its fixed result. Under different timing constraint, we must choose a suitable match for a coefficient rather than a changeless match.

The following example shows two different matches, we assume coefficient C=111000110010 and alphabet is A0. Fig. 10 (a) shows that the match Ma={F(S7,9), F(S3,4),

F(S1,1)} is found by SLSM and (b) shows another match Mb={F(S3,10), F(S1,9) , F(S3,4), F(S₁,1)}. Compared to Ma, maximum delay of Mb is longer than Ma but Mb has small area cost than Ma. In our work, we provide users for a timing constraint, extend all of possible matches under this timing constraint and decide a best solution.

Fig. 10 An Example of Timing Issue

3.6 Problem Formulation

In this thesis, we address the problem of optimal MCM design based on two stage multiplier block architecture. We are given:

 a set of fixed-point coefficient {C0, C1…, Cn-1},

 the timing constraint from input X to output of fragment summation unit X*C,

 the delay and area ratio of CSA and CPA.

Our goal is to minimize total area cost including alphabet generation unit and fragment summation unit under the given timing constraint.

Chapter 4 Our Proposed Algorithm

In this chapter, we describe our algorithm, called Global Optimal Symbol Match (GOSM), which can find out matches for coefficients and implement the delay and area optimal BSE-based FIR filters. Section 4.1 presents our algorithm including terminology (4.1.1), pseudo code (4.1.2), complexity analysis (4.1.3) and two enhanced methods (4.1.4 and 4.1.5).

Section 4.2 illustrates a working example. Finally, section 4.3 gives Integer Linear Programming (ILP) formulation to optimize delay and area.

4.1 Algorithm Flow

The detail processes are shown in Fig. 11. Step 1, we enumerate all possible matches and construct the Coefficient Assembly Tree (CAT). Step 2, to reduce the complexity, we eliminate the redundant paths. Step 3, we use ILP to decide the best paths for all coefficients.

Fig. 11 GOSM Flow

4.1.1 Terminology

 Coefficient Assembly Tree, CAT(C): A tree which is extended for a coefficient C.

For example, Fig. 12 (a) illustrates a CAT for a coefficient C=6’b011010.

 Path: A match which is from root to leaf in CAT(C).

For example, in Fig. 12 (a), The CAT includes 5 paths which correspond to 5 possible matches for this coefficient.

 SymSet(Path): A set of symbol that are used on Path.

For example, the marked path in Fig. 12 (a) includes two fragment, (S1,4) and (S5,1), therefore, SymSet(Path)={S1,S5}.

 Delay(Path): Delay of Path including symbol generation time.

For example, Fig. 12 (b) shows the implementation of alphabet generation unit and fragment summation unit which correspond to the Path={(S1,4),(S5,1)}. The maximum delay is 4 which is from X to X*C. generation unit includes two area cost, Area(S1)=0 and Area(S5)=2.

 Trim leading zero’s, TrimLZ(C): Trim the leading 0’s in C.

For example, TrimLZ(011010)=11010 , TrimLZ(0001011)=1011.

 Trim MSB, TrimMSB(C): Trim the MSB in C.

For example, TrimMSB(11010)=1010.

 |C|: Bitwidth of TrimLZ(C), where C is a binary number.

For example, |1001|=4, |011010|=5, |11010|=5.

 Difference of length, DOL(C, S): Return |C|-|S|, where C is a coefficient and S is a symbol.

For example, DOL(11010,1001)=5-4=1 , DOL(11010,11)=5-2=3.

 Residue(B, C): TrimLZ(B-C), where B and C are binary numbers.

For example, Residue(11010,10010)=TrimLZ(11010-10010)=TrimLZ(01000)=1000.

(a)

(b)

Fig. 12 (a)A CAT Example. (b)The Delay and Area Calculations.

4.1.2 Coefficient Assembly Tree Enumerator

CAT enumerator enumerates a CAT for a coefficient. To find out all CATs, we execute CAT enumerator n times for n coefficients in coefficient set. The recursive pseudo code for CAT enumerator as follows.

Initial: A=;

Note that, the alphabet A is empty initially. To simplify the calculation we trim the leading zero’s in C, before starting CAT. The first for loop in line 2 to line 5 generates the useable symbol. Num_1 indicates the number of 1’s which must be chosen in C’. For example,

assume C=011010 C’=TrimMSB(11010)=1010, it means zero 1‘s-combinations from two 1’s in C’(C ), one 1’s-combinations from two 1’s in C’(C ) and two 1’s-combinations from two 1’s in C’(C ), after the for loop, by three kinds of different combinations we can get 4 possible symbols 1(S1), 11(S3), 1001(S9), 1101(S13) and add them to a set Symbol. The second for loop in line 6 to line 10, for each symbol S in Symbol, we create a child node r= F(S, d) where d is DOL(C, S), add S into A and use the Residue(C, S<<d) to recursive call CAT. For example, we

create a child node r=F(1,d) for Root where d is DOL(011010,1)=4, add 1(S1) to A and call CAT(r,1010). The recursive call terminates when the first for loop don’t generate any

symbols.

The Sym_Enum is a sub-function in CAT line 5 which is used to generate the useable symbol for C. Before starting Sym_Enum, we trim the first 1 in C to an initial symbol S. By Sym_Enum, the symbol S grows up to those useable symbols and stores in the set Symbol. In

each recursive call, we check the MSB(C’). If it’s 0, we skip this bit, let S<<1 and recursive call Sym_Enum for TrimMSB(C’). If it’s 1, we can skip this 1 as a 0 if the remaining 1’s in C’

still enough also we can pick this 1 and let (S<<1)+1 and decrease the index num_1 by 1. The recursive call terminates when num_1 counts down to 0. An example for Sym_Enum(1010,1,1) is illustrated in Fig. 14. In this example, S grew up to 11(S3) and 1001(S9), it also means one 1’s-combinations from two 1’s in C’=1010(C ).

Fig. 13 Illustration for CAT(R, TrimLZ(011010))

Fig. 14 Illustration for Sym_Enum(1010,1,1)

4.1.3 Coefficient Assembly Tree Complexity

The analysis of CAT complexity, we try to find out NumP(k).

 NumP(k): Number of possible path in CAT(C) such that NZB(C)=k.

Actually, NumP(k) is only related to NZB(C). For example, the CAT of 10011 and the CAT of 11001 have identical number of paths because of their identical number of non-zero bits. According to CAT enumerator, the recursive equation as follows

1, k 0, 1

∑ C ∙ i 1 , otherwise (4.1)

, where C_i^k-1 represents i-combination from n-1 bits.

When k=2, the number of path: NumP(2)= C₀¹∙NumP(1)+ C₀¹·NumP(0) =1+1=2; if k=3,

NumP(3)= C₀²·NumP(2)+ C₁²·NumP(1)+ C₂²·NumP(0) =1*2+2+1=5. Table I shows NumP(k)

To consider the timing issue, during enumerating CAT, we can check maximum delay and prune those paths which are over timing constraint. The pseudo code of the modified CAT, called PCAT is as follows.

Initial: A=;

12 PCAT(r, Residue(C, S<<d), Pathmiddle);

End

The differences between CAT and PCAT are marked by bold text. To calculate maximum delay of Path, we call PCAT with Pathmiddle and record every node we enumerate. The significant difference is that, before creating a childe node, we must estimate the delay if the child node F(S, d) add to a temporary path Pathmiddle, corresponding to line 8 and line 9.

For example, the coefficient C is 11011101 and we assume the delay and area ratio of CPA and CSA is 2 and the timing constraint is 5. When we execute to the node F(S5, 4) in Fig.

15, PCAT(F(S5,4), 1101, {F(S1,7), F(S5,4)}), we add the node F(S1,3) with an arrow to Path_middle and calculate the maximum delay of Pathmiddle={F(S1,7), F(S5,4), F(S1,3)} to decide whether we create child node F(S1,3) or not. Because the delay of Pathmiddle is 5, which is equal to the timing constraint, PCAT for node F(S1,3) will not be execute. By pruning on F(S1,3), Those child nodes of F(S1,3) which are marked by dash circle will not be enumerated.

Pruned Coefficient Assembly Tree (PCAT) may not be enumerated completely as CAT since the complexity will not be as pessimistic as our analysis in section 4.1.3. In the following chapter (5.4), we will show the reduction rate by pruning with different timing constraint.

Fig. 15 A Pruning Example

4.1.5 Reduced Pruned Coefficient Assembly Tree

In this section, we propose a method, called reduction phase to reduce Coefficient Assembly Tree complexity again. Once a path is completed in PCAT process (not create any child node), we check SymSet(Path) and Area(Path) to decide whether we eliminate this path or not. Similar to SymSet(Path), we store the smallest area path from those path have completed so far. Since we only reserve the smallest area path for a kind of SymSet(Path).

Using hashtable technique, we can realize SymSet(Path) check in linear time. For example, if the area ratio of CPA and CSA is 2 and coefficient C=11011101, we assume a path1

={F(S1,7),F(S₅,4),F(S₁,3),F(S₁,2),F(S₁,0)} with respect to SymSet(Path₁)={S1, S5} store in the temporary in previous action then a path2={F(S1,7),F(S5,4),F(S1,3),F(S5,0)} with respect to SymSet(Path2)={S1, S5} now is completed. Because they have same SymSet and

Area(Path2)=4 is smaller than Area(Path1)=5, we replace path1 by path2 into the temporary.

path₁ is a redundant path in this case, eliminate path1 does not affect our optimal solution because Path2 is a better choice whatever considering other coefficient. In chapter 5, we show that the reduction phase can reduce about 20% total number of paths.

4.2 Working Example

We illustrate a complete example with two coefficients. The coefficient set is {101101(C0), 011010(C1)} and the delay and area ratio of CPA and CSA is 2, the timing constraint is 4. Fig. 16 shows the overall process, and deep color node indicates a pruning occur because of its timing violation. After executing PCAT for C0, an alphabet A extends completely. For each coefficient, through PCAT enumerator and reduction phase, we can get two RPCAT and there are five paths for each coefficient, as shows in Fig. 17.

Fig. 16 Illustrate RPCAT(C0) in Working Example

Fig. 17 The RPCAT Results for C0 and C1

4.3 Integer Linear Programming (ILP) Formulation

In the previous section, we enumerate all possible matches and construct the CAT.

Secondly, we form our problem to an ILP problem and use ILP solver to decide the best paths for all coefficients.

4.3.1 Variables

In the proposed ILP formulation, two variables are used to model the behavior of choosing a path in a CAT. First, VarPath indicates whether the path is selected or not. The other one is VarS which means the symbol selection in the alphabet. The following equation

lists the corresponding ILP formulations.

Our proposed FIR filter is a two stage architecture, alphabet generation unit and fragment summation unit. Since VarS is a 0-1 variable, we can calculate the area of alphabet generation unit by ∑ Area(S_i)·VarS_i . Similarly, VarPath is also a 0-1 variable. The area of fragment summation unit can be calculated by similar equation, such as

∑ ∑ Area(Pathi,j)·VarPathi,j . In order to minimize the total area cost, the objective function can be formulated as:

∑ Area(Si)·VarSi ∑ ∑ Area(Pathi,j)·VarPathi,j (4.4)

where n is number of coefficients

k is number of paths in ith coefficient and m is number of symbols in alphabet A.

4.3.3 Existence Constraint

In CAT, each path means a kind of implementation, which is composed by many symbols. If a Pathi,j is selected the corresponding symbols should be also existed in the alphabet, i.e. S  SymSet(Pathi,j).Therefore, the existence constraint is used to guarantee all of the symbols of the selected path is existed in the alphabet. The formulation is as follows:

VarPathi,j≤ min{VarS0,…, VarSn-1} ∀ Path_i,j|S SymSet(Pathi,j) (4.5)

4.3.4 Uniqueness Constraint

A coefficient can be produced by several implementations. If multiple implementations are chosen, it results in hardware waste. In order to ensure only one path is chosen for a coefficient, the uniqueness constraint should be accordingly formulated as:

∑^k-1_j=0 VarS_i 1 (4.6)

where k is number of paths for the coefficient

4.3.5 ILP Example

In section 4.2, we illustrate the example with two coefficients. After above procedure, we can get 5 possible paths for each coefficient C0 and C1 as show in Fig. 17. Then, the objective is minimized

4VarPath0,0+2VarPath0,1+2VarPath0,2+2VarPath0,3+0VarPath0,4+ 3VarPath1,0+2VarPath1,1+2VarPath1,2+2VarPath1,3+0VarPath1,4+

0VarS1+2VarS3+2VarS5+2VarS9+3VarS11+3VarS13+3VarS37 +3VarS41+4VarS45. In addition, the corresponding constraints are listed below:

Existence constraint:

VarPath_0,0+VarPath0,1+VarPath0,2+VarPath0,3+VarPath0,4=1;

VarPath1,0+VarPath1,1+VarPath1,2+VarPath1,3+VarPath1,4=1.

Eventually, we use ILP solver, named gurobi [17] to solve this ILP problem. The ILP result is VarPath =1, VarPath =1 and VarS =1, as shown in Fig. 18 and the best solution is

Area(Path0,2)+Area(Path1,3)+Area(S9)=2+2+2=6.

Fig. 18 ILP Results of Working Example

Chapter 5 Experimental Results

5.1 Experiments Setup

The proposed BSE algorithm, GOSM, is developed in C++/Linux environment. We also use this environment to develop SLSM [14]. The coefficient sets of these test designs are generated by Matlab FDAtool [18].

Two widely used CPA architectures. First, a ripple carry adder (RCA) is the simplest adder structure where the carry bit must wait for the previous full adder. Thus, the critical path delay is relatively longer than other adder structures. Second, a carry look-ahead adder (CLA) calculates the carry bits before the sum, which reduce the critical path delay dramatically.

Table II reports the synthesis results of these two CPA and CSA (refer to 3.1) architectures with different bitwidth under TSMC 180nm process. It apparently shows that RCA is the smallest with longer computation time and CSA is fastest with linear area increasing. Thus, in high speed application, such as software defined radio (SDR), it is desired to design a filter with shorter critical path delay. Therefore, we choose CLA as CPA and decide that area ratio is 4(1368/376) in 16 bitwidth, 3(968/306) in 12 bitwidth and delay ratio is 4(1.2/0.32) in 16 bitwidth, 3(0.95/0.32) in 12 bitwidth.

Table II Synthesis Results of Different Adder Architectures in TSMC .18μm

Architectures 4-bit 8-bit 12-bit 16-bit

RCA Delay(ns) 2.05 2.85 3.92 4.99

Area (μm²) 134 141 211 282

CLA Delay(ns) 0.60 0.79 0.95 1.2

Area (μm²) 301 814 968 1368

CSA Delay(ns) 0.32 0.32 0.32 0.32

Area (μm²) 190 235 306 376

5.2 Case Study I : Versus SLSM

Table III/IV illustrates 10 filter designs with 16/12 bitwidth coefficients, named lp (lowpass), hp (highpass), bp (bandpass), bs (bandsotp) and their filter length. The area &

delay ratio is 4/3 for 16/12 bitwidth. First, we find out maximum delay and area cost for all designs by using SLSM method as show in 2^nd and 3^rd columns. Then, we use those Maximum delays as our timing constraints in GOSM method and the corresponding area result are shown in right side columns. Rate is the percentage of (area cost by SLSM-area cost by GOSM)/area cost by SLSM and #sym. represents number of symbols which ILP solver actually chooses in GOSM.

Under same timing constraint, GOSM can minimize area cost average 26/22% in 16/12 bitwidth with timing constraint. Using SLSM, with the growth of the filter length, the area cost of filter increase linearly. But in GOSM, the area cost of filter increase slower.

Furthermore, by using GOSM, ILP solver chooses not only five symbols (avg. 18.) but grew up with the filter length. For longer length filters, the complex symbols appear in coefficient frequently. The benefit of those complex symbols causes the different of the reduction ratio between SLSM and GOSM in longer length filters. Therefore, the reduce ratio of area cost and the length of filter are in direct proportion.

在文檔中應用二進位共用項分享之延遲且面積最佳化的有限脈衝響應濾波器合成技術 (頁 14-0)