Chapter 3 Motivation
3.6 Problem Formulation
In this thesis, we address the problem of optimal MCM design based on two stage multiplier block architecture. We are given:
a set of fixed-point coefficient {C0, C1…, Cn-1},
the timing constraint from input X to output of fragment summation unit X*C,
the delay and area ratio of CSA and CPA.
Our goal is to minimize total area cost including alphabet generation unit and fragment summation unit under the given timing constraint.
Chapter 4
Our Proposed Algorithm
In this chapter, we describe our algorithm, called Global Optimal Symbol Match (GOSM), which can find out matches for coefficients and implement the delay and area optimal BSE-based FIR filters. Section 4.1 presents our algorithm including terminology (4.1.1), pseudo code (4.1.2), complexity analysis (4.1.3) and two enhanced methods (4.1.4 and 4.1.5).
Section 4.2 illustrates a working example. Finally, section 4.3 gives Integer Linear Programming (ILP) formulation to optimize delay and area.
4.1 Algorithm Flow
The detail processes are shown in Fig. 11. Step 1, we enumerate all possible matches and construct the Coefficient Assembly Tree (CAT). Step 2, to reduce the complexity, we eliminate the redundant paths. Step 3, we use ILP to decide the best paths for all coefficients.
Fig. 11 GOSM Flow
4.1.1 Terminology
Coefficient Assembly Tree, CAT(C): A tree which is extended for a coefficient C.
For example, Fig. 12 (a) illustrates a CAT for a coefficient C=6’b011010.
Path: A match which is from root to leaf in CAT(C).
For example, in Fig. 12 (a), The CAT includes 5 paths which correspond to 5 possible matches for this coefficient.
SymSet(Path): A set of symbol that are used on Path.
For example, the marked path in Fig. 12 (a) includes two fragment, (S1,4) and (S5,1), therefore, SymSet(Path)={S1,S5}.
Delay(Path): Delay of Path including symbol generation time.
For example, Fig. 12 (b) shows the implementation of alphabet generation unit and fragment summation unit which correspond to the Path={(S1,4),(S5,1)}. The maximum delay is 4 which is from X to X*C. generation unit includes two area cost, Area(S1)=0 and Area(S5)=2.
Trim leading zero’s, TrimLZ(C): Trim the leading 0’s in C.
For example, TrimLZ(011010)=11010 , TrimLZ(0001011)=1011.
Trim MSB, TrimMSB(C): Trim the MSB in C.
For example, TrimMSB(11010)=1010.
|C|: Bitwidth of TrimLZ(C), where C is a binary number.
For example, |1001|=4, |011010|=5, |11010|=5.
Difference of length, DOL(C, S): Return |C|-|S|, where C is a coefficient and S is a symbol.
For example, DOL(11010,1001)=5-4=1 , DOL(11010,11)=5-2=3.
Residue(B, C): TrimLZ(B-C), where B and C are binary numbers.
For example, Residue(11010,10010)=TrimLZ(11010-10010)=TrimLZ(01000)=1000.
(a)
(b)
Fig. 12 (a)A CAT Example. (b)The Delay and Area Calculations.
4.1.2 Coefficient Assembly Tree Enumerator
CAT enumerator enumerates a CAT for a coefficient. To find out all CATs, we execute CAT enumerator n times for n coefficients in coefficient set. The recursive pseudo code for CAT enumerator as follows.
Initial: A=;
Note that, the alphabet A is empty initially. To simplify the calculation we trim the leading zero’s in C, before starting CAT. The first for loop in line 2 to line 5 generates the useable symbol. Num_1 indicates the number of 1’s which must be chosen in C’. For example,
assume C=011010 C’=TrimMSB(11010)=1010, it means zero 1‘s-combinations from two 1’s in C’(C ), one 1’s-combinations from two 1’s in C’(C ) and two 1’s-combinations from two 1’s in C’(C ), after the for loop, by three kinds of different combinations we can get 4 possible symbols 1(S1), 11(S3), 1001(S9), 1101(S13) and add them to a set Symbol. The second for loop in line 6 to line 10, for each symbol S in Symbol, we create a child node r= F(S, d) where d is DOL(C, S), add S into A and use the Residue(C, S<<d) to recursive call CAT. For example, we
create a child node r=F(1,d) for Root where d is DOL(011010,1)=4, add 1(S1) to A and call CAT(r,1010). The recursive call terminates when the first for loop don’t generate any
symbols.
The Sym_Enum is a sub-function in CAT line 5 which is used to generate the useable symbol for C. Before starting Sym_Enum, we trim the first 1 in C to an initial symbol S. By Sym_Enum, the symbol S grows up to those useable symbols and stores in the set Symbol. In
each recursive call, we check the MSB(C’). If it’s 0, we skip this bit, let S<<1 and recursive call Sym_Enum for TrimMSB(C’). If it’s 1, we can skip this 1 as a 0 if the remaining 1’s in C’
still enough also we can pick this 1 and let (S<<1)+1 and decrease the index num_1 by 1. The recursive call terminates when num_1 counts down to 0. An example for Sym_Enum(1010,1,1) is illustrated in Fig. 14. In this example, S grew up to 11(S3) and 1001(S9), it also means one 1’s-combinations from two 1’s in C’=1010(C ).
Fig. 13 Illustration for CAT(R, TrimLZ(011010))
Fig. 14 Illustration for Sym_Enum(1010,1,1)
4.1.3 Coefficient Assembly Tree Complexity
The analysis of CAT complexity, we try to find out NumP(k).
NumP(k): Number of possible path in CAT(C) such that NZB(C)=k.
Actually, NumP(k) is only related to NZB(C). For example, the CAT of 10011 and the CAT of 11001 have identical number of paths because of their identical number of non-zero bits. According to CAT enumerator, the recursive equation as follows
1, k 0, 1
∑ C ∙ i 1 , otherwise (4.1)
, where Cik-1 represents i-combination from n-1 bits.
When k=2, the number of path: NumP(2)= C01∙NumP(1)+ C01·NumP(0) =1+1=2; if k=3,
NumP(3)= C02·NumP(2)+ C12·NumP(1)+ C22·NumP(0) =1*2+2+1=5. Table I shows NumP(k)
To consider the timing issue, during enumerating CAT, we can check maximum delay and prune those paths which are over timing constraint. The pseudo code of the modified CAT, called PCAT is as follows.
Initial: A=;
12 PCAT(r, Residue(C, S<<d), Pathmiddle);
End
The differences between CAT and PCAT are marked by bold text. To calculate maximum delay of Path, we call PCAT with Pathmiddle and record every node we enumerate. The significant difference is that, before creating a childe node, we must estimate the delay if the child node F(S, d) add to a temporary path Pathmiddle, corresponding to line 8 and line 9.
For example, the coefficient C is 11011101 and we assume the delay and area ratio of CPA and CSA is 2 and the timing constraint is 5. When we execute to the node F(S5, 4) in Fig.
15, PCAT(F(S5,4), 1101, {F(S1,7), F(S5,4)}), we add the node F(S1,3) with an arrow to Pathmiddle and calculate the maximum delay of Pathmiddle={F(S1,7), F(S5,4), F(S1,3)} to decide whether we create child node F(S1,3) or not. Because the delay of Pathmiddle is 5, which is equal to the timing constraint, PCAT for node F(S1,3) will not be execute. By pruning on F(S1,3), Those child nodes of F(S1,3) which are marked by dash circle will not be enumerated.
Pruned Coefficient Assembly Tree (PCAT) may not be enumerated completely as CAT since the complexity will not be as pessimistic as our analysis in section 4.1.3. In the following chapter (5.4), we will show the reduction rate by pruning with different timing constraint.
Fig. 15 A Pruning Example
4.1.5 Reduced Pruned Coefficient Assembly Tree
In this section, we propose a method, called reduction phase to reduce Coefficient Assembly Tree complexity again. Once a path is completed in PCAT process (not create any child node), we check SymSet(Path) and Area(Path) to decide whether we eliminate this path or not. Similar to SymSet(Path), we store the smallest area path from those path have completed so far. Since we only reserve the smallest area path for a kind of SymSet(Path).
Using hashtable technique, we can realize SymSet(Path) check in linear time. For example, if the area ratio of CPA and CSA is 2 and coefficient C=11011101, we assume a path1
={F(S1,7),F(S5,4),F(S1,3),F(S1,2),F(S1,0)} with respect to SymSet(Path1)={S1, S5} store in the temporary in previous action then a path2={F(S1,7),F(S5,4),F(S1,3),F(S5,0)} with respect to SymSet(Path2)={S1, S5} now is completed. Because they have same SymSet and
Area(Path2)=4 is smaller than Area(Path1)=5, we replace path1 by path2 into the temporary.
path1 is a redundant path in this case, eliminate path1 does not affect our optimal solution because Path2 is a better choice whatever considering other coefficient. In chapter 5, we show that the reduction phase can reduce about 20% total number of paths.
4.2 Working Example
We illustrate a complete example with two coefficients. The coefficient set is {101101(C0), 011010(C1)} and the delay and area ratio of CPA and CSA is 2, the timing constraint is 4. Fig. 16 shows the overall process, and deep color node indicates a pruning occur because of its timing violation. After executing PCAT for C0, an alphabet A extends completely. For each coefficient, through PCAT enumerator and reduction phase, we can get two RPCAT and there are five paths for each coefficient, as shows in Fig. 17.
Fig. 16 Illustrate RPCAT(C0) in Working Example
Fig. 17 The RPCAT Results for C0 and C1
4.3 Integer Linear Programming (ILP) Formulation
In the previous section, we enumerate all possible matches and construct the CAT.
Secondly, we form our problem to an ILP problem and use ILP solver to decide the best paths for all coefficients.
4.3.1 Variables
In the proposed ILP formulation, two variables are used to model the behavior of choosing a path in a CAT. First, VarPath indicates whether the path is selected or not. The other one is VarS which means the symbol selection in the alphabet. The following equation
lists the corresponding ILP formulations.
Our proposed FIR filter is a two stage architecture, alphabet generation unit and fragment summation unit. Since VarS is a 0-1 variable, we can calculate the area of alphabet generation unit by ∑ Area(Si)·VarSi . Similarly, VarPath is also a 0-1 variable. The area of fragment summation unit can be calculated by similar equation, such as
∑ ∑ Area(Pathi,j)·VarPathi,j . In order to minimize the total area cost, the objective function can be formulated as:
∑ Area(Si)·VarSi ∑ ∑ Area(Pathi,j)·VarPathi,j (4.4)
where n is number of coefficients
k is number of paths in ith coefficient and m is number of symbols in alphabet A.
4.3.3 Existence Constraint
In CAT, each path means a kind of implementation, which is composed by many symbols. If a Pathi,j is selected the corresponding symbols should be also existed in the alphabet, i.e. S SymSet(Pathi,j).Therefore, the existence constraint is used to guarantee all of the symbols of the selected path is existed in the alphabet. The formulation is as follows:
VarPathi,j≤ min{VarS0,…, VarSn-1} ∀ Pathi,j|S SymSet(Pathi,j) (4.5)
4.3.4 Uniqueness Constraint
A coefficient can be produced by several implementations. If multiple implementations are chosen, it results in hardware waste. In order to ensure only one path is chosen for a coefficient, the uniqueness constraint should be accordingly formulated as:
∑k-1j=0 VarSi 1 (4.6)
where k is number of paths for the coefficient
4.3.5 ILP Example
In section 4.2, we illustrate the example with two coefficients. After above procedure, we can get 5 possible paths for each coefficient C0 and C1 as show in Fig. 17. Then, the objective is minimized
4VarPath0,0+2VarPath0,1+2VarPath0,2+2VarPath0,3+0VarPath0,4+ 3VarPath1,0+2VarPath1,1+2VarPath1,2+2VarPath1,3+0VarPath1,4+
0VarS1+2VarS3+2VarS5+2VarS9+3VarS11+3VarS13+3VarS37 +3VarS41+4VarS45. In addition, the corresponding constraints are listed below:
Existence constraint:
VarPath0,0+VarPath0,1+VarPath0,2+VarPath0,3+VarPath0,4=1;
VarPath1,0+VarPath1,1+VarPath1,2+VarPath1,3+VarPath1,4=1.
Eventually, we use ILP solver, named gurobi [17] to solve this ILP problem. The ILP result is VarPath =1, VarPath =1 and VarS =1, as shown in Fig. 18 and the best solution is
Area(Path0,2)+Area(Path1,3)+Area(S9)=2+2+2=6.
Fig. 18 ILP Results of Working Example
Chapter 5
Experimental Results
5.1 Experiments Setup
The proposed BSE algorithm, GOSM, is developed in C++/Linux environment. We also use this environment to develop SLSM [14]. The coefficient sets of these test designs are generated by Matlab FDAtool [18].
Two widely used CPA architectures. First, a ripple carry adder (RCA) is the simplest adder structure where the carry bit must wait for the previous full adder. Thus, the critical path delay is relatively longer than other adder structures. Second, a carry look-ahead adder (CLA) calculates the carry bits before the sum, which reduce the critical path delay dramatically.
Table II reports the synthesis results of these two CPA and CSA (refer to 3.1) architectures with different bitwidth under TSMC 180nm process. It apparently shows that RCA is the smallest with longer computation time and CSA is fastest with linear area increasing. Thus, in high speed application, such as software defined radio (SDR), it is desired to design a filter with shorter critical path delay. Therefore, we choose CLA as CPA and decide that area ratio is 4(1368/376) in 16 bitwidth, 3(968/306) in 12 bitwidth and delay ratio is 4(1.2/0.32) in 16 bitwidth, 3(0.95/0.32) in 12 bitwidth.
Table II Synthesis Results of Different Adder Architectures in TSMC .18μm
Architectures 4-bit 8-bit 12-bit 16-bit
RCA Delay(ns) 2.05 2.85 3.92 4.99
Area (μm2) 134 141 211 282
CLA Delay(ns) 0.60 0.79 0.95 1.2
Area (μm2) 301 814 968 1368
CSA Delay(ns) 0.32 0.32 0.32 0.32
Area (μm2) 190 235 306 376
5.2 Case Study I : Versus SLSM
Table III/IV illustrates 10 filter designs with 16/12 bitwidth coefficients, named lp (lowpass), hp (highpass), bp (bandpass), bs (bandsotp) and their filter length. The area &
delay ratio is 4/3 for 16/12 bitwidth. First, we find out maximum delay and area cost for all designs by using SLSM method as show in 2nd and 3rd columns. Then, we use those Maximum delays as our timing constraints in GOSM method and the corresponding area result are shown in right side columns. Rate is the percentage of (area cost by SLSM-area cost by GOSM)/area cost by SLSM and #sym. represents number of symbols which ILP solver actually chooses in GOSM.
Under same timing constraint, GOSM can minimize area cost average 26/22% in 16/12 bitwidth with timing constraint. Using SLSM, with the growth of the filter length, the area cost of filter increase linearly. But in GOSM, the area cost of filter increase slower.
Furthermore, by using GOSM, ILP solver chooses not only five symbols (avg. 18.) but grew up with the filter length. For longer length filters, the complex symbols appear in coefficient frequently. The benefit of those complex symbols causes the different of the reduction ratio between SLSM and GOSM in longer length filters. Therefore, the reduce ratio of area cost and the length of filter are in direct proportion.
GOSM can find the solutions when timing constraint is tighter which illustrate in rows of delay-2 and delay-1 since it expand the solution space. The Reduction rate of maximum delay is at most 20%.
Table III Results of Filters with Bitwidth of Coefficient=16 and Area & Delay Ratio=4
Designs SLSM GOSM
filter delay area delay delay-1 delay-2
area (rate) #sym. area (rate) #sym. area (rate) #sym.
bs_31 10 92 77 (16.3%) 6 80 (13%) 7 97 (-5.4%) 10
lp_32 10 89 76 (14.6%) 6 80 (10.1%) 8 95 (-6.7%) 9
hp_63 10 144 126 (12.5%) 10 138 (4.2%) 8 166 (-15.3%) 12 bp_64 11 156 135 (13.5%) 11 135 (13.5%) 12 147 (5.7%) 11 lp_127 10 231 186 (19.5%) 13 189 (18.2%) 13 234 (-1.3%) 22 bp_128 10 306 248 (18.9%) 17 271 (11.4%) 17 344 (-12.4%) 33 hp_255 10 387 267 (31%) 27 276 (28.7%) 27 311 (19.6%) 33 lp_256 10 395 254 (35.7%) 21 262 (33.7%) 21 300 (24.1%) 26 bs_511 10 626 354 (43.5%) 35 357 (43%) 35 424 (32.3%) 44 lp_512 10 820 378 (53.9%) 36 387 (52.8%) 36 440 (46.4%) 44 Avg. 10.1 324.6 210.1 (26%) 18.2 217.5 (23%) 18.4 255.8 (16.3%) 24.4
Table IV Results of Filters with Bitwidth of Coefficient=12 and Area & Delay Ratio=3
Designs SLSM GOSM
filter delay area delay delay-1 delay-2
area (rate) #sym. area (rate) #sym. area (rate) #sym.
5.3 Case Study II : Synthesis Result
We use synopsys Design Compiler [19] and TSMC 0.18μm CMOS process on a workstation. Table V illustrates the synthesis result of test design: lp_32 with 16 bitwidth coefficients.
Compared GOSM with SLSM, our estimations are given from Case Study I multiplied by 1 CSA unit factor and corresponding synthesis results are shown in right side. In this design, we a little over estimate about 13% in delay and 6.6% in area. In Diff. row, we estimate GOSM can reduce 20% delay but increase 6.74% area overhead. Actually, GOSM can reduce 21% delay but increase 7.1% area overhead. It is means that the reduction rate in
Table III and IV can accurately correspond to their synthesis result.
Table V Synthesis Result of Test Design: lp_32
Algorithms
Our estimation Synthesis Result of Multiplier Block (error rate) Delay(ns) Area(μm2) Delay(ns) Area(μm2) SLSM 10*0.32=3.2 89*376=33464 2.8 (12.5%) 31201.6 (6.7%) GOSM 8*0.32=2.56 95*376=35720 2.2 (14.4%) 33420.3 (6.5%)
Diff. 20% -6.74% 21% -7.1%
5.4 Case Study III : Pruned CAT
In this case study, Table VI shows the comparison of number of paths without/with pruning (section 4.1.4). The bitwidth of coefficient is 16. 2nd column shows number of paths without pruning, CAT algorithm. Right side columns show number of paths using PCAT algorithm with 8~11 timing constraints. Number of paths extremely decreases to less than 5%
remaining with 8 timing constraint. Under 10 timing constraint, pruning technique also reduces average 37.5% number of paths.
Table VI Results of PCAT
Designs # of possible paths with reduction phase(reduction rate) filter Without
Table VII shows the comparison of number of paths before/after reduction phase (section 4.1.5). The bitwidth of coefficient is 16.There is no timing constraint and pruning occurrence.
In # of possible paths column, reduction phase can reduce average 27.8% number of paths and save the run time of ILP solver. In bp_128 this case, we can extra cost no more than 1 sec on enumerating RCAT but get about 50% speedup on ILP solving time. By using reduction phase, we can reduce ILP solver overhead, only increase a little enumerating time.
Table VII Results of RCAT
Designs Before reduction After reduction
filter # of possible paths Run time(sec.) # of possible paths (reduction rate)
5.6 Case Study V : Reduced Pruned CAT
Combining pruning with reduction phase, Table VIII illustrate the results of RPCAT.
Table VIII is the result of Table VI with reduction phase. Compared to Table VI, RPCAT also has the same tendency on timing axis but overall reduce about 20% number of path when timing constraint is 9~11. We succeed in reducing CAT complexity by using above two strategies.
Table VIII Results of RPCAT
Designs # of possible paths with reduction phase(reduction rate) filter Without
pruning
Timing constraint
11 10 9 8 bs_31 3965 3725 (6.1%) 2397 (39.5%) 397 (90%) 49 (98.8%)
lp_32 1428 1407 (1.5%) 1186 (17%) 336 (76.5%) 49 (96.6%) hp_63 19782 17636 (10.8%) 5940 (70%) 450 (97.7%) 111 (99.4%) bp_64 13789 12800 (7.2%) 7591 (44.9%) 575 (95.8%) 113 (99.2%) lp_127 6786 6481 (4.5%) 4641 (31.6%) 1017 (85%) 214 (96.8%) bp_128 43710 39167 (10.4%) 14120 (67.7%) 1260 (97.1%) 226 (99.5%) hp_255 12930 12121 (6.3%) 7673 (40.7%) 1409 (89.1%) 447 (96.5%) lp_256 12732 11902 (6.5%) 7471 (41.3%) 1255 (90%) 469 (96.3%) bs_511 12179 11737 (3.6%) 8793 (27.8%) 2118 (82.6%) 786 (93.5%) lp_512 12393 12019 (3.1%) 9403 (24.1%) 3223 (74%) 1000 (92%) Avg. 13969.4 12899 (6%) 6921.5 (40.5%) 1204 (88%) 346 (96.9%)
Chapter 6
Conclusions & Future Works
In this thesis, Global Optimal Symbol Match (GOSM) is proposed for FIR filer synthesis.
This method explores a large solution space, gives an optimal solution under the given timing constraint by formed ILP problem, provides a delay and area optimal BSE-based FIR filters and makes trade-off between area and delay.
Compared to SLSM, under the same timing constraint, GOSM reduces area cost about 25% and reduces maximum delay at most 20%.
According to case study II, BSE-based FIR filter by GOSM method can achieve up to about 400MHz clock rate in .18μm process and it could be suitable for high speed DSP applications.
We also propose two different kinds of method, PCAT and RCAT to reduce the complexity of coefficient assembly tree. PCAT reduces 37.5% number of paths when the timing constraint is 10. RCAT takes at most 50% speedup on ILP solving time.
GOSM produces an optimal solution in BSE architecture with two reduction methods to seamless minimize the complexity of coefficient assembly tree. However, for some coefficient, the number of paths is still large and takes too much long time for ILP solver. From those cases which NZB(C) is much bigger (i.e., up to 13 bits), we can separate this coefficient C to some sub-coefficients and then their non-zero bits would be smaller. Although the optimal property is scarified, a good-quality BSE-based filter can be still generated for an extremely large filter case.
References
[1] K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.
[2] D. R. Bull and D. H. Horrcks, “Primitive operator digital filters,” Proceeding Inst. Elect.
Eng.—Circuits Devices Systems, vol. 138, no. 3, pp. 401–412, Jun. 1991.
[3] A. G. Dempster and M. D. Macleod, “Use of minimum-adder multiplier blocks in FIR digital filters,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 42, no. 9, pp. 569–577, Sep. 1995.
[4] H.-J. Kang and I.-C. Park, “FIR filter synthesis algorithms for minimizing the delay and the number of adders,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 48, no. 8, pp. 770–777, Aug. 2001.
[5] A. Dempster et al., “Designing multiplier blocks with low logic depth,” IEEE international symposium on Circuits and Systems, May 2002, vol. 5, pp. 773–776.
[6] Y. Takahashi and M. Yokoyama, “New cost-effective VLSI implementation of multiplierless FIR filter using common subexpression elimination," IEEE international symposium on Circuits and Systems, May 2005, vol. 2, pp. 1445–1448.
[7] C. Yao, H. Chen, T. Lin, C. Chien and C. Hsu, “A novel common subexpression elimination method for synthesizing fixed-point FIR filters,” IEEE Transactions on Circuits and Systems I, pp. 2211–2215, Nov. 2004.
[8] A. Hosangadi et al., “Algebraic methods for optimizing constant multiplications in linear systems,” J. VLSI Signal Process Systems, vol. 49, no. 1, pp. 31–50, Oct. 2007.
[9] O. Gustafsson and L. Wanhammar, “ILP modelling of the common subexpression sharing problem,” IEEE International Conference on Electronics, Circuits and Systems, Dec. 2002, vol. 3, pp. 1171–1174.
[10] S. Vijay et al., “A greedy common subexpression elimination algorithm for
implementing FIR filters,” IEEE international symposium on Circuits and Systems, May 2007, pp. 3451–3454.
[11] R. M. Hewlitt and E. S. Swartzlander, “Canonical signed digit representation for FIR digital filters,” IEEE Workshop on Signal Processing Systems, 2000, pp. 416–426.
[12] J. H. Choi, et al., "Variation-aware low-power synthesis methodology for fixed-point FIR filters," IEEE Transactions on Computer-Aided Design Integrated Circuits and Systems, vol. 28, pp. 87-97, 2009.
[13] G Karakonstantis, N. Banerjee and K. Roy, “Process-variation resilient and voltage-scalable DCT architecture for robust low-power computing,” IEEE Transactions on Very Large Scale Integrated Systems, pp. 1461-1470, 2010.
[14] R. Mahesh and A. P. Vinod, “A new common subexpression elimination algorithm for realizing low complexity higher order digital filters,” IEEE Transactions on Computer-Aided Design Integrated Circuits and Systems, pp. 217–219, Feb. 2008.
[15] M. M. Peiro, E. I. Boemo, and L. Wanhammar, “Design of high-speed multiplierless filters using a nonrecursive signed common subexpression algorithm,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 9, no. 3, pp.
196–203, Mar. 2002.
[16] F. Xu, C. H. Chang, and C. C. Jong, “Contention resolution algorithm for common subexpression elimination in digital filter design,” IEEE Trans. Circuits Syst. II, Exp.
[16] F. Xu, C. H. Chang, and C. C. Jong, “Contention resolution algorithm for common subexpression elimination in digital filter design,” IEEE Trans. Circuits Syst. II, Exp.