ILP Example - Integer Linear Programming (ILP) Formulation

Chapter 4 Our Proposed Algorithm

4.3 Integer Linear Programming (ILP) Formulation

4.3.5 ILP Example

In section 4.2, we illustrate the example with two coefficients. After above procedure, we can get 5 possible paths for each coefficient C0 and C1 as show in Fig. 17. Then, the objective is minimized

4VarPath0,0+2VarPath0,1+2VarPath0,2+2VarPath0,3+0VarPath0,4+ 3VarPath1,0+2VarPath1,1+2VarPath1,2+2VarPath1,3+0VarPath1,4+

0VarS1+2VarS3+2VarS5+2VarS9+3VarS11+3VarS13+3VarS37 +3VarS41+4VarS45. In addition, the corresponding constraints are listed below:

Existence constraint:

VarPath_0,0+VarPath0,1+VarPath0,2+VarPath0,3+VarPath0,4=1;

VarPath1,0+VarPath1,1+VarPath1,2+VarPath1,3+VarPath1,4=1.

Eventually, we use ILP solver, named gurobi [17] to solve this ILP problem. The ILP result is VarPath =1, VarPath =1 and VarS =1, as shown in Fig. 18 and the best solution is

Area(Path0,2)+Area(Path1,3)+Area(S9)=2+2+2=6.

Fig. 18 ILP Results of Working Example

Chapter 5 Experimental Results

5.1 Experiments Setup

The proposed BSE algorithm, GOSM, is developed in C++/Linux environment. We also use this environment to develop SLSM [14]. The coefficient sets of these test designs are generated by Matlab FDAtool [18].

Two widely used CPA architectures. First, a ripple carry adder (RCA) is the simplest adder structure where the carry bit must wait for the previous full adder. Thus, the critical path delay is relatively longer than other adder structures. Second, a carry look-ahead adder (CLA) calculates the carry bits before the sum, which reduce the critical path delay dramatically.

Table II reports the synthesis results of these two CPA and CSA (refer to 3.1) architectures with different bitwidth under TSMC 180nm process. It apparently shows that RCA is the smallest with longer computation time and CSA is fastest with linear area increasing. Thus, in high speed application, such as software defined radio (SDR), it is desired to design a filter with shorter critical path delay. Therefore, we choose CLA as CPA and decide that area ratio is 4(1368/376) in 16 bitwidth, 3(968/306) in 12 bitwidth and delay ratio is 4(1.2/0.32) in 16 bitwidth, 3(0.95/0.32) in 12 bitwidth.

Table II Synthesis Results of Different Adder Architectures in TSMC .18μm

Architectures 4-bit 8-bit 12-bit 16-bit

RCA Delay(ns) 2.05 2.85 3.92 4.99

Area (μm²) 134 141 211 282

CLA Delay(ns) 0.60 0.79 0.95 1.2

Area (μm²) 301 814 968 1368

CSA Delay(ns) 0.32 0.32 0.32 0.32

Area (μm²) 190 235 306 376

5.2 Case Study I : Versus SLSM

Table III/IV illustrates 10 filter designs with 16/12 bitwidth coefficients, named lp (lowpass), hp (highpass), bp (bandpass), bs (bandsotp) and their filter length. The area &

delay ratio is 4/3 for 16/12 bitwidth. First, we find out maximum delay and area cost for all designs by using SLSM method as show in 2^nd and 3^rd columns. Then, we use those Maximum delays as our timing constraints in GOSM method and the corresponding area result are shown in right side columns. Rate is the percentage of (area cost by SLSM-area cost by GOSM)/area cost by SLSM and #sym. represents number of symbols which ILP solver actually chooses in GOSM.

Under same timing constraint, GOSM can minimize area cost average 26/22% in 16/12 bitwidth with timing constraint. Using SLSM, with the growth of the filter length, the area cost of filter increase linearly. But in GOSM, the area cost of filter increase slower.

Furthermore, by using GOSM, ILP solver chooses not only five symbols (avg. 18.) but grew up with the filter length. For longer length filters, the complex symbols appear in coefficient frequently. The benefit of those complex symbols causes the different of the reduction ratio between SLSM and GOSM in longer length filters. Therefore, the reduce ratio of area cost and the length of filter are in direct proportion.

GOSM can find the solutions when timing constraint is tighter which illustrate in rows of delay-2 and delay-1 since it expand the solution space. The Reduction rate of maximum delay is at most 20%.

Table III Results of Filters with Bitwidth of Coefficient=16 and Area & Delay Ratio=4

Designs SLSM GOSM

filter delay area delay delay-1 delay-2

area (rate) #sym. area (rate) #sym. area (rate) #sym.

bs_31 10 92 77 (16.3%) 6 80 (13%) 7 97 (-5.4%) 10

lp_32 10 89 76 (14.6%) 6 80 (10.1%) 8 95 (-6.7%) 9

hp_63 10 144 126 (12.5%) 10 138 (4.2%) 8 ¹⁶⁶ (-15.3%) 12 bp_64 11 156 135 (13.5%) 11 135 (13.5%) 12 ¹⁴⁷ (5.7%) 11 lp_127 10 231 186 (19.5%) 13 189 (18.2%) 13 234 (-1.3%) 22 bp_128 10 306 248 (18.9%) 17 271 (11.4%) 17 344 (-12.4%) 33 hp_255 10 387 267 (31%) 27 276 (28.7%) 27 ³¹¹ (19.6%) 33 lp_256 10 395 254 (35.7%) 21 262 (33.7%) 21 ³⁰⁰ (24.1%) 26 bs_511 10 626 354 (43.5%) 35 357 (43%) 35 424 (32.3%) 44 lp_512 10 820 378 (53.9%) 36 387 (52.8%) 36 440 (46.4%) 44 Avg. 10.1 324.6 210.1 (26%) 18.2 217.5 (23%) 18.4 ^255.8 (16.3%) 24.4

Table IV Results of Filters with Bitwidth of Coefficient=12 and Area & Delay Ratio=3

Designs SLSM GOSM

filter delay area delay delay-1 delay-2

area (rate) #sym. area (rate) #sym. area (rate) #sym.

5.3 Case Study II : Synthesis Result

We use synopsys Design Compiler [19] and TSMC 0.18μm CMOS process on a workstation. Table V illustrates the synthesis result of test design: lp_32 with 16 bitwidth coefficients.

Compared GOSM with SLSM, our estimations are given from Case Study I multiplied by 1 CSA unit factor and corresponding synthesis results are shown in right side. In this design, we a little over estimate about 13% in delay and 6.6% in area. In Diff. row, we estimate GOSM can reduce 20% delay but increase 6.74% area overhead. Actually, GOSM can reduce 21% delay but increase 7.1% area overhead. It is means that the reduction rate in

Table III and IV can accurately correspond to their synthesis result.

Table V Synthesis Result of Test Design: lp_32

Algorithms

Our estimation Synthesis Result of Multiplier Block (error rate) Delay(ns) Area(μm²) Delay(ns) Area(μm²) SLSM 10*0.32=3.2 89*376=33464 2.8 (12.5%) 31201.6 (6.7%) GOSM 8*0.32=2.56 95*376=35720 2.2 (14.4%) 33420.3 (6.5%)

Diff. 20% -6.74% 21% -7.1%

5.4 Case Study III : Pruned CAT

In this case study, Table VI shows the comparison of number of paths without/with pruning (section 4.1.4). The bitwidth of coefficient is 16. 2^nd column shows number of paths without pruning, CAT algorithm. Right side columns show number of paths using PCAT algorithm with 8~11 timing constraints. Number of paths extremely decreases to less than 5%

remaining with 8 timing constraint. Under 10 timing constraint, pruning technique also reduces average 37.5% number of paths.

Table VI Results of PCAT

Designs # of possible paths with reduction phase(reduction rate) filter Without

Table VII shows the comparison of number of paths before/after reduction phase (section 4.1.5). The bitwidth of coefficient is 16.There is no timing constraint and pruning occurrence.

In # of possible paths column, reduction phase can reduce average 27.8% number of paths and save the run time of ILP solver. In bp_128 this case, we can extra cost no more than 1 sec on enumerating RCAT but get about 50% speedup on ILP solving time. By using reduction phase, we can reduce ILP solver overhead, only increase a little enumerating time.

Table VII Results of RCAT

Designs Before reduction After reduction

filter # of possible paths Run time(sec.) # of possible paths (reduction rate)

5.6 Case Study V : Reduced Pruned CAT

Combining pruning with reduction phase, Table VIII illustrate the results of RPCAT.

Table VIII is the result of Table VI with reduction phase. Compared to Table VI, RPCAT also has the same tendency on timing axis but overall reduce about 20% number of path when timing constraint is 9~11. We succeed in reducing CAT complexity by using above two strategies.

Table VIII Results of RPCAT

Designs # of possible paths with reduction phase(reduction rate) filter Without

pruning

Timing constraint

11 10 9 8 bs_31 3965 3725 (6.1%) 2397 (39.5%) 397 (90%) 49 (98.8%)

lp_32 1428 1407 (1.5%) 1186 (17%) 336 (76.5%) 49 (96.6%) hp_63 19782 17636 (10.8%) 5940 (70%) 450 (97.7%) 111 (99.4%) bp_64 13789 12800 (7.2%) 7591 (44.9%) 575 (95.8%) 113 (99.2%) lp_127 6786 6481 (4.5%) 4641 (31.6%) 1017 (85%) 214 (96.8%) bp_128 43710 39167 (10.4%) 14120 (67.7%) 1260 (97.1%) 226 (99.5%) hp_255 12930 12121 (6.3%) 7673 (40.7%) 1409 (89.1%) 447 (96.5%) lp_256 12732 11902 (6.5%) 7471 (41.3%) 1255 (90%) 469 (96.3%) bs_511 12179 11737 (3.6%) 8793 (27.8%) 2118 (82.6%) 786 (93.5%) lp_512 12393 12019 (3.1%) 9403 (24.1%) 3223 (74%) 1000 (92%) Avg. 13969.4 12899 (6%) 6921.5 (40.5%) 1204 (88%) 346 (96.9%)

Chapter 6 Conclusions & Future Works

In this thesis, Global Optimal Symbol Match (GOSM) is proposed for FIR filer synthesis.

This method explores a large solution space, gives an optimal solution under the given timing constraint by formed ILP problem, provides a delay and area optimal BSE-based FIR filters and makes trade-off between area and delay.

Compared to SLSM, under the same timing constraint, GOSM reduces area cost about 25% and reduces maximum delay at most 20%.

According to case study II, BSE-based FIR filter by GOSM method can achieve up to about 400MHz clock rate in .18μm process and it could be suitable for high speed DSP applications.

We also propose two different kinds of method, PCAT and RCAT to reduce the complexity of coefficient assembly tree. PCAT reduces 37.5% number of paths when the timing constraint is 10. RCAT takes at most 50% speedup on ILP solving time.

GOSM produces an optimal solution in BSE architecture with two reduction methods to seamless minimize the complexity of coefficient assembly tree. However, for some coefficient, the number of paths is still large and takes too much long time for ILP solver. From those cases which NZB(C) is much bigger (i.e., up to 13 bits), we can separate this coefficient C to some sub-coefficients and then their non-zero bits would be smaller. Although the optimal property is scarified, a good-quality BSE-based filter can be still generated for an extremely large filter case.

References

[1] K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. New York: Wiley, 1999.

[2] D. R. Bull and D. H. Horrcks, “Primitive operator digital filters,” Proceeding Inst. Elect.

Eng.—Circuits Devices Systems, vol. 138, no. 3, pp. 401–412, Jun. 1991.

[3] A. G. Dempster and M. D. Macleod, “Use of minimum-adder multiplier blocks in FIR digital filters,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 42, no. 9, pp. 569–577, Sep. 1995.

[4] H.-J. Kang and I.-C. Park, “FIR filter synthesis algorithms for minimizing the delay and the number of adders,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 48, no. 8, pp. 770–777, Aug. 2001.

[5] A. Dempster et al., “Designing multiplier blocks with low logic depth,” IEEE international symposium on Circuits and Systems, May 2002, vol. 5, pp. 773–776.

[6] Y. Takahashi and M. Yokoyama, “New cost-effective VLSI implementation of multiplierless FIR filter using common subexpression elimination," IEEE international symposium on Circuits and Systems, May 2005, vol. 2, pp. 1445–1448.

[7] C. Yao, H. Chen, T. Lin, C. Chien and C. Hsu, “A novel common subexpression elimination method for synthesizing fixed-point FIR filters,” IEEE Transactions on Circuits and Systems I, pp. 2211–2215, Nov. 2004.

[8] A. Hosangadi et al., “Algebraic methods for optimizing constant multiplications in linear systems,” J. VLSI Signal Process Systems, vol. 49, no. 1, pp. 31–50, Oct. 2007.

[9] O. Gustafsson and L. Wanhammar, “ILP modelling of the common subexpression sharing problem,” IEEE International Conference on Electronics, Circuits and Systems, Dec. 2002, vol. 3, pp. 1171–1174.

[10] S. Vijay et al., “A greedy common subexpression elimination algorithm for

implementing FIR filters,” IEEE international symposium on Circuits and Systems, May 2007, pp. 3451–3454.

[11] R. M. Hewlitt and E. S. Swartzlander, “Canonical signed digit representation for FIR digital filters,” IEEE Workshop on Signal Processing Systems, 2000, pp. 416–426.

[12] J. H. Choi, et al., "Variation-aware low-power synthesis methodology for fixed-point FIR filters," IEEE Transactions on Computer-Aided Design Integrated Circuits and Systems, vol. 28, pp. 87-97, 2009.

[13] G Karakonstantis, N. Banerjee and K. Roy, “Process-variation resilient and voltage-scalable DCT architecture for robust low-power computing,” IEEE Transactions on Very Large Scale Integrated Systems, pp. 1461-1470, 2010.

[14] R. Mahesh and A. P. Vinod, “A new common subexpression elimination algorithm for realizing low complexity higher order digital filters,” IEEE Transactions on Computer-Aided Design Integrated Circuits and Systems, pp. 217–219, Feb. 2008.

[15] M. M. Peiro, E. I. Boemo, and L. Wanhammar, “Design of high-speed multiplierless filters using a nonrecursive signed common subexpression algorithm,” IEEE Transactions on Circuits and Systems. II, Analog Digital Signal Process, vol. 9, no. 3, pp.

196–203, Mar. 2002.

[16] F. Xu, C. H. Chang, and C. C. Jong, “Contention resolution algorithm for common subexpression elimination in digital filter design,” IEEE Trans. Circuits Syst. II, Exp.

Briefs, vol. 52, no. 10, pp. 695–700, Oct. 2005.

[17] Gurobi optimizer online[available] http://www.gurobi.com/

[18] Matlab FDAtool

[19] Synopsys Design Compiler

在文檔中應用二進位共用項分享之延遲且面積最佳化的有限脈衝響應濾波器合成技術 (頁 37-0)