• 沒有找到結果。

Chapter 3 Motivation

3.2 Problem Formulation

Given a set of coefficients {C0, C1, …, Cn-1} and an upper bound of level of CSA tree D. The given D is used to constrain timing of MCM block. The objective of this work is to decide a match for each coefficient and determine an alphabet that lead to minimum total area cost of MCM block.

14

Chapter 4

Proposed Algorithm

Our proposed algorithm, called Global Optimal Symbol Match (GOSM), is described in this chapter. Firstly, we introduce proposed architecture and definitions which are used in GOSM in Section 4.1 and 4.2 respectively; then, flow chart of GOSM is illustrated in Section 4.3. GOSM consists of two main parts which are explicitly described in Section 4.4 and Section 4.5, respectively.

4.1 Proposed Architecture

A two-stage MCM block using memory-based multiplication is depicted in Figure 9. The first stage generates the product values of common symbols and then they are used to realize all constant multiplications in second stage. The delay of the first stage is almost constant since it consists of memory unit. However, the delay of the second stage including CSA tree and CPAs is flexible. It is decided by number of levels of CSA tree; that is the reason why D is used to constrain timing.

Figure 9. Proposed MCM block architecture.

15

4.2 Definitions

 Coefficient Assembly Tree, CAT(C): CAT(C) is a tree which is extended for a coefficient C and each node in CAT(C) is a fragment. A CAT for a coefficient C=4’b1011 is depicted in Figure 10.

 Path: Path which is from root to leaf in CAT(C) is a match for a coefficient C. For example, there are five possible paths in CAT(C) in Figure 10.

 Pathi,j: jth path in ith coefficient.

 SymSet(Path): The SymSet(Path) is a set of symbols that are used on Path.

For example, the Path0,1 in Figure 10 includes F(S1, 3) and F(S3, 0), so SymSet(Path0,1) = {S1, S3}.

Figure 10. A CAT for a coefficient C=4’b1011.

16

(a) (b) Figure 11. (a) Example of support number of Path0,3.

(b) Example of support number of Path0,9.

 NumSup(Path): The NumSup(Path) is total number of supports on Path.

The number of supports of all symbols except S1 is two because their product values are read from dual-port memory. For example, Figure 11 illustrates NumSup(Path) on two different kinds of paths. In Figure 11(a), Path0,3 includes F(S1, 6), F(S3, 3), and F(S3, 0), so NumSup(Path0,3) equals five. Similarly, in Figure 11(b), NumSup(Path0,9) equals three since it includes F(S45, 1) and F(S1, 0).

 NumSupmax: NumSupmax is the maximal number of supports for a CSA tree.

For different values of D, NumSupmax are listed in Table 4.

 Legal path: Legal path is a path whose NumSup(Path) is less or equal than NumSupmax. That is, Legal Path can be implemented by using CSA tree whose number of supports is not more than NumSupmax.

17

Table 4. Maximum numbers of supports of CSA tree.

 Trim leading zero’s, TrimLZ(C): Trim the leading 0’s in C. For example, TrimLZ(01101)=1101, TrimLZ(000101101)=101101.

The overall of our proposed algorithm is illustrated in Figure 12. The inputs of GOSM are a set of coefficient and an upper bound of level of CSA tree. GOSM consists of two main parts. In first part, we enumerate all possible legal paths and construct a Coefficient Assembly Tree (CAT) for each coefficient. In second part, we formulate the problem into an integer linear programming (ILP) problem and use ILP solver to find global optimal matches for coefficients. Finally, the output is a Verilog file of the FIR filter by GOSM method.

18

Figure 12. The overall flow of proposed algorithm.

4.4 Coefficient Assembly Tree Construction

In this section, we introduce the concept of CAT enumerator which is used to enumerate all possible paths and construct a CAT for a coefficient. The pseudo code of CAT enumerator is shown below.

Initial: A=;

CAT(Root, TrimLZ(C))

1 Symbol=;

2 for num_1 from 0 to NZB(C)-1 3 C’=TrimMSB(C);

4 S=1;

5 Sym_Enum(C’, S, num_1);

19

5 Sym_Enum(TrimMSB(C’), S<<1, num_1); //skip current 0 6 else //MSB(C’)==1

7 if(NZB(C’)>num_1) //enough remaining 1’s?

8 Sym_Enum(TrimMSB(C’), S<<1, num_1); //skip current 1 9 Sym_Enum(TrimMSB(C’), S<<1+1, num_1-1); //pick current 1 End

Initially, the alphabet A is empty; the inputs of CAT enumerator are a root node and a coefficient without leading 0’s. The Symbol which is used to record all S generated from the first for loop in line 2 to line 5 is set as empty. The variable num_1 in line 2 indicates the number of 1’s which must be chosen in C’, where C’=TrimMSB(C). The first for loop in line 2 to line 5 is mainly used to enumerate all possible S which must include the MSB in C. For instance, assume C0=01011, the for loop runs from 0 to 2 because of NZB(C0)-1=2. The S1 is returned when num_1

20

equals 0. Similarly, the S5 and S9 are returned when num_1 equals 1, while the S11 is returned when num_1 equals 2. There are four different symbols 1(S1), 101(S5), 1001(S9), 1011(S11) stored in a set Symbol. The second for loop in line 6 to line 10 is mainly used to create a fragment for each S in Symbol and take remaining part as a new coefficient to recursively call the same function CAT. For instance, the first node in Path0,0 in Figure 13, we create a child node r=F(S1, d) for Root where d=DOL(C0, S1) and add S1 into A. Finally, we call function CAT again and the inputs of function CAT are a node r and Residue(C0, S1<<3). The function CAT does not end until the Symbol is empty.

Figure 13. A CAT(01011) example to illustrate the function CAT.

21

Figure 14. An example to illustrate the sub-function Sym_Enum when num_1=1.

An example of the sub-function Sym_Enum for coefficient C0 when num_1 equals 1 is depicted in Figure 14. The function Sym_Enum, which is also a recursive function, is used to find out all possible symbols when given a number num_1. In each recursion, we check whether the MSB in C’ is 1 or not. If the MSB in C’ is 0, we shift S left 1 bit and recursively call the same function Sym_Enum. If the MSB in C’ is 1, we let (S<<1)+1 and decrease the num_1 by 1. After that, we recursively call the same function Sym_Enum. In addition we also can regard this bit as 0 if number of 1’s in the remaining part is bigger than the num_1; we do the same operation with the situation of having 0 in the MSB of C’. The sub-function Sym_Enum does not return symbol until the num_1 equals 0.

4.5 Tree Pruning

In Section 4.4, we enumerate all possible paths for a coefficient but meanwhile all illegal paths are also enumerated in CAT. It causes extra time-consuming because

22

there is no chance to choose illegal path as a match for a coefficient. In order to avoid enumerating illegal paths in a CAT, we use branch-and-bound to modify CAT.

The pseudo code of the modified CAT, called PCAT is shown below.

Initial: A=;

9 if(NumSup(Pathmiddle)<=NumSupmax)

10 create a child node r=F(S, d) for Root;

11 add symbol S into alphabet A;

12 PCAT(r, Residue(C, S<<d), Pathmiddle);

End

The pseudo code of PACT and CAT are very similar. We create a fragment set called Pathmiddle in PACT. In each recursion, we add a new fragment into Pathmiddle and then check if NumSup(Pathmiddle) is less or equal than NumSupmax. If this condition is true, we create a child node r=F(S, d) for Root and recursively call the

23

same function PCAT; otherwise, we skip it. An example of the function PCAT for coefficient C=1111101 when D=1(NumSupmax=3) is depicted in Figure 15. The NumSup(Pathmiddle) in third node equals NumSupmax, so its child nodes which are marked by dash circle in Figure 15 cannot be created. Thus, we construct a CAT for a coefficient quickly without enumerating illegal paths.

Figure 15. An example to illustrate the function PCAT.

4.6 ILP Formulation

In Section 4.4 and Section 4.5, we introduce how to enumerate all possible legal paths for a coefficient. After that, in this section, we illustrate how to find global optimal matches for all coefficients by using Integer Linear Programming (ILP) solver.

24

4.6.1 Variables

Two variables are defined for modeling the behavior of choosing a path in a CAT. One is VarS which indicates whether the symbol is selected into A or not. The other is VarPath which means whether the path is selected or not. The following

Each path in CAT(C) is a match for coefficient C; a math is composed of a set of fragments. Thus, if the Pathi,j is chosen, every symbol in SymSet(Pathi,j) must be chosen. However, when every symbol in SymSet(Pathi,j) is chosen, the Pathi,j may not be chosen. The existence constraint is used to ensure all symbols corresponding to the selected path are chosen into A. The formulation is as follows:

(4)

4.6.3 Uniqueness Constraint

A CAT contains many paths while only one path should be chosen for a coefficient. It causes unnecessary area waste if more than one path is taken. The uniqueness constraint is used to make sure only one path to be decided in a coefficient. The uniqueness constraint should be accordingly formulated as:

(5)

where ki is total number of legal paths for ith coefficient.

25

4.6.4 Objective Function

The objective of the ILP problem is to minimize total area including memory and CSA tree. Our proposed architecture has two-stage. First stage consists of a dual-port memory which stores all product values of the symbols in A. Since VarS is a 0-1 variable, we can calculate the area of memory by , Section 4.6.6). Finally, equation (6) is the cost function of the ILP problem.

(6)

4.6.5 Area Cost of Path

In a CAT, each different path has different area cost but the area cost of each path can easily estimate. In the beginning, we calculate the bit-width of each support in CSA tree and then rank them in ascending order. Table 5 shows the schemes for different numbers of supports from three to six and the width of each CSA in CSA tree. An n-bit CSA consists of n disjoint full adders. In addition, one full adder requires two 4-input LUTs on FPGA. Therefore, the area cost of path can be estimated by multiplication of the total CSA bit-width with two.

26

Table 5. Area cost for different numbers of supports.

4.6.6 Area Cost of Symbol

All product values of symbols in A are stored in dual-port memory. Two 4-input LUTs can be combined to a 32x1-bit memory, as shown in Figure 16. Therefore, a 2LxW-bit dual-port memory is equivalent of LUTs. For instance, a 25x7-bit dual-port memory and 28 LUTs are equivalent.

Figure 16. An example of combining two LUTs into a 32x1 bit memory.

4.6.7 ILP example

In this section, we demonstrate an ILP example for coefficient set {1011, 10111}. In the beginning, CAT(C0) and CAT(C1) are constructed by the function

27

PCAT when D equals two and shown in Figure 17. Assume input word length is 10.

According to CAT(C0) and CAT(C1), two kinds of constraints are listed below:

Existence constraint:

Then, the objective which is shown in below is minimized.

Objective:

26VarPath0,0+26VarPath0,1+28VarPath0,2+28VarPath0,3+0VarPath0,4+52VarPath1,0+ 52VarPath1,1+54VarPath1,2+54VarPath1,3+28VarPath1,4+50VarPath1,5+54VarPath1,6

+54VarPath1,7+56VarPath1,8+54VarPath1,9+56VarPath1,10+30VarPath1,11+30VarPath

1,12+60VarPath1,13+0VarPath1,14.

Finally, the ILP problem is solved by using ILP solver named Gurobi [19].

VarPath0,4, VarPath1,11, and VarS11 are chosen by ILP solver. The total estimated area cost is 66 LUTs and the alphabet A contains S11.

28

Figure 17. ILP results for coefficient set {1011, 10111}.

29

Chapter 5

Experimental Results

5.1 Experimental Environment

The proposed algorithm, GOSM, is implemented in C++/Linux environment.

The experiments are run on a workstation with an Intel Xeon 2.4GHz CPU and 48GB RAM. The ILP solver which is used to find global optimal matches for coefficients is Gurobi 5.0 [19]. The FIR filter by GOSM method is described at RTL level using Verilog HDL. Based on Altera Stratix III device EP3SL50F484C2, synthesis and post-simulation are conducted with Quartus II 11.0 and ModelSim SE 6.3a.

Table 6 shows 8 FIR filters with 14-bit coefficient word length and Table 7 shows 8 FIR filters with 16-bit coefficient word length, where #tap is the number of coefficients and Width is the bit-width of filter coefficients. All filter coefficients are available in [20]. According to the bit-width of filter coefficients, we separate test cases into two groups. In each group, test cases are ranked in ascending order according to number of taps. Note that the minimum and maximum number of taps in Table 6 is 30 and 121, respectively. Similarly, the minimum and maximum number of taps in Table 7 is 20 and 279, respectively.

30

Table 6. FIR filters with14-bit coefficient word length.

Table 7. FIR filters with 16-bit coefficient word length.

5.2 Experimental Results for Different Width

Table 8 and Table 9 present the implementation results of FIR filters achieved by OMS and GOSM method. In these tables, Delay denotes the delay in ns in the critical path, LUTs denotes the required number of LUTs, and Memory bits denotes the utilization of total memory bits. Moreover, Reduction rate is percentage of (cost by OMS-cost by GOSM)/cost by OMS. The input bit-width of FIR filter is assumed to be the same with coefficient bit-width number and D is set as two. The delay of OMS method does not include the delay of address encoder because Stratix III only

31

supports synchronous dual-port ROM. Thus, the actual delay of FIR filter by OMS method is larger.

The results of area-minimized MCM block are obtained by the ILP solver. For coefficient bit-width is 14 and 16, the maximum ILP time was 1.6s and 44.87s, respectively. It indicates the generated ILP problem is easily to be solved. It is obvious that the ILP time is affected by coefficient bit-width since long coefficient bit-width has more possible legal paths than short coefficient bit-width. Therefore, the ILP time of 16-bit coefficient is much longer than the ILP time of 14-but coefficient.

Observe from the results that FIR filter generated using GOSM clearly outperforms that by the exiting OMS method in terms of delay and area. As shown in Table 8, the maximum improvement of delay, LUTs, and memory bits are 21.34%, 55.68%, and 82.98%, respectively. Similarly, in Table 9, the maximum improvement of delay, LUTs, and memory bits are 21.79%, 57.25%, and 81.77%, respectively.

The GOSM method has significant improvement of delay and LUTs since it replaces the overhead in OMS architecture such as barrel shifters and control circuits with CSA tree. Sharing common symbols among coefficients is considered in GOSM method so that memory bits can be markedly reduced; the reduction rate of memory bits is more prominent with increasing number of constant coefficients.

On average, the reduction rate of LUTs and memory bits in 14 bit-width do not have significant difference compared with in 16 bit-width while the reduction rate of delay tends to decline when bit-width changes from 14 to 16. The decline is caused by two factors. One is that the denominator of the reduction rate of delay will become large when bit-width increases because the delay of CPA is in positive correlation with coefficient bit-width. The other is that routing delay does not

32

consider in GOSM. When one symbol is simultaneously shared by several coefficients, it leads to the increase of the routing complexity degree.

Table 8. Results for width=14.

Table 9. Results for width=16.

5.3 Experimental Results for Different D

In this study, we compared the results of memory bits, LUTs, and delay for different D (from 1 to 3) as shown in Figure 18, Figure 19, and Figure 20, respectively. The bit-width of coefficient and input are 16-bit. The maximum

33

number of supports for a CSA tree grows as the number of D increases. It is easier to find common symbols among coefficients when the number of supports is big. Thus, the utilization of memory bits in all cases is in negative correlation with the number of D. On the other hand, delay in critical path and the required number of LUTs are in positive correlation with the number of D. Similarly, the required ILP time is affected by the number of D. When D equals 1, the maximum ILP time was 0.04s.

However, the maximum ILP time was 447.14s when D equals 3. It is reasonable to expect that the ILP time tends to increase with increasing the number of D since CAT contains more legal paths.

Figure 18. Results of memory bits for different D.

34

Figure 19. Results of LUTs for different D.

Figure 20. Results of delay for different D.

35

Chapter 6 Conclusion

In this thesis, global optimal symbol match (GOSM) algorithm is proposed for minimizing the area of multiple constant multiplication (MCM) block. The key concept of GOSM is to share common symbols among coefficients. GOSM consists of two major parts. In the first part, we enumerate all possible legal paths and construct a coefficient assembly tree (CAT) for each coefficient. In order to find global optimal matches for coefficients, we formulate the problem using integer linear programming (ILP) and solve it by an ILP solver in the second part. Finally, memory only consists of product values of the symbols chosen by the ILP solver.

FIR filter generated using GOSM clearly outperforms that by the existing OMS method in terms of delay and area. The experimental results show that on average, GOSM achieves a reduction of more than 10% and 50% in delay and area, respectively. Moreover, GOSM is more flexible than OMS since it offers an adjustable upper bound to the level of CSA tree, which can help well control the delay. Therefore, FIR filters generated by GOSM are very suitable for high-speed DSP applications.

36

References

[1] H.-R. Lee, C.-W. Jen, and C.-M. Liu, “On the design automation of the memory-based VLSI architectures for FIR filters,” IEEE Trans. Consum.

Electron., vol. 39, no. 3, pp. 619–629, Aug. 1993.

[2] . . , “Constant multipliers for FPGAs,” in 2nd Intl Workshop on Engineering of Reconfigurable Hardware/Software Objects (ENREGLE), pp. 167–173, Jun. 2000.

[3] J.-I. Guo, C.-M. Liu, and C.-W. Jen, “The efficient memory-based VLSI array designs for DFT and DCT,” IEEE Trans. Circuits Syst. II, Analog and Digit.

Signal Process., vol. 39, no 10, pp. 723–733, Oct. 1992.

[4] D. F. Chiper, “A systolic array algorithm for an efficient unified memory-based implementation of the inverse discrete cosine and sine transforms,” in IEEE Conf. Image Process., Oct. 1999, pp. 764–768.

[5] D. F. Chiper, M. N. S. Swamy, M. O. Ahmad, and T. Stouraits, “Systolic algorithms and a memory-based design approach for a unified architecture for the computation of DCT/DST/IDCT/IDST,” IEEE Trans. Circuits Syst. I, Reg.

Papers, vol. 52, no. 6, pp. 1125–1137, Jun. 2005.

[6] P. K. Meher and M. N. S. Swamy, “New systolic algorithm and array architecture for prime-length discrete sine transform,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 3, pp. 262–266, Mar. 2007.

[7] P. K. Meher, J. C. Patra, and M. N. S. Swamy, “High-throughput memory-based architecture for DHT using a new convolutional formulation,”

IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 54, no. 7, pp. 606–610, Jul.

2007.

[8] M. J. Wirthlin, “Constant coefficient multiplication using look-up tables,” J.

VLSI Signal Process., vol. 36, no. 1, pp. 7–15, Jan. 2004.

[9] M. Faust and C. H. Chang, “Bit-parallel multiple constant multiplication using look-up tables on FPGA,” in Proc. 2011 IEEE Int. Symp. Circuits Syst., ISCAS 2011, May 2011, pp. 657–660.

[10] P. K. Meher, “Memory-based hardware for resource-constraint digital signal processing systems,” in Proc. 6th Int. Conf. ICICS, Dec. 2007, pp. 1–4.

[11] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “LMS adaptive filters using distributed arithmetic for high throughput,” IEEE Trans.

Circuits Syst. I, Reg. Papers, vol. 52, no. 7, pp. 1327–1337, Jul. 2005.

[12] H. Yoo and D. V. Anderson, “Hardware-efficient distributed arithmetic architecture for high-order digital filters,” in Proc. IEEE Int. Conf. Acoustics,

37

Speech, Signal Processing, ICASSP 2005, Mar. 2005, pp. v/125–v/128.

[13] H.-C. Chen, J.-I. Guo, T.-S. Chang, and C.-W. Jen, “A memory-efficient realization of cyclic convolution and its application to discrete cosine transform,” IEEE Trans. Circuits and Systems for Video Technol., vol. 15, no. 3, pp. 455–453, Mar. 2005.

[14] P. K. Meher, “Novel input coding technique for high-precision LUT-based multiplication for DSP applications,” The 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010), pp. 201–206, Sept. 2010.

[15] P. K. Meher, “New look-up-table optimizations for memory-based multiplication,” in Proc. 12th International Circuits., ISIC 2009, Dec. 2009, pp.

663–666.

[16] P. K. Meher, “LUT optimization for memory-based computation,” IEEE Trans.

Circuits Syst. II, Exp. Briefs, vol. 57, no. 4, pp. 285–289, Apr. 2010.

[17] P. K. Meher, “New approach to LUT implementation and accumulation for

[17] P. K. Meher, “New approach to LUT implementation and accumulation for

相關文件