Summary - Proposed Approach - 高效能且低成本之可參數化快速傅利葉轉換硬體產生器

Chapter 3 Proposed Approach

3.4 Summary

In section 3, we proposed two directional trades-off approaches based on R2²MDC architecture and R2MDC architecture. In vertical direction, we provide an expansion approach for R2²MDC architecture to increase the throughput, and in horizontal direction, we provide a compression approach for R2MDC architecture to decrease the throughput. Under the throughput constraint, our approach can provide only one exact solution; however, as mentioned in section 2, they search the desired solution exhaustively. Table 2 lists the hardware and throughput comparison between our approach and previous work. Table 3 lists the hardware and throughput comparison with the same throughput by replacing jkwith tlog₂N.

Table 2 Hardware Requirement Comparison

FFT length (N) multipliers adders registers throughput

Pease jk 2jk N

2 log

N N

R2²MDC_P t(2 log 4N−2) 2 logt ₂ N N-2t 2t N

R2MDC_F tlog2N 2 logt ₂ N N 2t

Table 3 Hardware Requirement Comparison with the same throughput FFT length (N) multipliers adders registers throughput

Pease tlog2N 2 logt ₂N N 2t

N R2²MDC_P t(2 log 4 N−2) 2 logt ₂N N-2t 2t N

Chapter 4 Experiments

4.1 Experimental Environment

We implement two kinds of FFT architectures, including R2²MDC vertical expansion architecture and R2MDC horizontal architecture. Each PE stage of FFT architecture is piped. The complex adder contains two real adders and the complex multiplier contains four real multipliers and two real adders, as shown in Figure 40. For each complex multiplier, we design a ROM which contains all the possible twiddle factor values for this complex multiplier.

× ×

+ +

Figure 40 Complex Multiplier

Logic gate model includes adder, multiplier, and multiplexer. We use UMC 0.18um cell library and Synopsys DesignWare [15] to synthesis under 100MHz clock rate. The platform is built in an Intel dual Pentium Xeon at 2.5GHz with 32GB of main memory, running Linux.

We use Matlab [16] to generate random inputs, and calculate the SQNR to guarantee the correctness of the generated FFT architecture. Our simulation results of SQNR are between 80 (db) and 90 (db).

4.2 Experimental Results

Figure 41 shows the relation between throughput and area for N=256, where area indicates the number of gate counts. For Pease, three architectures are generated, from left to right, the parameters are j=1, j=2, and j=4respectively. For all architectures, we assume k=1. For R2MDC, three architectures are generated, from left to right, the parameters are ¹ same as the area of R2MDC under the same throughput. From Table 3, we can find that the hardware requirement is also the same under the same throughput. Figure 42 shows the FFT Length N=256FFT Length N=256 FFT Length N=256

Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)

Pease R2MDC

FFT Length N=1024

0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012

KKK

AreaAreaAreaArea Pease

R2MDC

Figure 42 Relation between throughput and area for Pease and R2MDC, N=1024

Figure 43 shows the relation between throughput and area for N=256. For Pease, five architectures are generated, from left to right, the parameters are j=8, j=16,…, and j=128 respectively. For R2²MDC, six architectures are generated, from left to right, the parameters are t=1, t=2,…, and t=32 respectively. We can find that the area of Pease is greatly larger than the area of R2²MDC vertical expansion architectures under the same throughput because of the great number of multipliers usage of Pease. It can be also seen in Table 3. Figure 44 shows the relation between throughput and area for N=1024. For Pease, five architectures are generated, from left to right, the parameters are j=8, j=16 ,…, and j=128 respectively. For R2²MDC, five architectures are generated, from left to right, the parameters are t=1,

t ,…, and t =16 respectively. We can find that the area of Pease is still greatly larger than the area of R2²MDC vertical expansion architectures under the same throughput.

FFT Length N=256 FFT Length N=256FFT Length N=256 FFT Length N=256 Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)

Pease R22MDC

Figure 43 Relation between throughput and area for Pease and R2²MDC, N=256

FFT Len gth N=10 24

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

KK KK

Throug hp ut Throug hp utThroug hp ut Throug hp ut

Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)

Pease R22MDC

Figure 44 Relation between throughput and area for Pease and R2²MDC, N=1024

Figure 45 shows the joint result of Figure 41 and Figure 43. And Figure 46 shows the

FFT Length N=256 Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)

Pease

R2MDC/R22MDC

Figure 45 Relation between throughput and area for Pease and R2MDC/R2²MDC, N=256

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)

K FFT Length N=1024FFT Length N=1024 FFT Length N=1024

Pease

R2MDC/R22MDC

Figure 46 Relation between throughput and area for Pease and R2MDC/R2²MDC, N=1024

Compared with the Pease architecture, for the length of 256 and 1024 cases, the generated FFT processor saves about 30.8% area under throughput constraints, as shown in Table 4.

Table 4 Area comparison FFT

Length (N)

Pease R2²MDC Area Reduction

Percentage (%) Throughput Area Throughput Area

256

0.0078 190524 0.0078 128033 32.8

0.0156 307040 0.0156 202469 34.06

0.0313 533357 0.0313 350469 34.29

0.0625 1044244 0.0625 641511 38.57

1024

0.0016 434154 0.002 313669 27.75

0.0031 565576 0.0039 417760 26.14

0.0063 825269 0.0078 623772 24.42

0.0125 1314636 0.0156 1029338 21.70

Chapter 5 Conclusions and Future Work

The FFT processor is an important computing block in communication and signal processing systems. To improve productivity and shorten time-to-market, an automatic FFT generator can be used to design a specified FFT processor. In this thesis, we propose a parameterizable FFT generator with two approaches to make good design trade-off between throughput and area under the design constraints. First, the vertical expansion approach parallels the datapath to increase the throughput.

Second, the horizontal compression approach folds the datapath to reduce the hardware usage. Besides, only the best FFT architecture is generated under the user-specified throughput constraint to reduce the computation time in our proposed FFT generator. Compared with the Pease architecture, for the length of 256 and 1024 cases, the generated FFT processor saves about 30.8% area under throughput constraints.

Various FFT architectures are proposed in literature. It can be implemented into our proposed FFT generator. In the future, more FFT algorithms such as the R2³MDC FFT algorithm, mixed-radix FFT [17] algorithm will be considered to enlarge the search space. Besides, the bitwidth optimization techniques proposed in [18] will also be considered.

Reference

[1] J. W. Cooley and J. W. Turkey, “An Algorithm for Machine Computation of Complex Fourier Series,” Math. Computation, Vol. 19, pp. 297-301, April 1965.

[2] L. R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing.

Prentice-Hall, Inc., 1975.

[3] E. H. Wold and A. M. Despain, “Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementation,” IEEE Trans. Computers, vol. 33, no. 5, pp. 414-426, May 1984.

[4] A.M. Despain. “Fourier Transform Computer using CORDIC Iterations,” IEEE Trans. Comput., C-23(10):993-1001, Oct.1974.

[5] S. He and M. Torkelson, “A New Approach to Pipeline FFT Processor,” in Proc.

10^th Int’l Parallel Processing Symp. (IPPS ’96), pp.766-770, 1996.

[6] R. Storn. “Radix-2 FFT-pipeline Architecture with Reduced Noise-to-signal Ratio,” IEE Proceedings- Vision, Image and Signal Processing, 141:81-86, 1994.

[7] S. He and M. Torkelson, "Designing Pipeline FFT Processor for OFDM (de)Modulation", International Symposium on Signals, Systems, and Electronics, pp. 257- 262, Oct. 1998.

[8] P. Duhamel, H. Hollmann, “Split Radix FFT Algorithm,” Electronics Letters, vol.

20, pp.14-16, January 1984.

[9] P. Duhamel, and H. Hollmann, “Split Radix FFT Algorithm,” Electronics Letters, vol. 20, pp. 14-16, Jan. 5, 1984.

[11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, “Automatic Generation of Customized Discrete Fourier Transform IPs,” In Proc. of ACM/IEEE Design Automation Conf., pp. 471-474, 2005.

[12] P. A. Milder, M. Ahmad, J.C. Hoe, and M. Püschel, “Fast and Accurate Resource Estimation of Automatically Generated Custom DFT IP Cores,” In Proc. of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 211-220 2006.

[13] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Püschel, “Formal Datapath Representation and Manipulation for Implementing DSP Transforms,”In Proc.

of ACM/IEEE Design Automation Conf., pp. 385-390, 2008.

[14] J. Takala, T.Jarvinen, P. Salmela, and D. Akopial. Multi-port Interconnection Networks for Radix-r Algorithms. In Proc. IEEE International Conference Acoustics, Speech, Signal Processing, pp. 1177-1180, 2001.

[15] Synopsys DesignWare[Online], Available: http://www.synopsys.com . [16] Matlab [Online], Available: http://www.mathworks.com .

[17] R.C. Singleton, “An Algorithm for Computing the Mixed Radix Fast Fourier Transform,” IEEE Trans. on AudioElectroacoust., vol. 1, no. 2, pp. 93-103, June 1969.

[18] C.Y. Wang, C.B. Kuo, and J.Y. Jou, “ Hybrid Word-Length Optimization Methods of Pipelined FFT Processors”, IEEE Trans. Computers, vol. 56, no. 8, pp. 1105- 1118, Aug. 2007.

[19] P.D. Welch, “A Fixed-Point Fast Fourier Transform Error Analysis,” IEEE Trans. Audio Electroacoustics, vol. 17, pp. 151-157, June 1969.

[20] A. Pomerleau, H.L. Buijs, and M. Fournier, “A Two-Pass Fixed Point Fast Fourier Transform Error Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 25, pp. 582-585, Dec. 1977.

在文檔中高效能且低成本之可參數化快速傅利葉轉換硬體產生器 (頁 41-0)