Chapter 3 Proposed Approach
3.4 Summary
In section 3, we proposed two directional trades-off approaches based on R22MDC architecture and R2MDC architecture. In vertical direction, we provide an expansion approach for R22MDC architecture to increase the throughput, and in horizontal direction, we provide a compression approach for R2MDC architecture to decrease the throughput. Under the throughput constraint, our approach can provide only one exact solution; however, as mentioned in section 2, they search the desired solution exhaustively. Table 2 lists the hardware and throughput comparison between our approach and previous work. Table 3 lists the hardware and throughput comparison with the same throughput by replacing jkwith tlog2N.
Table 2 Hardware Requirement Comparison
FFT length (N) multipliers adders registers throughput
Pease jk 2jk N
2
2 log
jk
N N
R22MDC_P t(2 log 4N−2) 2 logt 2 N N-2t 2t N
R2MDC_F tlog2N 2 logt 2 N N 2t
N
Table 3 Hardware Requirement Comparison with the same throughput FFT length (N) multipliers adders registers throughput
Pease tlog2N 2 logt 2N N 2t
N R22MDC_P t(2 log 4 N−2) 2 logt 2N N-2t 2t N
Chapter 4
Experiments
4.1 Experimental Environment
We implement two kinds of FFT architectures, including R22MDC vertical expansion architecture and R2MDC horizontal architecture. Each PE stage of FFT architecture is piped. The complex adder contains two real adders and the complex multiplier contains four real multipliers and two real adders, as shown in Figure 40. For each complex multiplier, we design a ROM which contains all the possible twiddle factor values for this complex multiplier.
× ×
× ×
+ +
Figure 40 Complex Multiplier
Logic gate model includes adder, multiplier, and multiplexer. We use UMC 0.18um cell library and Synopsys DesignWare [15] to synthesis under 100MHz clock rate. The platform is built in an Intel dual Pentium Xeon at 2.5GHz with 32GB of main memory, running Linux.
We use Matlab [16] to generate random inputs, and calculate the SQNR to guarantee the correctness of the generated FFT architecture. Our simulation results of SQNR are between 80 (db) and 90 (db).
4.2 Experimental Results
Figure 41 shows the relation between throughput and area for N=256, where area indicates the number of gate counts. For Pease, three architectures are generated, from left to right, the parameters are j=1, j=2, and j=4respectively. For all architectures, we assume k=1. For R2MDC, three architectures are generated, from left to right, the parameters are 1 same as the area of R2MDC under the same throughput. From Table 3, we can find that the hardware requirement is also the same under the same throughput. Figure 42 shows the FFT Length N=256FFT Length N=256 FFT Length N=256
Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)
Pease R2MDC
FFT Length N=1024
0 0.0002 0.0004 0.0006 0.0008 0.001 0.0012
KKK
AreaAreaAreaArea Pease
R2MDC
Figure 42 Relation between throughput and area for Pease and R2MDC, N=1024
Figure 43 shows the relation between throughput and area for N=256. For Pease, five architectures are generated, from left to right, the parameters are j=8, j=16,…, and j=128 respectively. For R22MDC, six architectures are generated, from left to right, the parameters are t=1, t=2,…, and t=32 respectively. We can find that the area of Pease is greatly larger than the area of R22MDC vertical expansion architectures under the same throughput because of the great number of multipliers usage of Pease. It can be also seen in Table 3. Figure 44 shows the relation between throughput and area for N=1024. For Pease, five architectures are generated, from left to right, the parameters are j=8, j=16 ,…, and j=128 respectively. For R22MDC, five architectures are generated, from left to right, the parameters are t=1,
=2
t ,…, and t =16 respectively. We can find that the area of Pease is still greatly larger than the area of R22MDC vertical expansion architectures under the same throughput.
FFT Length N=256 FFT Length N=256FFT Length N=256 FFT Length N=256 Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)
Pease R22MDC
Figure 43 Relation between throughput and area for Pease and R22MDC, N=256
FFT Len gth N=10 24
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035
KK KK
Throug hp ut Throug hp utThroug hp ut Throug hp ut
Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)
Pease R22MDC
Figure 44 Relation between throughput and area for Pease and R22MDC, N=1024
Figure 45 shows the joint result of Figure 41 and Figure 43. And Figure 46 shows the
FFT Length N=256 Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)
Pease
R2MDC/R22MDC
Figure 45 Relation between throughput and area for Pease and R2MDC/R22MDC, N=256
0
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035
Area (gate counts)Area (gate counts)Area (gate counts)Area (gate counts)
K FFT Length N=1024FFT Length N=1024 FFT Length N=1024
Pease
R2MDC/R22MDC
Figure 46 Relation between throughput and area for Pease and R2MDC/R22MDC, N=1024
Compared with the Pease architecture, for the length of 256 and 1024 cases, the generated FFT processor saves about 30.8% area under throughput constraints, as shown in Table 4.
Table 4 Area comparison FFT
Length (N)
Pease R22MDC Area Reduction
Percentage (%) Throughput Area Throughput Area
256
0.0078 190524 0.0078 128033 32.8
0.0156 307040 0.0156 202469 34.06
0.0313 533357 0.0313 350469 34.29
0.0625 1044244 0.0625 641511 38.57
1024
0.0016 434154 0.002 313669 27.75
0.0031 565576 0.0039 417760 26.14
0.0063 825269 0.0078 623772 24.42
0.0125 1314636 0.0156 1029338 21.70
Chapter 5
Conclusions and Future Work
The FFT processor is an important computing block in communication and signal processing systems. To improve productivity and shorten time-to-market, an automatic FFT generator can be used to design a specified FFT processor. In this thesis, we propose a parameterizable FFT generator with two approaches to make good design trade-off between throughput and area under the design constraints. First, the vertical expansion approach parallels the datapath to increase the throughput.
Second, the horizontal compression approach folds the datapath to reduce the hardware usage. Besides, only the best FFT architecture is generated under the user-specified throughput constraint to reduce the computation time in our proposed FFT generator. Compared with the Pease architecture, for the length of 256 and 1024 cases, the generated FFT processor saves about 30.8% area under throughput constraints.
Various FFT architectures are proposed in literature. It can be implemented into our proposed FFT generator. In the future, more FFT algorithms such as the R23MDC FFT algorithm, mixed-radix FFT [17] algorithm will be considered to enlarge the search space. Besides, the bitwidth optimization techniques proposed in [18] will also be considered.
Reference
[1] J. W. Cooley and J. W. Turkey, “An Algorithm for Machine Computation of Complex Fourier Series,” Math. Computation, Vol. 19, pp. 297-301, April 1965.
[2] L. R. Rabiner and B. Gold. Theory and Application of Digital Signal Processing.
Prentice-Hall, Inc., 1975.
[3] E. H. Wold and A. M. Despain, “Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementation,” IEEE Trans. Computers, vol. 33, no. 5, pp. 414-426, May 1984.
[4] A.M. Despain. “Fourier Transform Computer using CORDIC Iterations,” IEEE Trans. Comput., C-23(10):993-1001, Oct.1974.
[5] S. He and M. Torkelson, “A New Approach to Pipeline FFT Processor,” in Proc.
10th Int’l Parallel Processing Symp. (IPPS ’96), pp.766-770, 1996.
[6] R. Storn. “Radix-2 FFT-pipeline Architecture with Reduced Noise-to-signal Ratio,” IEE Proceedings- Vision, Image and Signal Processing, 141:81-86, 1994.
[7] S. He and M. Torkelson, "Designing Pipeline FFT Processor for OFDM (de)Modulation", International Symposium on Signals, Systems, and Electronics, pp. 257- 262, Oct. 1998.
[8] P. Duhamel, H. Hollmann, “Split Radix FFT Algorithm,” Electronics Letters, vol.
20, pp.14-16, January 1984.
[9] P. Duhamel, and H. Hollmann, “Split Radix FFT Algorithm,” Electronics Letters, vol. 20, pp. 14-16, Jan. 5, 1984.
[11] G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, “Automatic Generation of Customized Discrete Fourier Transform IPs,” In Proc. of ACM/IEEE Design Automation Conf., pp. 471-474, 2005.
[12] P. A. Milder, M. Ahmad, J.C. Hoe, and M. Püschel, “Fast and Accurate Resource Estimation of Automatically Generated Custom DFT IP Cores,” In Proc. of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 211-220 2006.
[13] P. A. Milder, F. Franchetti, J. C. Hoe, and M. Püschel, “Formal Datapath Representation and Manipulation for Implementing DSP Transforms,”In Proc.
of ACM/IEEE Design Automation Conf., pp. 385-390, 2008.
[14] J. Takala, T.Jarvinen, P. Salmela, and D. Akopial. Multi-port Interconnection Networks for Radix-r Algorithms. In Proc. IEEE International Conference Acoustics, Speech, Signal Processing, pp. 1177-1180, 2001.
[15] Synopsys DesignWare[Online], Available: http://www.synopsys.com . [16] Matlab [Online], Available: http://www.mathworks.com .
[17] R.C. Singleton, “An Algorithm for Computing the Mixed Radix Fast Fourier Transform,” IEEE Trans. on AudioElectroacoust., vol. 1, no. 2, pp. 93-103, June 1969.
[18] C.Y. Wang, C.B. Kuo, and J.Y. Jou, “ Hybrid Word-Length Optimization Methods of Pipelined FFT Processors”, IEEE Trans. Computers, vol. 56, no. 8, pp. 1105- 1118, Aug. 2007.
[19] P.D. Welch, “A Fixed-Point Fast Fourier Transform Error Analysis,” IEEE Trans. Audio Electroacoustics, vol. 17, pp. 151-157, June 1969.
[20] A. Pomerleau, H.L. Buijs, and M. Fournier, “A Two-Pass Fixed Point Fast Fourier Transform Error Analysis,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 25, pp. 582-585, Dec. 1977.