Chapter 4 The Proposed Approach
4.3 Restricted Number of Blocks
As the convergent block scaling method we have mentioned, the dynamic scaling method only scales when it is necessary to avoid the loss of accuracy. And the concept of grouping data into several blocks improves the SQNR since there are lots of exponents to represent the data with different dynamic range. Therefore, it is easy to expect that the larger number of blocks will acquire higher precision. However, the convergent block scaling method will divide one block into two blocks from the first stage to the last stage. That is, the number of blocks and the area of the exponent storage will be doubled through one stage. For an N-point FFT, there will be N/2 blocks in the last stage and N/2 exponent units are required.
As a result, it will cost a lot amount of storage. Therefore, we define Bmax = 2s-1 which is the total number of blocks in convergent block scaling and the number of blocks is doubled until the stage s. Fig. 33 shows the convergent block scaling with different Bmax.
(a) (b)
(c)
Fig. 33 The convergent block scaling with (a) Bmax = 1 (b) Bmax = 2 (c) Bmax = 4
39
Taking the 8192-point 16-bit wordlength FFT with MRCBS as an example which uses the S2-type detector with 10-bit comparators, the performance of SQNR and area are shown in Fig. 34. It can be observed that the area of the storage is getting increased yet SQNR is getting saturated while the Bmax is getting larger. It implies that in deeper stages, we are failed to get the SQNR we expect even if we double the area of the exponent storage. As we can see, if we divide the blocks until stage 11 which requires only 1024 exponent units, the area overhead of exponent storage is only 1/4 of that we divide until stage 13. However, the SQNR is just 0.13 dB lower than before. Thus, through doubling the number of blocks until a certain stage rather than doubling the number of blocks incessantly until the last stage, we can economize the usage of exponent storage to acquire the SQNR improvement we want.
Although the SQNR performance is not the ultimately highest if we restrict the number of blocks, we can still get the acceptable SQNR and reduce area cost consequently.
Fig. 34 The SQNR and area cost with different Bmax
40
Chapter 5
Experimental Results
The proposed MRCBS method is to generate many hardware solutions for SQNR improvement and find out the one which meets the SQNR constraint with minimum area cost.
Here we define the performance pair (PP): (SQNR, AREA) which indicates the SQNR performance with the corresponding area cost. Thus, each solution obtained by MRCBS has its own PP defined as PPT: (SQNRT, AREAT ) where the SQNRT represents the total SQNR performance and the AREAT represents the minimized total area cost.
The PPT is determined by the quintuple (N, WL, Type, BW, Bmax) where N is the given FFT size and WL is the wordlength of storage from 14 bits to 18 bits. The Type indicates different type of the detectors. Type = Cj implies the circular-type detectors and Type = Sj implies the square-type ones where j = 2, 4, and 6. The Cj detector includes four multipliers with bit width = BW, two adders with bit width = 2*BW and 2j comparators with fixed bit width = 10 while the Sj detector includes 2j comparators with bit width = BW. The BW can be chosen from 5 to 10. And the total number of blocks Bmax can be 2s-1 where s is from 1 to log2N.
In this work, we choose radix-2 FFT for implementation, and the FFT size and SQNR constraint are user defined. We present the FFT size N = 1024, 2048, 4096, and 8192 in our experimental results as the SQNR constraint is in the range from 50 dB to 70 dB. Given the FFT size, we apply MRCBS method and build some tables for PPs by simulations and syntheses. And we will obtain many solutions by combining those tables. Consequently, for the given FFT size, we can find out the solution among them which meets the SQNR constraint and has the minimum area overhead. In addition to our MRCBS scheme, the traditional forced scaling method [7] and the conditional scaling method [14] are implemented as well and will be compared to our approach.
41
The fixed-point FFT model is built by C++, and the SQNR performance is obtained by simulations with random input signals. And the circuit area is implemented with TSMC 90 nm cell library and using Synopsys DesignWare to synthesize under 100MHz clock rate. Finally, the platform for both C++ and Synopsys DesignWare are built in Intel dual Pentium Xeon at 2.53GHz with 50GB of main memory.
5.1 The Solution Generated by MRCBS
The MRCBS scheme improves the SQNR by two ways. One is dividing data into blocks with additional exponent storage, and the other is adding the multi-region detector to the basic memory-based FFT design proposed in [7] which is implemented with forced scaling.
Therefore, the total performance is the combinations of PP+ and the PPBase as shown in (5.1).
And the operation of combining two PPs is shown in (5.2).
The PPBase: (SQNRBase, AREABase) is the basic SQNR performance and original area cost obtained by the traditional memory-based FFT. On the other hand, the PP+ is the SQNR improvement and the additional area overhead obtained from the multi-region detection and convergent block scaling. PPx: (SQNRx,AREAx) is the additional SQNR performance obtained by the multi-region detection with the extra area cost of the detector and predictor.
And the PPy: (SQNRy,AREAy) indicates the additional SQNR performance obtained by the block scaling with the extra area cost of the exponent array. Therefore, we can obtain those three performance pairs respectively and combine them to acquire the PPTs. We will present the simulation results of these PPs in the following subsections.
PPT PPxPPyPPBase (5.1)
1 2 1 2 1 2
PPPP (SQNR SQNR , AREA AREA ) (5.2)
42
5.1.1 Performance Pair of the Forced Scaling FFT
We define the PPBase: (SQNRBase, AREABase) which is the performance pair of the traditional FFT design [7]. By SQNR simulation and hardware synthesis, the PPBases are shown in Table 5 which are determined by (N, WL).
(a) (b)
Table 5 The PPBase determined by (N, WL) (a)SQNRBase (dB) (b)AREABase (µm2)
5.1.2 Improvement from Multi-Region Detection
To know the effects on the performance of SQNR and area by the detector and predictor, we fix the numbers of blocks Bmax = 1 and wordlength WL = 16 to get PPs. That is, those PPs are determined by (N, WL = 16, Type, BW, Bmax = 1) by simulations and syntheses. Since we want to realize the improvement of SQNR and area called PP sx produced by multi-region detection compared to the traditional FFT, those PPs will be offset by PPBases (N, WL = 16) which can be obtained by Table 5. We present the SQNRxs for N = 1024, 2048, 4096 and 8192 in Table 6(a), (b), (c), and (d) respectively. Since the area of the detector and predictor are all the same with different N, we only show the AREAxs of those PP sx once in Table 7.
By simulations, the SQNRx is getting saturated while BW is larger than 10, so we have BW only from 5 to 10 to choose for six types of detectors.
43
(a) (b)
(c) (d)
Table 6 The SQNRx (dB) of the PPx for (a)1024 (b)2048 (c)4096 (d)8192 -point FFT
Table 7 The AREAx (µm2) of the PPx
5.1.3 Improvement from Convergent Block Scaling
To realize the relationship between total number of blocks Bmax and the performance of area and SQNR, we fix BW = 10 and WL = 16 to get PPs by simulations and synthesis. Those PPs will be offset by Bmax = 1 to obtain the additional SQNR and area cost produced by the
44
block scaling scheme with shared exponents which are defined as PP sy . That is, the PPy is obtained by (N, WL = 16, Type, BW = 10, Bmax). Table 8 (a), (b), (c), and (d) shows the SQNRy of PPy for N = 1024, 2048, 4096 and 8192 respectively. Because the AREAy consists of the exponent storage and the control circuits of exponent accesses, it only depends on the Bmax and N. Therefore, we only show the AREAy once in the second row of each table.
The larger Bmax implies the more storage of the exponents so the area is larger. And the control circuit accessing the exponent units is more complicated while N is larger, so AREAy of 8192-point FFT is larger than that of 1024-point with the same Bmax.
(a)
(b)
45
(c)
(d)
Table 8 The PPy (dB, µm2) for (a) 1024 (b)2048 (c)4096 (d) 8192 -point FFT
5.1.4 Performance Pair Combination
To get the result of total area and total SQNR performance PPT, we have to combine PPx, PPy, and PPBase as (5.1) shows. The PPBase can be figure out in Table 5. And the PPx can be obtained in Table 6 and Table 7 as PPy can be obtained in Table 8. Although WL in PPx and PPy is fixed to 16, we found that the WL does not affect the results so much and assume different WL will have the same results. As a result, given FFT size N, we will combine PPx,
PPy, and PPBase with WL from 14 to 18 to get 5(WL) * 6(Type) * 6(BW) * log2N(Bmax) PPTs.
In these PPTs, there may be some ones producing the same SQNRT but the AREATs are different. Therefore, we will delete the PPT which has the larger AREAT but lower SQNRT to
46
reserve the irreplaceable PPTs. Consequently, in each 6 dB range, we have 40 PPTs to be chosen to satisfy the SQNR constraint.
Besides, our PPTs include the solutions obtained by conditional scaling scheme in [14].
Those solutions are the special cases determined by (N, WL, Type = S2, BW = 10, Bmax = 1).
As shown in Fig. 35, Fig. 36, Fig. 37, and Fig.38, the black dots are the PPTs obtained by the proposed MRCBS method, the gray diamonds are the solutions obtained by the scheme in [14], and the triangles are the solutions obtained by the scheme in [7] which are the PPBases for N = 1024, 2048, 4096, and 8192.
Fig. 35 The PPTs for 1024-point FFT generated by MRCBS
Fig. 36 The PPTs for 2048-point FFT generated by MRCBS
47
Fig. 37 The PPTs for 4096-point FFT generated by MRCBS
Fig. 38 The PPTs for 8192-point FFT generated by MRCBS
5.2 Area Minimization under SQNR Constraint
In those irreplaceable PPTs for certain FFT size, the AREAT is definitely larger while the SQNRT is higher. Therefore, we sort the PPTs by SQNRT from small to large, and then search the SQNRT which is just satisfying the requirement. As a result, the PPT we find out will be the solutions which has the smallest AREAT.
Table 9, Table 10, Table 11, Table 12 show 8 different SQNR requirements with FFT size N = 1024, 2048, 4096, and 8192, respectively. Under different constraints, the solutions will
tell us the required wordlength, the type of the detector, the bit width in the detector, and the
48
total number of blocks. The exact SQNR is obtained by simulations and is almost equal to the the SQNRT estimated by MRCBS method. And if previous work has area cost K, the area reduction is derived by (K - AREAT) / K. Compared to the traditional FFT implemented with forced scaling, our method can reduce the area cost by 12.61% for N = 1024 and 23.57% for N = 8192 in the best case.
Besides, we know that conditional scaling has better performance compared to the forced scaling. However, if the conditional scaling scheme just meets the constraint in some cases, our method can reduce one bit of wordlength to save the area of memory storage. And if the constraint becomes tighter so that the previous conditional scaling scheme has to increase one bit to meet the constraint, our method will uses more blocks or more precise detector to meet the requirement and still maintain the wordlength. Therefore, we will reduce 2 bits of wordlength. That is, with larger-size FFT, the area occupancy of 2-bit memory wordlength will become larger. As we can see, we can reduce the area cost by 6.34% for N = 1024 but reduce 12.84% for larger N = 8192.
Table 9 The solutions under the SQNR constraints for 1024-point FFT
49
Table 10 The solutions under the SQNR constraints for 2048-point FFT
Table 11 The solutions under the SQNR constraints for 4096-point FFT
Table 12 The solutions under the SQNR constraints for 8192-point FFT
50
Chapter 6
Conclusions and Future Works
In this thesis, a scaling scheme for the memory-based FFT design is proposed which improves SQNR in an area-efficient way. This method takes advantage of both conditional scaling and convergent block scaling. By implementing with different detectors and using different number of the shared exponents, it will generate many solutions with different SQNR and area performance. Moreover, we can satisfy the SQNR requirement by increasing the area economically by applying this method.
The experimental results show that it will save at least one bit of wordlength to reduce about 5.6% area from previous conditional scaling method. And if the constraint is just a little tighter, our method can satisfy the required SQNR by increasing small area rather than increasing one bit of wordlength in previous approaches. As a result, the proposed scheme will save 2 bits of wordlength to bring about 13% area reduction from the conditional scaling scheme for 8192-point FFT in the best case.
In the future, the multi-region detection and the convergent block scaling method can be improved to optimize the SQNR and the area of the FFT core for different architectures and different algorithms.
51
References
[1] C. T. Lin, Y. C. Yu, L. D. Van, “A low-power 64-point FFT/IFFT design for IEEE 802.11a WLAN application,” IEEE International Symposium on Circuits and Systems, pp. 4 pp. -4526, 2006.
[2] R. V. Nee, R. Prasad, OFDM for Multimedia Communications, Artech House, 2000.
[3] ETSI, “Digital Video Broadcasting (DVB); Framing Structure, Channel Coding and Modulation for Digital Terrestrial Television,” ETSI EN 300 744 v1.4.1, 2001.
[4] E. Grass, K. Tittelbach, U. Jagdhold, A. Troya, G. Lippert, O. Kruger, J. Lehmann, K.
Maharatna, K. Dombrowski, N. Fiebig, R. Kraemer, P. Mahonen, “On the single-chip implementation of a Hiperlan/2 and IEEE 802.11a capable modem,” IEEE Personal Communications, vol. 8, no. 6, pp. 48-57, 2001.
[5] S. Li, H. Xu, W. Fan, Y. Chen, X. Zeng, “A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE 802.16e,” IEEE International Symposium on Circuits and Systems, pp. 1488-1491, 2010.
[6] C. Y. Wang, C. B. Kuo, J. Y. Jou, “Hybrid Wordlength Optimization Methods of Pipelined FFT Processors,” IEEE Transactions on Computers, vol.56, no.8, pp.
1105-1118, 2007.
[7] A. V. Oppenheim, C. J. Weinstein, “Effects of finite register length in digital filtering and the fast Fourier transform,” Proceedings of the IEEE , vol. 60, no. 8, pp. 957-976, 1972.
[8] Y. W. Lin, H. Y. Liu, C. Y. Lee, “A dynamic scaling FFT processor for DVB-T applications,” IEEE Journal of Solid-State Circuits, vol. 39, no. 11, pp. 2005-2013, 2004.
[9] S. N. Tang, J. W. Tsai, T. Y. Chang, “A 2.4-GS/s FFT Processor for OFDM-Based WPAN Applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, no. 6, pp. 451-455, 2010.
[10] Y. Chen, Y. C. Tsao, Y. W. Lin, C. H. Lin, C. Y. Lee, “An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 55, no. 2, pp. 146-150, 2008.
[11] S. Ramakrishnan, J. Balakrishnan, K. Ramasubramanian, “Exploiting signal and noise statistics for fixed point FFT design optimization in OFDM systems,” National Conference on Communications (NCC), pp. 1-5, 2010.
[12] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE Journal of Solid-State Circuits, vol. 30, no. 3, pp.
300-305, 1995.
52
[13] R. R. Shively, “A Digital Processor to Generate Spectra in Real Time,” IEEE Transactions on Computers, vol. C-17, no. 5, pp. 485-491, 1968.
[14] R. Koutsoyannis, P. Milder, C. R. Berger; M. Glick, J. C. Hoe; M. Puschel,
“Improving Fixed-point Accuracy of FFT Cores in O-OFDM Systems,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
[15] J. W. Cooley, J. W. Turkey, “An algorithm for machine computation of complex Fourier series,” Math. Computation, vol. 19, pp. 291-301, 1965.
[16] W. C. Yeh; C. W. Jen, “High-speed and low-power split-radix FFT,” IEEE Transactions on Signal Processing, vol. 51, no. 3, pp. 864-874, 2003.
[17] Y. W. Lin, H. Y. Liu, C. Y. Lee, “A 1-GS/s FFT/IFFT processor for UWB applications,”
IEEE Journal of Solid-State Circuits, vol. 40, no. 8, pp. 1726-1735, 2005.
[18] R. C. Agarwal, J. W. Cooley, “Vectorized mixed radix discrete Fourier transform algorithms,” Proceedings of the IEEE , vol. 75, no. 9, pp. 1283-1292, 1987.
[19] P. Y. Tsai, C. Y. Lin, “A Generalized Conflict-Free Memory Addressing Scheme for Continuous-Flow Parallel-Processing FFT Processors With Rescheduling,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 12, pp.
2290-2302, 2011.
[20] D. Reisis, N. Vlassopoulos, “Conflict-Free Parallel Memory Accessing Techniques for FFT Architectures,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 11, pp. 3438-3447, 2008.
[21] Y. W. Lin, “The study of FFT processors for OFDM systems,” Ph. D. thesis, Dept. of Electronic Engineering, National Chiao Tung University, Hsinchu, R.O.C., 2004.
[22] S. Lee, S. C. Park, “Modified SDF Architecture for Mixed DIF/DIT FFT,” IEEE International Symposium on Circuits and Syatems, pp. 2590-2593, 2007.
[23] A. Cortes, I. Velez, J. F. Sevillano, “Radix rk FFTs: Matricial Representation and SDC/SDF Pipeline Implementation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2824-2839, 2009.
[24] B. C. Lin, Y. H. Wang, J. D. Huang, J. Y. Jou, “Expandable MDC-based FFT architecture and its generator for high-performance applications,” IEEE International SOC Conference (SOCC) , pp. 188-192, 2010.
[25] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE Journal of Solid-State Circuits, vol. 30, no. 3, pp.
300-305, 1995.