Restricted Number of Blocks - The Proposed Approach

Chapter 4 The Proposed Approach

4.3 Restricted Number of Blocks

As the convergent block scaling method we have mentioned, the dynamic scaling method only scales when it is necessary to avoid the loss of accuracy. And the concept of grouping data into several blocks improves the SQNR since there are lots of exponents to represent the data with different dynamic range. Therefore, it is easy to expect that the larger number of blocks will acquire higher precision. However, the convergent block scaling method will divide one block into two blocks from the first stage to the last stage. That is, the number of blocks and the area of the exponent storage will be doubled through one stage. For an N-point FFT, there will be N/2 blocks in the last stage and N/2 exponent units are required.

As a result, it will cost a lot amount of storage. Therefore, we define Bmax = 2^s-1 which is the total number of blocks in convergent block scaling and the number of blocks is doubled until the stage s. Fig. 33 shows the convergent block scaling with different Bmax.

(a) (b)

(c)

Fig. 33 The convergent block scaling with (a) Bmax = 1 (b) Bmax = 2 (c) Bmax = 4

Taking the 8192-point 16-bit wordlength FFT with MRCBS as an example which uses the S2-type detector with 10-bit comparators, the performance of SQNR and area are shown in Fig. 34. It can be observed that the area of the storage is getting increased yet SQNR is getting saturated while the B_max is getting larger. It implies that in deeper stages, we are failed to get the SQNR we expect even if we double the area of the exponent storage. As we can see, if we divide the blocks until stage 11 which requires only 1024 exponent units, the area overhead of exponent storage is only 1/4 of that we divide until stage 13. However, the SQNR is just 0.13 dB lower than before. Thus, through doubling the number of blocks until a certain stage rather than doubling the number of blocks incessantly until the last stage, we can economize the usage of exponent storage to acquire the SQNR improvement we want.

Although the SQNR performance is not the ultimately highest if we restrict the number of blocks, we can still get the acceptable SQNR and reduce area cost consequently.

Fig. 34 The SQNR and area cost with different B_max

Chapter 5 Experimental Results

The proposed MRCBS method is to generate many hardware solutions for SQNR improvement and find out the one which meets the SQNR constraint with minimum area cost.

Here we define the performance pair (PP): (SQNR, AREA) which indicates the SQNR performance with the corresponding area cost. Thus, each solution obtained by MRCBS has its own PP defined as PP^T: (SQNR^T, AREA^T) where the SQNR^T represents the total SQNR performance and the AREA^T represents the minimized total area cost.

The PP^T is determined by the quintuple (N, WL, Type, BW, Bmax) where N is the given FFT size and WL is the wordlength of storage from 14 bits to 18 bits. The Type indicates different type of the detectors. Type = Cj implies the circular-type detectors and Type = Sj implies the square-type ones where j = 2, 4, and 6. The Cj detector includes four multipliers with bit width = BW, two adders with bit width = 2*BW and 2j comparators with fixed bit width = 10 while the Sj detector includes 2j comparators with bit width = BW. The BW can be chosen from 5 to 10. And the total number of blocks Bmax can be 2^s-1 where s is from 1 to log2N.

In this work, we choose radix-2 FFT for implementation, and the FFT size and SQNR constraint are user defined. We present the FFT size N = 1024, 2048, 4096, and 8192 in our experimental results as the SQNR constraint is in the range from 50 dB to 70 dB. Given the FFT size, we apply MRCBS method and build some tables for PPs by simulations and syntheses. And we will obtain many solutions by combining those tables. Consequently, for the given FFT size, we can find out the solution among them which meets the SQNR constraint and has the minimum area overhead. In addition to our MRCBS scheme, the traditional forced scaling method [7] and the conditional scaling method [14] are implemented as well and will be compared to our approach.

The fixed-point FFT model is built by C++, and the SQNR performance is obtained by simulations with random input signals. And the circuit area is implemented with TSMC 90 nm cell library and using Synopsys DesignWare to synthesize under 100MHz clock rate. Finally, the platform for both C++ and Synopsys DesignWare are built in Intel dual Pentium Xeon at 2.53GHz with 50GB of main memory.

5.1 The Solution Generated by MRCBS

The MRCBS scheme improves the SQNR by two ways. One is dividing data into blocks with additional exponent storage, and the other is adding the multi-region detector to the basic memory-based FFT design proposed in [7] which is implemented with forced scaling.

Therefore, the total performance is the combinations of PP⁺ and the PP^Base as shown in (5.1).

And the operation of combining two PPs is shown in (5.2).

The PP^Base: (SQNR^Base, AREA^Base) is the basic SQNR performance and original area cost obtained by the traditional memory-based FFT. On the other hand, the PP⁺ is the SQNR improvement and the additional area overhead obtained from the multi-region detection and convergent block scaling. PP_x^: (SQNR_x^,AREA_x^) is the additional SQNR performance obtained by the multi-region detection with the extra area cost of the detector and predictor.

And the PP_y^: (SQNR_y^,AREA_y^) indicates the additional SQNR performance obtained by the block scaling with the extra area cost of the exponent array. Therefore, we can obtain those three performance pairs respectively and combine them to acquire the PP^Ts. We will present the simulation results of these PPs in the following subsections.

PP^T PP_x^PP_y^PP^Base (5.1)

1 2 1 2 1 2

PPPP (SQNR SQNR , AREA AREA ) (5.2)

5.1.1 Performance Pair of the Forced Scaling FFT

We define the PP^Base: (SQNR^Base, AREA^Base) which is the performance pair of the traditional FFT design [7]. By SQNR simulation and hardware synthesis, the PP^Bases are shown in Table 5 which are determined by (N, WL).

(a) (b)

Table 5 The PP^Base determined by (N, WL) (a)SQNR^Base (dB) (b)AREA^Base (µm²)

5.1.2 Improvement from Multi-Region Detection

To know the effects on the performance of SQNR and area by the detector and predictor, we fix the numbers of blocks B_max = 1 and wordlength WL = 16 to get PPs. That is, those PPs are determined by (N, WL = 16, Type, BW, Bmax = 1) by simulations and syntheses. Since we want to realize the improvement of SQNR and area called PP s_x^ produced by multi-region detection compared to the traditional FFT, those PPs will be offset by PP^Bases (N, WL = 16) which can be obtained by Table 5. We present the SQNR_x^s for N = 1024, 2048, 4096 and 8192 in Table 6(a), (b), (c), and (d) respectively. Since the area of the detector and predictor are all the same with different N, we only show the AREA_x^s of those PP s_x^ once in Table 7.

By simulations, the SQNR_x^ is getting saturated while BW is larger than 10, so we have BW only from 5 to 10 to choose for six types of detectors.

(a) (b)

Table 6 The SQNR_x^ (dB) of the PP_x^ for (a)1024 (b)2048 (c)4096 (d)8192 -point FFT

Table 7 The AREA_x^ (µm²) of the PP_x^

5.1.3 Improvement from Convergent Block Scaling

To realize the relationship between total number of blocks Bmax and the performance of area and SQNR, we fix BW = 10 and WL = 16 to get PPs by simulations and synthesis. Those PPs will be offset by Bmax = 1 to obtain the additional SQNR and area cost produced by the

block scaling scheme with shared exponents which are defined as PP s_y^ . That is, the PP_y^ is obtained by (N, WL = 16, Type, BW = 10, Bmax). Table 8 (a), (b), (c), and (d) shows the SQNRy^ of PP_y^ for N = 1024, 2048, 4096 and 8192 respectively. Because the AREA_y^ consists of the exponent storage and the control circuits of exponent accesses, it only depends on the Bmax and N. Therefore, we only show the AREA_y^ once in the second row of each table.

The larger Bmax implies the more storage of the exponents so the area is larger. And the control circuit accessing the exponent units is more complicated while N is larger, so AREA_y^ of 8192-point FFT is larger than that of 1024-point with the same B_max.

(a)

(b)

(c)

(d)

Table 8 The PP_y^ (dB, µm²) for (a) 1024 (b)2048 (c)4096 (d) 8192 -point FFT

5.1.4 Performance Pair Combination

To get the result of total area and total SQNR performance PP^T, we have to combine PP_x^, PP_y^, and PP^Base as (5.1) shows. The PP^Base can be figure out in Table 5. And the PP_x^ can be obtained in Table 6 and Table 7 as PP_y^ can be obtained in Table 8. Although WL in PP_x^ and PP_y^ is fixed to 16, we found that the WL does not affect the results so much and assume different WL will have the same results. As a result, given FFT size N, we will combine PP_x^,

PP_y^, and PP^Base with WL from 14 to 18 to get 5(WL) * 6(Type) * 6(BW) * log2N(Bmax) PP^Ts.

In these PP^Ts, there may be some ones producing the same SQNR^T but the AREA^Ts are different. Therefore, we will delete the PP^T which has the larger AREA^T but lower SQNR^T to

reserve the irreplaceable PP^Ts. Consequently, in each 6 dB range, we have 40 PP^Ts to be chosen to satisfy the SQNR constraint.

Besides, our PP^Ts include the solutions obtained by conditional scaling scheme in [14].

Those solutions are the special cases determined by (N, WL, Type = S2, BW = 10, B_max = 1).

As shown in Fig. 35, Fig. 36, Fig. 37, and Fig.38, the black dots are the PP^Ts obtained by the proposed MRCBS method, the gray diamonds are the solutions obtained by the scheme in [14], and the triangles are the solutions obtained by the scheme in [7] which are the PP^Bases for N = 1024, 2048, 4096, and 8192.

Fig. 35 The PP^Ts for 1024-point FFT generated by MRCBS

Fig. 36 The PP^Ts for 2048-point FFT generated by MRCBS

Fig. 37 The PP^Ts for 4096-point FFT generated by MRCBS

Fig. 38 The PP^Ts for 8192-point FFT generated by MRCBS

5.2 Area Minimization under SQNR Constraint

In those irreplaceable PP^Ts for certain FFT size, the AREA^T is definitely larger while the SQNR^T is higher. Therefore, we sort the PP^Ts by SQNR^T from small to large, and then search the SQNR^T which is just satisfying the requirement. As a result, the PP^T we find out will be the solutions which has the smallest AREA^T.

Table 9, Table 10, Table 11, Table 12 show 8 different SQNR requirements with FFT size N = 1024, 2048, 4096, and 8192, respectively. Under different constraints, the solutions will

tell us the required wordlength, the type of the detector, the bit width in the detector, and the

total number of blocks. The exact SQNR is obtained by simulations and is almost equal to the the SQNR^T estimated by MRCBS method. And if previous work has area cost K, the area reduction is derived by (K - AREA^T) / K. Compared to the traditional FFT implemented with forced scaling, our method can reduce the area cost by 12.61% for N = 1024 and 23.57% for N = 8192 in the best case.

Besides, we know that conditional scaling has better performance compared to the forced scaling. However, if the conditional scaling scheme just meets the constraint in some cases, our method can reduce one bit of wordlength to save the area of memory storage. And if the constraint becomes tighter so that the previous conditional scaling scheme has to increase one bit to meet the constraint, our method will uses more blocks or more precise detector to meet the requirement and still maintain the wordlength. Therefore, we will reduce 2 bits of wordlength. That is, with larger-size FFT, the area occupancy of 2-bit memory wordlength will become larger. As we can see, we can reduce the area cost by 6.34% for N = 1024 but reduce 12.84% for larger N = 8192.

Table 9 The solutions under the SQNR constraints for 1024-point FFT

Table 10 The solutions under the SQNR constraints for 2048-point FFT

Table 11 The solutions under the SQNR constraints for 4096-point FFT

Table 12 The solutions under the SQNR constraints for 8192-point FFT

Chapter 6 Conclusions and Future Works

In this thesis, a scaling scheme for the memory-based FFT design is proposed which improves SQNR in an area-efficient way. This method takes advantage of both conditional scaling and convergent block scaling. By implementing with different detectors and using different number of the shared exponents, it will generate many solutions with different SQNR and area performance. Moreover, we can satisfy the SQNR requirement by increasing the area economically by applying this method.

The experimental results show that it will save at least one bit of wordlength to reduce about 5.6% area from previous conditional scaling method. And if the constraint is just a little tighter, our method can satisfy the required SQNR by increasing small area rather than increasing one bit of wordlength in previous approaches. As a result, the proposed scheme will save 2 bits of wordlength to bring about 13% area reduction from the conditional scaling scheme for 8192-point FFT in the best case.

In the future, the multi-region detection and the convergent block scaling method can be improved to optimize the SQNR and the area of the FFT core for different architectures and different algorithms.

References

[1] C. T. Lin, Y. C. Yu, L. D. Van, “A low-power 64-point FFT/IFFT design for IEEE 802.11a WLAN application,” IEEE International Symposium on Circuits and Systems, pp. 4 pp. -4526, 2006.

[2] R. V. Nee, R. Prasad, OFDM for Multimedia Communications, Artech House, 2000.

[3] ETSI, “Digital Video Broadcasting (DVB); Framing Structure, Channel Coding and Modulation for Digital Terrestrial Television,” ETSI EN 300 744 v1.4.1, 2001.

[4] E. Grass, K. Tittelbach, U. Jagdhold, A. Troya, G. Lippert, O. Kruger, J. Lehmann, K.

Maharatna, K. Dombrowski, N. Fiebig, R. Kraemer, P. Mahonen, “On the single-chip implementation of a Hiperlan/2 and IEEE 802.11a capable modem,” IEEE Personal Communications, vol. 8, no. 6, pp. 48-57, 2001.

[5] S. Li, H. Xu, W. Fan, Y. Chen, X. Zeng, “A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE 802.16e,” IEEE International Symposium on Circuits and Systems, pp. 1488-1491, 2010.

[6] C. Y. Wang, C. B. Kuo, J. Y. Jou, “Hybrid Wordlength Optimization Methods of Pipelined FFT Processors,” IEEE Transactions on Computers, vol.56, no.8, pp.

1105-1118, 2007.

[7] A. V. Oppenheim, C. J. Weinstein, “Effects of finite register length in digital filtering and the fast Fourier transform,” Proceedings of the IEEE , vol. 60, no. 8, pp. 957-976, 1972.

[8] Y. W. Lin, H. Y. Liu, C. Y. Lee, “A dynamic scaling FFT processor for DVB-T applications,” IEEE Journal of Solid-State Circuits, vol. 39, no. 11, pp. 2005-2013, 2004.

[9] S. N. Tang, J. W. Tsai, T. Y. Chang, “A 2.4-GS/s FFT Processor for OFDM-Based WPAN Applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, no. 6, pp. 451-455, 2010.

[10] Y. Chen, Y. C. Tsao, Y. W. Lin, C. H. Lin, C. Y. Lee, “An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 55, no. 2, pp. 146-150, 2008.

[11] S. Ramakrishnan, J. Balakrishnan, K. Ramasubramanian, “Exploiting signal and noise statistics for fixed point FFT design optimization in OFDM systems,” National Conference on Communications (NCC), pp. 1-5, 2010.

[12] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE Journal of Solid-State Circuits, vol. 30, no. 3, pp.

300-305, 1995.

[13] R. R. Shively, “A Digital Processor to Generate Spectra in Real Time,” IEEE Transactions on Computers, vol. C-17, no. 5, pp. 485-491, 1968.

[14] R. Koutsoyannis, P. Milder, C. R. Berger; M. Glick, J. C. Hoe; M. Puschel,

“Improving Fixed-point Accuracy of FFT Cores in O-OFDM Systems,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.

[15] J. W. Cooley, J. W. Turkey, “An algorithm for machine computation of complex Fourier series,” Math. Computation, vol. 19, pp. 291-301, 1965.

[16] W. C. Yeh; C. W. Jen, “High-speed and low-power split-radix FFT,” IEEE Transactions on Signal Processing, vol. 51, no. 3, pp. 864-874, 2003.

[17] Y. W. Lin, H. Y. Liu, C. Y. Lee, “A 1-GS/s FFT/IFFT processor for UWB applications,”

IEEE Journal of Solid-State Circuits, vol. 40, no. 8, pp. 1726-1735, 2005.

[18] R. C. Agarwal, J. W. Cooley, “Vectorized mixed radix discrete Fourier transform algorithms,” Proceedings of the IEEE , vol. 75, no. 9, pp. 1283-1292, 1987.

[19] P. Y. Tsai, C. Y. Lin, “A Generalized Conflict-Free Memory Addressing Scheme for Continuous-Flow Parallel-Processing FFT Processors With Rescheduling,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 12, pp.

2290-2302, 2011.

[20] D. Reisis, N. Vlassopoulos, “Conflict-Free Parallel Memory Accessing Techniques for FFT Architectures,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 11, pp. 3438-3447, 2008.

[21] Y. W. Lin, “The study of FFT processors for OFDM systems,” Ph. D. thesis, Dept. of Electronic Engineering, National Chiao Tung University, Hsinchu, R.O.C., 2004.

[22] S. Lee, S. C. Park, “Modified SDF Architecture for Mixed DIF/DIT FFT,” IEEE International Symposium on Circuits and Syatems, pp. 2590-2593, 2007.

[23] A. Cortes, I. Velez, J. F. Sevillano, “Radix r^k FFTs: Matricial Representation and SDC/SDF Pipeline Implementation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2824-2839, 2009.

[24] B. C. Lin, Y. H. Wang, J. D. Huang, J. Y. Jou, “Expandable MDC-based FFT architecture and its generator for high-performance applications,” IEEE International SOC Conference (SOCC) , pp. 188-192, 2010.

[25] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, “A fast single-chip implementation of 8192 complex point FFT,” IEEE Journal of Solid-State Circuits, vol. 30, no. 3, pp.

300-305, 1995.

在文檔中應用多重區域條件式成組縮放法於快速傅利葉轉換處理器之面積最小化技術 (頁 47-0)