Optimized Scaling Behavior for 8192-point FFT

Chapter 4 Experimental Results

4.2 Optimized Scaling Behavior for 8192-point FFT

Table V presents the scaling behavior for 8192-point, radix-2 FFT with different wordlength (WL), and the input signals are uniformly distributed in [-1, 1] with <1, WL-1>

number format. The runtime of the optimization flow is from 0.04 seconds to 127.2 seconds for 8 to 16 bits of wordlength. Since the major computation is the convolution at each stage, the time complexity is O(2^WL*s), where s is the number of stages. We can find that the SQNR of our result is much higher than Oppenheim’s approach [12]. Fig. 28 shows that the wordlength has about 3 bits less than Oppenheim’s approach with the same SQNR.

Therefore, about 48k (3*8k*2, for both real and imaginary parts) bits of storage can be saved by our approach. And compared to BFP approach [1], the performance of our method is almost the same without the high hardware complexity for dynamic scaling method.

Table V Scaling behavior for 8192-point, radix-2 FFT

(bit)

Integer Part of Each Stage SQNR

(proposed)

SQNR

[12]

1 2 3 4 5 6 7 8 9 10 11 12 13 Runtime

(sec)

8 2 3 4 4 5 5 6 6 7 7 8 8 9 0.04 18.08 dB -3.75 dB

9 2 3 4 4 5 5 6 6 7 7 8 8 9 0.06 23.70 dB 2.14 dB

10 2 3 4 5 5 6 6 7 7 8 8 9 9 0.07 27.47 dB 8.11 dB

11 2 3 4 5 5 6 6 7 7 8 8 9 9 0.11 33.47 dB 14.10 dB

12 2 3 4 5 5 6 6 7 7 8 8 9 9 1.60 39.50 dB 20.13 dB

13 2 3 4 5 5 6 6 7 7 8 8 9 9 3.52 45.51 dB 26.15 dB

14 2 3 4 5 5 6 6 7 7 8 8 9 9 18.36 51.50 dB 32.16 dB

15 2 3 4 5 5 6 7 7 8 8 9 9 10 44.44 55.28 dB 38.18 dB

16 2 3 4 5 6 6 7 7 8 8 9 9 10 127.2 60.83 dB 44.18 dB

-10.00 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00

8 9 10 11 12 13 14 15 16

SQNR (dB)

wordlength (bit)

Oppenheim [12]

BFP [1]

proposed 3 bits

Fig. 28 SQNR vs. wordlength for 8192-point, radix-2 FFT

The comparison of 11-bit and 14-bit wordlength for 8192-point, radix-2 FFT processor are given in Table VI. The FFT architecture is memory-based and synthesized in 10 ns of cycle time, which gives 100M Sample per second of throughput. The SQNR of the one with 11-bit wordlength by our proposed method is almost the same as the other one with 14-bit wordlength by Oppenheim’s approach as shown in Fig. 28. The area and power reduction excluding storage is 33.65 % and 26.89 %, respectively, as well as the storage reduction is 21.41 %.

Table VI Hardware Comparison of 11-bit and 14-bit wordlength for 8192-point, radix-2 FFT

8192-point radix-2 FFT (100MS/s)

11-bit WL 14-bit WL Reduction

Area excluding storage 85,771.2 μm² 129.852.7μm² 33.65 % Power excluding storage 1.9229 mW 2.6302 mW 26.89 %

Storage 180k bits 229k bits 21.41 %

Fig. 29 shows the SQNR with different wordlengths for 8192-point, radix-4 FFT. Also 3-bit benefit is obtained with our method as compared to Oppenheim’s scheme. Note that Ramakrishnan’s approach [13] is not suitable for inputs with uniform distribution, so the error performance is not good under this assumption. The case of radix-8 FFT is shown in Fig. 30.

Moreover, about 4-bit wordlength can be saved in this algorithm.

-10.00

Fig. 29 SQNR vs. wordlength for 8192-point, radix-4 FFT

-10.00

Fig. 30 SQNR vs. wordlength for 8192-point, radix-8 FFT

For the inputs with normal distribution, Fig. 31 and Fig. 32 show the experimental results of deviation σ= 0.2 and 0.4. When σ= 0.2, our analyzed results from 10-bit to 15-bit wordlength is the same as these of Ramakrishnan’s approach, which are increasing 1 bit for each radix-4 stage. However, Ramakrishnan’s approach is not feasible when σ= 0.4, while our method still has good performance. Fig. 33 shows the SQNR for different deviations with 12-bit wordlength, and we can find that Ramakrishnan’s approach is only feasible in a narrow range. Expect the performance is the same at σ = 0.2, our approach is better than Ramakrishnan’s method in all the other cases.

-20.00

-10.00

Fig. 32 SQNR vs. wordlength for 8192-point, radix-4 FFT with normally distributed inputs (σ= 0.4)

0.00

Fig. 33 SQNR with different deviations for 8192-point, radix-4 FFT with normally distributed inputs (WL = 12)

Chapter 5 Conclusion & Future Work

In this thesis, a fast precision optimization approach to fix the scaling behavior at each stage which gives optimized SQNR is proposed for the FFT processor design with the fixed-wordlength storage. This method is based on the probability-based analysis which utilizes the concept of the derived distribution. It has ability to evaluate the overflow and truncation behavior in terms of probability and induced noise at each stage. The proposed flow can handle different FFT sizes, input distributions, algorithms, and wordlengths of storage. The experimental results show that about 3 bits of wordlength for 8192-point radix-2 FFT processor can be saved compared to Oppenheim’s approach [12] without any hardware overhead. Furthermore, the wordlength can be saved about 4 bits for 8192-point, radix-8 FFT.

Area and power consumption can be further reduced significantly. The performance is also comes close to the dynamic scaling method [1], that is, the SQNR difference is within 2 dB which is about 1/3 bit for the same wordlength.

In the future, more designs for DSP, such as FIR filters, can be analyzed by the concept of derived distribution to optimize the wordlength. The scaling decision is a good guideline for design automations and optimizations.

References

[1] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, "A dynamic scaling FFT processor for DVB-T applications," IEEE J. of Solid-State Circuits, vol. 39, pp. 2005-2013, 2004.

[2] C.-T. Lin, Y.-C. Yu, and L.-D. Van, "A low-power 64-point FFT/IFFT design for IEEE 802.11a WLAN application," in Proc. of IEEE International Symposium on Circuits and Systems, 2006, pp. 4 pp.-4526.

[3] S. Li, H. Xu, W. Fan, Y. Chen, and X. Zeng, "A 128/256-point pipeline FFT/IFFT processor for MIMO OFDM system IEEE 802.16e," in Proc. of IEEE International Symposium on Circuits and Systems, 2010, pp. 1488-1491.

[4] Y. Chen, Y.-W. Lin, Y.-C. Tsao, and C.-Y. Lee, "A 2.4-GSample/s DVFS FFT processor for MIMO OFDM communication systems," IEEE J. of Solid-State Circuits, vol. 43, pp.

1260-1273, 2008.

[5] Y.-W. Lin and C.-Y. Lee, "Design of an FFT/IFFT processor for MIMO OFDM systems," IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 54, pp. 807-815, 2007.

[6] C. M. Chen, C. C. Hung, and Y. H. Huang, "An energy-efficient partial FFT processor for the OFDMA communication system," IEEE Trans. on Circuits and Systems II:

Express Briefs, vol. 57, pp. 136-140, Feb. 2010.

[7] S. N. Tang, J. W. Tsai, and T. Y. Chang, "A 2.4-GS/s FFT processor for OFDM-based WPAN applications," IEEE Trans. on Circuits and Systems II: Express Briefs, vol. 57, pp. 1-5, 2010.

[8] E. Grass, K. Tittelbach, U. Jagdhold, A. Troya, G. Lippert, O. Krueger, J. Lehman, K.

Maharatna, N. Fiebig, K. Dombrowski, R. Kraemer, and P. Maehoenen, "On the single-chip implementation of a Hiperlan/2 and IEEE 802.11 a capable modem," IEEE Perssonal Communcation, vol. 8, pp. 48-57, 2001.

[9] Y. Chen, Y.-C. Tsao, Y.-W. Lin, C.-H. Lin, and C.-Y. Lee, "An indexed-scaling pipelined FFT processor for OFDM-Based WPAN applications," IEEE Trans. on Circuits and Systems II: Express Briefs, vol. 55, pp. 146-150, 2008.

[10] W.-H. Chang and T. Q. Nguyen, "On the fixed-point accuracy analysis of FFT algorithms," IEEE Trans. on Signal Processing, vol. 56, pp. 4673-4682, 2008.

[11] C.-Y. Wang, C.-B. Kuo, and J.-Y. Jou, "Hybrid wordlength optimization methods of pipelined FFT processors," IEEE Trans. on Computers, vol. 56, pp. 1105-1118, 2007.

[12] A. V. Oppenheim and C. J. Weinstein, "Effects of finite register length in digital filtering and the fast Fourier transform," Proc. of IEEE, vol. 60, pp. 957-976, 1972.

[13] S. Ramakrishnan, J. Balakrishnan, and K. Ramasubramanian, "Exploiting signal and noise statistics for fixed point FFT design optimization in OFDM systems," in Proc. of National Communication Conference (NCC), 2010, pp. 1-5.

[14] E. Bidet, D. Castelain, C. Joanblanq, and P. Senn, "A fast single-chip implementation of

8192 complex point FFT," IEEE J. of Solid-State Circuits, vol. 30, pp. 300-305, 1995.

[15] T. Lenart and V. Owall, "A 2048 complex point FFT processor using a novel data scaling approach," in Proc. of IEEE International Symposium on Circuits and Systems, 2003, pp. 45-48.

[16] J. W. Cooley and J. W. Turkey, "An algorithm for machine computation of complex Fourier series," Math. Computation, vol. 19, pp. 291-301, 1965.

[17] W.-C. Yeh and C.-W. Jen, "High-speed and low-power split-radix FFT," IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 51, pp. 864-874, 2003.

[18] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, "A 1-GS/s FFT/IFFT processor for UWB applications," IEEE J. of Solid-State Circuits, vol. 40, pp. 1726-1735, 2005.

[19] R. C. Agarwal and J. W. Cooley, "Vectorized mixed radix discrete Fourier transform algorithms," Proc. of IEEE, vol. 75, pp. 1283-1292, 1987.

[20] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 1999.

[21] P. Y. Tsai and C. Y. Lin, "A generalized conflict-free memory addressing scheme for continuous-flow parallel-processing FFT processors with rescheduling," IEEE Trans.

on Very Large Scale Integration (VLSI) Systems, vol. PP, pp. 1-1, 2010.

[22] D. Reisis and N. Vlassopoulos, "Conflict-free parallel memory accessing techniques for FFT architectures," IEEE Trans. on Circuits and Systems I: Regular Papers, vol. 55, pp.

3438-3447, 2008.

[23] Y.-W. Lin, "The study of FFT processors for OFDM systems," Ph. D. thesis, Dept. of Electronic Engineering, National Chiao Tung University, Hsinchu, R.O.C., 2004.

[24] S. Lee and S.-C. Park, "Modified SDF Architecture for Mixed DIF/DIT FFT," in Proc.

of IEEE International Symposium on Circuits and Systems, 2007, pp. 2590-2593.

[25] A. Cortes, I. Velez, and J. F. Sevillano, "Radix 2^k FFTs: matricial representation and SDC/SDF pipeline implementation," IEEE Trans. on Signal Processing, vol. 57, pp.

2824-2839, 2009.

[26] B.-C. Lin, Y.-H. Wang, J.-D. Huang, and J.-Y. Jou, "Expandable MDC-based FFT architecture and its generator for high-performance applications " in Conf. of IEEE International SOC Conference, 2010.

[27] D. P. Bertsekas and J. N. Tsitsiklis, Introduction to Probability. Nashua, NH: Athena Scientific, 2008.

Vita

Ming-En Shih was born in Hsinchu, R.O.C., on February 1, 1987. He received the B.S.

degree from the Electronic Engineering Department from National Chiao Tung University in June, 2009. He is the graduate student of Prof. Jing-Yang Jou. His research interests include digital hardware design, digital signal processing, and Electronics Design Automation (EDA).

在文檔中以靜態機率模型分析為基礎之應用於快速傅利葉轉換處理器設計的精度最佳化技術 (頁 44-0)