Design of Data Interface - EDA Realization of the New Multi-Standard

Chapter 5 EDA Realization of the New Multi-Standard

5.3 Design of Data Interface

Since a practical FFT processor shall receive serial in data in reality, N input samples have to be temporarily stored in a buffer before FFT operations are started.

Similarly, FFT output data should be recorded in a memory buffer for the following channel equalization or demodulation operations.

Fig 5.5 shows three popular memory arrangement schemes that properly handle those input and output data. One scheme is inserting an input RAM buffer that per-forms serial to parallel converter, and an output RAM buffer that preserve the previ-ous FFT results. Another is using three identical memory blocks, where one of them alternately acts as PE’s data memory and the remaining two act as the input buffer and output buffer respectively. The third scheme [51] is reading the input RAM buffer and performing the first-stage FFT before the guard interval has passed. Furthermore, during the final-stage FFT operation, the computational results are written to the out-put RAM for the following demodulation operations, instead of the main RAM for intermediate data read and write.

In the structure of Fig. 5.5(a), there exists clock rate difference between the front-end function modules and FFT processor, because of the rate mismatch between the input data rate N and the total operation count O(NlogrN). Namely, the intermedi-ate data memory is accessed with a faster PE’s clock rintermedi-ate, while the input buffer is accessed with a slower front-end system clock rate. Similarly, the output buffer is ac-cessed with another back-end system clock rate. However, when an FFT computation has been completed, we have to directly transfer the N-point output data from the in-termediate data memory to output buffer in a short time and then load the next N-point

rate is a critical issue during the input and output data transfers, and the input (output) buffer has to be driven by another clock rate which is faster than the front-end (back-end) system clock rate. This kind of clock difference isn’t too hard to handle with state-of-art VLSI technology. But the direct input and output data transfers with-out memory remapping is inefficient.

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

Access

Output buffer (N-word RAM)

Load Load Output data

(a) The 1st data interface structure for FFT PE

RAM_ 1 (N-word RAM)

RAM_2 (N-word RAM)

PE Access

RAM_3 (N-word RAM)

Switch Switch

Access

Access Access

Switch Input data

Load

Output data

Load

(b) The 2nd data interface structure for FFT PE

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

PE Access

Output buffer (N-word RAM)

Output data

Read Write

In the interface structure of Fig. 5.5(b), three identical memory blocks take turns in serving as input buffer, PE’s data memory, or output buffer. Namely, when one memory block is loading the next N-point input data, another memory block provides current N-point FFT data executed by PE, and the other holds the previous FFT result for back-end function module. When the next symbol period begins, memory blocks change their roles and repeat the mentioned process. For instance, the memory block which stores the input data will act as PE’s data memory next time. However, clock of the memory block is synchronous to front-end function modules when working as in-put buffer, while it should be synchronous to the faster FFT processor when working as PE’s data memory. As a result, those memory blocks have to be driven by different clocking systems. This status is similar to the first interface structure, but without di-rectly transference.

In the interface structure of Fig. 5.5(c), the N input data collected in the input buffer will be read to PE to perform the first-stage FFT operation and then written back to the PE’s intermediate data memory before the guard interval has passed.

Therefore, we don’t have to execute the copy operations between data memory and input buffer. Similarly, the results of the last-stage FFT operation are written to output buffer instead of PE’s data memory. However, for the proposed CORDIC-based FFT PE, we need more PE operation cycles than the multiplier-based FFT PE. Conse-quently, in order to complete the required computation within the guard interval, we have to speed up the operation clock rate of CORDIC-based PE, especially for DVB-T and 802.16. Therefore, we don’t adopt this structure.

By employing the interface structure of Fig. 5.5(b), the total required number of CORDIC iteration operation with respect to various OFDM communication systems is shown in Table 5.2. In this table, 802.16 is the most demanding in speed issue. If

cover all the OFDM communication systems listed in Table 5.2.

Table 5.2 The required operation counts and clock rates of the proposed

CORDIC-based PE to various OFDM communication specifications (output precision is 12-bit)

Standards Symbol duration Total PE opera-tion cycles

Cycle duration (ns)

Clock rate (MHz) 8K mode

(924µs) 68252 924/68252 = 13.5 73.8

DVB-T

2K mode

(231µs) 14472 231/14472 = 15.9 62.6

2048

(1246µs) 14472 1246/14472 =

86.1 11.6

1024

(623µs) 5204 623/5204 = 119.7 8.3

512

(312µs) 3092 312/3092 = 101 9.9

DAB

256

(156µs) 1232 156/1232 = 126.6 7.9

802.16 2048

(105.6µs) 14472 105.6/14472 = 7.3 137

Chapter 6 Conclusion

In this thesis, we propose an in-place memory-based variable-length FFT proc-essor architecture, which is suited for multi-mode and multi-standard OFDM systems, including 802.16a, DAB, and DVB-T. The design is featured with the variable-length data address generator which simplifies the original area-consuming barrel-shifter based designs with a few simpler multiplexer-based addressing functions. Further-more, we propose an efficient twiddle factor generator, which has the merit of low area complexity and high speed. Analysis and simulations show that it is favorable over the existing twiddle factor generators for practical FFT operations. The proposed design is mainly suitable for the situations where FFT lengths are long and adjustable, as required by the multi-mode and multi-standard operations defined in the mentioned systems. Finally, we proposed a new CORDIC algorithm which reduces iteration number significantly. It is achieved by combining several design techniques, including efficient high radix rotation scheme, angle encoding, leading-one bit detection, and on-line variable factor compensation. Since the biggest advantage of CORDIC-based FFT is that the twiddle factor generator can be eliminated, we replace the conven-tional complex multiplier and look-up table approach with CORDIC-based butterfly rotation operations.

The FFT core is currently under EDA realization and will be silicon implemented finally. In the future, we will emphasize on the integration into the OFDM baseband systems.

Bibliography

[1] J. W. Cooley and J. W. Tukey, “An algorithm for machine computation of complex fourier series,” Math. Computation, Vol. 19, pp. 297-301, Apr. 1965.

[2] Shousheng He and Mats Torkelson, “A new approach to pipeline FFT processor,”

Parallel Processing Symposium, The 10th International, pp. 766-770, 1996.

[3] Shousheng He and Mats Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation,” URSI International Symposium on Signals, Systems and Elec-tronics, pp. 257-262, 1998.

[4] Shousheng He and Mats Torkelson, “Design and implementation of a 1024-point FFT processor,” in Proc. IEEE Custom Integrated Circuit Conference, pp. 131-134, 1998.

[5] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI implementation,” IEEE Transactions on Computers, Vol. 33 No. 5, pp.

414-426, May 1984.

[6] L. G. Johnson, “Conflict free memory addressing for dedicated FFT hardware,”

IEEE Transactions on Circuit and System-II: Analog and Digital Signal Process-ing, Vol. 39 No.5, pp.312-316, May 1992.

[7] Hsin-Fu Lo, Ming-Der Shieh, and Chien-Ming Wu, “Design of an efficient FFT processor for DAB system,” IEEE International Symposium on Circuits and Sys-tems, Vol. 4, pp. 654 –657, 2001.

[8] Yutai Ma, “An effective memory addressing scheme for FFT processors,” IEEE Transactions on Signal Processing, Vol. 47 Issue: 3, pp. 907-911, Mar. 1999.

[9] Yutai Ma and Lars Wanhammar, “A hardware efficient control of memory

ad-dressing for high-performance FFT processors,” IEEE Transactions on Signal Processing, Vol. 48 Issue: 3, pp. 917-921, Mar. 2000.

[10] C. H. Chang, C. L. Wang and Y. T. Chang, “Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse,” IEEE Transactions on Signal Processing, Vol. 48 Issue: 11, pp. 3206-3216, Nov. 2000.

[11] C. L. Wang and C. H. Chang, “A new memory-based FFT processor for VDSL transceivers,” IEEE International Symposium on Circuits and Systems, Vol. 4, pp.

670 –673, 2001.

[12] A. M. Despain, “Fast Fourier transform using CORDIC iterations,” IEEE Trans.

Comput., Vol. C-23 No. 10 pp. 933-1001, Oct. 1974.

[13] G. Bi and E. V. Jones, “A pipelined FFT processor for word sequential data,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. 37 No. 12, pp. 1982-1985, Dec.

1989.

[14] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Prentice-Hall Inc., 1975.

[15] B. S. Kim and L. S. Kim, “Low power pipelined FFT architecture for synthetic aperture radar signal processing,” in Proc. IEEE Midwest Symposium on Circuits and Systems, Vol.3, pp. 1367-1370, 1996.

[16] M. M. Jamali, S. C. Kwatra and D. H. Shetty, “Module generation based VLSI implementation of a demultiplexer for satellite communications,” in Proc. IEEE International Symposium on Circuits and Systems, Vol.4, pp. 364-367, 1996.

[17] A. Delaruelle, j. huisken, J. van Loon, F. Welten, “A channel demodulator IC for digital audio broadcasting” in Proc. IEEE Custom Integrated Circuits Conference, pp. 47-50, 1994.

[18] D. Cohen, “Simplified control of FFT hardware,” IEEE Trans. Acoust., Speech

[19] C.K. Chang, “Investigation and design of FFT core for OFDM communication systems,” NCTU, Master Thesis, Jun. 2002.

[20] C.P. Hung, “Design of variable-length FFT processor,” NCTU, Master Thesis, Jun.

2003.

[21] J.C. Chi and S.G. Chen, “An efficient FFT twiddle factor generator,” 12th EUSIPCO, Sep. 2004.

[22] L. Fanucci, R. Roncella, and R. Saletti, “A sine wave digital synthesizer based on a quadratic approximation,” Proceedings of IEEE Frequency Control Symposium and PDA Exhibition, pp. 806-810, Jun. 2001.

[23] A.M. Sodagar and G. Roientan, “A novel architecture for ROM-less sine-output direct digital frequency synthesizers by using the 2nd-order parabolic approxima-tion,” Proceedings of IEEE Frequency Control Symposium and Exhibition, pp.

284-289, Jun. 2002.

[24] A. Bellaouar, M. Obrecht, A. Fahim, and M.I. Elmasry, “A low-power direct digi-tal frequency synthesizer architecture for wireless communications,” Proceedings of IEEE Custom Integrated Circuit, pp. 593-596, May. 1999.

[25] F. Curticapean and J. Niittylahti, “Low-power direct digital frequency synthe-sizer,” Proceedings of IEEE Circuit and System, vol: 2, pp. 8-11, Aug. 2000.

[26] A.M. Eltawil and B. Daneshrad, “Piece-wise parabolic interpolation for direct digital frequency synthesis,” Proceedings of IEEE Custom Integrated Circuits, pp.

401-404, May. 2002.

[27] L. Xiu and Z. You, “A new frequency shnthesis method based on flying-adder architecture,” Trans. on IEEE Circuits and Systems, vol: 50, pp. 130-134, Mar.

2003.

[28] A.M. Eltawil and B. Daneshrad, “Interpolation based direct digital frequency synthesis for wireless communications,” Proceedings of IEEE WCNC, vol: 1, pp.

73-77, Mar. 2002.

[29] N.J. Fliege and J. Wintermantel, “Complex digital oscillator and FSK modula-tors,” Tran. on IEEE Signal Processing, vol:40, pp. 333-342, Feb. 1992.

[30] M.M. Al-Ibrahim, “A simple recursive digital sinusoidal oscillator with uniform frequency spacing,” Proceedings of IEEE Circuits and Systems, pp. 689-692, Mar.

2001.

[31] A.V. Oppenheim, R.V. Schafer and J.R. Buck, Discrete-time signal processing, 2nd Ed., Prentice Hall, 1999.

[32] J.E. Volder, “The CORDIC trigonometric computing technique,” IRE Trans. Elec-tronic Comput., Vol. EC-8, pp. 330-334, 1959.

[33] J.S. Walther, “A unified algorithm for elementary functions,” AFIPS Spring Joint Comput. Conf., pp. 379-385, 1971.

[34] M.D. Ercegovac and T. Lang, “Redundant and on-line CORDIC: application to matrix triangularization and SVD,” IEEE Trans. on Computers, Vol. 39, No. 6, pp.

725-740, Jun. 1990.

[35] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDIC method with constant scale factor for sine and cosine computation,” IEEE Trans. on Computers, Vol. 40, No. 9, pp. 989-995, 1991.

[36] D. Timmermann, H. Hahn, and B .J. Hosticka, “Low latency time CORDIC algo-rithms,” IEEE Trans. on Computers, Vol. 41, No. 8, pp. 1010-1015, 1992.

[37] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapata, “High performance rota-tion architectures based on the radix-4 CORDIC algorithm,” IEEE Trans. on Computers, Vol. 46, No. 3, pp. 855-870, Aug. 1997.

[38] P.R. Rao and I. Chakrabarti, “High-performance compensation technique for the radix-4 CORDIC algorithm,” Proceedings of IEEE International Symposium on

[39] E. Antelo, T. Lang, and J.D. Bruguera, “Very-high radix CORDIC rotation based on selection by rounding,” Journal of VLSI Signal Processing, Vol. 25, pp.

141-153, 2000.

[40] E. Antelo, T. Lang, and J.D. Bruguera, “Very-high Radix circular CORDIC: vec-toring and unified rotation/vecvec-toring,” IEEE Trans. on Computers, Vol. 49, No. 7, pp. 727-739, Jul. 2000.

[41] S.F. Hsiao and C.Y. Lau, “Design of a unified arithmetic processor based on re-dundant constant-factor CORDIC with merged scaling operation,” Proceedings of IEEE International Symposium on Circuits and Systems, pp. 137-140, 2000.

[42] H. Dawid, and H. Meyr, “The differential CORDIC algorithm: constant scale factor redundant implementation without correcting iterations,” IEEE Trans. on Computers, Vol. 45, No. 3, pp. 307-318, Mar. 1996.

[43] Y.H. Hu, and S. Naganathan, “An angle recoding method for CORDIC algorithm implementation,” IEEE Trans. on Computers, Vol. 42, No. 1, pp. 99-102, Jan.

1993.

[44] C.S. Wu, A.Y. Wu, and C.H. Lin, “A high-performance/low-latency vector rota-tional CORDIC architecture based on extended elementary angle set and trel-lis-based searching schemes,” Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on Volume: 50 , Issue: 9, pp. 589 – 601, Sep. 2003.

[45] C.C. Li and S.G. Chen, “New redundant CORDIC algorithm with fast variable scale factor compensations,” Proceedings of IEEE International Symposium Cir-cuits and systems, pp. 264-267, May 1996.

[46] C.C. Li and S.G. Chen, “A radix-4 redundant CORDIC algorithm with fast on-line variable scale factor compensation,” Proceedings. of IEEE International Confer-ence on Acoustic, Speech and Signal Processing, pp. 639-642, 1997.

[47] S.G. Chen and C.F. Lin, “A CORDIC algorithm with fast rotation prediction and

small iteration number,” Proceedings of IEEE International Symposium on Cir-cuits and Systems, pp. 229-232, 1998.

[48] J.C. Chin and S.G. Chen, “Fast CORDIC algorithm based on a new recoding scheme for rotation angles and variable scale factors,” Journal of VLSI signal processing, Vol. 8, pp.56-61, 2002.

[49] T.C. Chen, “Automatic computation of exponentials, logarithms, ratios and square roots,” IBM Journal Res. And Dev., Vol. 16, pp. 380-388, Jul. 1972.

[50] H. Dawid, and H. Meyr, “The differential CORDIC algorithm: constant scale factor redundant implementation without correcting iterations,” IEEE Trans. on Computers, Vol. 45, No. 3, pp. 307-318, Mar. 1996.

[51] J.A. Huisken, M.J.G. Bekooij, G.C.M. Gielis, P.W.F. Gruijters, F.P.J. Welten, “A power-efficient single-chip OFDM demodulator and channel decoder for multi-media broadcasting,” ISSCC 1998 IEEE International, pp. 40–41, Feb. 1998.

在文檔中適用於正交分頻多工系統之快速傅立葉轉換處理器設計 (頁 89-0)