Comparison of FFT Processing Elements - Processing Elements of FFT Processor

Chapter 4 Processing Elements of FFT Processor

4.3 Comparison of FFT Processing Elements

Table 4.11 is the comparison of the multiplier-based PE and CORDIC-based PE.

This 8192-point radix-2² FFT PE, with 12-bit accuracy, is synthesized based on UMC 0.18μm standard cell library by Synopsis Design Analyzer. The multiplier-based PE includes three 1024-word twiddle factor ROM table.

The proposed CORDIC-based PE performs the front add/sub of a butterfly op-eration in the first cycle, and then executes the rotation opop-erations to carry out the butterfly complex multiplications. The average operation cycles are about 4.76 per butterfly computation for an 8192-point FFT.

Table 4.11 Comparison of the multiplier-based PE and CORDIC-based PE Proposed CORDIC-based PE

(word serial architecture) Multiplier-based PE

Gate counts 5163 34591

(Single complex multiplier: 5746)

Path delay 2.15ns 9.76ns

Required operation cycles per butterfly

computation

4.76

(averaged) 1

Chapter 5 EDA Realization of the New

Multi-Standard CORDIC-Based FFT Processor

5.1 Design Overview

The proposed design is an in-place memory-based FFT processor. The processor needs four-bank memory that matches the in-place memory address generator for high-bandwidth data access. In order to meet specifications of 802.16, DAB, and DVB-T, we employ the variable-length data address generator which covers five dif-ferent FFT lengths, including 256, 512, 1024, 2048, and 8192 points. Correspondingly, the processing element is based on radix-2² DIF FFT algorithm and also supports non-power-of-4 FFT computation, as discussed in Chapter 3. Since we replace the conventional complex multipliers of the PE with CORDIC processor, the ROM table which stores twiddle factors can be eliminated. Block diagram of our design is shown in Fig. 5.1.

SR A M B ank 3 SR A M B ank 2 SR A M B ank 1 SR A M B ank 0

Commutator read

C O RD IC-based PE

Commutatorwrite

R otation angle generator D ata address generator

Fig. 5.1 Block diagram of the proposed FFT processor

5.2 Components of FFT Processor

5.2.1 The Data Memory

The memory block of our FFT processor design is a 4-bank synchronous SRAM.

Each bank of SRAM has 2048 words and 24 bits per word which is generated by Ar-tisan™ UMC™ 0.18µm SRAM generator. The memory word length is 12-bit for both real and imaginary FFT data and is 24-bit in total for each data.

The details of memory partition scheme and address generation method are pre-sented in Chapter 3. Data address for each memory bank can be obtained by shifting the one-dimension data address right by two bits, which is easy to implement. On the other hand, bank index can be obtained by performing summation and module 4 of one-dimension data address as mentioned in Section 3.1.

For the general FFT processor with multiplier-based PE, in order to avoid stall operation and increase throughput, the data required by butterfly unit need to be read from and written to main memory simultaneously. Since continuous read or write op-eration is not allowed in the SRAM design, it is a serious problem when continuous memory access is assumed and preferred. In order to solve this problem, one may use the dual-port SRAM. The disadvantage of the dual-port type memory is that it has a larger area than that of a single-port type, because of two read/write ports, two sense amplifiers and two address generators in a dual-port memory. Furthermore, the power consumption is also a problem. According to Table 5.1, we can find that the power consumption per MHz of dual-port memory is larger than the single-port memory of same size.

Table 5.1 Power consumption of SRAM at 0.18µm process

SRAM size = 2048 × 24 Power (mW/MHz)

Single port 0.21

Dual port 0.49

In our design with CORDIC-based PE, the butterfly unit needs at least two cy-cles to execute the required operation. Since continuous read or write operation is avoided, the single-port memory which has smaller area and power consumption can be adopted.

5.2.2 The Processing Element

We replace the complex multipliers of the PE as shown in Fig. 3.6 by the CORDIC processor as shown in Fig. 4.8. The CORDIC-based PE structure is shown in Fig. 5.2. When the control signals of the MUX_1 are assigned 0, the PE is to exe-cute radix-2² FFT butterfly. Alternatively, if control signals are assigned 1, the PE is modified to process two radix-2 butterflies simultaneously. The required rotation an-gles for the variable-length CORDIC-based FFT processor can be generated by the similar hardware of the coefficient address generator mentioned in Section 3.1.

+ -+

-+

--j

MUX_1 MUX_1

MUX_1MUX_1

Radix-2²/2 select

Data in 0

Data in 1

Data in 2

Data in 3

Data out 0

Data out 1

Data out 2

Data out 3 CORDIC

processor (Fig. 4.8)

MUX_2

CORDIC processor (Fig. 4.8)

MUX_2

CORDIC processor (Fig. 4.8)

Reg

MUX_2 RegReg

Rotation angle 3 Rotation angle 1 Rotation angle 2

Fig. 5.2 The CORDIC-based PE structure

5.2.3 Controller

By combining trivial 2/2(±1± j)multiplications and front add/sub of a butter-fly operation with the basic CORDIC rotation operation (for butterbutter-fly complex multi-plications), we can design a flexible CORDIC processor that can execute the men-tioned three sub operations of a butterfly operation. The operation flow chart and the timing diagram of CORDIC-based FFT processor are shown in Fig 5.3 and Fig 5.4 respectively.

S ta rt

Fig. 5.3 The flow chart of the butterfly operations with proposed CORDIC-based PE

CORDIC iteration 2 (Data 0)

Data 4 Angle Value

Data 4 Address Value Data 2 Angle Value

Data 2 Address Value

Data 3 Angle Value

Data 3 Address Value Data 1

Address (Read)

Data 0 Address

(Write) XX ^Address^{Data 2}_(Read) ^Address_(Write)^{Data 1} ^Address_(Read)^{Data 3} ^Address_(Write)^{Data 2} XX SRAM Address

Write_enable

CORDIC iteration 3 (Data 0)

Angle decomposition (Data 3) Residue

angle =0 (Data 2) Angle

decompo sition (Data 2) Residue

angle =0 (Data 1) Residue

angle =0 (Data 0)

Angle decomposition (Data 1)

iteration 1 (Data 2) Butterfly

(Data 2) CORDIC

iteration 2 (Data 1) CORDIC

iteration 1 (Data 1)

Write back flag

Data Address generator Angle Value

generator

Fig. 5.4 Timing diagram of CORDIC-based FFT processor

5.3 Design of Data Interface

Since a practical FFT processor shall receive serial in data in reality, N input samples have to be temporarily stored in a buffer before FFT operations are started.

Similarly, FFT output data should be recorded in a memory buffer for the following channel equalization or demodulation operations.

Fig 5.5 shows three popular memory arrangement schemes that properly handle those input and output data. One scheme is inserting an input RAM buffer that per-forms serial to parallel converter, and an output RAM buffer that preserve the previ-ous FFT results. Another is using three identical memory blocks, where one of them alternately acts as PE’s data memory and the remaining two act as the input buffer and output buffer respectively. The third scheme [51] is reading the input RAM buffer and performing the first-stage FFT before the guard interval has passed. Furthermore, during the final-stage FFT operation, the computational results are written to the out-put RAM for the following demodulation operations, instead of the main RAM for intermediate data read and write.

In the structure of Fig. 5.5(a), there exists clock rate difference between the front-end function modules and FFT processor, because of the rate mismatch between the input data rate N and the total operation count O(NlogrN). Namely, the intermedi-ate data memory is accessed with a faster PE’s clock rintermedi-ate, while the input buffer is accessed with a slower front-end system clock rate. Similarly, the output buffer is ac-cessed with another back-end system clock rate. However, when an FFT computation has been completed, we have to directly transfer the N-point output data from the in-termediate data memory to output buffer in a short time and then load the next N-point

rate is a critical issue during the input and output data transfers, and the input (output) buffer has to be driven by another clock rate which is faster than the front-end (back-end) system clock rate. This kind of clock difference isn’t too hard to handle with state-of-art VLSI technology. But the direct input and output data transfers with-out memory remapping is inefficient.

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

Access

Output buffer (N-word RAM)

Load Load Output data

(a) The 1st data interface structure for FFT PE

RAM_ 1 (N-word RAM)

RAM_2 (N-word RAM)

PE Access

RAM_3 (N-word RAM)

Switch Switch

Access

Access Access

Switch Input data

Load

Output data

Load

(b) The 2nd data interface structure for FFT PE

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

PE Access

Output buffer (N-word RAM)

Output data

Read Write

In the interface structure of Fig. 5.5(b), three identical memory blocks take turns in serving as input buffer, PE’s data memory, or output buffer. Namely, when one memory block is loading the next N-point input data, another memory block provides current N-point FFT data executed by PE, and the other holds the previous FFT result for back-end function module. When the next symbol period begins, memory blocks change their roles and repeat the mentioned process. For instance, the memory block which stores the input data will act as PE’s data memory next time. However, clock of the memory block is synchronous to front-end function modules when working as in-put buffer, while it should be synchronous to the faster FFT processor when working as PE’s data memory. As a result, those memory blocks have to be driven by different clocking systems. This status is similar to the first interface structure, but without di-rectly transference.

In the interface structure of Fig. 5.5(c), the N input data collected in the input buffer will be read to PE to perform the first-stage FFT operation and then written back to the PE’s intermediate data memory before the guard interval has passed.

Therefore, we don’t have to execute the copy operations between data memory and input buffer. Similarly, the results of the last-stage FFT operation are written to output buffer instead of PE’s data memory. However, for the proposed CORDIC-based FFT PE, we need more PE operation cycles than the multiplier-based FFT PE. Conse-quently, in order to complete the required computation within the guard interval, we have to speed up the operation clock rate of CORDIC-based PE, especially for DVB-T and 802.16. Therefore, we don’t adopt this structure.

By employing the interface structure of Fig. 5.5(b), the total required number of CORDIC iteration operation with respect to various OFDM communication systems is shown in Table 5.2. In this table, 802.16 is the most demanding in speed issue. If

cover all the OFDM communication systems listed in Table 5.2.

Table 5.2 The required operation counts and clock rates of the proposed

CORDIC-based PE to various OFDM communication specifications (output precision is 12-bit)

Standards Symbol duration Total PE opera-tion cycles

Cycle duration (ns)

Clock rate (MHz) 8K mode

(924µs) 68252 924/68252 = 13.5 73.8

DVB-T

2K mode

(231µs) 14472 231/14472 = 15.9 62.6

2048

(1246µs) 14472 1246/14472 =

86.1 11.6

1024

(623µs) 5204 623/5204 = 119.7 8.3

512

(312µs) 3092 312/3092 = 101 9.9

DAB

256

(156µs) 1232 156/1232 = 126.6 7.9

802.16 2048

(105.6µs) 14472 105.6/14472 = 7.3 137

Chapter 6 Conclusion

In this thesis, we propose an in-place memory-based variable-length FFT proc-essor architecture, which is suited for multi-mode and multi-standard OFDM systems, including 802.16a, DAB, and DVB-T. The design is featured with the variable-length data address generator which simplifies the original area-consuming barrel-shifter based designs with a few simpler multiplexer-based addressing functions. Further-more, we propose an efficient twiddle factor generator, which has the merit of low area complexity and high speed. Analysis and simulations show that it is favorable over the existing twiddle factor generators for practical FFT operations. The proposed design is mainly suitable for the situations where FFT lengths are long and adjustable, as required by the multi-mode and multi-standard operations defined in the mentioned systems. Finally, we proposed a new CORDIC algorithm which reduces iteration number significantly. It is achieved by combining several design techniques, including efficient high radix rotation scheme, angle encoding, leading-one bit detection, and on-line variable factor compensation. Since the biggest advantage of CORDIC-based FFT is that the twiddle factor generator can be eliminated, we replace the conven-tional complex multiplier and look-up table approach with CORDIC-based butterfly rotation operations.

The FFT core is currently under EDA realization and will be silicon implemented finally. In the future, we will emphasize on the integration into the OFDM baseband systems.

Bibliography

[1] J. W. Cooley and J. W. Tukey, “An algorithm for machine computation of complex fourier series,” Math. Computation, Vol. 19, pp. 297-301, Apr. 1965.

[2] Shousheng He and Mats Torkelson, “A new approach to pipeline FFT processor,”

Parallel Processing Symposium, The 10th International, pp. 766-770, 1996.

[3] Shousheng He and Mats Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation,” URSI International Symposium on Signals, Systems and Elec-tronics, pp. 257-262, 1998.

[4] Shousheng He and Mats Torkelson, “Design and implementation of a 1024-point FFT processor,” in Proc. IEEE Custom Integrated Circuit Conference, pp. 131-134, 1998.

[5] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT processors for VLSI implementation,” IEEE Transactions on Computers, Vol. 33 No. 5, pp.

414-426, May 1984.

[6] L. G. Johnson, “Conflict free memory addressing for dedicated FFT hardware,”

IEEE Transactions on Circuit and System-II: Analog and Digital Signal Process-ing, Vol. 39 No.5, pp.312-316, May 1992.

[7] Hsin-Fu Lo, Ming-Der Shieh, and Chien-Ming Wu, “Design of an efficient FFT processor for DAB system,” IEEE International Symposium on Circuits and Sys-tems, Vol. 4, pp. 654 –657, 2001.

[8] Yutai Ma, “An effective memory addressing scheme for FFT processors,” IEEE Transactions on Signal Processing, Vol. 47 Issue: 3, pp. 907-911, Mar. 1999.

[9] Yutai Ma and Lars Wanhammar, “A hardware efficient control of memory

ad-dressing for high-performance FFT processors,” IEEE Transactions on Signal Processing, Vol. 48 Issue: 3, pp. 917-921, Mar. 2000.

[10] C. H. Chang, C. L. Wang and Y. T. Chang, “Efficient VLSI architectures for fast computation of the discrete Fourier transform and its inverse,” IEEE Transactions on Signal Processing, Vol. 48 Issue: 11, pp. 3206-3216, Nov. 2000.

[11] C. L. Wang and C. H. Chang, “A new memory-based FFT processor for VDSL transceivers,” IEEE International Symposium on Circuits and Systems, Vol. 4, pp.

670 –673, 2001.

[12] A. M. Despain, “Fast Fourier transform using CORDIC iterations,” IEEE Trans.

Comput., Vol. C-23 No. 10 pp. 933-1001, Oct. 1974.

[13] G. Bi and E. V. Jones, “A pipelined FFT processor for word sequential data,” IEEE Trans. Acoust., Speech, Signal Processing, Vol. 37 No. 12, pp. 1982-1985, Dec.

1989.

[14] L. R. Rabiner and B. Gold, Theory and application of digital signal processing, Prentice-Hall Inc., 1975.

[15] B. S. Kim and L. S. Kim, “Low power pipelined FFT architecture for synthetic aperture radar signal processing,” in Proc. IEEE Midwest Symposium on Circuits and Systems, Vol.3, pp. 1367-1370, 1996.

[16] M. M. Jamali, S. C. Kwatra and D. H. Shetty, “Module generation based VLSI implementation of a demultiplexer for satellite communications,” in Proc. IEEE International Symposium on Circuits and Systems, Vol.4, pp. 364-367, 1996.

[17] A. Delaruelle, j. huisken, J. van Loon, F. Welten, “A channel demodulator IC for digital audio broadcasting” in Proc. IEEE Custom Integrated Circuits Conference, pp. 47-50, 1994.

[18] D. Cohen, “Simplified control of FFT hardware,” IEEE Trans. Acoust., Speech

[19] C.K. Chang, “Investigation and design of FFT core for OFDM communication systems,” NCTU, Master Thesis, Jun. 2002.

[20] C.P. Hung, “Design of variable-length FFT processor,” NCTU, Master Thesis, Jun.

2003.

[21] J.C. Chi and S.G. Chen, “An efficient FFT twiddle factor generator,” 12th EUSIPCO, Sep. 2004.

[22] L. Fanucci, R. Roncella, and R. Saletti, “A sine wave digital synthesizer based on a quadratic approximation,” Proceedings of IEEE Frequency Control Symposium and PDA Exhibition, pp. 806-810, Jun. 2001.

[23] A.M. Sodagar and G. Roientan, “A novel architecture for ROM-less sine-output direct digital frequency synthesizers by using the 2nd-order parabolic approxima-tion,” Proceedings of IEEE Frequency Control Symposium and Exhibition, pp.

284-289, Jun. 2002.

[24] A. Bellaouar, M. Obrecht, A. Fahim, and M.I. Elmasry, “A low-power direct digi-tal frequency synthesizer architecture for wireless communications,” Proceedings of IEEE Custom Integrated Circuit, pp. 593-596, May. 1999.

[25] F. Curticapean and J. Niittylahti, “Low-power direct digital frequency synthe-sizer,” Proceedings of IEEE Circuit and System, vol: 2, pp. 8-11, Aug. 2000.

[26] A.M. Eltawil and B. Daneshrad, “Piece-wise parabolic interpolation for direct digital frequency synthesis,” Proceedings of IEEE Custom Integrated Circuits, pp.

401-404, May. 2002.

[27] L. Xiu and Z. You, “A new frequency shnthesis method based on flying-adder architecture,” Trans. on IEEE Circuits and Systems, vol: 50, pp. 130-134, Mar.

2003.

[28] A.M. Eltawil and B. Daneshrad, “Interpolation based direct digital frequency synthesis for wireless communications,” Proceedings of IEEE WCNC, vol: 1, pp.

73-77, Mar. 2002.

[29] N.J. Fliege and J. Wintermantel, “Complex digital oscillator and FSK modula-tors,” Tran. on IEEE Signal Processing, vol:40, pp. 333-342, Feb. 1992.

[30] M.M. Al-Ibrahim, “A simple recursive digital sinusoidal oscillator with uniform frequency spacing,” Proceedings of IEEE Circuits and Systems, pp. 689-692, Mar.

2001.

[31] A.V. Oppenheim, R.V. Schafer and J.R. Buck, Discrete-time signal processing, 2nd Ed., Prentice Hall, 1999.

[32] J.E. Volder, “The CORDIC trigonometric computing technique,” IRE Trans. Elec-tronic Comput., Vol. EC-8, pp. 330-334, 1959.

[33] J.S. Walther, “A unified algorithm for elementary functions,” AFIPS Spring Joint Comput. Conf., pp. 379-385, 1971.

[34] M.D. Ercegovac and T. Lang, “Redundant and on-line CORDIC: application to matrix triangularization and SVD,” IEEE Trans. on Computers, Vol. 39, No. 6, pp.

725-740, Jun. 1990.

[35] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDIC method with constant scale factor for sine and cosine computation,” IEEE Trans. on Computers, Vol. 40, No. 9, pp. 989-995, 1991.

[36] D. Timmermann, H. Hahn, and B .J. Hosticka, “Low latency time CORDIC algo-rithms,” IEEE Trans. on Computers, Vol. 41, No. 8, pp. 1010-1015, 1992.

[37] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapata, “High performance rota-tion architectures based on the radix-4 CORDIC algorithm,” IEEE Trans. on Computers, Vol. 46, No. 3, pp. 855-870, Aug. 1997.

[38] P.R. Rao and I. Chakrabarti, “High-performance compensation technique for the radix-4 CORDIC algorithm,” Proceedings of IEEE International Symposium on

[39] E. Antelo, T. Lang, and J.D. Bruguera, “Very-high radix CORDIC rotation based on selection by rounding,” Journal of VLSI Signal Processing, Vol. 25, pp.

141-153, 2000.

[40] E. Antelo, T. Lang, and J.D. Bruguera, “Very-high Radix circular CORDIC: vec-toring and unified rotation/vecvec-toring,” IEEE Trans. on Computers, Vol. 49, No. 7, pp. 727-739, Jul. 2000.

[41] S.F. Hsiao and C.Y. Lau, “Design of a unified arithmetic processor based on re-dundant constant-factor CORDIC with merged scaling operation,” Proceedings of IEEE International Symposium on Circuits and Systems, pp. 137-140, 2000.

[42] H. Dawid, and H. Meyr, “The differential CORDIC algorithm: constant scale factor redundant implementation without correcting iterations,” IEEE Trans. on Computers, Vol. 45, No. 3, pp. 307-318, Mar. 1996.

[43] Y.H. Hu, and S. Naganathan, “An angle recoding method for CORDIC algorithm implementation,” IEEE Trans. on Computers, Vol. 42, No. 1, pp. 99-102, Jan.

1993.

[44] C.S. Wu, A.Y. Wu, and C.H. Lin, “A high-performance/low-latency vector rota-tional CORDIC architecture based on extended elementary angle set and trel-lis-based searching schemes,” Circuits and Systems II: Analog and Digital Signal Processing, IEEE Transactions on Volume: 50 , Issue: 9, pp. 589 – 601, Sep. 2003.

[45] C.C. Li and S.G. Chen, “New redundant CORDIC algorithm with fast variable scale factor compensations,” Proceedings of IEEE International Symposium Cir-cuits and systems, pp. 264-267, May 1996.

[46] C.C. Li and S.G. Chen, “A radix-4 redundant CORDIC algorithm with fast on-line variable scale factor compensation,” Proceedings. of IEEE International Confer-ence on Acoustic, Speech and Signal Processing, pp. 639-642, 1997.

[47] S.G. Chen and C.F. Lin, “A CORDIC algorithm with fast rotation prediction and

small iteration number,” Proceedings of IEEE International Symposium on Cir-cuits and Systems, pp. 229-232, 1998.

[48] J.C. Chin and S.G. Chen, “Fast CORDIC algorithm based on a new recoding scheme for rotation angles and variable scale factors,” Journal of VLSI signal processing, Vol. 8, pp.56-61, 2002.

[49] T.C. Chen, “Automatic computation of exponentials, logarithms, ratios and square roots,” IBM Journal Res. And Dev., Vol. 16, pp. 380-388, Jul. 1972.

[50] H. Dawid, and H. Meyr, “The differential CORDIC algorithm: constant scale factor redundant implementation without correcting iterations,” IEEE Trans. on Computers, Vol. 45, No. 3, pp. 307-318, Mar. 1996.

[51] J.A. Huisken, M.J.G. Bekooij, G.C.M. Gielis, P.W.F. Gruijters, F.P.J. Welten, “A power-efficient single-chip OFDM demodulator and channel decoder for multi-media broadcasting,” ISSCC 1998 IEEE International, pp. 40–41, Feb. 1998.

在文檔中適用於正交分頻多工系統之快速傅立葉轉換處理器設計 (頁 83-0)