Comparison of Separated Twiddle Factor ROM

Chapter 4 Parallel-In-Parallel-Out FFT/IFFT Processor Architecture Design

4.5 Hardware Implementation Result

4.5.2 Comparison of Separated Twiddle Factor ROM

Fig. 4.29 Area comparison of separated twiddle factor ROM

As the discussion in Section 4.3.3, the conventional twiddle factor ROM is partitioned for different PE called PE-based TW ROM. The PE-based TW ROM is implemented by combinational circuits using synthesis tool to optimize the area of

each ROM. For instance, PE0 has many twiddle factors with the same number as discussed in Section 4.3.3. Thus, as the result in Fig. 4.29, the area of PE-based TW ROM for PE0 is lower than other twiddle factor ROM’s area.

4.5.2 Hardware Implementation Result

As the hardware implementation results shown in Table 4-8, the proposed 1024-point FFT/IFFT processor can achieve the throughput rate up to 1.28 G samples/sec and the execution time down to 7.3 us when working at 160 MHz. When working at the system required 78.4 MHz, the execution time is 14.9 us which meets the system requirement of 25 us, and the power consumption is 21.7 mW with 155792 gates (including memory) that occupy 0.545 mm² by using 90 nm CMOS 1P9M 1V process.

Table 4-8 Hardware Implementation of the Proposed FFT/IFFT Processor

Items Specification

FFT Size 1024 points

Process Technology 90 nm CMOS 1P9M 1V Max Working Frequency 160 MHz

System Working Frequency 78.4 MHz

Throughput Rate 1.28G samples/sec @ 160MHz Power Consumption

45.1 mW @ 160 MHz 21.7 mW @ 78.4 MHz

(estimate by Design Compiler) Gate count/Area 155792 gates @ 78.4 MHz

(including memory)/ 0.545mm²

Memory Size 8 x 128 words x 40 bits

Memory Area 77816 gates

(8 bank memories)

Execution Time 7.3 us @ 160 MHz

14.9 us @ 78.4MHz

4.6 Summary

In order to evaluate the proposed FFT/IFFT processor, we compare the computation complexity and memory requirement in Table 4-8. It is apparent that compared with R8MDC and radix-8 memory-based, the proposed FFT processor requires less complex multipliers and no reorder buffer. As the result, the proposed FFT/IFFT architecture can meet the system requirement with the least hardware complexity.

Table 4-9 Comparison of several high throughput FFT architectures R2³SDF R8MDC Radix-8

Also, we use the proposed FFT processor hardware cost to evaluate the hardware cost of different architectures, and the results are shown in Table 4-9. It is apparent that the proposed FFT processor can save about 81.0% and 43.9% complex multipliers as compared with R8MDC and radix-8 memory-based. The memory bank composed of single port memory reduces about 64.8% and 32.0% area of R8MDC and radix-8 memory-based. Moreover, it frees the requirement of reorder buffer which

needs area of 69490 gates. The data latency of the proposed FFT processor is 14.9 us at 78.4 MHz and meets the system requirement of 25 us. Meeting the system requirement, the proposed FFT processor has the least hardware cost for low complexity design.

Table 4-10 Comparison of hardware cost for different architectures R8MDC R8M Proposed Complex Adders 22243

(100%) Reorder Buffer Size 69490

(100%)

Chapter 5 Chip Implementation of IEEE 802.16e Receiver

This chapter will introduce the chip design flow for IEEE 802.16e baseband receiver. The 802.16e baseband receiver is including a frequency divider, a synchronization block, a FFT processor for FFT_dem block with 5 memory banks, and a channel estimation block with FFT/IFFT processors for FFT_ch/IFFT_ch blocks as shown in Fig. 2.5 and Fig. 2.6.

5.1 Design Flow

The 802.16e baseband receiver system is modeled in C language. For hardware implementation, each component uses fixed-point simulation to have the least performance decreasing as compared with the floating system model. After the word length of each component is decided, the hardware implementation of each component is modeled in Verilog language, called RTL design. Besides the performance analysis, the RTL design of each component in Verilog is verified in Verilog XL with the result generated by the model in C language, called RTL verification. If the RTL verification is done, the RTL code is synthesized in synthesis tool, Design Compiler, with suitable constrain, and we usually have timing overdesign in this stage because the synthesis tool don’t have the real line delay. The gate-level netlist generated by synthesis tool is verified with the result generated by RTL verification in Verilog XL, called gate-level verification. APR (Automatic Place and Route) tool, such as SOC Encounter, help us to implement the chip from gate-level netlist and also help us to make sure the chip meet the layout design rule. The layout

file (GDS) generated by APR tool is verified in Calibre by DRC (Design Rule Check) and LVS (Layout versus Schematic). The post-layout simulation using gate-level netlist generated by APR tool is used to verify the result as compared with gate-level simulation. Usually, the chip has post transistor-level simulation before the chip is taped out. However, the simulation time is too long if the chip has too many transistors. The 802.16e baseband receiver has over 1 million gates and is too large to simulate by post transistor-level simulation. Thus, we skip the simulation in this stage.

Finally, the chip is taped out.

Fig. 5.1 Cell based chip design flow

5.2 Multi-Frequency Design

The 802.16e baseband receiver has two clock domains. One is 11.2 MHz for achieving the required data rate to IEEE 802.16e. The other is 7 times of 11.2 MHz which equals to 78.4 MHz. Since there are several combinational circuits between the registers in different clock domain, we have to set the different timing constrain in those path for the respected timing check. An example of two clock domains is shown in Fig. 5.2, and the default timing check is shown in Fig. 5.3.

DFF A1

DFF A2

DFF B1

DFF B2

CLK_A CLK_A

CLK_B CLK_B

Comb. Logic

Fig. 5.2 Combination logic circuits between 2 clock domains

Fig. 5.3 Default timing check in 2 clock domains

The frequency of CLK_A is 3 times of CLK_B, and the 2 different conditions of timing check in different clock domains are shown in Fig. 5.3. For the upper case of Fig. 5.3 (DFFA1 to DFFB2), the default timing check leads the timing constrain of the combination circuits is limited in cycle of CLK_A. And, for the lower case of Fig.

5.3 (DFFB1 to DFFA2), the default timing check also leads the timing constrain of the combination circuits is limited in cycle of CLK_A. However, in the lower case, the expected timing constrain of the combination circuits usually is cycle of CLK_B shown in Fig. 5.4. Therefore, we have to correct the default timing check by setting the SDC (Synopsys Design Constrain) constrain in synthesis tool. The commands of SDC constrain for changing the timing constrain from Fig. 5.3 to Fig. 5.4 are

“set_multicycle_path 3 – end – setup – from CLK_B – to CLK_A” and

“set_multicycle_path 2 –end –hold –from CLK_B –to CLK_A”.

Fig. 5.4 Expected timing constrain for DFFB1 to DFFA2

In our case, FFT_dem block gets the signal from synchronization block, and the synchronization is working at the low clock frequency 11.2 MHz while the FFT_dem block is working at the high clock frequency 78.4 MHz. Thus, we have to set the commands for SDC constrain, too, and the commands are similar to the commands mentioned before.

Since we have two clock domains, frequency divider is used in our baseband receiver design. In synthesis stage, the frequency divider will introduce the clock

skew, which should be fixed by APR tool if we synthesize the receiver with frequency divider. Therefore, we separate the receiver into 2 parts, one is frequency divider, and the other is circuits with 2 ideal clock input. The two parts of receiver are synthesized individually, and combined in APR tool shown in Fig. 5.5. In addition, the gate-level verification is verified the gate-level netlist of circuits without frequency divider, and is simulated with 2 ideal clocks.

802.16e Receiver

Circuits with 2 Ideal Clock Input

Frequency Divider

Synthesis

Gate-Level Netlist

APR Tool Gate-Level

Simulation with 2 Ideal Clock

Fig. 5.5 Synthesis flow of chip with frequency divider

5.3 Chip Floor Plan

Since the 802.16e baseband receiver is a sequential system, the floor plan of the baseband receiver is based on the sequential order of the receiver shown in Fig. 5.6.

From Fig. 5.6, the components of the receiver based on the sequential order are planed from north to south of the whole chip.

As the result of APR, the chip size of the 802.16e baseband receiver is 3211 × 3211 um²; however, the size of the chip is too large to piece together with other chips in a shuttle since the shuttle size is 4000 × 4000 um². In order to tape out with other chips, a rectangular version of the receiver chip is used to replace the square version shown in Fig. 5.7. The chip size of rectangular version is 3955 × 2755 um² which is large than that of square version but is more flexible to piece together with other chips in a shuttle.

Fig. 5.6 Floor plan of the 802.16e baseband receiver

Fig. 5.7 Rectangular version floor plan of the 802.16e baseband receiver

5.4 Chip Summary

The chip summary is shown in Table 5-1. The square version is prepared to tape out from CIC, and the rectangular version is directly taped out from UMC. The cell library and PAD library is different between the two versions: square version’s library is from Faraday, and rectangular version’s is from UMC. As the results shown in Table 5-1, the square version’s working frequency can meet the system specification while the rectangular version’s can not. In summary, the taped out version chip size is 3955×2755 um², power consumption is 47.1 mW at 8.2/57.1 MHz, and is using UMC 90nm 1V CMOS process. Moreover, the area of two FFT/IFFT processors for FFT_ch/IFFT_ch blocks in DF DF-based CE in the taped out chip is 1.711 mm², and the power consumption of that is 20.2 mW working at 57.1 MHz.

Table 5-1 Chip summary

Item Specification

Square Version Rectangular Version (taped out)

FFT_ch/IFFT_ch Processors (taped out)

Technology UMC 90nm

CMOS 1P9M

Core 2411×2411 3144×1944

PAD Core 3057×3057 3799×2599

Area(um²)

Chip 3211×3211 3955×2755

1.711 mm²

Working Frequency 11.2/78.4 MHz 8.2/57.1 MHz 57.1 MHz Power Consumption

Chapter 6 Conclusion and Future Work

In this thesis, a FFT/IFFT processor with parallel-in-parallel-out in normal order which is used in a DF DFT-based channel estimation block is proposed. A 802.16e baseband receiver including this DF DFT-based channel estimation is taped out.

To design a FFT/IFFT processor with parallel-in-parallel-out in normal order, we analyze different parallel-in-parallel-out FFT architecture, and try to design the FFT/IFFT processor based on memory-based architecture. Memory allocation helps us to design a FFT/IFFT processor with parallel-in-parallel-out in normal order, and commutator design helps us to use single port memories to reduce the area of memories. These two methods can also be applied to different specification of parallel-in-parallel-out FFT processor. As the synthesis results, the proposed 1024-point FFT/IFFT processor can achieve the throughput rate up to 1.28 G samples/sec and the execution time down to 7.3 us when working at 160 MHz. When working at the system required 78.4 MHz, it consumes 21.7 mW with 155792 gates (including memory) that occupy 0.545 mm² by using 90 nm, 1V CMOS process.

A study of partial FFT for DF DFT-based channel estimation is also presented in this thesis. The pruning algorithm with only a subset of input or output points can help us to decrease the FFT processor hardware cost, and the multiple subsets of input or output points help us to save more power in FFT computation. As the analysis, the proposed partial FFT processor can reduce 75.1% of the memory size, 22.3% of the complex multipliers, and 30% of the complex adders as compared with traditional radix-2 SDF FFT architecture. Furthermore, with increasing the partial FFT control

for the proposed partial FFT processor, the proposed partial FFT can reduce maximum 65.3% of multiplication operations and 49.5% of addition operations, which may save more power if the 8 valid output point’s indices have common bits.

In the future, since we only implement the FFT/IFFT processor with parallel-in-parallel-out in normal order, a suitable FFT/IFFT processor for DF DFT-based channel estimation have to keep on study, such as the FFT/IFFT processor combining partial FFT algorithm and MIMO FFT concept.

Reference

[1] R.W. Chang, ”Synthesis of Band-Limited Orthogonal Signals for Multichannel Data Transmission”, Bell Syst. Tech. J., Vol.45, pp. 1775-1796, Dec. 1966.

[2] IEEE, Std. 802.16-2004: Air Interface for Fixed Broadband Wireless Access Systems, 2004.

[3] IEEE, Std. 802.16e: Air Interface for Fixed and Mobile Broadband Wireless Access Systems, 2005.

[4] M. Julia., F. G. Garcia, M. Jose, P. B., S. Zazo, “DFT-based channel estimation in 2D-pilot-symbol-aided OFDM wireless systems” IEEE Vehicular Technology Conference, Vol. 2, pp. 810-814, May 2001.

[5] V. Tarokh, N. Seshadri, and A. R. Calderbank, “Space-time codes for high data rate wireless communication: Performance analysis and code construction,”

IEEE Trans. Inform. Theory, Vol. 44, No. 2, pp. 744–765, Mar 1998.

[6] IEEE Std. 802.16-2001 IEEE Standard for Local and Metropolitan area networks Part 16: Air Interface for Fixed Broadband Wireless Access Systems.

[7] Y. Li, “Channel Estimation for OFDM Systems with Transmitter Diversity in Mobile Wireless Channels,” IEEE J. Selected Areas in Commun., Vol. 17, pp.

461-471, Mar. 1999.

[8] Y. Li, “Simplified Channel Estimation for OFDM Systems With Multiple Transmit Antennas,” IEEE Trans. Wireless Commun., Vol. 1, pp. 67-75, Jan.

2002.

[9] J-J V. D. Beek, O. Edfors, M. Sandell, S. K. Wilson and P. O. Brjesson, “On channel estimation in OFDM systems,” Vehicular Technology Conf., pp.

815-819, 1995.

[10] M. L. Ku and C. C. Huang, “A Derivation on the Equivalence between Newton’s Method and DF DFT-Based Method for Channel Estimation in OFDM Systems,”

submitted to IEEE Trans.Wireless Commun.

[11] Rabiner, L.R., and Gold, B. “Theory and application of digital signal processing”

(Prentice Hall, 1975).

[12] J. W. Cooley and J. W. Tukey, “An Algorithm for Machine Computation of Complex Fourier Series,” Math. Computation, Vol. 19, pp. 297-301, April 1965.

[13] S. He and M. Torkelson, “A New Approach to Pipeline FFT Processor,” Parallel Processing Symposium, pp. 766-770, 1996.

[14] S. He and M. Torkelson, “Designing Pipeline FFT Processor for OFDM (de) Modulation,” URSI International Symposium on Signals, Systems and Electronics, pp. 257-262, 1998.

[15] E. H. Wold and A. M. Despain, “Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementation,” IEEE Transactions on Computers, Vol. 33 No. 5, pp.

414-426, May 1984.

[16] S. Magar, S. Shen, G. Luikuo, M. Fleming, and R. Aguilar, “An application specific DSP chip set for 100 MHz data rates,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, Vol. 4, pp. 1989-1992, Apr. 1988.

[17] B. M. Bass, “A low-power, high-performance, 1024-point FFT processor,” IEEE J. Solid-State Circuits, Vol. 34, No. 3, pp. 380–387, Mar. 1999.

[18] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A 1-GS/s FFT/IFFT Processor for UWB Applications,” IEEE journal of solid-state circuits, Vol. 40, No. 8, pp. 1726-1735, Aug 2005.

[19] T. Sansaloni, A. Pe´rez-Pascual, V. Torres and J. Valls, “Efficient pipeline FFT processors for WLAN MIMO-OFDM systems,” Electronics letters 15^th, Vol. 41, No. 19, Sep 2005.

[20] L. Jia, Y. Gao, J. Isoaho, and H. Tenhunen, “A new VLSI-oriented FFT algorithm and implement,” in Proc. 11th Annu. IEEE Int. ASIC Conf., pp. 337–341, Sep 1998.

[21] L. G. Johnson, “Conflict Free Memory Addressing for Dedicated FFT Hardware,” IEEE Transactions on Circuit and System-II: Analog and Digital Signal Processing, Vol. 39, No.5, pp. 312-316, May 1992.

[22] Y. Ma, “An Effective Memory Addressing Scheme for FFT Processors,” IEEE Transactions on Signal Processing, Vol. 47, Issue: 3, pp. 907-911, March 1999.

[23] H. V. Sorensen, “Efficient Computation of the DFT with Only a Subset of Input or Output Points,” IEEE Transactions on signal processing, Vol. 41, No. 3, pp.

1184-1200, March 1993.

[24] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A Dynamic Scaling FFT Processor for DVB-T Applications,” IEEE Journal of solid-state circuits, Vol. 39, No. 11, pp.

2005-2013, November 2004.

[25] J. D. Markel, “FFT pruning,” IEEE Trans. Audio Electroacoust., Vol. 19, No. 4, pp. 305-311, Dec. 1971.

[26] D. P. Skinner, “Pruning the decimation in-time FFT algorithm,” IEEE Trans.

Acoust., Speech, Signal Processing, Vol. 24, No. 2, pp. 193-194, Apr. 1976.

[27] H. V. Sorensen, “Efficient Computation of the DFT with Only a Subset,” IEEE Transaction on signal processing, Vol. 41, No. 3, March 1993.

[28] C. M. Chen, Y. H. Huang, “Partial Cached-FFT Algorithm for OFDMA Communications,” IEEE TENCON, Oct 2007.

[29] L. Jia, Y. Gao, J. Isoaho and H. Tenhunen, “A New VLSI-Oriented FFT Algorithm and Implementation”, IEEE International ASIC Conference, pp.

337-341, Sep 1998.

[30] Xilinx Corporation, “Fast Fourier Transform,” LogiCore v3.1, Nov 2004.

在文檔中可平行順序輸入及輸出快速傅立葉轉換處理器之設計 (頁 85-0)