Chip Implementation - Long-Length based Effective Pipeline FFT/IFFT Processor

5 Long-Length based Effective Pipeline FFT/IFFT Processor

5.5 Chip Implementation

Following the functional verification in the Matlab environment, the proposed R4²SDF and R4³SDF based 4096-point FFT/IFFT architectures in which the internal word-length of entire design are 14-bit and 13-bit, respectively, were synthesized by the Design Compiler with TSMC 0.13µm CMOS technology. Using the standard logic process rules, the single port SRAM applies the 6T bit cell. The floorplan and post-layout were performed by Astro. The post-simulation was issued by NC-Simulator to verify the functionality after back-annotation was performed from the Start-RC extractor. The static timing check can be signed-off by PrimeTime. Finally, the power analysis and DRC were conducted using Astro Rail and Dracula, respectively. The core area of the post layout for the R4²SDF and R4³SDF design are 1.01 and 0.89 mm², which includes power rings and power straps as depicted in Fig. 30(a) and 30(b) , respectively. The gate count usage of each building block for R4²SDF and R4³SDF design are listed in Table 10. Comparing with the R4²SDF architecture, the R4³SDF architecture can replace one complex multiplier with one constant multiplier in the 4096-point FFT/IFFT computation as depicted in Fig. 22 and 23. Then, the R4³SDF design reduces the multiplier cost of 3.9 % than the R4²SDF design as listed in Table 10, which includes the complex and constant multipliers. It is obviously that 4095 words feedback memory

dominates the chip of 77.34 % and 80.72 % for the R4²SDF and R4³SDF design, respectively.

Both of these two chips could operate at 20 MHz, thus satisfying the high throughput requirement. Concerning the speed performance, because the pipelined multiplier operation is easy to design at a clock rate of 20 MHz or even higher, the proposed architectures can achieve a high clock rate by simple pipelining techniques for the involved arithmetic components. The average power dissipation of the R4²SDF and R4³SDF based 4096-point FFT/IFFT design are 6.3725 and 5.985 mW@20 MHz at 1.2V supply voltage. The layout view of R4²SDF design as shown in Fig. 30(a) has 68 I/O pins, of which eight pins are power supply pins. Due to the few datawidth requirements, the layout view of R4³SDF design as shown in Fig. 30(b) has only 64 I/O pins. The proposed R4²SDF and R4³SDF based 4096-point FFT/IFFT implementation satisfies the system performance of DVB-H standard.

Additionally, the proposed R4³SDF based 4096-point FFT/IFFT implementation has a low power consumption (5.985 mW), and the lowest hardware requirement among the tested pipeline architectures. These findings indicate that the proposed design meets the requirements of high effective pipeline FFT/IFFT processor for SoC IP.

(a) The layout view of proposed R4²SDF design.

(b) The layout view of proposed R4³SDF design.

Fig. 30: The layout view of proposed 4096-point pipeline FFT/IFFT processor.

Table 10: The Gate Count Usage of Each Building Block in the Proposed Design.

Categories Control Butterfly Cores

Complex Multiplier

Constant Multipliers

Shift Registers

R4²SDF 0.33 % 10.1 % 9.83 % 2.4 % 77.34 %

R4³SDF 0.35 % 10.6 % 5.03 % 3.3 % 80.72 %

5.6 Summary

This work develops two high effective R4²SDF and R4³SDF pipeline VLSI architectures that support the long-length FFT/IFFT computations. The proposed R4³SDF pipeline FFT/IFFT architecture has lower multiplicative complexity and higher hardware utilization rate with smaller cost than R4²SDF and other pipeline architectures. Following with fixed-point analysis in 40dB AWGN environment, the proposed R4²SDF and R4³SDF based 4096-point FFT/IFFT designs are successfully implemented in 0.13 µm CMOS technology with an internal word-length of 14 and 13-bits, respectively. The proposed R4²SDF and R4³SDF based design have a low power consumption of 6.3725 and 5.985 mW @20 MHz at 1.2V supply voltage. Thus, these features ensure that the proposed R4³SDF pipeline 4096-points FFT/IFFT processor design certainly meets the high effective VLSI architecture.

Chapter 6 Effeeeective Triple-Mode Reconfigurable Pipeline FFT/IFFT/2D-DCT Processor

Tell et al. [8] presented the FFT/WALSH/1-D DCT processor for multiple radio standards of the upcoming 4^th generation wireless systems. Conversely, some designs [8-10]

only support 1-D DCT computation, and have no 2-D DCT support. However, 2-D DCT is desirable for the video compression among wireless communication applications. This study not only presents a single reconfigurable architecture for the 256-point FFT/IFFT modes and the 8×8 2-D DCT mode, but also achieves high cost-efficiency in portable multimedia applications. Results of comprehensive comparison further indicate that the proposed R4²SDF-based pipeline processor achieves a higher utilization with a smaller hardware requirement than R2²SDF-based pipeline processor [31] in the 256-point FFT/IFFT mode, and thus has higher cost efficiency. The proposed R4²SDF-based design also achieves satisfactory performance for the DV encoding standard with the lowest cost in the 8×8 2D DCT mode. The organization of this chapter is structured as follows. A new R4²SDF FFT/IFFT and 8×8 2D DCT algorithm is given in Section 6.1. Section 6.2 demonstrates the proposed FFT/IFFT/2-D DCT pipeline architecture using the R4²SDF algorithm. The finite wordlength analysis is given in Section 6.3, and indicates that the proposed architecture achieves the required system performance in both 256-point FFT/IFFT and 8×8 2-D DCT modes with the lowest hardwire cost. Section 6.4 tabulates the comparison results in terms of hardware utilization and cost to demonstrate the high cost-efficiency of the proposed architecture, and also discusses the chip implementation. The section 6.5 draws conclusions.

6.1 8×8 2D FFT and 8×8 2D DCT Formula

Two concurrent 2D DCTs can be calculated by the single 2D shifted FFT (SFFT) algorithm [74] from the input reordering and post computation. This study presented a high-speed pipeline processor to support the triple-mode 256-point FFT/IFFT/8×8 2D DCT with the radix-4² algorithm. Two concurrent 2D DCTs results can be obtained by the

This study neglects the post-scaling factor of ( ) ( ) 4 could then be reordered as

) 1/4 samples. The detail description of the transfer function between 8×8 2D SFFT and 8×8 2D DCT could be found in Appendix A.

∑ ∑ ⋅ ⋅

where 0≤k₁,k₂,n₁,n₂ ≤7. Since the input data y(n1,n2) form a real-valued sequence, the second half output can be derived as

∑ ∑ ⋅ ⋅ respectively, can then be created as

{

^Re[ ⁽ ^, ^)] ^Re[ ⁽⁸ ^,⁸ ^)]

}

substituted as X[8k₁+k₂] and x(8n₁+n₂), respectively. Then, the specific two-dimensional (2D) linear index map is applied as follows:

minimized by following the specific mapping in (96). The 8×8 2-D FFT CFA form can then The butterfly structures for 8×8 2D DCT, corresponding to above equations (88)-(97), are summarized as follows:

Butterfly stage I:

The time-domain shift stage of 2-D DCT:

]

{

^Im[ ⁽⁸ ^)] ^Im[ ⁽⁷² ⁸ ^)]

}

4 ] 1 8

[ ₁ ₂ ₁ ₂ ₁ ₂

2 k k Y k k Y k k

X + = _s + − _s − −

{

^Re[ ⁽⁶⁴ ⁸ ⁾ ^Re[ ⁽⁸ ⁸ ^)]]

}

4 1

2 1 2

1 k Y k k

Y_s − + + _s + −

+ . (98)

Two 8×8 2D DCT computation results X1[8k1+k2] and X2[8k1+k2] are calculated concurrently in the post-computation of the butterfly stage IV. The 8×8 2-D IDCT computation can also be obtained following a similar decomposition procedure. Because of the cost-effective constraint in the physical design, this study only considers the triple-mode FFT/IFFT and 2-D DCT computations. The derivation results of the radix-4² based FFT/IFFT/2-D DCT algorithm indicate that all butterfly computation can be easily implemented with four four-input complex adders and some shuffle circuits. The radix-4 butterfly structure has no multipliers. Additionally, the regular structure can be easily derived in both the 8×8 2-D DCT and 256-point FFT/IFFT pipeline processor architecture.

6.2 Pipeline 256-Point FFT/IFFT/8×8 2D-DCT Processor

在文檔中高效能之管線式傅立葉轉換處理器之設計與實現 (頁 98-106)