可平行順序輸入及輸出快速傅立葉轉換處理器之設計

全文

(1)國立交通大學電子工程學系電子研究所碩士班碩士論文. 可平行順序輸入及輸出快速傅立葉轉換處理器之設計 Design of FFT Processor with Parallel-In-Parallel-Out in Normal Order. 研究生：胡祥甡指導教授：周世傑博士. 中華民國九十七年十一月.

(2) 可平行順序輸入及輸出快速傅立葉轉換處理器之設計 Design of FFT Processor with Parallel-In-Parallel-Out in Normal Order 研究生：胡祥甡. Student：Hsiang-Sheng Hu. 指導教授：周世傑博士. Advisor：Dr. Shyh-Jye Jou. 國立交通大學電子工程學系電子研究所碩士班碩士論文 A Thesis Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical and Computer Engineering National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science In Electronics Engineering July 2006 Hsinchu, Taiwan, Republic of China. 中華民國九十七年十一月.

(3) 可平行順序輸入及輸出快速傅立葉轉換處理器之設計. 研究生：胡祥甡. 指導教授：周世傑博士. 國立交通大學電子工程學系電子研究所碩士班. 摘要. 為了設計一個具有可平行輸入及平行輸出的快傅立葉轉換處理器以適用於高速移動無線都會型區域網路(WMAN)基頻接收器中的通道估測方法，本論文由硬體設計層級研究各種可平行輸入及平行輸出的快速傅立葉轉換硬體架構技術。同時，為了簡化快速傅立葉轉換電路的資料串列輸入及資料串列輸出所需的控制訊號複雜度及暫存器使用，本論文提出一個可依序平行輸入及平行輸出的快速傅立葉轉換電路，以符合系統對快速傅立葉轉換處理器的輸入輸出規格要求。最後，本論文提出一個可適用於 802.16e 通訊系統中離散傅立葉通道估測法 (DFT-based channel estimation)的可平行順序輸入及平行順序輸出之快速傅立葉轉換處理器的架構設計。並且根據離散傅立葉通道估測法中對快速傅立葉轉換的特殊需求，提出一個可適用於此通道估測法的部份傅立葉轉換(Partial FFT) 架構設計。最後，本論文所提出的快速傅立葉轉換處理器已實現於一個 2×1 STBC/OFDMA 基頻接收器中。此傅立葉轉換處理器可達到最高 1.28 G 樣本/秒的. i.

(4) 資料吞吐量；當操作在最大工作頻率 160 MHz 下，其資料延遲時間僅需 7.3 us；當操作在系統給定頻率 78.4 MHz 下，此傅立葉轉換電路消耗功率為 21.7 mW，面積為 155792 邏輯閘數(包含記憶體)，使用 90 奈米 1V CMOS 製程下，其面積為 0.545 mm2。. ii.

(5) Design of FFT Processor with Parallel-In-Parallel-Out in Normal Order. Student：Hsiang-Sheng Hu. Advisor：Dr. Shyh-Jye Jou. Department of Electronics Engineering & Institute of Electronics National Chiao Tung University. ABSTRACT. In order to design a parallel-in-parallel-out Fast Fourier Transform (FFT) processor suitable for channel estimation in a highly mobile wireless metropolitan area network (WMAN) baseband receiver, this thesis studies various parallel-inparallel-out FFT design techniques from hardware architecture level. Also, in order to reduce the control complexity and buffer overhead for data stream-in and stream-out of the FFT processor, this thesis proposes a FFT processor with parallel-inparallel-out in normal order to meet the input data and output data requirement for the systems requirement. Finally, this thesis proposes a 1024-point FFT processor architecture with parallel-in-parallel-out in normal order, which can meet the needs of DFT-based channel estimation in 802.16e communication system. Furthermore, according to the special requirement of the DFT-based channel estimation, the thesis proposes the partial FFT processor architecture suitable for the DFT-based channel estimation. The FFT/IFFT processor is designed and is implemented together with a 2×1 STBC/OFDMA baseband receiver. The proposed 1024-point FFT/IFFT processor iii.

(6) can achieve the throughput rate up to 1.28 G samples/sec and the execution time down to 7.3 us when working at 160 MHz. When working at the system required 78.4 MHz, it consumes 21.7 mW with 155792 gates (including memory) that occupy 0.545 mm2 by using 90 nm, 1V CMOS process.. iv.

(7) 誌. 謝首先我要感謝指導教授周世傑老師兩年多來的提攜，使我在研究迷惘之. 餘，給予我方向與努力的目標，並且在為人處事及研究態度上，都給予我不少寶貴的建議。再來，我要感謝 momo 學姐的幫忙，總是不吝其煩的幫我解決研究上得困難，還有陳紹基老師實驗室的學生黃紳叡學長的協助，使得我在理論及實作方面能有顯著的進步，誠文學長更是分享很多他工作上的經驗，使我研究更加順利，庭楨學長也給予我很多理論和實作上的幫助，在此感激這些學長姊使我能順利完成研究。另外實驗室的好夥伴們，紹維、盈志、小胖、運翔、舒蓉、儷蓉、莊立、… 還有許多許多其他的好同學們，感謝你們總在我最失意時，幫我加油打氣，給予我許多鼓勵與包容。最後，感謝我的父母親、家人以及女朋友長久以來的支持與鼓勵，沒有你們這條研究的路很難堅持下去，謹代表我內心致上最大的感激與敬意。. v.

(8) Content Chapter 1 Introduction................................................................................................1 1.1 Background ......................................................................................................1 1.2 Thesis Organization .........................................................................................2 Chapter 2 FFT Application in OFDM Communication System.................................5 2.1 Concept of OFDM ...........................................................................................5 2.2 Introduction of IEEE 802.16e ..........................................................................6 2.3 DFT-Based Channel Estimation.......................................................................8 2.4 System Specification........................................................................................9 2.4.1 Specification of FFT Processor on Demodulation Path...................... 11 2.4.2 Specification of FFT Processor in Channel Estimation...................... 13 Chapter 3 FFT Algorithms and Architectures...........................................................15 3.1 Concept of FFT Algorithms ...........................................................................15 3.1.1 Radix-2 DIF FFT Algorithm............................................................... 16 3.2 Concept of FFT Architectures........................................................................18 3.2.1 Pipeline-Based FFT Architecture ....................................................... 18 3.2.1.1 Radix-r Multi-Path Delay Commutator Architecture ...................... 20 3.2.1.2 Radix-r Single-Path Delay Feedback Architecture.......................... 23 3.2.2 Memory-Based FFT Architecture....................................................... 26 3.3 Comparison of Different FFT Architecture ...................................................28 3.4 Partial FFT Design .........................................................................................29 3.4.1 Concept of Partial FFT ....................................................................... 29 3.4.2 DFT with only a Subset of Input or Output Points ............................. 29 3.4.3 DFT with Multiple Subsets of Output Points ..................................... 32 3.4.4 DFT with Multiple Subsets of Input and Output Points ..................... 35 3.4.5 Partial FFT Processor Design in DFT-Based Channel Estimation ..... 38 3.5 Summary ........................................................................................................42 Chapter 4 Parallel-In-Parallel-Out FFT/IFFT Processor Architecture Design .........43 4.1 System Requirement of the FFT/IFFT Processor ..........................................43 4.2 Architecture of the FFT/IFFT Processor........................................................44 4.3 FFT Sub_Module Design...............................................................................46 4.3.1 Radix-2/4/8 SDF Processing Element ................................................ 46 4.3.2 Complex Multiplier ............................................................................ 49 4.3.3 ROM Table ......................................................................................... 54 4.3.4 Memory Allocation............................................................................. 55 4.3.5 Commutator Design............................................................................ 57 4.3.6 Mixed FFT/IFFT Processor ................................................................ 62 vi.

(9) 4.3.7 Fixed-Point Block Design with Dynamic Scaling.............................. 64 4.4 The FFT/IFFT Processor Fixed Point Simulation..........................................66 4.4.1 Fixed Point Simulation for Constant Multiplier in Radix-2/4/8 PE ... 67 4.4.2 Fixed Point Simulation for Twiddle Factor ........................................ 68 4.4.3 Fixed Point Simulation for FFT/IFFT Processor................................ 69 4.5 Hardware Implementation Result ..................................................................71 4.5.1 Comparison for the FFT Processor Design Flow ............................... 71 4.5.2 Comparison of Separated Twiddle Factor ROM ................................ 73 4.6 Summary ........................................................................................................75 Chapter 5 Chip Implementation of IEEE 802.16e Receiver ....................................77 5.1 Design Flow ...................................................................................................77 5.2 Multi-Frequency Design ................................................................................79 5.3 Chip Floor Plan ..............................................................................................81 5.4 Chip Summary ...............................................................................................83 Chapter 6 Conclusion and Future Work ...................................................................85 Reference …………………………………………………………………………….87. vii.

(10) List of Tables Table 2-1 Comparisons of IEEE 802.16 standards ........................................................8 Table 2-2 System specification of IEEE 802.16e transceiver system..........................10 Table 3-1 Comparison of different FFT architecture ...................................................28 Table 3-2 Control counter and function of FFT with partial output points..................34 Table 3-3 Control counter and function of FFT with partial input and output points..36 Table 3-4 Comparison with Partial FFT and Conventional FFT .................................41 Table 3-5 Reduced operations of partial FFT with radix-2 SDF architecture .............42 Table 4-1 FFT/IFFT system requirement.....................................................................44 Table 4-2 Twiddle factors value for different PE in different stages...........................54 Table 4-3 Address of PE-based TW ROM in each stage .............................................55 Table 4-4 Read or write address for the processing elements in each stage ................58 Table 4-5 Scale down block parameter for FFT/IFFT mode .......................................66 Table 4-6 System required SQNR for FFT/IFFT processor ........................................67 Table 4-7 Comparison of different version FFT processor..........................................72 Table 4-8 Comparison of several high throughput FFT architectures .........................75 Table 4-9 Comparison of hardware cost for different architectures ............................76 Table 5-1 Chip summary..............................................................................................84. viii.

(11) List of Figures Fig. 2.1 Bandwidth allocation for sub-cannels in FDM system ....................................5 Fig. 2.2 Bandwidth allocation for sub-channels in OFDM system................................6 Fig. 2.3 Basic block diagram of an OFDM transceiver system .....................................6 Fig. 2.4 Block diagram of DFT-based channel estimation ............................................9 Fig. 2.5 Block diagram of baseband transceiver in IEEE 802.16e ................................9 Fig. 2.6 Block diagram of decision feedback DFT-based channel estimation.............10 Fig. 2.7 FFT Processor with 5 shared memories ......................................................... 11 Fig. 2.8 Time chart for the 5 memory banks................................................................12 Fig. 2.9 FFT processor in decision feedback DFT-based channel estimation .............13 Fig. 3.1 Radix-2 DIF FFT algorithm architecture........................................................17 Fig. 3.2 Radix-2 butterfly module................................................................................18 Fig. 3.3 Vertical projection mapping of 8-point radix-2 DIF FFT...............................19 Fig. 3.4 64-point FFT with R4MDC architecture ........................................................21 Fig. 3.5 Modified input stage and output stage of 64-point R4MDC architecture ......21 Fig. 3.6 512-point FFT with R8MDC architecture ......................................................22 Fig. 3.7 Modified input stage and output stage of 512-point R8MDC architecture ....23 Fig. 3.8 64-point FFT with radix-2 SDF architecture ..................................................24 Fig. 3.9 64-point FFT with R8SDF architecture..........................................................25 Fig. 3.10 64-point FFT with R23SDF architecture ......................................................25 Fig. 3.11 8-point FFT radix-2/4/8 SDF architecture ....................................................26 Fig. 3.12 Radix-8 memory-based (R8M) FFT architecture .........................................27 Fig. 3.13 Markel’s pruned 16-point FFT with a subset of nonzero input (L=2)..........30 Fig. 3.14 Skinner’s pruned 16-point FFT with a subset of nonzero input (L=2).........31 Fig. 3.15 Markel’s pruned 16-point FFT with a subset of output points (L=2)...........32 Fig. 3.16 Skinner’s pruned 16-point FFT with a subset of output points (L=2)..........32 Fig. 3.17 8-point DFT with butterfly function of each butterfly unit output point......33 Fig. 3.18 Example of 8-point DFT with multiple subsets of output points .................35 Fig. 3.19 Example of 8-point DFT with multiple subsets of input and output points .37 Fig. 3.20 System specification for the partial FFT/IFFT processor.............................38 Fig. 3.21 Pipeline-based partial FFT/IFFT processor ..................................................39 Fig. 3.22 Partial FFT/IFFT processor in IFFT mode ...................................................40 Fig. 3.23 Fig. 3.24 Partial FFT/IFFT processor in FFT mode .....................................40 Fig. 4.1 Decision feedback DFT-based channel estimation block diagram.................43 Fig. 4.2 The proposed 1024-point FFT/IFFT processor architecture ..........................45 Fig. 4.3 FFT/IFFT processing structure.......................................................................46 Fig. 4.4 Radix-2/4/8 SDF processing element .............................................................46 ix.

(12) Fig. 4.5 Radix-2/4/8 SDF with DIT algorithm ............................................................47 Fig. 4.6 Processing elements of radix-2/4/8 SDF with DIT algorithm........................48 Fig. 4.7 Reorder buffer input and output timing flow graph........................................49 Fig. 4.8 Architecture of multiplication of -j.................................................................50 Fig. 4.9 Architecture of multiplication of W81 .............................................................50 Fig. 4.10 Architecture of multiplication by W81 with CSA tree ..................................51 Fig. 4.11 Delay optimized architecture of multiplication by W81 with CSA tree........52 Fig. 4.12 Architecture of complex multiplication........................................................53 Fig. 4.13 Modified architecture of complex multiplication.........................................53 Fig. 4.14 System requirement for multi-input and multi-output in normal order........56 Fig. 4.15 Memory allocation of the FFT/IFFT input data ...........................................57 Fig. 4.16 Memories read write operations for different PE in stage 1.........................58 Fig. 4.17 Memories read write operations for different PE in stage 2.........................59 Fig. 4.18 Memories read write operations for different PE in stage 3.........................60 Fig. 4.19 State diagram of FFT/IFFT processor ..........................................................62 Fig. 4.20 The FFT/IFFT processor in the DF DFT-based CE block diagram..............62 Fig. 4.21 Modified processing elements with conjugate operation .............................63 Fig. 4.22 System required SQNR simulation model....................................................67 Fig. 4.23 SQNR versus constant multiplier truncate bits.............................................68 Fig. 4.24 SQNR versus word length of twiddle factor ................................................68 Fig. 4.25 SQNR versus internal word length in IFFT mode........................................70 Fig. 4.26 SQNR versus internal word length in FFT mode .........................................70 Fig. 4.27 Area comparisons for different versions of FFT processor ..........................71 Fig. 4.28 Data latency comparisons for different versions of FFT processor..............72 Fig. 4.29 Area comparison of separated twiddle factor ROM .....................................73 Fig. 5.1 Cell based chip design flow............................................................................78 Fig. 5.2 Combination logic circuits between 2 clock domains ....................................79 Fig. 5.3 Default timing check in 2 clock domains .......................................................79 Fig. 5.4 Expected timing constrain for DFFB1 to DFFA2 ..........................................80 Fig. 5.5 Synthesis flow of chip with frequency divider...............................................81 Fig. 5.6 Floor plan of the 802.16e baseband receiver..................................................82 Fig. 5.7 Rectangular version floor plan of the 802.16e baseband receiver..................82. x.

(13) Chapter 1 Introduction 1.1 Background In many digital signal processing applications, especially in communication systems, Fast Fourier Transform (FFT) becomes more important nowadays. Orthogonal frequency division multiplexing (OFDM) technology [1] is used in the most modern wired or wireless communication systems, such as ADSL, VDSL, 802.11a, DVB-T, 802.16-2004 [2], 802.16e [3], which needs a FFT processor to transform the data between time domain and frequency domain; however, the FFT processor is the critical component in many OFDM based communication systems because the FFT processor’s hardware complexity is too high. For this reason, many FFT processors are designed for OFDM based communication systems to make the FFT processor become efficiency for system implementation. As the result of growing VLSI technology, improved modulation and channel estimation can be implemented with reasonable cost. OFDM is an improved modulation technique that can provide high data rate, immunity to delay spread, resistance to frequency selective fading, and efficient bandwidth usage. In wireless communication, OFDM also reduces inter-symbol-interference (ISI) and inter-carrier interference (ICI) caused by multipath effect. Also the Discrete Fourier Transform (DFT)-based channel estimation [4] with space time block code (STBC) [5] is proposed to do channel estimation in OFDM wireless communication system, which is effective in high mobility channel environment. In these applications, FFT plays an. 1.

(14) important role to decide the system performance and hardware cost; thus, a high throughput FFT processor with low hardware cost is an important module to make more advanced modulation and channel estimation algorithm to be implemented on chip reasonable. In order to design a high throughput FFT and also speed up the operations ahead or behind the FFT processor, a parallel-in-parallel-out FFT will be introduced in this thesis; also a 1024-point parallel-in-parallel-out in normal order FFT processor design example used in DFT-based channel estimation in 802.16e will be proposed.. 1.2 Thesis Organization In this thesis, FFT/IFFT designs for robust channel estimation of high-mobility STBC/OFDMA communication system are proposed. System simulation, architecture and circuit design, and implementation of FFT/IFFT processor with baseband of 802.16e are carried output in thesis. IEEE 802.16e, DFT-based channel estimation, and the system block we used, will be introduced in Chapter 2. Since the system block we used including two kinds of FFT/IFFT processor design, we also introduce the system requirement for different kind of FFT/IFFT processor: one for OFDMA demodulation, the other for DFT-based channel estimation. The system requirement of FFT processor for 802.16e OFDMA demodulation has no difference with other OFDM communication, thus, the thesis will introduce the conventional FFT processor we used in Chapter 2. Shared memory concept is used between FFT processor, used for OFDMA demodulation, and channel estimation. The requirement of FFT/IFFT processor used in DFT-based channel estimation is different from the conventional FFT processor by two aspects. One aspect is parallel-in-parallel-out of data and the other aspect is a FFT processor with several zero value input or several valid output, called partial FFT processor. The thesis focus on the FFT/IFFT processor hardware 2.

(15) design for channel estimation with parallel-in-parallel-out in normal order, and then the concept of partial FFT processor design will be demonstrated. Investigation. of. the. conventional. FFT. algorithm. and. various. parallel-in-parallel-out FFT architectures is presented in Chapter 3. The conventional high throughput FFT processors usually use a pipeline-based FFT architecture which provide high throughput but also has high hardware cost. Memory-based FFT architecture has the advantage of low hardware cost, and it can also provide high throughput by parallel-in-parallel-out with multi-partitioned memories. The comparisons among the different parallel-in-parallel-out FFT architectures are also carried out in Chapter 3. The comparison results are helpful to FFT processor design in our system. At the end of Chapter 3, concept of partial FFT processor design will be introduced to solve another goal of FFT processor for DFT-based channel estimation. The architecture design of FFT processor with parallel-in-parallel-out in normal order will be proposed in Chapter 4. A novel memory allocation method for parallel-in-parallel-out in normal are proposed in this chapter. Designs of processing elements, memory allocation, commutator, scale down block, and coefficient ROM table for the proposed FFT processor will be introduced, and considered as the key contribution of this thesis. In the end of Chapter 4, comparisons are carried out for the hardware implement result with other FFT processor with parallel-in-parallel-out in normal order. Backend design flow for the chip of 802.16e receiver will be introduced in Chapter 5. In order to tape out the chip, two versions of chip implementation results are presented, one for UMC shuttle, the other for CIC. The chip floor plan and design flow will be presented in Chapter 5. In the end of the thesis, the conclusion future works will be presented in Chapter 3.

(16) 6.. 4.

(17) Chapter 2 FFT Application in OFDM Communication System 2.1 Concept of OFDM Orthogonal Frequency Division Multiplexing (OFDM) is based on frequency division multiplexing (FDM). FDM translates several message signals to different spectral locations. An example of bandwidth allocation of FDM is shown in Fig. 2.1.. SC.1. SC.2. SC.3. SC.4. SC.5. SC.6. SC.7. SC.8. Frequency Fig. 2.1 Bandwidth allocation for sub-cannels in FDM system FDM technique keeps all sub-channels away from overlapping by guard bands to against the adjacent sub-channels producing inter-channel interference (ICI); however, guard bands waste the bandwidth efficiency, which is important in communication system, because it is not used to carry any message signals. OFDM uses orthogonal sub-carriers to overlap the sub-channels to carry more message signals in the same bandwidth than FDM as shown in Fig. 2.2.. 5.

(18) Fig. 2.2 Bandwidth allocation for sub-channels in OFDM system A basic block diagram of OFDM system is shown in Fig. 2.3. Fig. 2.3 shows the transmitter in OFDM system need IFFT module to modulate the message signal, called OFDM modulation, and the receiver also need FFT module for OFDM de-modulation; thus, FFT processor is a key block in OFDM transceiver system. Data In. IFFT. …. Signal Mapper. …. …. S/P. P/S. Guard Interval Insertion. D/A. Up Converter. Equalizer. FFT. …. Signal DeMapper. …. S/P. …. Data Out. …. Channel. S/P. Guard Interval Removal. A/D. Down Converter. Fig. 2.3 Basic block diagram of an OFDM transceiver system. 2.2 Introduction of IEEE 802.16e IEEE 802.16 is a broadband wireless access (BWA) standard. The first standard of IEEE 802.16 approved in December 2001 called IEEE 802.16-2001 [6]. It delivered a standard, which transmits in 10-66 GHz with only a line-of-sight (LOS) capability, used in Wireless Metropolitan Area Networks (WiMAN). It uses a single carrier (SC) physical (PHY) standard. IEEE 802.16a is a extension of IEEE 802.16-2001. It transmits in 2-11 GHz with both LOS and none-light-of-sight (NLOS), 6.

(19) and less distortion by rain than IEEE 802.16-2001. IEEE 802.16-2004 (also called IEEE 802.16d) is a fixed broadband wireless access (BWA) standard, which combines both of IEEE 802.16-2001 and IEEE 802.16a standards. IEEE 802.16-2004 describes more detail for media access control layer (MAC) and PHY in 2-66 GHz. It supports multiple physical layer (PHY) specifications, such as WiMAN-SC, WiMAN-OFDM, WiMAN-OFDMA, and WiMAN-SCa, operation in different frequency. For operation frequency in 10-66 GHz, the WiMAN-SC PHY, based on single carrier, is specified; for operation frequency below 11 GHz, the IEEE 802.16-2004 transmitting in NLOS provides three alternative PHY specifications: WiMAN-OFDM. (based. on. orthogonal. frequency. division. multiplexing),. WiMAN-OFDMA (based on orthogonal frequency division multiple access), WiMAN-SCa (based on single carrier). IEEE 802.16e, which is a fixed and mobile broadband wireless access (BWA) standard, is an enhancement of IEEE 802.16-2004 standard. It fills the gap between very high data rate local area network and very high mobility. cellular. system.. An. extension. PHY. layer. specification. called. scalable-OFDMA (SOFDMA), based on WiMAN-OFDMA, provide different FFT Size for OFDMA, such as 128, 512, 1024, 2048 points. Table 2-1 is the summary of IEEE 802.16-2001, IEEE 802.16a, and IEEE 802.16e.. 7.

(20) Table 2-1 Comparisons of IEEE 802.16 standards IEEE 802.16-2001 IEEE 802.16a. IEEE 802.16e. 10-66 GHz. 2-11 GHz. 2-6 GHz. Channel Bandwidth 20, 25, 28 MHz. 1.5 to 20 MHz. 1.5 to 20 MHz. Carrier. Single Carrier. OFDM/OFDMA. OFDM/SOFDMA. FFT Size. N/A. 256(OFDM) 2048(OFDMA). 256(OFDM) 128/512/1024/2048 (SOFDMA). Modulation. QPSK, 16QAM, QPSK, 16QAM, QPSK, 64QAM 64QAM 64QAM. Bit Rate. 32-134 Mbps (28 MHz). Spectrum. 75 Mbps MHz). 16QAM,. (20 15 Mbps (5 MHz). Channel Conditions LOS. Non-LOS. Non-LOS. Typical Cell Radius. 2-5 Km. 7-10 Km, max 50 2-5 Km Km. Application. Fixed. Fixed and portable. Fixed and mobile. 2.3 DFT-Based Channel Estimation Channel estimation in conventional OFDM system is a simple one-tap equalizer since the channel gain varies slowly between each adjacent OFDM symbol. However, in the mobile wireless communication environment, such as the channel in IEEE 802.16e, the channel gain varies rapidly between each adjacent OFDM symbol, so a one-tap equalizer seems not suitable for the time-varying channel environment. The one-tap equalizer can be realized as a least square (LS) channel estimator, and it has low hardware complexity but low performance than minimum-mean-square-error (MMSE) estimator. MMSE estimator has better performance but the hardware complexity is too high. DFT-based channel estimation [7-9] is presented to combine the LS and MMSE estimator, and it reduces the hardware complexity of MMSE estimator. A simple block diagram of DFT-based channel estimation is shown in Fig. 2.4. R(k) is the received data in sub-carrier k after OFDM demodulation, X(k) is the 8.

(21) decision data, which is determined by the latest OFDM symbol channel estimator, and H(k) is the channel estimator used in next OFDM symbol.. Fig. 2.4 Block diagram of DFT-based channel estimation DFT-based channel estimation can provide more accurate channel gain with lower hardware complexity than the original MMSE estimator. However, it needs both IFFT block and FFT block to implement the algorithm. Thus, a suitable FFT or IFFT processor design can reduce the hardware cost of DFT-based channel estimation.. 2.4 System Specification For mobile WMAN baseband transceiver using standard IEEE 802.16e, we proposed a baseband transceiver [10]. A simply block diagram of the 2×1 multipleinput-single-output (MISO) IEEE 802.16e OFDM system is shown in Fig. 2.5. For chip implementation, we only implement the receiver part of Fig. 2.5. The key system specifications are listed in Table 2-2.. Fig. 2.5 Block diagram of baseband transceiver in IEEE 802.16e. 9.

(22) Table 2-2 System specification of IEEE 802.16e transceiver system Items. Specification. Bandwidth. 10 MHz. PHY Layer Specification. WiMAN-SOFDMA. FFT Size. 1024. Sample Rate. 11.2 MHz. Guard Interval. 1/8. Constellation. QPSK, 16QAM. OFDM Symbol Time. 102.9 us. The channel estimation block is a decision feedback (DF) DFT-based channel estimation [10], which combines the channel estimation and data detection as shown in Fig. 2.6. The system requirement for channel estimation will be introduced in the following sections. Data In Preamble Match. IFFT _ch. Path Selection. FFT _ch. Channel Estimator. Inverse Hessian Matrix Calculation Gradient Estimator. IFFT _ch. Search Direction Estimator Calculation. D FFT _ch. Channel Estimator Modification. STBC Decoder Data Out. Fig. 2.6 Block diagram of decision feedback DFT-based channel estimation There are two kinds of FFT processor in the receiver part, FFT_dem located of the synchronization block called OFDM demodulator. FFT_ch and IFFT_ch blocks are required in channel estimation block. The following sections will introduce the system specifications of these two kinds of FFT processor.. 10.

(23) 2.4.1 Specification of FFT Processor on Demodulation Path The FFT_dem processor in Fig. 2.5 receives the data from synchronization block, and passes the data to channel estimation and space-time decoding. The input data format of FFT processor is like that in other OFDM communication system. However, the output ports have to buffer 2 OFDM symbol since we use 2×1 MISO system with STBC coding and DF DFT-based channel estimation. For this reason, we design a conventional memory-based FFT processor [11] with 5 memory banks shown in Fig. 2.7.. Fig. 2.7 FFT Processor with 5 shared memories SYN_wr is the data from synchronization block, and only one of the memory banks would be written by synchronization block in an OFDM symbol time. Then, the written memory bank would be used to do FFT by the FFT processor. In the same time, the synchronization block is writing the data to another memory bank. After the two OFDM symbols in a STBC time slot have been calculate by FFT processor, the memories, which stored the FFT calculation result of this two OFDM symbols, would be read from channel estimation, called CE_rd, in two OFDM symbol time. The time chart of 5 memory banks is shown in Fig. 2.8. At the first preamble. 11.

(24) symbol, the data from synchronization block are written to MEM_1_0. At the second and the third symbols, the data from synchronization block are written to MEM_0_0 and MEM_1_1 while the data in MEM_1_0 are calculated by FFT processor and read by channel estimation. Furthermore, the memory operations for OFDM symbol index 12 is the same as index 0, thus the memory operations of 5 memory banks are repeated every 12 OFDM symbols. OFDM Symbol Index Syn_wr to MEM_X. 1. 0(Preamble) MEM_1_0. 2. MEM_0_0 MEM_1_0. MEM to do FFT. MEM_1_1 MEM_1_1. MEM_0_0. CE1_rd from MEM_X. MEM_1_1. MEM_1_0. CE0_rd from MEM_X. MEM_0_1. MEM_0_2. OFDM Symbol Index. 3. 4. Syn_wr to MEM_X. MEM_0_1. MEM_1_0. MEM to do FFT. 5. MEM_0_1. MEM_0_2 MEM_1_0. MEM_0_2. CE1_rd from MEM_X. MEM_1_1. MEM_1_0. CE0_rd from MEM_X. MEM_0_0. MEM_0_1. OFDM Symbol Index. 6. Syn_wr to MEM_X. MEM_1_1. 7. 8. MEM_0_0 MEM_1_1. MEM to do FFT. MEM_1_0. CE1_rd from MEM_X. MEM_1_0. MEM_1_1. CE0_rd from MEM_X. MEM_0_1. MEM_0_2. OFDM Symbol Index. 9. Syn_wr to MEM_X. MEM_0_1. MEM to do FFT. MEM_1_0. MEM_0_0. 10. 11. MEM_1_1 MEM_0_1. MEM_0_2 MEM_1_1. MEM_0_2. CE1_rd from MEM_X. MEM_1_0. MEM_1_1. CE0_rd from MEM_X. MEM_0_0. MEM_0_1. OFDM Symbol Index. 12. 13. Syn_wr to MEM_X. MEM_1_0. MEM_0_0 MEM_1_0. MEM to do FFT CE1_rd from MEM_X. MEM_1_1. MEM_1_0. CE0_rd from MEM_X. MEM_0_1. MEM_0_2. Fig. 2.8 Time chart for the 5 memory banks. 12.

(25) 2.4.2 Specification of FFT Processor in Channel Estimation The FFT_ch and IFFT_ch blocks in decision feedback DFT-based channel estimation (DF DFT-based CE) are shown in Fig. 2.9. Before introducing the system requirement, we make a brief description of the DF DFT-based CE. The DF DFT-based CE has two parts. One is initial channel gain calculated by using the preamble signals. The operational blocks are preamble match block, two IFFT_ch blocks, path selection block, inverse hessian matrix calculation, two FFT_ch blocks, and channel estimator block. The channel gain should be calculated within 2 OFDM symbol time. The second part is channel gain tracking loop. The operational block s are gradient estimator, two IFFT_ch blocks, search direction estimator calculation, two FFT_ch blocks, channel estimator modification block, and the channel estimator block. The channel gain is calculated by tracking loop with 2 iterations. At the first iteration, the channel gain variance for the channel estimator modification block is determined by the pilot signals, called global tracking, since the pilot signals have higher SNR than data signals. At the second iteration, variance is determined by the data signals, called local tracking. Both of two parts can use the same IFFT_ch blocks and FFT_ch blocks. Data In Preamble Match. IFFT _ch. Path Selection. FFT _ch. Channel Estimator. Inverse Hessian Matrix Calculation Gradient Estimator. IFFT _ch. Search Direction Estimator Calculation. D FFT _ch. Channel Estimator Modification. STBC Decoder Data Out. Fig. 2.9 FFT processor in decision feedback DFT-based channel estimation 13.

(26) Since the channel estimation included tracking loop, the channel gain should be calculated within 2 OFDM symbol time before the data buffers for channel estimation in Fig. 2.7 are updated; thus, data latency is an important issue to implement the channel estimation block into hardware. With this purpose, a parallel-in-parallel-out (PIPO) FFT/IFFT processor is necessary for not only increasing the throughput rate of FFT/IFFT processor but also increasing the throughput rate of other blocks in channel estimation block. The DFT-based channel estimation has a special feature for the FFT_ch and IFFT_ch blocks. Only a subset of output data is required for IFFT_ch output ports. Also, the input data of FFT_ch block may have several zero points, which are no required to be computed with other non-zero points. The FFT processors design for only some subset of input or output points are called partial FFT [23]. The thesis will introduce the idea of partial FFT processor design for DFT-based channel estimation. Finally, there are two purposes of FFT processor design, one is a FFT processor with parallel-in-parallel-out in normal order, and the other is partial FFT processor design. The thesis will focus on the FFT processor design with parallel-in-parallel-out in normal order. The partial FFT processor design concept will be introduced in the end of next chapter.. 14.

(27) Chapter 3 FFT Algorithms and Architectures 3.1 Concept of FFT Algorithms Discrete Fourier Transform (DFT) is a key block in OFDM communication system, and it is widely used in many applications; however, its computational complexity is so high that implementation of DFT algorithm directly seems not feasible to meet low cost design goal. Fortunately, early contributors, particularly Cooley and Turkey in 1965 [12], employed the redundancy of DFT operations by iteratively decomposing the computation, called radix-2 FFT algorithm, to reduce the computation complexity from O(N2) to O(Nlog2N). Based on Cooley and Turkey’s FFT algorithm, various FFT algorithms were later developed, which provide flexible choices for implementation. According to the ways of decomposing DFT, there are two types of FFT algorithms: one is the decimation-in-time (DIT) decomposition, which decomposes the time domain input sequence into successively smaller subsequences; the other is the decimation-in-frequency (DIF) decomposition, which alternately decomposes the frequency domain output sequence into smaller subsequences. The basic N-point DFT equation is defined as N −1. X (k ) = ∑ x(n) ⋅ WNk ⋅n. (3.1). n =0. where WNk ⋅n = exp(− j 2π nk / N ) is the DFT coefficient. Since a complex number multiplied with a coefficient is equivalent to a vector rotation, the DFT coefficient is also called twiddle factor. 15.

(28) The key feature of the FFT algorithm is to divide a complete DFT operations into several small point DFT operations; moreover, the FFT algorithm also uses the symmetry property of the twiddle factors. First, radix-2 FFT algorithm use the k ⋅n +. symmetry property of WN. N 2. = −WNk ⋅n ; then, we can reduce number of. multiplications in Eq. (3.1) by half as shown in Eq. (3.2). N −1 2. N −1 2. . N. N. −1. 2 k ⋅ n +  N N   X (k ) = ∑ x(n) ⋅WNk ⋅n + ∑ x(n + ) ⋅WN  2  = ∑  x(n) − x(n + )  ⋅ WNk ⋅n (3.2) 2 2  n =0 n=0 n=0 . Another symmetry feature is in its phase difference of ± 90° as  N k ⋅ n +  4  N. W. = − j ⋅ WNk ⋅n . Multiplying a complex number with – j, we can just exchange. the real part and imaginary part, and then negate the imaginary part. Therefore, we can reduce the computational complexity of Eq. (3.1) by using Eq. (3.3). k ⋅n N. A ×W.  N k ⋅ n +  4  N. + B ×W. = ( A − jB ) × WNk ⋅n. (3.3). Finally, symmetry feature of its phase difference of ± 45° is also common in FFT algorithms. Based on the symmetry, the equation can be reduced to  N k ⋅ n +  8. 1 (1 − j ) × B) × WNk ⋅n 2 (3.4) 1 1 1 (1 − j ) × B = (1 − j ) × (c + jd ) = ((c + d ) + (d − c) j ) 2 2 2 1 can be realized by constant multiplications, which The multiplication of 2 A × WNk ⋅n + B × WN . = (A +. may be customized to shifter and adder, and will be demonstrate in Chapter 4. According to the symmetry of twiddle factors, the computation complexity of DFT operation can be reduced to a fraction of the original operation. We will take a example of radix-2 DIF FFT algorithm in following subsection.. 3.1.1 Radix-2 DIF FFT Algorithm The DIF FFT Algorithm is decomposed the frequency domain output sequence 16.

(29) into small subsequence. Here we take a example of radix-2 DIF FFT algorithm. The radix-2 DIF FFT algorithm divided the frequency domain sequence into even and odd parts and using the symmetry of twiddle factor in Eq. (3.2), as shown below.. X (k1 + 2k2 ) = =. N −1 1 2. ∑ ∑ x(. n2 = 0 n1 = 0. N ( k1 + 2 k2 )⋅( n1 + n2 ) N 2 n1 + n2 ) ⋅WN 2.  1  k1 ⋅n2 N k ⋅n ⋅WNk2/ ⋅2n2  ∑ x( n1 + n2 ) ⋅W2 1 1  ⋅ W ∑ N E55 F 2 n2 = 0  n1 = 0 twiddle 1444424444 3 factor 2-point DFT 14444444 4244444444 3 N / 2 −1. ,. n1 = 0,1   n2 = 0,1,..., ( N / 2) − 1   k1 = 0,1  k2 = 0,1,..., ( N / 2) − 1. (3.5). N/2-point DFT. The DFT operation can be divided into 2 stages, one is 2-point DFT, and another is N/2-point DFT, which is shown below. …. …. …. …. Fig. 3.1 Radix-2 DIF FFT algorithm architecture After the first decomposition, the N-point DFT operation can be divided into N/2 2-point DFT operation and 2 N/2-point DFT operation, where the 2-point DFT is well known that the operation can be realized as a radix-2 butterfly (BF) module, shown as Fig. 3.2. 17.

(30) Fig. 3.2 Radix-2 butterfly module Similar to the first decomposition, we can further decompose the N/2-point DFTs into even smaller DFTs until all DFTs are decomposed into 2-point DFT.. 3.2 Concept of FFT Architectures The FFT processor architecture design can be simply divided into two types, one is pipeline-based FFT architecture [13-15], and the other is memory-based FFT architecture [16-17]. Pipeline-based FFT architecture has the advantage of high throughput rate and low data latency, but it also has the disadvantage of high hardware cost; in contrast, memory-based FFT architecture has low hardware cost but high data latency. For both of the FFT processor architectures, to increase the FFT processor throughput rate, high working clock rate is the simplest way to meet the throughput constrain; however, it will also increase the FFT processor hardware cost and power consumption. In this chapter, we will discuss different architectures for high throughput FFT processors with multi-input-and-multi-output in normal order.. 3.2.1 Pipeline-Based FFT Architecture The pipeline-based FFT architectures are the most popular architectures in many applications because they are designed for high speed performance and sequence of data input; but, in order to make the output data in normal order, they usually need a 18.

(31) reorder buffer in output stage, which regular a very high hardware cost. The best way to obtain the pipeline-based architecture is through vertical projection of signal flow graph (SFG). Fig. 3.3 shows an example to explain vertical projection mapping of 8-point radix-2 DIF FFT.. Fig. 3.3 Vertical projection mapping of 8-point radix-2 DIF FFT Each stage obtained by vertical projection is called a processing element (PE), which contains a delay buffer (Buffer), a radix-2 butterfly unit (Radix-2 BF), and a complex multiplier. The delay buffer is used to reorder the data input for each stage butterfly unit. There are two types of the delay buffer, one is called delay-feedback (DF), and the other is called delay-commutator (DC). According to the structure difference, pipeline-based FFT architecture can be divided into three types: single-path delay feedback (SDF) architecture, single-path delay commutator (SDC) 19.

(32) architecture, and multi-path delay commutator (MDC) architecture. Since the SDC architecture can provide only one-path output data stream, similar to SDF architecture, and hardware cost is between the SDF and MDC architectures, the SDC architecture can not provide parallel data stream with least hardware cost. Here we only focus on SDF and MDC architectures. In the following subsections, we will introduce different radix-r SDF and MDC pipeline-based FFT architectures, where r is the radix number for the decimation-in-time (DIT) or decimation in frequency (DIF) algorithm.. 3.2.1.1 Radix-r Multi-Path Delay Commutator Architecture Radix-r MDC architecture [18-19] uses commutator to break the input data into r parallel data streams flowing forward with correct ordering for the data entering the butterfly unit by proper delays. Here are two examples to introduce MDC architecture in the following discussions.. (1). Radix-4 Multi-Path Delay Commutator (R4MDC) Architecture Fig. 3.4 shows a 64-point FFT with radix-4 multi-path delay commutator (R4MDC) architecture. In Fig. 3.4, the elements of the R4MDC architecture are commutators, shift registers, and radix-4 butterfly units. The butterfly unit is also called arithmetic element (AE). At the beginning, the first 16 points of input data are delay at the first line of AE1’s inputs, the next 16 points are delay at the second line, and the next 16 points are delay at the third line. When the 49th point of input data coming at the forth line, the first butterfly is computing at AE1. With proper delays and commutation between each AE, the input data of each AE has correct ordering to compute a radix-4 butterfly in each AE. Finally, the output data of AE3 are 2bit-reverse order of the input data order.. 20.

(33) Radix-4 BF. Commutator. Radix-4 BF. Commutator. Radix-4 BF. Commutator. Fig. 3.4 64-point FFT with R4MDC architecture In order to revise the 64-point FFT of R4MDC architecture with multi-input and multi-output in normal order, we have to replace the first stage and add a reorder stage at the output stage, which is shown in Fig. 3.5.. Fig. 3.5 Modified input stage and output stage of 64-point R4MDC architecture For multi-input in normal order, the first stage has to change the commutator from one input to four inputs. And, in order to write four data in one cycle to one of the input shift registers of AE1, the shift registers have to be changed into random access registers. Thus, the first to the third line of AE1’s inputs shifter registers are changed into. N random access registers. Furthermore, the fourth line have to add a 4. 4×4 random access registers to buffer the input data because the fourth line has four input data and one output data. For multi-output in normal order, the output stage has to add a reorder stage to reorder the output data from bit-reverse order to normal order. By using the similar way of delay commutator, the output data order will be changed into normal order.. 21.

(34) For N-point FFT computation, R4MDC needs. 11 N − 4 registers, 3 · (log4N–1) 4. complex multipliers, and 8 · log4N complex adders. The latency is. 11 N − 5 cycles. 16. (2). Radix-8 Multi-Path Delay Commutator (R8MDC) Architecture Radix-8 multi-path delay commutator (R8MDC) is similar to R4MDC architecture, it use radix-8 algorithm with MDC architecture, and it can provide higher throughput rate than R4MDC architecture with 8 parallel data streams. But, it also has more delay buffers and other arithmetic elements. The 512-point FFT with R8MDC architecture is shown in Fig. 3.6.. Fig. 3.6 512-point FFT with R8MDC architecture Also, for multi-output in normal order, the architecture has to revise the first stage and add a reorder stage at output stage, which is shown in Fig. 3.7. The input stage and output stage are similar to R4MDC architecture with multi-input and multi-output in normal order. As a result, reorder stage has more delay buffer when higher radix is used in MDC architecture. For N-point FFT computation, R8MDC needs. 23 N − 8 registers, 7×(log8N–1) 8. complex multipliers, and (24+2T)×log8N complex adders, where the parameter T indicates the number of adders required in the implementation of multiplications by constant values. The latency is. 23 N − 9 cycles. 64 22.

(35) Fig. 3.7 Modified input stage and output stage of 512-point R8MDC architecture. 3.2.1.2 Radix-r Single-Path Delay Feedback Architecture Unlike multi-path delay commutator (MDC) architecture, single-path delay feedback (SDF) architecture combines the commutator and the radix-r butterfly unit, and uses delay feedback method to reuse the delay buffer of each stage to reorder the data input of butterfly unit. The SDF architecture’s hardware is less than the MDC architecture’s, but the data latency is more than the MDC architecture’s. Moreover, the SDF has only one path between butterfly units, the throughput rate can’t be higher even it uses higher radix FFT algorithm. For input and output data in normal order, it needs a reorder buffer at output stage, and, the buffer size is about N/2 for N-point DFT with SDF architecture. Also, we take two cases of SDF architecture in the following discussions.. (1). Radix-2 Single-Path Delay Feedback (R2SDF) Architecture The radix-2 single-path delay feedback (R2SDF) architecture combines radix-2 MDC architecture’s commutator and radix-2 butterfly unit in R2SDF’s radix-2 butterfly unit shown in Fig. 3.8. Without 2 parallel data streams from the radix-2 butterfly unit output to the next stage, R2SDF only has one output to the next stage, 23.

(36) and the other output is feedback to store in delay buffer; therefore, it is called single-path delay feedback architecture. For N-point FFT computation, R2SDF needs N–1 registers, (log2N–1) complex multipliers, and 2 ×log2N complex adders. The latency is N–1 cycles without reorder buffer.. 32. 16. 4. 2. 1. X. Radix-2 BF. X. Radix-2 BF. X. Radix-2 BF. X. Radix-2 BF. Radix-2 BF. Radix-2 BF. X. 8. Fig. 3.8 64-point FFT with radix-2 SDF architecture. (2). Radix-8 Single-Path Delay Feedback (R8SDF) Architecture The block diagram of 64-point radix-8 single-path delay feedback (R8SDF) architecture is shown in Fig. 3.9. It has less multiplier than the R2SDF architecture, for 64-point FFT architecture, R8SDF can save 80% of complex multipliers; but it also has more register banks to store the data for BF unit, which may have more power consumption. For N-point FFT computation, R8SDF needs N–1 registers, (log8N–1) complex multipliers, and (24+2T) ×log8N complex adders. The latency is N–1 cycles without reorder buffer.. 24.

(37) 8 8 8 8 8 8 8. 1 1 1 1 1 1 1. X. Fig. 3.9 64-point FFT with R8SDF architecture. (3). Radix-2/4/8 Single-Path Delay Feedback (R23SDF) Architecture Radix-2/4/8 single-path delay feedback (R23SDF) architecture is based on R2SDF architecture with radix-8 FFT algorithm shown in Fig. 3.10, and, it replaces the radix-2 butterfly unit with the radix-8 FFT processing element, which is shown in Fig. 3.11. The numbers of required complex multiplier are the same as R8SDF architecture, and the numbers of required complex adders are less than R8SDF architecture; moreover, the partitions of registers are less than R8SDF architecture, which may have less power consumption.. Fig. 3.10 64-point FFT with R23SDF architecture. 25.

(38) For N-point FFT computation, R23SDF needs N–1 registers, (log8N–1) complex multipliers, and (6+2T) · log8N complex adders. The latency is N–1 cycles without reorder buffer. PE1. PE2. PE3. x[0]. X[0]. x[1]. X[4]. W80. x[2]. X[2]. W82. x[3] x[4] x[5] x[6] x[7]. W8. 0. W8. 1. W8. 2. W80. W8. 3. W82. X[6] X[1] X[5] X[3] X[7]. 4. 2. 1. PE1. PE2. PE3. Fig. 3.11 8-point FFT radix-2/4/8 SDF architecture. 3.2.2 Memory-Based FFT Architecture Memory-based FFT architecture, unlike pipeline-based FFT architecture, only has a few arithmetic elements (AE), which also called processing element (PE). There are two advantage of using memory-based FFT architecture: One is that the hardware area of the processing elements for N-point DFT computation is the same even N is very large; the other is that the total number of memory banks are less than pipeline-based FFT architecture because it used a few PE and need less read or write. 26.

(39) operations in the same time. Fig. 3.12 shows a radix-8 memory-based FFT architecture, it only has one radix-8 butterfly unit and 8 memory banks.. Fig. 3.12 Radix-8 memory-based (R8M) FFT architecture For multi-input in normal order, different input data in one cycle should write to different memory banks, but, this requirement is conflict with radix-r FFT algorithm for memory-based architecture. Similar to the MDC architecture, radix-r memory-based FFT architecture can add reorder stage at the input stage for parallel data to be written to different memory banks. Also, for multi-output in normal order, it needs a reorder stage at the output stage. Another choice for memory-based FFT architecture with multi-input and multi-output in normal order is rearrangement of data in memory with higher control complexity. Next chapter will show the proposed FFT processor architecture based on this concept.. 27.

(40) For N-point FFT computation, R8M needs. 7 N+56 registers and N words 8. memory with 8 memory banks, 7 complex multipliers, and (24+2T) complex adders. The latency is. 15 N N − 8 + log8 N cycles. 64 8. 3.3 Comparison of Different FFT Architecture Table 3-1 Comparison of different FFT architecture R2SDF. R8SDF. R23SDF. R4MDC. R8MDC. R8M. Complex Multipliers. log2N−1. log8N−1. log8N−1. 3· log4N−1). 7· log8N−1). 7. Complex Adders. 2· log2N. (24+2T). (6+2T). 8· log4N. (24+2T). 24+2T. · log8N. · log8N. Memory Size. N−1. N−1. N−1. 7N/4+12. 15N/8+56. N. Reorder Buffer Size. N/2. 7N/8. N/2. N−16. N−64. N−64. Data Latency. 3N/2−1. 15N/8−1. 3N/2−1. 11N/16−5. 23N/64−9. 15N/64−8+. Throughput Rate. R. · log8N. (N/8)log8N R. R. 4R. 8R. 8R. The comparison of different FFT architecture with multi-input and multi-output in normal order is shown in Table 3-1, where the N is the FFT size and R is the internal clock rate of the FFT processor. Due to the FFT algorithms, all architecture need reorder buffer at input stage or output stage, and the hardware cost of reorder buffer is so high that the conventional FFT architecture can’t provide an efficient way to make the output sequence in normal order. For this reason, we have to develop a FFT processor providing high throughput rate with multi-input and multi-output in normal order in an efficient way for low hardware cost. As the goal of low hardware cost, radix-8 memory-based FFT architecture has the least hardware cost for high throughput rate with the same clock working frequency. However, it also needs a very 28.

(41) large reorder buffer. Therefore, the main issue of the FFT architecture with multi-input and multi-output in normal order is to reduce the reorder buffer. The proposed FFT architecture can provide high throughput rate with multi-input and multi-output in normal order, and does not need any reorder buffer. It will be introduced in next chapter.. 3.4 Partial FFT Design 3.4.1 Concept of Partial FFT Partial FFT design is a study of redundancies of the standard FFT algorithm due to a reduction in either the number of input or output points. For most applications, the input and the output sequence of the DFT operation are equal, but, there are still some applications where the numbers of input and output points are different, such as DFT-based channel estimation. Hence, many researches of partial FFT design are presented to reduce the redundant operations of FFT algorithm. The thesis will introduce the partial FFT design in two points of view in the following subsections, one is concerned that only a subset of input or output points of DFT operation are computed, another point is concerned that multiple subsets of input or output points of DFT operation are computed. Finally, we propose a partial FFT design, combining the reducing methods with only a subset and multiple subsets of input or output points of DFT operation, suitable for DFT-based channel estimation.. 3.4.2 DFT with only a Subset of Input or Output Points There are two conditions we have to design a partial FFT with only a subset of input or output points, one is that only a narrow spectrum is interested but the resolution within the band has to be very high; the other is that a very high resolution 29.

(42) spectrum is to pad the input sequence with a large number of zeros. It usually use a regular FFT to compute the results, but if the number of nonzero input or the number of output concerned is small compared with the DFT length, it is very inefficient. The pruning algorithm [25][26] and transform decomposition [27] is presented for efficient DFT computation with only a subset of input or output points. Because the transform decomposition method is not suitable in our application, we only introduce the pruning algorithm in the following. The pruning algorithm is first developed by Markel [25] for computing only a subset of input or output points. An example of Markel’s pruned 16-point FFT with a subset of nonzero input is shown in Fig. 3.13, where the Markel’s pruning algorithm is based on radix-2 DIF FFT algorithm. We focus on the case that the nonzero input points are from the first L points of input sequence because this case is similar to the case of FFT processor in DFT-based channel estimation. As the result from Fig. 3.13, it reduces. N × log 2 ( N / L). complex additions and. ( N / 2) log 2 ( N / L) − N + L. complex multiplications than the original FFT algorithm, where L is a power of 2. x[0]. WN0. x[1]. WN0. x[2]. WN4. x[3]. WN0. x[4]. WN2. x[5]. x[8] x[9]. WN4. x[12] x[13]. WN0. WN0 WN0. WN1 WN0. x[10] x[11]. WN0 WN0. x[6] x[7]. WN0. WN4. WN0. WN0 WN2. WN0 WN0. x[14]. WN4. x[15]. WN0. X[0] X[8] X[4] X[12] X[2] X[10] X[6] X[14] X[1] X[9] X[5] X[13] X[3] X[11] X[7] X[15]. Fig. 3.13 Markel’s pruned 16-point FFT with a subset of nonzero input (L=2) The Skinner develops more efficient pruning algorithm [26] than that of Markel 30.

(43) as shown in Fig. 3.14. However, Skinner’s algorithm is only for L is a power of 2. It is. achieved. by. pruning. a. decimation-in-time. algorithm. instead. of. the. decimation-in-frequency that Markel’s algorithm is based on. In Skinner’s pruning algorithm, the first log 2 ( N / L) stages contain no complex additions and no complex multiplications, and it means that it reduces N × log 2 ( N / L) complex additions and ( N / 2) log 2 ( N / L) complex multiplications. Therefore, the Skinner’s algorithm with a subset of nonzero input saves N-L of complex multiplications as compared to Markel’s algorithm when L is a power of 2. x[0]. X[0]. x[8]. X[1]. x[4]. X[2]. x[12]. X[3]. x[2]. X[4]. x[10]. X[5]. x[6]. X[6]. x[14]. WN0. x[1]. WN1. x[9]. WN2. x[5]. WN3. x[13]. WN4. x[3]. WN5. x[11]. WN6. x[7]. WN7. x[15]. X[7] X[8] X[9] X[10] X[11] X[12] X[13] X[14] X[15]. Fig. 3.14 Skinner’s pruned 16-point FFT with a subset of nonzero input (L=2) The pruning algorithm for FFT with a subset of output points is also presented by Markel and Skinner as shown in Fig. 3.15 and Fig. 3.16. The Markel pruning algorithm is based on decimation-in-time algorithm while that of the Skinner’s is based on decimation-in-frequency algorithm. The Markel’s algorithm can reduce N log 2 ( N / L) − N + L of complex additions and ( N / 2) log 2 ( N / L) − N + L of complex multiplications, and the Skinner’s algorithm can reduce N × log 2 ( N / L) of complex additions and ( N / 2) log 2 ( N / L) of complex multiplications.. 31.

(44) x[0] x[8]. X[0] WN0. x[4] x[12]. X[1] WN0. WN0. WN0. X[6]. WN4. X[7] WN0 WN1. WN0 WN0. x[5] x[13]. WN0 WN0. x[15]. WN0. X[9] X[11]. WN0. X[12]. WN2. X[13]. WN0. x[7]. X[8] X[10]. WN4. x[3] x[11]. X[5]. WN0. x[1] x[9]. X[4]. WN2. WN0. x[6] x[14]. X[3] WN0. x[2] x[10]. X[2]. WN4. X[14]. WN4. X[15]. Fig. 3.15 Markel’s pruned 16-point FFT with a subset of output points (L=2). Fig. 3.16 Skinner’s pruned 16-point FFT with a subset of output points (L=2). 3.4.3 DFT with Multiple Subsets of Output Points Conventional partial FFT algorithm can only extract one subset of spectrum. An efficient partial FFT algorithm for DFT with multiple subsets of output points has been presented [28], which focus on the control of DFT with multiple subsets of output points, and an example of 8-point DFT based on the concept [28] is shown in Fig. 3.17. 32.

(45) (a) Signal flow graph of 8-point DFT. (b) Butterfly function for each butterfly output Fig. 3.17 8-point DFT with butterfly function of each butterfly unit output point The signal flow graph of 8-point DFT is shown in Fig. 3.17(a), and the butterfly function for each butterfly output is shown in Fig. 3.17(b). In order to reduce the redundant operations of butterfly unit, we have to decide the butterflies need to be computed and operations for needed butterflies, and an example of 8-point DFT with multiple subsets of output points is shown in Table 3-2.. 33.

(46) Table 3-2 Control counter and function of FFT with partial output points Butterfly Counter Stage 1. Stage 2. Stage 3. Butterfly Function. b1b0. Q0 = {0,1} → Normal Q0 = 0 → Addition Q0 = 1 → Subtraction. Q0b0. Q1 = {0,1} → Normal Q1 = 0 → Addition Q1 = 1 → Subtraction. Q0Q1. Q2 = {0,1} → Normal Q2 = 0 → Addition Q2 = 1 → Subtraction. If the multiple output subcarriers, whose indices are [G2 G1 G0], [H2 H1 H0],[I2 I1 I0],…, are interested in the system, the Qn in the Table 3-2 can be defined as Qn = {Gn∪Hn∪In∪…}; then, the possible results for Qn are {0,1},{0},{1}. The needed butterflies and operations of the butterflies can be defined as butterfly counter and butterfly function in Table 3-2. In addition, b1b0 is the original butterfly counter counting from 0 to 3. It is clearly that all the butterflies should be computed in stage 1, the stage is defined in Fig. 3.17, for all possible result of output points, but if the Q0 equals to 0 or equals to 1, all the butterflies only compute the addition or subtraction butterfly function. In stage 2, the butterflies should be computed only if butterfly counter of the butterflies equals to Q0b0, and the operations is decided by Q1. Similar to stage 2, in stage 3, the butterflies should be computed only if butterfly counter of the butterflies equals to Q0Q1, and operations of the butterflies are decided by Q2. An example of DFT with multiple output points is shown in Fig. 3.18. The expected signal flow graph is shown in upper side, and the active operations and butterflies are shown in lower side. In stage 1 and stage 2, the active operations and butterflies meet the expected signal flow graph, but, in stage 3, the addition operation of butterfly counter 3 is a redundant operation due to the butterfly function control is shared with all butterflies. Although there are still redundant operations in this 34.

(47) algorithm, it provides an efficient way to simplify the control of partial FFT. Q2={0,1} Q1={0,1} Q0={1} Stage 1. Stage 2. Stage 3. x[0]. X[0]. x[1]. X[4]. x[2]. X[2]. x[3]. X[6]. x[4]. X[1]. x[5]. X[5]. x[6]. X[3]. x[7]. X[7]. x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7]. Stage 1. Stage 2. Stage 3. BU Counter 0 Add BU Counter 1 Add BU Counter 2 Add BU Counter 3 Add BU Counter 0 Sub BU Counter 1 Sub BU Counter 2 Sub BU Counter 3 Sub. BU Counter 0 Add BU Counter 1 Add BU Counter 0 Sub BU Counter 1 Sub BU Counter 2 Add BU Counter 3 Add BU Counter 2 Sub BU Counter 3 Sub. BU Counter 0 Add BU Counter 0 Sub BU Counter 1 Add BU Counter 1 Sub BU Counter 2 Add BU Counter 2 Sub BU Counter 3 Add BU Counter 3 Sub. X[0] X[4] X[2] X[6] X[1] X[5] X[3] X[7]. Fig. 3.18 Example of 8-point DFT with multiple subsets of output points. 3.4.4 DFT with Multiple Subsets of Input and Output Points Based on the algorithm presented in Section 3.4.3, we enhance the algorithm from only suitable for multiple subsets of output points to both multiple subsets of input and output points, and the modified butterfly counter and operations is shown in Table 3-3. Qn represents the multiple output subcarriers’ indices as mentioned in Section 3.4.3, and Pn represents the multiple nonzero input points’ indices. The new operation of butterfly function, bypassing input values, is added due to that there are several zero input points of DFT operation. An example of DFT operation with multiple nonzero input and output points is shown in Fig. 3.19. 35.

(48) Table 3-3 Control counter and function of FFT with partial input and output points Butterfly Counter. Stage 1. Stage 2. Stage 3. P1P0. Q0P0. Q0Q1. Butterfly Function. P2 = {0,1}. Q0 = {0,1} → Normal Q0 = 0 → Addition Q0 = 1 → Subtraction. P2 = 0. Q0 = {0,1} → Bypass upper input to both upper and lower output Q0 = 0 →Bypass upper input to upper output Q0 = 1 →Bypass upper input to lower output. P2 = 1. Q0 = {0,1} → Bypass lower input to both upper and lower output Q0 = 0 →Bypass lower input to upper output Q0 = 1 →Bypass lower input to lower output. P1 = {0,1}. Q1 = {0,1} → Normal Q1 = 0 → Addition Q1 = 1 → Subtraction. P1 = 0. Q1 = {0,1} → Bypass upper input to both upper and lower output Q1 = 0 →Bypass upper input to upper output Q1 = 1 →Bypass upper input to lower output. P1 = 1. Q1 = {0,1} → Bypass lower input to both upper and lower output Q1 = 0 →Bypass lower input to upper output Q1 = 1 →Bypass lower input to lower output. P0 = {0,1}. Q2 = {0,1} → Normal Q2 = 0 → Addition Q2 = 1 → Subtraction. P0 = 0. Q2 = {0,1} → Bypass upper input to both upper and lower output Q2 = 0 →Bypass upper input to upper output Q2 = 1 →Bypass upper input to lower output. P0 = 1. Q2 = {0,1} → Bypass lower input to both upper and lower output Q2 = 0 →Bypass lower input to upper output Q2 = 1 →Bypass lower input to lower output. 36.

(49) P2={0} P1={0,1} P0={1} Stage 1. Q2={0,1} Q1={0,1} Q0={1} Stage 2. Stage 3. x[0]. X[0]. x[1]. X[4]. x[2]. X[2]. x[3]. X[6]. x[4]. X[1]. x[5]. X[5]. x[6]. X[3]. x[7]. X[7]. x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7]. Stage 1. Stage 2. Stage 3. BU Counter 0 Add BU Counter 1 Add BU Counter 2 Add BU Counter 3 Add BU Counter 0 Sub BU 1 Bypass Upper Input BU Counter 2 Sub BU 3 Bypass Upper Input. BU Counter 0 Add BU Counter 1 Add BU Counter 0 Sub BU Counter 1 Sub BU Counter 2 Add BU Counter 3 Add BU Counter 2 Sub BU Counter 3 Sub. BU Counter 0 Add BU Counter 0 Sub BU Counter 1 Add BU Counter 1 Sub BU 2 Bypass Lower Input BU 2 Bypass Lower Input BU 2 Bypass Lower Input BU 2 Bypass Lower Input. X[0] X[4] X[2] X[6] X[1] X[5] X[3] X[7]. Fig. 3.19 Example of 8-point DFT with multiple subsets of input and output points Table 3-3 also presents the dependency between the butterfly counter and the parameter Qn and Pn in each stage. In the first stage, the butterflies needing to be computed is only dependent on the set of valid input points indices, which is the parameter Pn. In the second stage, the butterfly counter is dependent on half of the set of valid input point’s indices and half of the set of valid output point’s indices. In the final stage the butterfly counter is only dependent on the set of valid output points indices, Qn. As the result, this algorithm reduces more redundant operations than Fig. 3.18 due to multiple zero input points, and it is helpful to design the partial FFT in DFT-based channel estimation, and it will be explained later.. 37.

(50) 3.4.5 Partial FFT Processor Design in DFT-Based Channel Estimation There are two purposes for designing partial FFT in DFT-based channel estimation. The first one is that the partial FFT processor should compute the IDFT operations with N points of input and N×GI points of output and DFT operations with N×GI points of input and N points of output as discussed in Section 2.3. The second one is that the partial FFT should reduce the redundant operations due to the non regular input or output points. For example, the IDFT operations with zero input points of guard band and redundant output points of non-usable of multi-path response shall be avoided. Hence, we design the partial FFT for different purposes by the algorithm mentioned in Section 3.4.3 and Section 3.4.4. The partial FFT processor specification of the proposed DF DFT-based CE in 802.16e baseband receiver is that the FFT size is 1024 points and the guard interval is 1/8. Thus, we have to design a partial FFT/IFFT processor for IDFT operation with 1024 points input transform to 128 points output and DFT operation with 128 points nonzero input transform to 1024 points output as shown in Fig. 3.20.. Fig. 3.20 System specification for the partial FFT/IFFT processor For this purpose, the FFT and IFFT blocks in DF DFT-based CE is well suitable for partial FFT/IFFT design with only a subset of input / only a subset of output 38.

(51) points. A pipeline-based architecture for the partial FFT is presented in Fig. 3.21, which used the concept of Section 3.4.3 and combined the IFFT and FFT in the same hardware.. Fig. 3.21 Pipeline-based partial FFT/IFFT processor The active block of partial FFT/IFFT processor in IFFT mode is shown in Fig. 3.22. Due to the DIF algorithm, the 1024-point IFFT operation can be partitioned into 8 128-point IDFT operations and combining the output data of 128-point IDFT operations with a radix-8 butterfly. Since we only need to compute the first subset of output points, we use only a complex adder to replace the radix-8 butterfly unit. As the result, we used only a 128-point FFT/IFFT processor to compute the 128-point IDFT operation and a 128 words memory to buffer the combining output data. Finally, we sent the data out from the 128 words buffer memory.. 39.

(52) Fig. 3.22 Partial FFT/IFFT processor in IFFT mode The active block of partial FFT/IFFT processor in FFT mode is shown in Fig. 3.23. Due to the DIT algorithm, the 1024-point FFT operation can be partitioned into 8 128-point DFT operations with a modified input by radix-8 butterfly unit. Since the non-zero input points of DFT operations are only in the first subset of input points, we use only a complex multiplier to replace the radix-8 butterfly unit. Therefore, the partial FFT/IFFT processor in FFT mode will first buffer the input data in 128 words memory, and then read the data from memories by multiplying with suitable twiddle factors to send as the input of 128-point FFT/IFFT processor. Finally, the output data order is a bit-reversal order of input order.. Fig. 3.23 Fig. 3.24 Partial FFT/IFFT processor in FFT mode 40.

(53) Moreover, we can use the concept of DFT with multiple subsets of input and output points in our 128-point FFT processing element with suitable control. It is useful to reduce the redundant operations due to zero input points of guard band or none usable multi-path response. In our proposed DFT-based channel estimation, the path selector will only choose 8 path impulses of the 128 output points for IDFT operation by system simulation. Hence, we can increase the partial FFT control in 128-point FFT processing elements to reduce more redundant operations. The comparison of hardware complexity is shown in Table 3-4, the proposed partial FFT can reduce 75.1% of the memory size, 22.3% of the complex multipliers, and 30% of the complex adders as compared with traditional radix-2 SDF FFT architecture. Furthermore, with increasing the partial FFT control for the 128-point FFT processor shown in Table 3-5, the proposed partial FFT can reduce maximum 65.3% of multiplication operations and 49.5% of addition operations, which may save more power if the 8 valid output point’s indices have common bits. Table 3-4 Comparison with Partial FFT and Conventional FFT Conventional Radix-2 SDF. Partial FFT with Radix-2 SDF. Memory Size (words). 1023(100%). 255(24.9%). Complex Multiplier. 9(100%). 7(77.7%). Complex Adder. 20(100%). 14(70.0%). Data Latency. 1023(100%). 1023(100%). 41.

(54) Table 3-5 Reduced operations of partial FFT with radix-2 SDF architecture Original FFT. Modified Architecture Partial FFT. Modified Reduced Control Partial FFT. 4608 (100%). 3584 (77.7%). Max 3584 (77.7%) Min 1600 (34.7%). Max 65.3%. Operations of 10240 Complex Additions (100%). 8064 (78.8%). Max 8064 (78.8%) Min 5176 (50.5%). Max 49.5%. Operations of Complex Multiplications. 3.5 Summary This chapter introduces the method of designing a parallel-in-parallel-out FFT processor and partial FFT/IFFT processor. In order to tape out the chip of 802.16e baseband receiver, a parallel-in-parallel-out FFT processor is more urgent to make the DFT-based channel estimation to be achievement. Hence, this thesis only focus on the hardware implementation of a parallel-in-parallel-out FFT processor, and the partial FFT processor design can be a future work to improve our system. Next chapter will introduce the design of FFT/IFFT processor with parallel-in-parallel-out in normal order.. 42.

(55) Chapter 4 Parallel-In-Parallel-Out FFT/IFFT Processor Architecture Design 4.1 System Requirement of the FFT/IFFT Processor The decision feedback DFT-based channel estimation (DF DFT-based CE) block diagram is shown in Fig. 4.1, it needs FFT_ch and IFFT_ch blocks with parallel-inparallel-out to speed up the circuits blocks before or after the FFT_ch and IFFT_ch blocks with parallel computation.. Fig. 4.1 Decision feedback DFT-based channel estimation block diagram From the analysis of high throughput FFT/IFFT processor architecture with multi-input and multi-output in Chapter 3, memory-based architecture is the best choice for the lowest hardware cost without data latency concerned. In order to speed up the memory-based FFT/IFFT architecture to meet the data latency of system requirement, parallel memory-based architecture is used in our FFT/IFFT processor design. Furthermore, to reduce the hardware cost and control complexity of the. 43.

(56) processing elements, we use pipeline-based SDF processing elements to replace the radix-r butterfly units of the memory-based architecture. As a result, the proposed FFT/IFFT processor is based on parallel memory-based FFT architecture with pipeline-based SDF processing elements. The system requirement of the FFT/IFFT processor is shown in Table 4-1. Table 4-1 FFT/IFFT system requirement Items. Specification. System Clock Rate. 78.4 MHz. FFT Size. 1024 points. No. of Inputs or Outputs of FFT processor. 8. Data Latency. 25 us. The FFT_ch/IFFT_ch blocks have to be designed as the 1024-point FFT/IFFT processor with 8 inputs and 8 outputs working at the system clock rate of 78.4 MHz and the data latency of the FFT/IFFT processor must less than 1/4 OFDM symbol time which is about 25 us.. 4.2 Architecture of the FFT/IFFT Processor According to Chapter 3, we focus on the memory-based FFT processor design with parallel-in-parallel-out in normal order. The conventional memory-based FFT processor with 1 PE and 1 dual-port memory can not achieve the goal of 8 parallel-in-parallel-out data streams. Thus, first, we change the memory from 1 dual-port memory to 8 dual-port memories to achieve the goal of 8 parallel-inparallel-out data streams. However, the data latency is too long for the memory-based FFT processor with only 1 PE. A FFT/IFFT processor with 4 PE and 8 memory banks is designed to reduce the data latency. In the later discussion, we will show that the. 44.