高效能之管線式傅立葉轉換處理器之設計與實現

(1)

國

國立

立

立交

交

交通

通

通大

通

大

大學

大

學

電機與控制工程學系

博

博士

士

士論

士

論

論文

論

文

高效能之管線式傅立葉轉換處理器之設計與實現

Design and Implementation of High-Effective

Pipelined Processors for Discrete-Time Fourier

Transform Applications

研

研究

究

究生

生

生：

：余

：

余

余遠

遠

遠渠

渠

指導教授

指導教授：

：

：林

林

林進

進

進燈

燈

中華民國九

中華民國九十

十

十七

七

七年

七

年

年五

五

五月

月

(2)

高效能之管線式傅立葉轉換處理器之設計與實現

Design and Implementation of High-Effective

Pipelined Processors for Discrete-Time Fourier

Transform Applications

研究生：余遠渠 Student: Yuan-Chu Yu

指導教授：林進燈 Advisor: Chin-Teng Lin

國立交通大學

電機與控制工程學系

博士論文

A Dissertation

Submitted to Department of Electrical and Control Engineering

College of Electrical Engineering and Computer Science

National Chiao Tung University

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

In

Electrical and Control Engineering

May 2008

Hsinchu, Taiwan, Republic of China

(3)

中文摘要

本篇論文針對傅立葉轉換，設計其高效能之管線式處理器。論文以四種不同之即時應用為範例來提出其對應之高效能設計，其包括：雙聲多頻偵測器在高通道密度之 VoP 應用、多輸入多輸出之正交多頻的無線區域網路、多輸入點之長快速傅立葉轉換運算在手機之數位影像傳波系統應用、以及快速傅立葉正(反)轉換/二維數位餘弦轉換在下代手機之多媒體應用。針對這四種明顯不同之應用，本論文提出了六種特定之硬體導向設計，以達到最高效能之管線式處理器架構，其評估之指標包括: 單位時間輸出量、計算延遲時間、運算複雜度、硬體成本與硬體使用之利用率。在雙聲多頻偵測器之應用上，本論文採用：精簡式輸入序列架構、分散式記憶體以及柴比雪夫多項式為基準之改良式遞迴式轉換器，來達到低計算週期、高能量利用率之優點。所架構之單聲多頻偵測器單核心，可在相同之運算速度及運算時間內，達到雙倍之資料運算量。對於 2×2 以及 4×4 多輸入多輸出之正交多頻的無線區域網路，本論文提出兩種高效能之快速傅立葉正(反)轉換處理器：積數 2/8 之多回授路徑架構(R28MDF)與積數 2/8 之多延遲整流路徑架構(R28MDC)。依據精簡式之基數 8 快速傅立葉轉換單元(R8-FFT)，配合先寫後讀(MAW)之技巧，此兩架構達到了 100%之蝴蝶器利用率，同時更在單位時間內達到高輸出量已滿足 2×2 以及 4×4 多輸入多輸出之正交多頻之無線區域網路需求。針對多輸入點之長快速傅立葉轉換運算應用上，本論文提出兩個新式架構：基數 42 單一迴授路徑架構與基數 43 單一迴授路徑架構，其以較少之基數 4 理論來達到高基數 16 與基數 64 之低運算複雜度效能。在跟其他數個已存在之管線式處理器比較後，可證明本論文所提出之架構，以最少之硬體成本達到最高之硬體使用率，因此達到了高效能之應用需求。最後根據基數 42 單一迴授路徑架構，配合區段移位暫存器與翻轉移位暫存器架構，架構了一”三模處理器” 來支援 256 點之快速傅立葉正(反)轉換運算與二維數位餘弦轉換運算。同樣地，在跟其他數個現存之管線式處理器比較後，可證明本論文所提出之架構，以最少之硬體成本達到最高之硬體使用率，因此達到了高效能之應用需求。在本論中六個處理器皆以用 TSMC 0.13µm CMOS 製程完成實現與驗證，根據實現結果與嚴謹之比較，我們可證明本文所提出之 RDFT、R28MDF/R28MDC、R42 SDF/ R43 SDF 與三模處理器，在雙聲多頻偵測器、多輸入多輸出之正交多頻的無線區域網路、多輸入點之長快速傅立葉轉換運算、下代手機之多媒體應用上皆達到高處理效能之優點。

(4)

Design and Implementation of High-Effective Pipelined

Processors for Discrete-Time Fourier Transform Applications

Student：Yuan-Chu Yu Advisor：Chin-Teng Lin

Department of Electrical and Control Engineering National Chiao-Tung University

ABSTRACT

In this thesis, the design and implementation of effective pipeline processors for Fourier transform are presented. Four different real-time applications are introduced, which includes dual tone multi-frequency (DTMF) detector in the high channel density voice over packet (VoP) application, multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) wireless LAN (WLAN) system, long-length based FFT/IFFT computations in digital video broadcasting－handheld (DVB-T) standard and FFT/IFFT/2D-DCT computations in next generation mobile multimedia applications. According to these four standards, six specific hardware-orientated designs for most effective pipeline processors have been proposed in terms of throughput, computation latency, computation complexity, hardware cost and hardware utilization.

For the DTMF standards, one low-computation cycle and power-efficient recursive DFT/IDFT processor adopting a hybrid of input strength reduction, the Chebyshev polynomial, and register-splitting schemes has been proposed. Appling this novel low-computation cycle architecture, we could double the throughput rate and the channel density without increasing the operating frequency for the DTMF detector in the high channel density VoP application. Two effective FFT/IFFT processors, namely adix-2/8 multiple-path delay feedback (R28MDF) based and raidx-2/8 multiple-path delay commutator (R28MDC) based FFT/IFFT processors for the 2×2 and 4×4 MIMO-OFDM WLAN systems, respectively. By applying the retrenched 8-point FFT (R8-FFT) unit combined with the proposed multiplication-after-write (MAW) method, the R28MDF and R28MDC architectures resulted in 100% butterfly utilization and an appropriate throughput rate with few hardware resources for the 2×2 and 4×4 MIMO-OFDM applications, respectively. For the long-length based FFT/IFFT computations, two novel radix-42 single-path delay feedback (R42SDF) design and radix-43 single-path delay feedback (R43SDF) design with the low computational complexities of the radix-16 and radix-64 algorithms and the low hardware requirement of the radix-4 algorithm achieve

(5)

the smallest hardware cost and the highest hardware utilization among the tested architectures and thus has the highest efficiency. Base on the effective R42SDF architecture with the segment shift register (SSR) and overturn shift register (OSR) structure, the proposed triple-mode processor not only supports both 256-point FFT/IFFT and 8×8 2-D DCT computations, but also has the smallest hardware requirement and largest hardware utilization among the tested architectures for the FFT/IFFT computation, and thus has the highest cost efficiency.

In this thesis, six processors all implemented under TSMC 0.13µm CMOS process. According to the comprehensive comparisons and implementation results, we could demonstrate that the proposed RDFT, R28MDF/R28MDC, R42SDF/ R43SDF and Triple-Mode designs achieve the high effective advantages for DTMF, MIMO-OFDM WLAN, DVB-T and next-generation applications.

(6)

致謝

本論文的整個研究過程，需要感謝的人實在太多了，無論是碩士班的學弟還是博士班的同學與學長們，對於我都給予我相當多的支持與鼓勵，讓我在博士班的時期能不斷地精進。最感謝當然是指導教授林進燈博士的悉心指導，在研究的方向總是給我最正確的方向與建議，儘管指導教授再忙碌，也從不望給予我鼓勵與指導，且老師也給我相當大的彈性，讓我能學習如何面對及解決問題的正確態度與方法。另外要感謝邱創乾教授千里迢迢來擔任我的口試委員召集人，教授們的建議與指導，更讓本論文的內容更加充實與完整。其次是要感謝交通大學資訊工程系的范倫達教授，在我的研究內容與方向上給我相當多的建議與幫助，並在我博士班時期遇到的所有困難與低潮時，都給予最大的協助與體諒。在這段時間和我共同度過許許多多難忘的回憶。同時也感謝義隆電子股份有限公司的顏國隆副總的賞識與補助，讓我可在上班的閒暇之餘，完成如此艱鉅的博士班求學過程。最後最要感謝的是默默支持我的老婆，以及母親、弟弟與女兒給予我精神及物質上的一切支援，也感謝其他親朋好友的關心與鼓勵。你們的關心與支持，才是使我保持研究的動力與精神來源。謹以本論文獻給我的家人及所有關心我的師長與朋友們。

(7)

Abstract in Chinese………...i Abstract in English………...iii Acknowledgements in Chinese………...v Contents……….……….vi List of Figures……….………...x List of Tables...……….………xiv 1 Introduction………...1 1.1 Motivation………2 1.2 Objectives………….………...4 1.3 Contributions………9 1.4 Organization……….………...12 2 Literature Review………...14

2.1 The Goertzel Algorithm...………....14

2.1.1 The Recursive DFT Algorithm………15

2.1.2 The Recursive DFT Architecture……….…………..16

2.2 The Review of FFT Algorithm………...18

2.2.1 Radix-2 DIF FFT Algorithm………19

2.2.2 Radix-4 DIF FFT Algorithm……….…………..20

2.2.3 Radix-8 DIF FFT Algorithm………23

2.2.4 Radix-2/4 DIF FFT Algorithm……….…..………..24

2.2.5 Radix-2/8 DIF FFT Algorithm……….……26

2.2.6 Radix-22 DIF FFT Algorithm……….…..…………..27

2.2.7 Radix-23 DIF FFT Algorithm……….…………28

2.3 The Review of Pipeline FFT Architecture……….………...30

2.4 The MIMO-FFT Architecture………...31

3 he Low-Computation Cycle and Power-Efficient Recursive DFT/IDFT Design….34 3.1 New Recursive Algorithm and Architecture...………..……....35

(8)

3.3 The Comparison of different Recursive DFT/IDFT Architecture………...46

3.4 Summary………...48

4 Effective FFT/IFFT Processors for MIMO- OFDM WLAN Systems…………...49

4.1 The Proposed Modified Radix-2/8 FFT/IFFT Algorithm...………...50

4.2 The Proposed MIMO-FFT Architecture...……….………...53

4.2.1 R28MDF-based 64-Point FFT/IFFT Processor for 2×2 MIMO-OFDM system………....53

4.2.2 R28MDC-based 64-Point pipeline FFT/IFFT Processor for 4×4 MIMO-OFDM system...……….……….….58

4.3 Circuit Implementation………..………...59

4.4 The Comparison Discussion of MIMO-FFT Architecture………...61

4.4.1 2×2 MIMO-OFDM WLAN application………...…….…………62

4.4.2 4×4 MIMO-OFDM WLAN application………..64

4.5 Summary……….………...66

5 Long-Length based Effective Pipeline FFT/IFFT Processor………...67

5.1 New Radix-42 and Radix-43 based FFT/IFFT Algorithm...………68

5.1.1 Radix-42 based FFT Formula……….………68

5.1.2 Radix-42 based IFFT Formula……….…………..70

5.1.3 Radix-43 based FFT/IFFT Formula………..71

5.2 Pipeline 4096-Point R42SDF and R43SDF Based FFT/IFFT VLSI Architecture…..73

5.2.1 Radix-4 Butterfly……….……….…………74

5.2.2 Memory Structure……….…..………..75

5.2.3 Constant Multiplier……….…………78

5.2.4 Eight Folded Complex Multiplier……….…..………..80

5.3 Finite Word-Length Analysis……….………...81

5.4 The MIMO-FFT Architecture………...83

5.5 Chip Implementation……….….………...85

5.6 Summary………...88

6 Effective Triple-Mode Reconfigurable Pipeline FFT/IFFT/2D-DCT Processor….89 6.1 8×8 2D FFT and 8×8 2D DCT Formula...………90

(9)

6.2.1 Radix-4 Butterfly and Radix-2 Butterfly………..…………95

6.2.2 Memory Structure……….…..………..96

6.2.3 Input Re-ordering and First Butterfly Computation……….…………99

6.2.4 Constant Multiplier……….………101

6.2.5 Eight Folded Complex Multiplier……….………..103

6.2.5 Post Computation………..………..104

6.3 Finite Word-Length Analysis……….………...105

6.3.1 Pipeline 256-Point FFT/IFFT………..………..………106

6.3.2 Pipeline 8×8 2-D DCT………..………107

6.4 Comparison and Chip Implementation……….………..109

6.4.1 Comparison between R42SDF and R22SDF………..…………109

6.4.2 8×8 2-D DCT Comparison……….…..………..111

6.4.3 Chip Implementation……….…..………….114

6.5 Summary………..116

7 Conclusion and Future Work………..………..…...117

8 Bibliography………..………..…...119

(10)

List of Figures

Fig. 1: MIMO-FFT architectures. (a) Parallel multi-path MIMO-FFT architecture. (b) Serial multi-stream MIMO-FFT architecture. (c) Serial blockwise MIMO-FFT

architecture..……….……. 7

Fig. 2: (a) Block diagram of the first-order recursive DFT structure and (b) a multiplexer-type dash-line implementation with down-sampling value of N……. 19

Fig. 3: Block diagram of the second-order recursive DFT structure... 20

Fig. 4: Three 256-points pipeline FFT architecture. (a) The R4SDF architecture. (b) The R4MDC architecture. (c) The R4SDC architecture.………...….. 34

Fig. 5: Block diagram of low-computation cycle for (a) DCT part and (b) DST part of the DFT computation. .………...….. 40

Fig. 6: Block diagram of the proposed low-computation cycle and power-efficiency recursive DFT architecture. …...….. 41

Fig. 7: Block diagram of the proposed low-computation cycle and power-efficient recursive IDFT architecture. …...….. 44

Fig. 8: Dataflow of the DTMF detection [21]. …...….. 45

Fig. 9: Block diagram of the proposed high channel density DTMF architecture. ... 46

Fig. 10: Bit level SNR simulation environment.…...….. 46

Fig. 11: Bit level SNR simulation results. ...….. 46

Fig. 12: The 212/106-point recursive DFT/IDFT chip layout. ...….. 49

Fig. 13: The “L” shaped butterfly of novel radix-2/8 FFT algorithm. ...……... 55

Fig. 14: Block diagram of the proposed R28MDF-based 64-point FFT/IFFT architecture for2X2 MIMO-OFDM system. . ...……... 56

Fig. 15: The timing sequence of the purposed block based input unit. ... 57

Fig. 16: Block diagram of the proposed R8-FFT/IFFT unit. ... 58

Fig. 17: Block diagram of the proposed MAW-based multiplier unit. ... 59

Fig. 18: The timing sequence of the proposedR28MDF and R28MDC architectures. (a) The first stage: Multiplication Stage. (b) The second stage: Output Stage. (c) The timing sequence of R28MDF design. (d) The pipeline timing sequence of R28MDC design. ... 60

Fig. 19: Block diagram of the proposed R28MDC-based 64-point FFT/IFFT architecture For 4X4 MIMO-OFDM system... 61

(11)

Fig. 20: Layout view of the proposed 64-point FFT/IFFT processors. (a) The R28MDF implementation. (b) The R28MDC implementation. ... 63 Fig. 21: The CFA decomposition procedure of the proposed radix-42 based N-point FFT Algorithm………... 73 Fig. 22: Block diagram of the R42SDF-based 4096-point FFT/IFFT VLSI architecture 76 Fig. 23: Block diagram of the R43SDF-based 4096-point FFT/IFFT VLSI architecture 76 Fig. 24: Block diagram of the radix-4 butterfly architecture. ... 77 Fig. 25: The proposed 4 operation modes of the radix-4 butterfly stage in the R42SDF And R43SDF based 4096-point FFT/IFFT VLSI architecture. (a) The proposed 4 operation modes in the radix-4 based butterfly stages. (b) The timing sequences of 4 operation modes in the proposed pipeline architecture. ... 78 Fig. 26: The proposed memory architecture of the butterfly stage I and II in the R42SDF and R43SDF based 4096-point FFT/IFFT VLSI architecture. (a) The proposed FIFO shift registers architecture on the butterfly stage I. (b) The proposed single port SRAM with independent word control. (c) The Memory context on the purposed Butterfly stage I. (d) The timing sequence of proposed memory architecture in the o p e rat i o n m o d e 3 . . . 8 0 Fig. 27: Block diagram of the proposed constant multiplier in R42SDF design... 82 Fig. 28: The block diagram of eight-folded algorithm in the coefficient ROM…... 83 Fig. 29: Finite word-length analysis of the proposed pipeline R42SDF and R43SDF-based 4096 points FFT/IFFT architecture…... 85 Fig. 30: The layout view of proposed 4096-point pipeline FFT/IFFT processor. (a) The Layout view of proposed R42SDF design. (b) The layout view of proposed R43SDF design... 89 Fig. 31: Block diagram of the R42SDF-based 256-point FFT/IFFT and 8×8 2D-DCT Architecture……... 97 Fig. 32: Block diagram of the radix-4 butterfly architecture... 97 Fig. 33: Block diagram of the proposed first radix-4 butterfly stage in the R42SDF-based 256-point FFT/IFFT and 8×8 2D-DCT architecture. (a) The proposed 12 reconfigurable operation mechanisms of the first butterfly stage. (b) The timing sequences of operation mechanism in the first butterfly stage. (c) The storage content in SSR in the 8×8 2D-DCT mode. (d) The content of the 8×8 2-D DCT computation result in SSR... 100 Fig. 34: Block diagram of the proposed constant multiplier architecture. ... 105

(12)

Fig. 35: The block diagram of eight-folded algorithm in the coefficient ROM... 106 Fig. 36: Block diagram of the proposed fourth butterfly stage in the R42SDF-based 256-point FFT/IFFT and 8×8 2D-DCT architecture. (a) The data context of the fourth butterfly stage in the 8×8 2D DCT mode. (b) The OSR structure of the fourth butterfly stage... 107 Fig. 37: Finite wordlength analysis of the proposed pipeline R42SDF-based 256 points FFT/IFFT architecture…………... 108 Fig. 38: Finite wordlength analysis of the proposed pipeline R42SDF-based 8×8 2D DCT architecture. (a). Overall mean square error analysis. (b) Peak Mean Square Error an al ys i s . (c). Ov eral l M ean Erro r an al ys i s . (d ). P eak M ean Erro r analysis…………... 110 Fig. 39: The layout view and design characteristics of proposed pipeline 256-point FFT/IFFT /8×8 2D DCT processor………... 116

(13)

List of Tables

Table 1: Number of complex multiplication needed for the computation of a 64 point FFT/IFFT processor……….………..…. 35 Table 2: Chip Characteristics of the Proposed DTMF detector. ...………. 49 Table 3: Comparison Results among the Recursive DFT/IDFT Architectures……. 50 Table 4: Area usage of each building block in the proposed R28MDF and R28MDC Design………...……….…. 63 Table 5: Comparison results of the 64-point FFT/IFFT chip designs in 2x2 MIMO-OFDM system………...……….…. 65 Table 6: Comparison results of the 64-point pipelined FFT/IFFT architecture in 4x4 MIMO-OFDM system………...………. 68 Table 7: The Data Control of The Coefficient ROM in the R43SDF design. ………. 83 Table 8: Hardware Cost Comparisons of the Pipelined FFT/IFFT Architecture……. 86 Table 9: Hardware Utilization Rate Comparisons of the Pipelined FFT/IFFT Architecture. ………...…. 87 Table 10: The Gate Count Usage of Each Building Block in the Proposed Design. . 89 Table 11: The Corresponding Equation Numbers for Each Building Block. ... 97 Table 12: The Data Control of The Coefficient ROM. ... 106 Table 13: Hardware Cost Comparisons of the Pipelined FFT/IFFT Architecture... 113 Table 14: Hardware Utilization Rate Comparisons of the Pipelined FFT/IFFT Architecture………...………...…. 114 Table 15: Hardware Requirement Comparison of 8×8 2D DCT Architecture...…. 114 Table 16: The Gate Count Usage of Each Building Blocks...…. 117

(14)

Chapter 1 Introduction

The increased demand for communication, multimedia, and other consumer products has created the need for low-cost, low-power consumption and high throughput based processor that can use Fourier transforms for their signal processing or data manipulation. The discrete Fourier transform (DFT) is an equation for converting time domain data into frequency domain data [1]. Discrete means that the signal is sampled in time rather than being continuous. Therefore, DFT is an approximation for the continuous Fourier transform [2]. The DFT equation, unlike the continuous Fourier transform, covers a finite time and frequency span. Base on the requirements of the DFT results, there are possible two categories for the effective algorithms of DFT computations: 1) fast Fourier transform (FFT) algorithm, 2) recursive algorithm. FFT based algorithms are a group of algorithms for significantly speeding up the computation of the DFT, when all N points of DFT results are required. The most widely known of these algorithms is attributed to Cooley and Tukey [3] and is used for a number of point N equal to a power-of-two. In the realistic world, many applications require spectrum analysis only over a subset of the N center frequencies via the DFT computation instead of the overall results of the FFT. An effective derivative of DFT is the recursive based algorithm, which emerges better performance than the FFT algorithm when only some sparse DFT results need to be obtained by completing a single complex DFT spectral bin value for every N input time instances. The most famous of the recursive algorithms is the Goertzel algorithm [4], which use the periodicity properties to reduce DFT computations. Base on the required portions of DFT results, two effective DFT processors could be found: 1) FFT based processor, 2) recursive based processor. In this study, one high effective recursive processor has been presented. Base on the different requirements, five different pipeline FFT/IFFT processors are also presented in this work.

(15)

1.1 Motivation

Many researchers have concentrated on designing an optimized reconfigurable DSP processor to achieve a high processing rate and low power consumption in next-generation mobile multimedia applications [5][6]. The software based architecture such as the co-processor and dual-MAC designs have been proposed by Chai et al. [5] and Kolagotla et al. [6], respectively. However, they induce the large chip size because of the high flexibility. Vorbach et al. have also presented hardware-based concepts such as the processing element (PE) array [7], which achieves a high processing rate with reasonable flexibility. However, the processing kernel has the flaw of a low utilization rate with a large array memory and muti-MACs, leading to poor cost efficiency. The specific ASIC based design on a fast computation algorithm provides high cost efficiency [8]-[10]. Base on the different real-time applications, some design decisions for ASIC based FFT processor should be made following with the different specification:

Required portions of DFT results: The primary advantage of recursive based algorithms is that it allows a subset of the DFT’s N output terms to be efficiently calculated. Considering the computation complexity, the direct evaluation of DFT of all N values requires a total of N2 complex multiplications and N(N-1) complex additions. If only M values of N DFT results are required, the computation complexity of Goertzel and radix-2 based FFT algorithm are NM and Nlog2N, respectively. It is obviously that the computation saving of radix-2 based FFT algorithm is not significant —less than a factor of two. Then, the Goertzel algorithm demonstrates the good efficient for certain applications, such as: the dual tone multi-frequency (DTMF) standards [11-16] for voice over packet (VoP) network [17-19], discrete multi-tone equalizer of multi-carrier modulation system [20, 21], and speed detection.

Number of FFT channels: Future broadband wireless access systems including wireless LANs (WLAN) and fourth-generation (4G) mobile radio systems need much higher spectral efficiency and service quality than the current standards do [22, 23]. A multiple-input-multiple-output (MIMO) wireless system has been extensively studied recently due to the potential for raising system capacity [24, 25, 26]. The orthogonal frequency division multiplexing (OFDM) modulation scheme not only decreases the receiver complexity, but also improves the performance on highly dispersive channels. An especially promising candidate for the next-generation fixed and mobile wireless systems is the combination of MIMO technology with OFDM, called the MIMO-OFDM

(16)

system. A MIIMO-OFDM system with k antennas in the transmitter and the receiver comprises k OFDM baseband processors working in parallel, and thus requires k FFT processors, one for each antenna [24-26]. Then, a high throughput FFT processor, which could compute the multi-channel FFT computations, would be required.

Transform length of FFT computation: The size of the transform will directly affect frequency resolution, memory requirements, and the speed at which the computation can be done. In the realistic world, many applications require the FFT/IFFT implementations that can perform long-length computations while exhibiting low cost, low power consumption and high throughput. The long-length based FFT/IFFT processor has been widely applied in many real time applications, such as: DVB-H(Digital Video Broadcasting－Handheld)[27, 28], VDSL(Very-high-speed Digital Subscriber Line) [29],

and audio measurement [30]. Since such long-length FFT computations are rather time-consuming, the efficient FFT processors are necessary to meet the real time operations. Furthermore, the handheld devices include multimedia mobile phones with color displays as well as personal digital assistant (PDA) and pocket PC, which should consider some specific advantages — small, lightweight, portable, battery-powered devices.

Number of dimension: All multidimensional FFTs are done as a sequence of one-dimensional FFTs. The importance of knowing how many dimensions (one, two, or three, usually) there are determines how many FFTs will be need and how the data must be organized to do the multiple dimensions. This will affect chip processing load and the choice of architecture. To improve the radix-2 based FFT algorithm, He et al. [31] has presented radix-22 and radix-23 algorithms for the higher computation efficiency. Then, the design in [31] achieves the high hardware utilization and low hardware resource usage.

Algorithm construction: The algorithm used will affect the computational complexity the algorithm requires and computation speed the design does. The low radix based algorithm is well known to have higher multiplicative complexity than the high radix based algorithm. Notably, the design with the highest complex multiplicative complexity has the highest power consumption [26, 28, 31-33].

Architectures: Many researches were concentrated on the efficient FFT realizations [26, 31, 34-36]. The appropriated algorithm and architecture for the FFT processor should be chosen trading off its processing speed and its chip cost. The pipeline architecture processes regularity, modularity, local connection, and high throughput rate with lower

(17)

clock frequency [37]. Furthermore, pipeline FFT processor is characterized by non-stopping processing on a clock frequency of the input data sampling. An analysis has depicted that a unique operating frequency, which is close or equivalent to the sampling frequency is preferable to the FFT processor when the power consumption is confined by the application environment, such as handheld communications [26, 31, 32, 34, 38]. Basically, there are mainly two different pipeline architectures: multipath delay commutator (MDC) architectures [33, 36, 39, 40] and single-path delay feedback (SDF) architectures [31, 32, 34, 35, 42, 43]. The SDF architectures are well known to be more efficient than MDC architectures in terms of memory utilization since the butterfly output share the same storage with its input [31, 32, 34]. Therefore, this investigation focuses on the “hardware-oriented” pipeline architecture, in which the arithmetic operations can be tightly scheduled for effective hardware utilization.

1.2 Objectives

The objectives of this thesis are to propose the high effective pipeline processors for the DFT computations in different real-time applications. Four different applications have been taken into consideration, which are recursive based DFT computation in DTMF standard [12-15], multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) wireless LAN (WLAN) [22, 23], long-length based FFT/IFFT computations in digital video broadcasting－handheld (DVB-T) standard [27, 28] and FFT/IFFT/2D-DCT computations in next generation mobile multimedia applications [5-7, 44]. The objective descriptions of these four designs are provided as below:

1. Recursive DFT/IDFT Design: The Goertzel algorithm has been widely applied to the dual tone multi-frequency (DTMF) standards [11]-[16] for voice over packet (VoP) network [17]-[19] to compute the interested spectra, the discrete multitone equalizer of multicarrier modulation system [20]-[31], and speed detection. Considering the state-of-the-art applications, the high channel-density dual-tone detector [17]-[19] is demanded. Some advanced DTMF detectors for the high density VoP network application have been realized by one embedded DSP processor [12]-[14], [17]-[19]. Although, the DSP processor based design could keep the maximum flexibility, it may

(18)

not meet the cost effective considerations. On the other hand, the DSP processor based design may lose the advantages of high-throughput, low power, and small area compared with the application-specific integrated circuits (ASIC) designs [45]. In [13], the DSP processor based DTMF detectors needs a large amount of memory to decode only 24 channels, which requires 800 words data memory and 1000 words program memory with 16-bit wordlength for each words. Also, it has to operate on the higher frequency of 24 MHz. For the purpose of optimizing the whole system performance and cost, much research [46]-[53] has concentrated on the dedicated core design. In [15]-[17], the recursive expressions for the DCT computation that are suitable for VLSI implementation are presented. It is worth noticing that the recursive algorithms are solely used to design recursive DCT architectures rather than the recursive DFT architectures in [46]-[48]. In the past two decades, several recursive DFT algorithms and architectures have been explored [49]-[53]. Compared with the conventional second-order recursive DFT/IDFT architecture, Van et al. [51] utilized resource-sharing and register-splitting schemes to reduce two multipliers and speedup the computation, respectively. Yang et al. [52] proposed two unified IIR filter structures to save the hardware cost for the DFT computation. Nevertheless, neither Van et al. [51] nor Yang et al. [52] improve the computation cycle. In [53], Fan et al. applied the previous proposed method to reduce the computation cycles but the performance is limited. On the other hand, Fan et al. only proposed the recursive DFT algorithm but the IDFT algorithm is not yet ready in [53]. In essence, a short description of the proposed algorithm has been presented in the associated conference [54, 55]. In this thesis, the detailed descriptions of a high-performance and power-efficient VLSI algorithm and architecture by the hybrid of input strength reduction scheme, Chebyshev polynomial, and register-splitting scheme for the DTMF application have been fully provided. The derived recursive algorithm and devised architecture [54, 55] possesses the following features: low-computation cycle (i.e., high throughput) and power efficiency at the expense of slightly increased area overhead compared with the existing recursive DFT/IDFT structures.

2. MIMO-OFDM FFT design: Future broadband wireless access systems including wireless LANs (WLAN) and fourth-generation (4G) mobile radio systems need much higher spectral efficiency and service quality than the current standards do [22, 23]. A multiple-input-multiple-output (MIMO) wireless system has been extensively studied recently due to the potential for raising system capacity [24-26]. The orthogonal

(19)

frequency division multiplexing (OFDM) modulation scheme not only decreases the receiver complexity, but also improves the performance on highly dispersive channels. An especially promising candidate for the next-generation fixed and mobile wireless systems is the combination of MIMO technology with OFDM, called the MIMO-OFDM system. A MIIMO-OFDM system with k antennas in the transmitter and the receiver comprises k OFDM baseband processors working in parallel, and thus requires k FFT processors, one for each antenna [24-26]. Because of the high throughput requirements of the FFT computation in the MIMO-OFDM system, three 4×4 MIMO-FFT architectures, parallel multi-path architecture, serial multi-stream architecture and serial blockwise architecture, as depicted in Fig. 1(a)-(c), respectively, have been presented [25]. A parallel multi-path architecture includes k FFT blocks for k antennas, as depicted in Fig. 1(a). The figure indicates that the area cost of parallel multi-path based system rises linearly with the number of antennas (i.e. k times the FFT block area). Conversely, the serial multi-stream architecture and serial blockwise architecture only requires one FFT block to handle the concurrent computation of k antennas. However, the serial multi-stream architecture applies one lower throughput rate FFT processor embedded with the k times buffer size for intermediate computation, as depicted in Fig. 1(b). For k channel computation, the serial multi-stream architecture must operate at a higher clock frequency than sampling data frequency of Fs to satisfy the higher throughput requirements. Analytical results indicate that the operating frequency of serial multi-stream based system grows linearly with the number of antennae (i.e. k times the sampling frequency of Fs). Based on the serial blockwise FFT architecture, the input data of the FFT block can be provided in parallel with k embedding input buffer, as depicted in Fig. 1(c). Applying one higher throughput rate FFT processor, the serial blockwise FFT based processor can complete k channel FFT computations concurrently. Among these three architectures, the serial blockwise architecture only requires one FFT block operating at the same clock frequency with the data sampling frequency of Fs. An analysis has depicted that a unique operating frequency, which is close or equivalent to the sampling frequency of Fs, is preferable to the FFT processor when the power consumption is confined by the application environment, such as mobile communications [26, 31, 38, 56, 57]. Considering the memory cost, the serial blockwise architecture should slightly increase the cost with one extra buffer of size N than other architectures. However, the memory cost problem for serial blockwise architecture becomes increasingly minor when the number of antennae in the MIMO-OFDM system

(20)

is larger. Consequently, the serial blockwise-based MIMO-FFT architecture applies single FFT block to achieve the appropriate throughput and minimizes power consumption for MIMO-OFDM WLAN applications.

FFT block # 1 (Operating Freqency: F_s) A/D A/D A/D A/D Buffer (Size: N) FFT block # 2

(Operating Freqency: F_s) Buffer (Size: N)

FFT block # 3

FFT block # 4

Channel 1 Channel 2 Channel 3 Channel 4 Z₁(k) Z₂(k) Z₃(k) Z₄(k)

(A/D Sampling Frequency: Fs)

Parallel Multi-Path MIMO-FFT Processor

(a) Parallel multi-path MIMO-FFT architecture.

A/D A/D A/D A/D

Buffer (Size: 4N)

MUX FFT block # 1 DeMUX

(Operating Freqency: 4F_s) Channel 1 Channel 2 Channel 3 Channel 4 Z₁(k) Z₂(k) Z₃(k) Z₄(k)

Serial Multi-Stream MIMO-FFT Processor

(b) Serial multi-stream MIMO-FFT architecture.

A/D A/D A/D A/D MUX DeMUX Buffer (Size: N) Buffer (Size: N) Buffer (Size: N) Buffer (Size: N) FFT block # 1 (Operating Freqency: F_s) Buffer (Size: N) Channel 1 Channel 2 Channel 3 Channel 4 Z₁(k) Z₂(k) Z₃(k) Z₄(k)

Serial Blockwise MIMO-FFT Processor

(c) Serial blockwise MIMO-FFT architecture. Fig. 1: MIMO-FFT architectures.

3. Long-Length FFT Design: The FFT and IFFT are essential in the field of digital signal processing (DSP) and communication systems. In the realistic world, many applications require the FFT/IFFT implementations that can perform long-length computations while exhibiting low cost, low power consumption and high throughput. The long-length

(21)

based FFT/IFFT processor has been widely applied in many real time applications, such as: DVB-H(Digital Video Broadcasting－Handheld)[27, 28], VDSL(Very-high-speed

Digital Subscriber Line) [29], and audio measurement [30]. DVB-H is a digital broadcast standard offering high data rate audio/video content delivery to handheld devices, which requires a 4096-point FFT computation (i.e. 4k mode) for the flexible networking design in single frequency networks (SFNs) [27, 28].The VDSL transceiver and audio analyzer need to involve the complicated FFT computations, where the transform length is also 4096-point [29, 30]. Since such long-length FFT computations are rather time-consuming, the efficient FFT processors are necessary to meet the real time operations. Furthermore, the handheld devices include multimedia mobile phones with color displays as well as personal digital assistant (PDA) and pocket PC, which should consider some specific advantages — small, lightweight, portable, battery-powered devices.

4. Triple-mode reconfigurable FFT/IFFT/2-D DCT design: generation mobile multimedia applications, including mobile phones and personal digital assistant (PDAs), require much sufficiently high processing power for multimedia applications. Multimedia applications include video/audio codecs, speech recognition and echo cancellers. The speech recognition requires the speech extraction and autocorrelation coefficient computations [58] in the voice command application. The video codec is the most challenging element of a multimedia application, since it requires much processing power and bandwidth. Hence, a flexible and low cost pipeline processor with the superiority of high processing rate is required to realize necessary computation-intensive algorithms, such as 256-point FFT/IFFT and 8×8 2-D DCT [5]-[7]. Additionally, a major integration challenge is to design the digital baseband and accompanying control logic. The WiMAX baseband is constructed around orthogonal frequency division multiplexing (OFDM) technology requiring high processing throughput. The fixed, IEEE 802.16e [44], version of WiMAX also needs a 256-point FFT computation. Many researchers have recently concentrated on designing an optimized reconfigurable DSP processor to achieve a high processing rate and low power consumption in next-generation mobile multimedia applications [5][6]. The software based architecture such as the co-processor and dual-MAC designs have been proposed by Chai et al. [5] and Kolagotla et al. [6], respectively. However, they induce the large chip size because of the high flexibility. Vorbach et al. have also presented hardware-based concepts such as the processing element (PE) array [7], which achieves a high processing rate with

(22)

reasonable flexibility. However, the processing kernel has the flaw of a low utilization rate with a large array memory and muti-MACs, leading to poor cost efficiency. The specific ASIC based design on a fast computation algorithm provides high cost efficiency [8]-[10]. Tell et al. [8] presented the FFT/WALSH/1-D DCT processor for multiple radio standards of the upcoming 4th generation wireless systems. Conversely, some designs [8]-[10] only support 1-D DCT computation, and have no 2-D DCT support. However, 2-D DCT is desirable for the video compression among wireless communication applications. This study not only presents a single reconfigurable architecture for the 256-point FFT/IFFT modes and the 8×8 2-D DCT mode, but also achieves high cost-efficiency in portable multimedia applications.

1.3 Contributions

For the purpose of supporting these four applications, six ASIC based pipeline processors, namely recursive DFT/IDFT (RDFT) based processor, radix-2/8 multiple-path delay feedback (R28MDF) based processor, radix-2/8 multiple-path delay commutator (R28MDC) based processor, radix-42 single-path delay feedback (R42SDF) based processor, radix-43 single-path delay feedback (R43SDF) based processor and reconfigurable triple-mode FFT/IFFT/2-D DCT processor, have been presented in this thesis. The contributive descriptions are presented as below:

1. RDFT Design: Based on the proposed RDFT architecture, one high-throughput (i.e. high channel density) and power-efficient DTMF detector has been proposed. For the purpose of achieving the high power efficiency, we perform the bit level SNR simulation to decide the best configuration for the DTMF detector system. The results show that the proposed design only needs 9-bit word-length, which is one-bit less than the second order Goertzel structure, to land the satisfactory resolution under 15 dB SNR environment. In this paper, the resulting DTMF detector uses 12-bit word-length, where the additional 3 bits are used for design margins so as to obtain better performance. On the other hand, the novel design saves 4-bit cost compared with the 16-bit based DSP processor design [12]-[14]. In summary, the proposed DTMF structure not only saves more area cost, but also reduces the power consumption due to the register-splitting

(23)

scheme [51] and a smaller word-length requirement. Most importantly, the computation cycles can be reduced to 50% and thus a double throughput rate and channel density can be easily obtained without increasing the operation frequency. Our proposed DFT/IDFT chip is able to offer over 128-channel telephone signals for the high channel density DTMF detector [16] without any DSP processor inside. Each channel consumes 9.77 uW under 1.2V@20 MHz in TSMC 0.13 1P8M CMOS process. This is a significant contribution, as the high channel density and low power characteristics are demanded for the communication systems.

2. R28MDF and R28MDC Design: This investigation presents two new efficient designs, R28MDF based and R28MDC based FFT/IFFT processors for the 2×2 and 4×4 multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM) wireless LAN (WLAN) system, respectively. The novel radix-2/8 algorithm reduces the half constant multiplier requirement in the proposed retrenched 8-point FFT (R8-FFT) unit compared with that of the conventional radix-2/8 algorithm, and has low multiplicative complexity as a radix-8 based algorithm. By applying the R8-FFT unit combined with the proposed multiplication-after-write (MAW) method, the R28MDF and R28MDC architectures resulted in 100% butterfly utilization and an appropriate throughput rate with few hardware resources for the 2×2 and 4×4 MIMO-OFDM applications, respectively. Implementation results indicate that two chips consume only 19.42mW and 23.57mW under 1.2V@20 MHz in a TSMC 0.13µm 1P8M CMOS process. The comparison results among the existing 64-point FFT/IFFT processor architectures are comprehensively discussed. The architecture analyses and chip implementation indicate that the proposed FFT/IFFT processor architectures are suitable for MIMO-OFDM WLAN systems.

3. R42SDF and R43SDF Design: In this investigation, we proposes the novel radix-42 and radix-43 algorithms with the low computational complexities of the radix-16 and radix-64 algorithms and the low hardware requirement of the radix-4 algorithm. Base on the multiplierless radix-4 butterfly structure, the proposed R42SDF design and R43SDF design support the 4096-point FFT/IFFT computations. Moreover, the retrenched constant multiplier and eight-folded complex multiplier structures are adopted to decrease the multiplier cost and the coefficient ROM size with the complex conjugate symmetry rule and subexpression elimination technology. To further decrease the chip cost, a finite word-length analysis is provided to indicate that the proposed R42SDF and R43SDF architectures only require 14 and 13-bit internal word-length to achieve 40dB

(24)

SNR performance in the 4096-point FFT/IFFT computation. The comprehensive comparison results indicate that the proposed R43SDF design has the smallest hardware cost and the highest hardware utilization among the tested architectures for the FFT/IFFT computation, and thus has the highest efficiency. The implementation results show that the proposed R42SDF and R43SDF based 4096-point pipeline FFT/IFFT processors only consumes 6.3725 and 5.985 mW@20 MHz at 1.2V supply voltage in TSMC 0.13 µm CMOS process.

4. The triple-mode reconfigurable FFT/IFFT/2D-DCT Design: Applying the R42SDF architecture with the specific linear mapping of common factor algorithm (CFA), the proposed triple-mode design supports both 256-point FFT/IFFT and 8×8 2-D DCT modes following with the high efficient feedback shift registers architecture. The segment shift register (SSR) and overturn shift register (OSR) structure are adopted to minimize the register cost for the input re-ordering and post computation operations in the 8×8 2-D DCT mode, respectively. Moreover, the retrenched constant multiplier and eight-folded complex multiplier structures are adopted to decrease the multiplier cost and the coefficient ROM size with the complex conjugate symmetry rule and subexpression elimination technology. To further decrease the chip cost, a finite wordlength analysis is provided to indicate that the proposed architecture only requires a 13-bit internal wordlength to achieve 40dB SNR performance in 256-point FFT/IFFT modes and high digital video (DV) compression quality in 8×8 2-D DCT mode. The comprehensive comparison results indicate that the proposed cost effective reconfigurable design has the smallest hardware requirement and largest hardware utilization among the tested architectures for the FFT/IFFT computation, and thus has the highest cost efficiency. The derivation and chip implementation results show that the proposed pipeline 256-point FFT/IFFT/2-D DCT triple-mode chip consumes 22.37mW@100 MHz at 1.2V supply voltage in TSMC 0.13µm CMOS process, which is very appropriate for theRSoCs IP of next-generation handheld devices.

(25)

1.4 Organization

The remainder of this thesis is organized as follows.

Chapter 2 reviews the literature of the work presented in this thesis and four topics are reviewed. The first topic is a review of the Goertzel algorithm and respective hardware architecture. The second topic is a review of mixed-radix based FFT algorithms. The third topic is a comparative review of high-radix based FFT algorithms. The final topic is a review the DCT algorithm.

Chapter 3 describes a new recursive DFT/IDFT algorithm and architecture by the hybrid of input strength reduction, Chebyshev polynomial, and register-splitting schemes is revealed. Applying this new architecture, the DTMF application has been demonstrated. After the bit-level SNR simulation, the 212/106-point DFT/IDFT chip has been successfully implemented for the DTMF detector system. Furthermore, the comparison results are tabulated in terms of the amount of computation cycles for each output as well as N-point DFT/IDFT, the maximum number of the channel density, the clock period, and the number of real multipliers.

Chapter 4 describes a modified radix -2/8 FFT/IFFT algorithm. Using this mixed-radix based algorithm, we discuss the corresponding R28MDF and R28MDC fabrics and the detailed timing considerations. Furthermore, the implementation issues are discussed. Finally, the comparison results of the 64-point FFT/IFFT architectures for the 2×2 and 4×4 MIMO-OFDM system have been summarized.

Chapter 5 describes a new radix-42 and radix-43 FFT/IFFT algorithms. Applying these algorithms, the proposed R42SDF and R43SDF VLSI architectures could be demonstrated. Base on the finite word-length analysis, we could prove that the proposed architectures achieve the satisfactory system performance. Furthermore, the comparison results in terms of hardware utilization and cost demonstrate the high cost-efficiency of the proposed architectures. The chip implementation is also presented.

Chapter 6 describes a new triple-mode radix-42 FFT/IFFT and 8×8 2D DCT algorithm. Using the proposed radix-42 algorithm, the proposed R42SDF based FFT/IFFT/2-D DCT pipeline architecture is demonstrates. The finite wordlength analysis indicates that the proposed

(26)

architecture achieves the required system performance in both 256-point FFT/IFFT and 8×8 2-D DCT modes with the lowest hardwire cost. According to the comparison results in terms of hardware utilization and cost, we could demonstrate the high cost-efficiency of the proposed architecture. Finally, the chip implementation is presented.

(27)

Chapter 2 Literature Review

The research work described in this thesis pertains to the design and realization of high effective pipeline processor for DFT/IDFT computations in different applications as discussed in Chapter 1. In this chapter, we consider a number of algorithms for computing the DFT. The algorithms vary in efficiency, but all of them require fewer multiplications and additions than does direct evaluation of DFT. This chapter will review four different topics relating to four different applications as discussed in the chapter 1. First, a review of the Goertzel algorithm and respective hardware architecture is presented. Second, a r eview of mixed-radix based FFT algorithms is presented. Third, a comparative review of high-radix based FFT algorithms is discussed. Finally, the algorithm mapping between FFT and DCT is detail reviewed.

2.1 The Goertzel Algorithm

In this section, we first discuss the Goertzel’s algorithm [4], which requires computation proportional to N2, but with a smaller constant proportionality than that of the direct computation of DFT. Notably, the Goertzel’s algorithm is that it is not restricted to computation of the DFT, but is in fact equally valid for the computation of any desired set of samples of the Fourier transform of a sequence. Adopting the periodicity of the sequence

kn N

W , the Goertzel algorithm efficiently reduce the computation complexity of DFT

(28)

2.1.1 The Recursive DFT Algorithm

Given input sequence and DFT output sequence denoted as x[n] and X[K], respectively, the N-point DFT can be defined as

kn N N n W n x k X = ∑− ⋅ = 1 0 ] [ ] [ , (1)

where W_N =e−j2π/N. The Goertzel algorithm [4] making use of the periodicity of the sequence W_Nkn can be used to reduce computation. For convenience of deriving a new

architecture, we begin a review of the recursive DFT expression based on Goertzel algorithm by noting that

1 2 ) / 2 ( ₌ ₌ = − −kN j N Nk j k N e e W π π . (2) Because of Eq. (2), we may multiply the right side of Eq. (1) by W_N−kN without affecting the equation. Thus,

. ] [ ] [ ] [ 1 ( ) 0 1 0 r N k N N r kr N N r kN N x r W x r W W k X − − − = − = − _∑ _⋅ ₌ _∑ _⋅ = (3) In order to simplify the final expression, let us define the sequence

]. [ ] [ ) (n x r W ( ) u n r y _Nk n r r k = ∑ ⋅ − − ⋅ − ∞ −∞ = (4) From Eqs. (3) and (4) and the fact that x[n]=0 for n<0 and

n

≧N , it follows that

N n k n y k X[ ]= [ ] ₌ . (5) Eq. (4) can be interpreted as a discrete convolution of the finite-duration sequence x[n],

0≦n≦N-1, with the W_N−knu[n]. As a consequence, y_k(n) can be regarded as the response of a system with impulse response W_N−knu[n] to a finite-length input x[n]. In particular, X[k] is the value of the output when n=N. Taking the z-transform of Eq. (4), we can obtain the first-order transfer function as

1 1 1 ] [ − − − = z W z H k N k . (6)

It is possible to retain this simplification while reducing the number of multiplications by a factor of 2. To see how this may be treated, the transfer function of the first-order recursive DFT structure can be noted. Multiplying both the numerator and the

(29)

denominator of Hk(z) by the factor (1−W_Nkz−1), we obtain second-order transfer function as 2 1 1 1 1 1 ) 2 cos( 2 1 1 ) 1 )( 1 ( 1 ] [ ₋ ₋ − − − − − + − − = − − − = z z N k z W z W z W z W z H k N k N k N k N k

_π

. (7)

2.1.2 The Recursive DFT Architecture

(a) (b)

Fig. 2: (a) Block diagram of the first-order recursive DFT structure and (b) a multiplexer-type dash-line implementation with down-sampling value of N.

Eq. (6) can be mapped into the first-order recursive DFT structure as shown in Fig. 2(a), where initial rest conditions are assumed and the vertical dash-line denotes the down-sample operation with N for each crossing signal path. Note that the dash-line as shown in Fig. 2(a) can be possibly implemented by multiplexer-type or register-type down-sampling realization. Here, we adopt the multiplexer-type down-sampling realization as shown in Fig. 2(b) due to the advantages of less area and exact mapping from the equation to the architecture. In Fig. 2(b), if sel=1, the lower-side signal is passed to the output; otherwise, the upper-side signal is selected as the output signal for the

multiplexer. In this correspondence, since the input x[n] and the coefficient W_N−kare in the complex domain, the computation of each new value of yk[n] through the first-order recursive DFT structure as shown in Fig. 2(a) requires four real multiplications and four real additions. All the intervening values yk[1], yk[2],… yk[N-1] must be computed in order to compute yk[N]=X[k], so the use of the first-order recursive DFT structure as a

(30)

computational algorithm requires 4N real multiplications and 4N real additions to compute X[k] for a particular value of k. However, a large number of multiplications are required for the first-order recursive DFT architecture, even if the one avoids the

computation or storage of the coefficients W_Nkn in Eq. (1) at each nth time index.

Eq. (7) can be mapped into the second-order recursive DFT structure as shown in Fig. 3.

Fig. 3: Block diagram of the second-order recursive DFT structure.

In Fig. 3, only two real multiplications per sample are required to implement the poles of this system as shown in Fig. 3. Note that, in the denominator of Eq. (7), the coefficients are real and the factor –1 need not be counted as a multiplication. It is worthy of

emphasizing that the complex multiplication by −W_Nk required to implement the zero of the transfer function need not be performed at every iteration of the difference equation, but only after the Nth iteration. Thus, the total computation is 2N real multiplications and 4N real additions for the poles plus four real multiplications and four real additions for the

zero. The coefficients W_Nkn are again computed implicitly in the iteration of the

recursion formula implied in Fig. 3. The second-order recursive DFT structure can decrease the number of multiplications by Goertzel algorithm; however, the amount of multipliers and the value of the critical period are sacrificed. Hence, the structures in Figs. 2(a) and 3 are not efficient.

(31)

2.2 The Review of FFT Algorithm

Due to the large computation load of DFT computation, the direct evaluation of the entire DFT results will cause the serious quantization noise error. FFT are a group of algorithms for significantly speeding up the computation of the DFT. Furthermore, FFT based algorithms reduce the number of computations to achieve the low quantization. Notably, the design with the highest computation complexity also means the highest power consumption [26, 28, 31-33]. The most widely known of these algorithms is attributed to Cooley and Tukey and is used for a number of points N equal to a power-of-two [3]. The number of applications for specific FFTs continues to grow and includes such diverse areas as: speech recognition, video/audio codecs and MIMO-OFDM based mobile communication. There are many ways to measure the complexity and efficiency of an implementation or algorithm, and a final assessment depends on both the available technology the intended application [62]. The arithmetic multiplications and additions are well known to be the good measurements of computational complexity. In this section, some popular FFT algorithms are first reviewed. Some famous pipeline FFT architectures are also detail discussed. Later, some design issues are reminded, such as: high-throughput and long-length based FFT design.

According to the variant of decomposing sequence, two common FFT algorithms could be found, namely decimation in time (DIT) and decimation in frequency (DIF) based FFT algorithms. Significantly, the in-place computation could conveniently make the conversion between these two algorithms [62]. There is no difference in computational complexity and signal flow graph (SFG) between two types of algorithms; herein we only focus on DIF FFT algorithm. In this thesis, we focus on the discussion of DIF based FFT algorithms. Since the low computational complexity of FFT algorithms is desired for high speed and low power consideration in VLSI implementation as discussed before. In this sub-section, the radix-2, radix-4 and radix-8 DIF based equations will be first discussed to demonstrate the computation complexity between different FFT algorithms.

(32)

2.2.1 Radix-2 DIF FFT Algorithm

The DIF FFT algorithms are all based on structuring the DFT computation by forming smaller and smaller subsequences of the output sequence X[k]. To restrict the formula to N a power of 2, the radix-2 DIF FFT algorithm is to consider computing separately the even-numbered frequency samples and the odd-numbered frequency samples. By separating X[k] into 2r and 2r+1, we obtain the following equations.

rn N N N n rn N N n r n N N n W n x W n x W n x r X 1 2 2 2 1 2 0 ) 2 ( 1 0 ] [ ] [ ] [ ] 2 [ = ∑ ⋅ = ∑ ⋅ + ∑− ⋅ = − = − = ( 8 ) n r N N N n n r N N n r n N N n W n x W n x W n x r X 1 (2 1) 2 ) 1 2 ( 1 2 0 ) 1 2 ( 1 0 ] [ ] [ ] [ ] 1 2 [ − + = + − = + − = ⋅ = ∑ ⋅ + ∑ ⋅ ∑ = + ( 9 )

where r=0,1,……(N/2 - 1). Due to the periodicity of W_N2rn, we could substitute the variables

in the second term of summation to obtain the following equations.

rn N N n rn N N n N n r N N n rn N N n W N n x W n x W N n x W n x r X 2 1 2 0 2 1 2 0 ) 2 ( 2 1 2 0 2 1 2 0 ] 2 [ ] [ ] 2 [ ] [ ] 2 [ = ∑ ⋅ + ∑ + ⋅ = ∑ ⋅ + ∑ + ⋅ − = − = + − = − = _Nrn N n rn N N n W N n x n x W N n x n x 2 1 2 0 2 1 2 0 ]} 2 [ ] [ { ]} 2 [ ] [ { + + = ∑ + + ∑ = − = − = ( 1 0 ) ) 2 )( 1 2 ( 1 2 0 ) 1 2 ( 1 2 0 ] 2 [ ] [ ] 1 2 [ N n r N N n n r N N n W N n x W n x r X + + − = + − = ⋅ + ∑ + ⋅ ∑ = + n r N N n n r N N n W N n x W n x (2 1) 1 2 0 ) 1 2 ( 1 2 0 ] 2 [ ] [ + − = + − = ⋅ ∑ ₊ − ⋅ ∑ = rn N rn N N n n r N N n W W N n x n x W N n x n x 2 1 2 0 ) 1 2 ( 1 2 0 ]} 2 [ ] [ { ]} 2 [ ] [ { ⋅ = ∑ − + ⋅ ∑ − + = − = + − = ( 1 1 )

Following with the similar decomposition procedure, two N/2 points DFT results can be further decomposed and then four N/4 points DFT results are produced. After log2N time recursive decompositions, we can obtain the radix-2 DIF FFT algorithm.

(33)

N2 complex multiplications and N(N-1) complex additions. It is well known that each complex multiplication requires four real multiplications and two real additions, and each complex addition requires two real additions. Then, the direct computation of DFT of a sequence x[n] totally requires 4N2 real multiplications and N(4N-2) real additions. From the eqs. (10) and (11), the radix-2 algorithm requires Nlog2N complex multiplications and

complex additions. Alternately, the radix-2 algorithm requires 8 2 7 log 2 3 2 N− N+ N real multiplications and 8 2 7 log 2 5 2 N− N + N real additions.

2.2.2 Radix-4 DIF FFT Algorithm

From the discussion in subsection 2.2.1.1, it is obviously that the radix-2 DIF FFT algorithm could efficiently compute the DFT results than direct method. Comparing with the radix-2 algorithm, the radix-4 algorithm can further reduce the computation complexity with keeping the same regularity in each butterfly computation. A radic-4 DIF FFT algorithm can be derived from recursively decimating the frequency series into four subsets. By separating X[k] into 4r , 4r+1, 4r+2 and 4r+3, we obtain the following equations.

rn N N N n rn N N N n rn N N N n rn N N n rn N N n W n x W n x W n x W n x W n x r X 1 4 4 3 4 1 4 3 2 4 1 2 4 4 1 4 0 4 1 0 ] [ ] [ ] [ ] [ ] [ ] 4 [ = ∑ ⋅ = ∑ ⋅ + ∑ ⋅ + ∑ ⋅ + ∑− ⋅ = − = − = − = − = (12) n r N N n W n x r X (4 1) 1 0 ] [ ] 1 4 [ − + = ⋅ ∑ = + _N r n N N n n r N N N n n r N N N n n r N N n W n x W n x W n x W n x (4 1) 1 4 3 ) 1 4 ( 1 4 3 2 ) 1 4 ( 1 2 4 ) 1 4 ( 1 4 0 ] [ ] [ ] [ ] [ − + = + − = + − = + − = ⋅ + ∑ ⋅ + ∑ ⋅ + ∑ ⋅ ∑ = (13) n r N N n W n x r X 1 (4 2) 0 ] [ ] 2 4 [ − + = ⋅ ∑ = +

高效能之管線式傅立葉轉換處理器之設計與實現

國

國

國

國 立

立

立

立 交

交

交

交 通

通

通 大

通

大

大 學

大

學

學

學

電機與控制工程學系

電機與控制工程學系

電機與控制工程學系

電機與控制工程學系

博

博

博

博 士

士

士 論

士

論

論 文

論

文

文

文

高效能之管線式傅立葉轉換處理器之設計與實現

Design and Implementation of High-Effective

Pipelined Processors for Discrete-Time Fourier

Transform Applications

研

研

研

研 究

究

究

究 生

生

生

生：

：余

：

：

余

余

余 遠

遠

遠

遠 渠

渠

渠

渠

指導教授

指導教授

指導教授

指導教授：

：

：

：林

林

林

林 進

進

進

進 燈

燈

燈

燈

中華民國九

國立

立交

交通

通大

大學

博士

士論

論文

研究

究生

余遠

遠渠

林進

進燈

研究生：余遠渠 Student: Yuan-Chu Yu

指導教授：林進燈 Advisor: Chin-Teng Lin