適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度快速傅立葉轉換

(1)

國

立

交

通

大

學

電機與控制工程學系

碩

士

論

文

適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度

快速傅立葉轉換

A Variable FFT for MIMO-OFDM Systems over Wi-MAX

Applications

研究生：葉柏賢

指導教授：蔡尚澕教授

(2)

適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度快速傅

立葉轉換

A Variable FFT for MIMO-OFDM Systems over Wi-MAX Applications

研究生：葉柏賢 Student：Bo-Xian Ye

指導教授：蔡尚澕 Advisor：Shang-Ho Tsai

國立交通大學

電機與控制工程學系

碩士論文

A Thesis

Submitted to Department of Electrical and Control Engineering College of Electrical Engineering

National Chiao Tung University in partial Fulfillment of the Requirements

for the Degree of Master

in

Electrical and Control Engineering

November 2008

Hsinchu, Taiwan, Republic of China

(3)

適用於多輸入多輸出正交分頻多工 W i - M A X 系統之可變長度快速傅立葉轉換

學生：葉柏賢

指導教授：蔡尚澕

國立交通大學電機與控制工程學系﹙研究所﹚碩士班

摘

要

在這篇論文，我們介紹一個可以應用於 Wi-MAX 系統中的可變長度快速傅立

葉轉換。這個可變長度快速傅立葉轉換可以提供許多快速傅立葉轉換的長度及多

天線傳輸。這個 2048/1024/512/128-point 可變長度快速傅立葉轉換是以 radix-2

及

3

radix-2 快速傅立葉轉換演算法。我們也提出一個記憶體分享的方法去減少記

憶體的使用。這個方法比較於 R2SDF 的方法可以減少 ROM 表格大小從 1023N/1024

到 N/4 ，N 為快速傅立葉轉換的長度。此外，我們使用

3

radix-2 快速傅立葉轉換

演算法使得複數乘法器的數量減少並且也使用修正的複數乘法器使的所使用的

邏輯閘數比較少。如此功率消耗也能更加節省。我們所提出的可變長度快速傅立

葉轉換是使用台積電 0.18um CMOS 製程所製造，晶片的面積為 25

2

mm 。當處理

器操作於頻率 40MHz 時所需的功率為 181 mW。

(4)

A Variable FFT for MIMO-OFDM Systems over Wi-MAX Applications

Student：Bo-Xian Ye

Advisors：Dr. Shang-Ho Tsai

Department﹙Institute﹚of Electrical and Control Engineering

National Chiao Tung University

ABSTRACT

In this thesis, we present a variable FFT that it support multiple FFT size and

multiple antennas for Wi-MAX systems. The 2048/1024/512/128-point variable FFT

is based on

radix-2 and

radix-2 FFT algorithm. We propose a memory sharing

3

method to reduce the memory size. This method can reduce the ROM table size from

1023N/1024 to N/4, where N is the FFT size, compared with R2SDF. Furthermore,

we use the

radix-2 FFT algorithm to reduce the number of complex multipliers, and

3

the modified complex multiplier leads to a smaller gate count. Thus, the power

consumption can be to reduced as well. The proposed variable FFT is fabricated using

a TSMC

0.18um CMOS technology with chip area 25

mm . The average dynamic

2

power consumption is 181 mW at 40 MHz operating frequency.

(5)

誌

謝

兩年來的研究生活終於要告一個段落了，此篇論文能夠順利的完

成首先要感謝的是我的指導教授蔡尚澕教授。在兩年的研究生活中，

老師不辭辛苦的一步一步的帶領我們走進通訊晶片設計的領域，也很

配服老師的研究精神及超人的體力，讓我在學習上也有更明確的目

標。也希望老師在忙碌之於能多愛惜自己的身體。也感謝我的口試委

員:林源倍教授、簡鳳村教授、董蘭榮教授的經驗提供使得我的論文更

加的完整。

另外，感謝 535 實驗室的學長及同學，因為有你們在課業上的幫

忙及意見的提供，讓我在修課上的疑惑能夠有很大的幫助。另外，還

需感謝實驗室一起打拼的同學，讓我在作研究中可以有更多的思考方

式去解決作研究時所遇到的種種困難。也感謝學弟妹們的加入，因為

有你們的加入使的我的研究生活更加有樂趣。

最後，我要感謝的是我偉大的母親，感謝她一直在背後為了我默

默的付出，一直在背後支持著我，因為有妳的支持及鼓勵使得我在作

任何事情都能更有力量更有信心。

將此篇論文獻給所有關心我幫助我的人，感謝你們。

(6)

A Variable FFT for MIMO-OFDM Systems over

Wi-MAX Applications

Bo-Xian Ye

Advisor: Dr. Shang-Ho Tsai

Department of Electrical and Control Engineering

National Chiao Tung University

November 28, 2008

Abstract

In this thesis, we present a variable FFT that it support multiple FFT size and multiple antennas for Wi-MAX systems. The 2048/1024/512/128-point variable FFT is based on radix-2 and radix-23 FFT algorithm. We propose a memory sharing method to reduce the memory size. This method can reduce the memory size from 1023N₁₀₂₄ to N₄, where N is the FFT size, compared with R2SDF. Furthermore, we use the radix-23 FFT algorithm to reduce the number of complex multipliers, and the modiﬁed complex multiplier leads to a smaller gate count. Thus, the power consumption can be to reduced as well. The proposed variable FFT is fabricated using a TSMC 0.18um CMOS technology with chip area 25mm2. The average dynamic power consumption is 181mW at 40MHz operating frequency.

(7)

List of Figures

2.1 The SFG of 128-point mixed-radix FFT. . . 8

2.2 Block diagram FFT/IFFT. . . 9

2.3 Block diagram of the 128/64-point FFT/IFFT processor. . . 10

2.4 (a) Order of input; (b) Order of output. . . 11

2.5 Block diagram of Module 1. . . 11

2.6 Relation between Module 1 input and Module 1 output. . . 12

2.8 (a) Architecture of multiplexer; (b) Operation mode of multiplexer. 14 2.9 Architecture of four antenna R2SDF FFT 128-point at stage one. 15 2.10 Save data in memory. . . 15

2.11 Operation of radix-2. . . 16

2.12 Module 2 memory bank. . . 16

2.14 Two operation mode. . . 17

2.15 Eight region of twiddle factor. . . 18

2.17 Architecture of R2MDC. . . 21

2.18 Architecture of R2SDF. . . 21

2.19 Block diagrams of a variable FFT processor. . . 22

2.20 The SFG of 64-point mixed-radix FFT. . . 23

3.1 The SFG of stage 1 to stage 5 (radix-2). . . 27

3.2 The SFG of stage 6 to stage 7 (radix-23). . . 27

(10)

3.4 The input and output relationship of FFT. . . 28

3.5 (a) Read and write with column; (b) Read and write with row. . . 29

3.6 Relation between Module 1 input and Module 1 output. . . 29

3.8 Memory sharing from Module 2 to Module 6. . . 31

3.9 (a) ROM table at clock cycle 1024 to 2047; (b) ROM table at clock cycle 2048 to 3071. . . 32

3.10 (a) ROM table at clock cycle 1024 to 2047; (b) ROM table at clock cycle 2048 to 3701. . . 32

3.11 Analysis for critical path. . . 33

3.12 The FFT critical path. . . 33

3.13 System model of SQNR. . . 35

3.14 SQNR v.s. SNR. . . 36

4.1 A Design ﬂow. . . 38

4.2 Simulation environment for variable FFT. . . 39

4.3 Bit-width in all stage. . . 40

4.4 BIST circuit. . . 41

4.5 From Flip-Flop to scan Flip-Flop. . . 41

(11)

List of Tables

2.1 Mapping table of twiddle factors in diﬀerent regions. . . 18

2.2 FFT size in several OFDM systems. . . 20

3.1 Comparison of hardware requirement. . . 34

4.1 Expected chip performance of the proposed FFT processor. . . 44

(12)

Chapter 1 Introduction

1.1 Motivation and goal

Wi-MAX (Worldwide Interoperability for Microwave Access) is a technique aimed to provide applications in wireless metropolitan area networking (WMAN). This technique was developed by the IEEE 802.16 groups and was adopted by both the IEEE and the ETSI HIPERMAN groups. The IEEE 802.16 group was formed in 1998 to develop an air-interface standard for wireless broadband. At begin, the Wi-MAX solutions targeted on fixed applications, e.g. IEEE 802.16-2004 which is also called as fixed Wi-MAX. In December 2005, the IEEE group completed and approved IEEE 802.16e-2005 standard, which was and amendment to IEEE 802.16e-2004 standard that added mobility support. The IEEE 802.16e-2005 is often called as mobile Wi-MAX. Wi-MAX offers a rich set of features with a lot of flexibility including deployment options and potential service offerings. Some important features of Wi-MAX in physical layer are as follows:

• OFDM techniques:

The Wi-MAX physical layer is based on orthogonal frequency division mul-tiplexing (OFDM), which is robust to multipath eﬀect. Thus, it can be used in Non-Line-of-Sight (NLOS) environments. In OFDM-based systems, FFT (Fast Fourier Transform) is key component and hence it is widely studied there years.

(13)

Wi-MAX can support variable bandwidth in physical layer. That is, it can adjust the transmission rate via changing the bandwidth. As a result, we need FFT with various sizes. For example, the bandwidth for Wi-MAX systems can be 1.25MHz, 5MHz, 10MHz, 20MHz corresponding to the 128-, 512-, 1024- and 2048-point FFT. This dynamic adjustment of and bandwidth real location possible FFT size enables user roaming in diﬀerent network. Thus, the variable FFT is a key component for OFDM systems with various bandwidth.

• MIMO techniques:

The Wi-MAX solution uses multiple antenna techniques, such as beamform-ing, space-time codbeamform-ing, and spatial multiplexing to enhance system perfor-mance, including the overall system capacity and spectral eﬃciency. There-fore, an FFT architecture that can be eﬃciently used in MIMO-OFDM systems is also important.

The traditional FFT algorithm can be roughly classified into three types. The first type is fixed-radix FFT algorithm. The fixed-radix algorithm can be further to divided into the radix-2, radix-4/radix-22 and radix-8/radix-23 algorithm [1] - [3]. The second type is split-radix FFT algorithm. The split-radix algorithm can be to divided into the radix-2/4 , radix-2/8 and radix-2/4/8 algorithm [4] - [5]. Third, we can used the method of common-factor algorithm (CFA) or prime-factor algorithm (PFA) to preform the mixed-radix algorithm [6].

The traditional pipeline FFT architecture can be roughly classified into two types. The first type is the single-path delay feedback (SDF). The single-path de-lay feedback can be divided into the R2SDF, R4SDF/R22SDF and R8SDF/R23SDF. The second type is the multi-path delay commutator (MDC). The multi-path de-lay commutator can be further divided into the R2MDC, R4MDC/R22MDC and R8MDC/R23MDC. Others architecture rather than the pipeline include Memory-based FFT [6], Cordic-base FFT [7] and systolic FFT [8] - [16]. The above ar-chitectures are not suitable to be used in Wi-MAX systems without modification since both MIMO and variable FFT are not needed in Wi-MAX systems.

(14)

Recently, some architectures for MIMO FFT or variable FFT were proposed. For example, the combination of MIMO and OFDM such as mixed-radix multi-path delay feedback (MRMDF) use proposed by Lin [19]. The authors use the characteristic of mixed-radix, multi-path and feedback plan for FFT and apply is in standard 802.11n. As for the variable FFT the authors in [8] used radix-23 and multiplexors to implement the low power FFT architecture. Also, the others variable FFT architecture as refer to [9] - [11]. In general, we needed to increase the number of FFT processing units to our best knowledge in order to increase the throughput rate in MIMO-OFDM systems. Thus, the hardware complexity and power consumption increase as well. When MIMO-OFDM need variable FFT size, the hardware complexity increases dramatically. However, few researches has been conduct about the architecture of combining MIMO-OFDM and variable FFT which is used in Wi-MAX systems. A good processor not only need to support high throughput rate and variable FFT size, but also they need to be more eﬃcient for hardware implementation. It is challenging to combine the advantages of MIMO-OFDM and variable FFT size.

1.2 Contributions and Features

The contributions of this research include:

• We proposed the method of reduce the memory size to 25% in Wi-MAX compared with that using R2SDF: Since the maximum supported FFT size is 2048-point, and the major part of the FFT Module is radix-2 algorithm, the memory occupancies much gate count. Thus, reducing memory leads to reduction of die area and power consumption.

• Multiplier sharing: In each radix-2 stage, the number of complex multipliers can be reduced from 4 to 2. Thus, the utilization rate of the multipliers increases from 50% to 100%.

• High radix to reduce the complexity: Because higher-radix algorithm can reduce number of multiplier and power consumption, we employ radix-23

(15)

(16)

Chapter 2 Background

2.1 FFT for MIMO systems

The combination of the multiple-input multiple-output (MIMO) signal process-ing with orthogonal frequency-division multiplexprocess-ing (OFDM) is considered as a promising solution for enhancing the data rates of the next generation wire-less communication systems operated in frequency-selective fading environments. Because the technique of the MIMO can increase the data rate by extending an OFDM system, in the IEEE802.11n standard that uses a MIMO-OFDM system provides very high data throughput rate from the original data rate 54 Mb/s to the data rate in excess of 600 Mb/s. However, the IEEE802.11n standard also increases the computational and hardware complexities, compared with the current SISO standards. It is a challenge to realize the physical layer of the MIMO-OFDM system with small hardware complexity and power consumption in very large scale integration (VLSI) implementation. Because the employing traditional approach to solve the simultaneous multiple data sequence, several FFT/IFFT processors are needed in the physical layer of a MIMO-OFDM sys-tem, we present the fast Fourier transform (FFT)/inverse FFT (IFFT) architec-ture was proposed by Lin for applications in a MIMO-OFDM systems [19]. The mixed-radix multi-path delay feedback (MRMDF) FFT architecture can provide higher throughput rate with small hardware cost, and can support 1-4 data se-quence transmitted. The MRMDF architecture utilizes the advantages of the

(17)

following two FFT architectures: one is the single-path delay feedback and the other is the multi-path delay commutator [2].

2.1.1 Algorithm

A basic N -point discrete Fourier transform (DFT) is deﬁned as

X(k) =N−1

n=0

x(n)Wkn

N , k = 0, 1, . . . , 127 (2.1)

where x(n) and X(k) are complex number. The twiddle factor is

Wnk N = e−j 2πnk N _{= cos}2πnk N −j sin 2πnk N . (2.2)

From the equation (2.1) we know that computational complexity is O(N2) through directly performing the required computation. By using the FFT al-gorithm, the computational complexity can be reduced to O(Nlog_rN), where r means the radix-r FFT algorithm. Although higher radix FFT algorithm has smaller process element (PE) iteration counts, generally require higher PE com-plexities in implementation. One well-known approach to solving this problem is the method introduced by He and Torkelson [3]. The solution for the problem is used the radix-23 algorithm replace radix-8 algorithm, and then PE complexity can be reduce to radix-2 FFT algorithm. Because the 128-point FFT is not a power of 8, the mixed-radix algorithm is needed. The mixed-radix include the radix-2 and radix-8 FFT algorithm. we shall be derived in detail below.

First let N = 128 n = 64n₁+ n₂, n₁ = 0, 1 n₂ = 0, 1, . . . , 63 k = k₁+ 2k₂, k₁= 0, 1 k₂= 0, 1, . . . , 63 the equation (2.1) can be rewritten as

X(2k₂+ k₁) = 63 n2=0 1 n1=0 x(64n₁+ n₂)W(64n1+n2)(2k2+k1) 128

(18)

= 63 n2=0 ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1 n1=0 x(64n₁+ n₂)Wn1k1 2 2-point Wn2k2 128 twiddle factor ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ Wn2k2 64 64-point (2.3) = 63 n2=0 BU₂(k₁, n₂)Wn2k2 64 . (2.4)

Equation (2.3) can regarded as a two-dimensional DFT, one is 64-point DFT and the other is 2-point DFT. Thus, we can complete the 128-point mixed-radix FFT operation. Furthermore, we can decompose 64-point DFT into 8-point DFT recursively 2 times and replace by radix-23 FFT algorithm. Because the radix-23 FFT algorithm is more eﬃcient for VLSI design, we further reduce the PE complexity by using radix-2 process element. By a four-dimensional linear index map, we can rewritten n₂ and k₂ as

n₂ = 32α₁+ 16α₂+ 8α₃+ α₄ α₁, α₂, α₃ = 0, 1; α₄ = 0, 1, . . . , 7

k₂ = β₁+ 2β₂+ 4β₃+ 8β₄ β₁, β₂, β₃ = 0, 1; β₄ = 0, 1, . . . , 7 (2.5) By means of equation (2.5), equation (2.4) take the form of

X(2(β₁+ 2β₂+ 4β₃+ 8β₄) + k₁) = 7 α4=0 1 α3=0 1 α2=0 1 α1=0 BU₂(k₁, 32α₁+ 16α₂+ 8α₃+ α₄) × W(32α1+16α2+8α3+α4)(β1+2β2+4β3+8β4) 64 = 7 α4=0 BU₈(k₁, β₁, β₂, β₃, α₄)Wα4β4 8 , (2.6)

where BU₈(k₁, β₁, β₂, β₃, α₄) is show in equation (2.7).

BU₈(k₁, β₁, β₂, β₃, α₄) = 1 α3=0 1 α2=0 1 α1=0 BU₂_{× W}α2β1 4 W2α2β2W8α3(β1+2β2)W2α3β3W64α4(β1+2β2+4β3),(2.7)

(19)

where BU₂ = BU₂(k₁, 32α₁+ 16α₂+ 8α₃+ α₄)× Wα1β1

2 . The 128-point

mixed-radix FFT algorithm signal ﬂow graph is show in Fig. 2.1. The mixed-radix-2 FFT algorithm is used in the ﬁrst stage, and the radix-8 FFT algorithm is applied in the second and third stage. The black point between the stage is twiddle factor.

Stage 1

Stage 2

Stage 3

Figure 2.1: The SFG of 128-point mixed-radix FFT.

(20)

The IFFT of an N -point sequence x(n), k = 0, 1, . . . , N− 1 is deﬁned as x(n) = 1 N N−1 k=0 X(k)W−nk N . (2.8)

In order to implement the IFFT algorithm more eﬃciently, equation (2.8) can be rewritten as x(n) = 1 N _N−1 k=0 X∗_(k)Wnk N ∗ . (2.9)

According to equation (2.9), the IFFT can be performed by taking the com-plex conjugate of input data and then taking the comcom-plex conjugate of output data without change in any coeﬃcient in the original FFT architecture. Thus, the hardware implementation can be more eﬃcient. The block diagram of FFT/IFFT is show in Fig. 2.2. It was utilized multiplexer to change in the operation mode that operation of FFT or IFFT.

FFT

M

U

X

[]*

M

U

X

[]*

Figure 2.2: Block diagram FFT/IFFT.

2.1.2 Architecture

The FFT of MRMDF is provide 128/64-point FFT/IFFT operation, and then can support 1-4 data sequence transmitted for MIMO-OFDM system. From the Fig. 2.3 show that system architecture contains of Module 1 (data reorder), Mod-ule 2 (radix-2), ModMod-ule 3 (radix-23), Module 4 (radix-23), conjugate block, divi-sion block and multiplexer. The characteristic of the MRMDF FFT architecture with size 128/64 are the following:

• The 128/64 point FFT with 1-4 simultaneous data sequence can be operated in this design.

(21)

• The FFT architecture can provide 1-4 throughput rates to achieve the re-quirements of IEEE802.11n standard.

• Small memory is needed by using the delay feed back scheme. • Hight throughput rate can achieve by using the multi-path scheme.

• Higher radix FFT algorithm can be implemented to save power consump-tion.

• Modify complex multiplier can be implemented by constant multiplier to save power consumption.

Because the MRMDF architecture based on a radix-2 butterﬂy, the order of the output sequence is the bit reversal of the order of the input sequence, as shown in Fig. 2.4. The operation of the FFT and IFFT is controlled by the control signal, FFT/IFFT signal is show in Fig. 2.3. The details of this FFT architecture will be described in the next subsection.

M U X []* Module 1 (Data Reordering) Module 2 (Radix-2 FFT) MUX Module 3 (Radix-8 FFT) Module 4 (Radix-8 FFT) []* 1/N M U X Data In Data Out FFT/IFFT Mode

Figure 2.3: Block diagram of the 128/64-point FFT/IFFT processor.

a) Module 1: Module 1 contains several diﬀerent size delay elements and switch block, as shown in Fig. 2.5. The function of Module 1 is to reorder the

(22)

Time

Time (a)

(b)

Figure 2.4: (a) Order of input; (b) Order of output.

input data sequence to achieve two goals. First, let Module 2, Module 3 and Module 4 implement the operation the FFT/IFFT more eﬃcient. Second, avoid the data sequences in Module 3 to be multiplied by the same twiddle factor in each data path simultaneously. Thus, the modify complex multiplier can be used in Module 3 to reduce the hardware complexity by using the shift-and-add method [20]. The operation of the Module 1 is show in Fig. 2.6.

switch 1 2 3 1 2 3

Figure 2.5: Block diagram of Module 1.

First, the four adjoining sequence with across diﬀerence delay unit, and then four adjoining data will be reordered by the appropriate operation of the switch. Finally, the adjoining data will simultaneous by diﬀerence delay unit. The re-ordered data will be separated into 32 groups or 16 groups for 128 or 64 point FFT calculation. If four data sequence will be transmitted, each group contains four data sequence, A, B, C and D. And in each group has the same sub-index, as shown in Fig. 2.6. As seen in Fig. 2.6, if there is only three data sequence will

(23)

Time 1 2 3 1 2 3 Group Group Time Time Time

Figure 2.6: Relation between Module 1 input and Module 1 output.

be transmitted, the number of operation is three in the each group and so on. The operation of FFT/IFFT will more eﬃcient through the reordering module.

b) Module 2: The Module 2 contains memory, two complex multipliers, four butterfly units, two ROM tables and some multiplexors as shown in Fig. 2.7. There are two kind of radix-2 butterfly unit in the FFT/IFFT processor. One of radix-2 butterflies is in Module 2, denoted by BF1. The function of BF1 is X(i) = x(i) − y(i) and Y (i) = x(i) + y(i). The dot-line rectangular in Fig. 2.7 is redrawn more detailed in Fig. 2.8(a). The control signal of the multiplexer is to determine one of the two operation modes of data change, as shown in Fig. 2.8(b). When a 64-point FFT/IFFT is used in this architecture, the input data will skip Module 2 and directly go to Module 3. Four memory units are needed to save the result of butterfly operation. Only 1/8 cycle of cosine and sine values are needed to be stored in ROM table, and the other values can be reconstructed by these stored values. Thus, the ROM table size can be to reduce. In general, four complex multipliers are needed to implement FFT/IFFT with four-parallel data sequences by traditional radix-2 SDF architecture, but we only needed two multipliers to implement four-parallel data sequences in this module. We can first multiply the twiddle factors of two data sequences, and then multiply the other two data sequences. We call this method as time sharing. The time sharing is explained below.

(24)

BF1 A B C D DATA IN

DATA IN DATA OUT

MEM OUT MEM IN

´ ´

64

BANK1 BANK2 BANK3 BANK4

0 1 1 0 0 1 . . . ROM ROM BU2_A BU2_A BU2_A BU2_A BF1 ) 3 ( ) 3 ( ) 2 ( ) 2 ( ) 1 ( ) 1 ( ) 0 ( ) 0 ( Y X Y X Y X Y X ) 3 ( ) 3 ( ) 2 ( ) 2 ( ) 1 ( ) 1 ( ) 0 ( ) 0 ( y x y x y x y x + + -) ( ) ( i Y i X ) ( ) ( i y i x BU2_A A B C D A B C D . . . A B C D A B C D . . . A B C D A B C D . . . A B C D MUX 4 MUX 4

Time sharing: Consider the traditional R2SDF FFT 128-point architecture with four data sequences at the first stages, as shown in Fig. 2.9. When data sequence form x(0) to x(63) arrive, they are stored in memory (at clock cycle 0 to 63), as shown in Fig. 2.10. When data from x(64) to x(127) arrive, radix-2 butterfly starts to work (at clock cycle 64 to 127). Then, added results are fed to next stage, and the subtract result are sent back and saved in memory, as shown in Fig. 2.11, where Â₀ = A₀ − A₆₃, Â₁ = A₁− A₆₄, Â₂ = A₂ − A₆₅,. . . , Â₆₃ = A₆₃_−A₁₂₇. Finally the data are read from memory, multiplied appropriate twiddle factors and then passed to next stage (at clock cycle 128 to 191 ). We know that there is no need to operate addition and substraction since the operation of adder or subtract was completed before clock cycle 128. Consequently, we can utilize

(25)

1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ´ ´ ´ ´

₌

Mode 1 Mode 2

(a)

(b)

MUX 4 MUX 4

Figure 2.8: (a) Architecture of multiplexer; (b) Operation mode of multiplexer.

the two or periods of clock cycle 64∼127 and clock cycle 128∼191 to multiply twiddle factors for four data sequences. Thus, we can use two multipliers to complete the multiplication of twiddle factors for four data sequence.

(26)

64 ´ 64 ´ 64 ´ 64 ´ + -+ -+ -+

-Figure 2.9: Architecture of four antenna R2SDF FFT 128-point at stage one.

´ ´ ´ ´ 0 1 2 3 63 A A A A ...A 0 1 2 3 63 B B B B ...B 0 1 2 3 63 C C C C ...C 0 1 2 3 63 D D D D ...D

Figure 2.10: Save data in memory.

When a 128-point FFT/IFFT is used in this architecture, at ﬁrst the data sequences from x(0) to x(63) arrive, they are stored in memory and operation mode of multiplexer are mode 2 (at clock cycle 0 to 63 ), as shown in Fig. 2.12. When data sequences from x(64) to x(127) arrive, radix-2 PE starts to work (at clock cycle 64 to 127). Then, the added results are fed to next stage, and the subtracted results are multiply appropriate twiddle factors with two sequences and save in memory. Moreover, the operation mode of multiplexer are mode 1. Finally, these data stored in the memory are read, multiplied appropriate twiddle

(27)

u u u u 0 1 2 3 63 ˆ ˆ ˆ ˆ ˆ A A A A ...A 0 1 2 3 63 ˆ ˆ ˆ ˆ ˆ B B B B ...B 0 1 2 3 63 ˆ ˆ ˆ ˆ ˆ C C C C ...C 0 1 2 3 63 ˆ ˆ ˆ ˆ ˆ D D D D ...D

Figure 2.11: Operation of radix-2.

factors with other two sequences and then passed to next stage (at clock cycle 128 to 191). In traditional single path delay feedback architecture the utilization rate of the complex multiplier is only 50%. By the timing sharing, only two complex multipliers are needed and the utilization of the complex multipliers can achieve 100% in this scheme. A0 B0 C0 D0 64 BANK1 A60 B60 C60 D60 . . . A1 B1 C1 D1 BANK2 A61 B61 C61 D61 . . . A2 B2 C2 D2 BANK3 A62 B62 C62 D62 . . . A3 B3 C3 D3 BANK4 A63 B63 C63 D63 . . .

Figure 2.12: Module 2 memory bank.

c) Module 3: The Module 3 contains three radix-2 PEs and one modiﬁed complex multiplier, as shown in Fig. 2.13. The BU2 B include control signal,

(28)

which controls the operation modes of radix-2, as show in Fig. 2.14. A BF2 Group 1,-j BF2 ´ 3 8 1 8, , , 1-jW W BF2 Modified Complex Multiplier

Step 1 Step 2 Step 3

´ BU2_B BU2_B BU2_B BU2_B BF2 ) 3 ( ) 3 ( ) 2 ( ) 2 ( ) 1 ( ) 1 ( ) 0 ( ) 0 ( Y X Y X Y X Y X ) 3 ( ) 3 ( ) 2 ( ) 2 ( ) 1 ( ) 1 ( ) 0 ( ) 0 ( y x y x y x y x + () ) ( i Y i X ) ( ) ( i y i x BU2_B 1 0 0 1 -Control Signal 32 B C D A B C D A 16 B C D A B C D A 8 B C D A B C D 4

-+

Mode 1 Mode 2

Figure 2.14: Two operation mode.

Module 3 is the radix-23 FFT algorithm proposed by He and Torkelson [3], whose SFG is shown in the second stage of Fig. 2.1. The function of Module 3 is to preform the second stage butterfly of Fig. 2.1, where each stage is multiplied by the twiddle factor 1,−j, W₈3 and W₈1. From Fig. 2.9, it is inefficient to have four complex multipliers to multiply different twiddle factors. Here we can utiliz an approach proposed by Maharatna [20] to reduce the complexity of the complex

(29)

multipliers. The twiddle factor in Module 3 is W₆₄p=e−j2πp64 =X_p+ jY_p=cos (2πp

64 )−

j ×sin (2πp

64 ), where p is from 0 to 49, as shown in Fig. 2.15. Due to the symmetric

or anti-symmetric property of sine and cosine function, only nine sets of twiddle factors is needed to construct. That is, the X_p and Y_p with p=0∼8 in region A are needed, because the twiddle factors in the other seven regions can be obtained by changing their sign as shown in Table 2.1. Thus, these complex values can be realized more eﬃciently by using shift-and-add method [20]. The gate count of this method can save about 38% compared to the approach four complex multipliers. In addition, using the performance of this method is equivalent to that using four complex multipliers.

48 A B C D E F G H 0 8 16 24 32 40 56

Figure 2.15: Eight region of twiddle factor.

Region Real Image

A X_p Y_p B −Y_p −X_p C Y_p −X_p D −X_p Y_p E −X_p −Y_p F Y_p X_p G −Y_p X_p H X_p −Y_p

Table 2.1: Mapping table of twiddle factors in diﬀerent regions.

d) Module 4: The block diagram of the Module 4 is show in Fig. 2.16. The function of Module 4 is the radix-23 FFT algorithm, which is directly mapped to

(30)

the third stage of Fig. 2.1. Although Module 3 and Module 4 are both radix-23 FFT algorithm, the architecture of Module 4 is diﬀerent from that of Module 3, due to the scheme makes the circuit utilization more eﬃciently, and reduce the processing unit. Referral Fig. 2.16, due to the four data sequences are proposed simultaneously in one clock cycle, thus the data are ready for step 2 and step 3. Hence, the data sequences do not need to be stored in memory at step 2 and step 3.

Step 1

BF2 4 BF2 4 BF2 4 BF2 4 ´ ´

j

-,

1

+ -+ -´ ´ 1 8 , 1 W 1 8 , , 1-jW + -+ -Group

1 ,

-

j

Step 2

Step 3

2.2 Variable FFT

The standards of DAB, DVB-T, VDSL and Wi-MAX need various FFT sizes, as shown in Table 2.2. Hence, the design of a variable FFT for diﬀerent purposes become more important. In this section we present a variable FFT that can support 64, 32, 16 and 8-point operation. It is very easy to modify our design by adding 128-point or others 2k-point FFT operation size to create any required length of FFT, where k is integer. Hence, this modiﬁcation of circuit is convenient

(31)

and simple to be used for DAB, DVB-T, Wi-MAX and VDSL systems. For the others variable FFT, please refer to the [8].

Communication system FFT size

Wi-MAX 128,512,1024,2048

VDSL [15] 8192,4096,2048,1024,512 DAB [12] 2048,1024,512,256

DVB-T [13] 8192,2048

Table 2.2: FFT size in several OFDM systems.

2.2.1 Pipeline FFT processor architecture

The traditional radix-2 pipeline FFT architectures can be roughly classiﬁed multi path delay commutator and single path delay feedback [3]. A radix-2 multi path delay commutator (R2MDC) architecture with N =8 is shown in Fig. 2.17. The data sequence is divided into two data paths by commutator, and then properly scheduled for two data paths. The processor element (PE) is implemented by radix-2 algorithm. The numbers of multipliers, PE unit and delay elements are with order (log₂N − 2), log₂N and (3N₂ )− 2 respectively. A radix-2 single path delay feedback (R2SDF) architecture with N =8 is shown in Fig. 2.18. The uti-lization of delay elements in R2SDF is more eﬃcient than R2MDC by sharing the memory. The numbers of multipliers, PE units and delay elements for R2MDC are (log₂N − 1), N − 1 and log₂N. The SDF FFT and MDC FFT are decried as follows.

• a) SDF FFT: Because the SDF FFT uses feedback to reuse memory, the SDF FFT can reduce the memory usage. Its drawback is that the through put rate is low.

• b) MDC FFT: Because the MDC FFT uses multi path to increase data path, the MDC FFT can increase the through put rate. Its drawback is that the memory size is so large.

(32)

commutator ´ 4

PE

2 commutator 2

PE

´ 1 commutator 1

PE

Figure 2.17: Architecture of R2MDC. Radix-2 PE 4 Radix-2 PE 2 Radix-2 PE 1 Figure 2.18: Architecture of R2SDF.

2.2.2 Variable FFT processor architecture

Let us see a variable FFT that can achieve 64-point, 32-point, 16-point and 8-point operation as shown in Fig. 2.19. The radix-2/23 64-point mixed-radix SFG is shown in Fig. 2.20. Where stage 1 to stage 3 are radix-2 algorithm and stage 4 is radix-23algorithm. It uses the above architectures and multiplexors to preform the variable FFT. This FFT can deal with 4 types of transformation. Moreover, it is easy to be modified to any transformation length. For the 64-point FFT, all stages are active. For the 32-point FFT, the input data will skip the first stage and go to the second stage directly. For the 16-point FFT, the input data will skip the first stage and second stage and go to the third stage. Multiplexors are used to switch to different FFT size operation.

(33)

Radix-2 PE 32 ROM M U X Radix-2 PE 16 ROM M U X Radix-2 PE 8 ROM M U X Radix-2 PE 1 Radix-2 PE 2 Radix-2 PE 4 3 8 1 8, , , 1-jW W 1,-j 3 2 -radix

(34)

Stage 1 Stage 2 Stage 3 Stage 4

8-point

16-point

32-point

(35)

Chapter 3 The proposed variable FFT for

MIMO systems

In this chapter, we detail the FFT in IEEE 802.16e which is for MIMO-OFDM ap-plication. The variable FFT can support multiple antenna and 2048/1024/512/128-point FFT size. In Sec. 3.1 we shall derive the FFT algorithm, and to show the SFG. In Sec. 3.2 we shall detail the FFT architecture, it is introduce in each Module. In Sec. 3.3 we compare the hardware requirement with several classes FFT and proposed approach in case 2048-point FFT. Finally, we show the SQNR simulation in Sec. 3.4, and to explain about how to determine bit width.

3.1 Algorithm

From (2.1), the N -point DFT operation can be decomposed to N₁× N₂× . . . × N_k point DFT operation. We use the radix-23 as many as possible reduce the multipliers. The mathematical representation is shown in equation (3.1).

N = 2048 = 8 × 8 × 2 128-point ×2 × 2 512-point ×2 1024-point ×2 2048-point . (3.1)

(36)

n = N₂n₁+ n₂= 1024n₁+ n₂, n₁ = 0, 1 n₂ = 0, 1, . . . , 1023 and k = k₁+ N₁k₂= k₁+ 2k₂, k₁ = 0, 1 k₂ = 0, 1, . . . , 1023 (2.1) can be rewritten as X(k₁+ 2k₂) = 1023 n2=0 1 n1=0 x(1024n₁+ n₂)W₂₀₄₈(1024n1+n2)(k1+2k2) = 1023 n2=0 ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1 n1=0 x(1024n₁+ n₂)Wn1k1 2 2-point Wn2k1 2048 twiddle factor ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ Wn2k2 1024 1024-point (3.2) = 1024 n2=0 B₂(1)(n₂)Wn2k2 1024, (3.3)

where B_r(k)denotes the radix-r algorithm at stage k. Now N₂ = 1024 = N₂×N₃ = 2× 512. Deﬁne the indices n₂ and k₂ as

n₂ = N₃n₂+ n₃ = 512n₂+ n₃, n₂ = 0, 1 n₃ = 0, 1, . . . , 511 and k₂ = k₂+ N₂k₃ = k₂+ 2k₃, k₂= 0, 1 k₃= 0, 1, . . . , 511 We have: X(k₁+ 2k₂+ 4k₃) = 511 n3=0 1 n2=0 B₂(1)(512n₂+ n₃)W₁₀₂₄(512n2+n3)(k2+2k3) = 511 n3=0 ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1 n2=0 B₂(1)(512n₂+ n₃)Wn2k2 2 4-point Wn3k2 1024 twiddle factor ⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ Wn3k3 512 512-point

(37)

= 511 n3=0 B₂(2)(n₃)Wn3k3 512 . (3.4)

In this way, we can obtain the result is given by

X(k₁+ 2k₂+ 4k₃+ 8k₄+ 16k₅+ 32k₆) = 63 n6=0 B₂(5)(n₆)Wn6k6 64 , (3.5)

where k₆ = 0, 1, . . . , 63. By decomposing the 64-point DFT into the 8-point DFT, we can achieve the 2048-point mixed-radix FFT algorithm.

X(k₁+ 2k₂+ 4k₃+ 8k₄+ 16k₅+ 32k₆+ 256k₇) = 7 n7=0 ⎧ ⎨ ⎩ 7 n6=0 B₂(5)(8n₆+ n₇)Wn6k6 8 W64n7k6 ⎫ ⎬ ⎭W8n7k7. (3.6)

Because the radix-8 butterfly unit is inefficient in the use of adders and mul-tipliers. we use the radix-23 FFT algorithm [3] to replace the radix-8 FFT al-gorithm. In this case, we can further reduce the complexity of the butterfly by using the radix-2 butterfly three times. The SFG of the 2048-point mixed-radix FFT algorithm, is as shown in Fig. 3.1 and Fig. 3.2.

Let n₆ = 4α₁+ 2α₂ + α₃ and k₆ = β₁+ 2β₂+ 4β₃, we can obtain the form with radix-23 in equation (3.9).

X(k₁+ 2k₂+ 4k₃+ 8k₄+ 16k₅+ 32(β₁+ 2β₂+ 4β₃) + 256k₇) = 7 n7=0 B₈(6)(n₇, k₇)Wn7k7 8 , (3.7) where B₈(6)(n₇, k₇) = 1 α3=0 1 α2=0 1 α1=0 B₂(5)(8(4α₁+ 2α₂+ α₃) + n₇)Wα1β1 2 Wα2β1 4 W2α2β2W8α3(β1+2β2)W2α3β3W64n7(β1+2β2+4β3). (3.8)

(38)

x(0) x(1) x(2) x(N/2-1) x(N/2) x(N/2+1) x(N-1) 0 N W 0 N W 0 N W 0 N W 0 N W 1 N W 2 N W (N/ 2 1) N W

-Figure 3.1: The SFG of stage 1 to stage 5 (radix-2).

j -j -1 8 W 3 8 W x(n) x(n+8) x(n+16) x(n+24) x(n+32) x(n+40) x(n+48) x(n+56) y(n) y(n+8) y(n+16) y(n+24) y(n+32) y(n+40) y(n+48) y(n+56) j -j -j -1 8 W 3 8 W x(n) x(n+1) x(n+2) x(n+3) x(n+4) x(n+5) x(n+6) x(n+7) y(n) y(n+1) y(n+2) y(n+3) y(n+4) y(n+5) y(n+6) y(n+7) j -(a) (b)

Figure 3.2: The SFG of stage 6 to stage 7 (radix-23).

3.2 Architecture

The variable FFT for MIMO-OFDM system is provide 2048/1024/512/256-point FFT/IFFT operations and can support number T of data streams from T = 1 to T = 4. From Fig. 3.3, the system is contains of Module 1 (data reordering), Module 2 to Module 6 (radix-2), Module 7 (radix-23) and Module 8 (radix-23), conjugate blocks, some divide blocks and multiplexors. Because the FFT is based on a radix-2 butterﬂy, the order of the output sequences is bit reversal of input, as shown in Fig. 3.4.

(39)

M U X []* Module 1 Module 2 Data In FFT/IFFT M U X Module 3 M U X Module 4 Module 6 Module 7 Module 8 []* M U X M U X MODE Data Out M U X Module 5 1/N1 1/N2 1/N3 1/N4

(radix-2) (radix-2) (radix-2)

(radix-2) (radix-2) 3 (radix-2 ) 3 (radix-2 ) (reorder)

Figure 3.3: Block diagram of the variable FFT processor.

Time

(a)

(b)

Time

Figure 3.4: The input and output relationship of FFT.

3.2.1 Module 1 (data reordering)

The Module 1 is implemented by registers with size 4× 4, and we use clock gating to save the power consumption. The time schedule of Module 1 is shown in Fig. 3.5. The input and output relationship of Module 1 is shown in Fig. 3.6. In Module 1, the input data sequences are re-permuted so that it leads to eﬃcient operation for radix-2 module and Module 8 implementation as we will mention later. From Fig. 3.6, where n = 128, 512, 1024 and 2048, N = 32, 128, 256 and 512 for various FFT sizes. For example, when the number of transmitted sequences is four and the FFT size is 128, each group is contains four data sequences and there are 32 groups.

(40)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Time Write Read 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Write Read

(a)

(b)

Time

Figure 3.5: (a) Read and write with column; (b) Read and write with row.

Time A0 A1 A2 A3 A4 A5 A6A7 A8 A9 A10 A11 An-1 B0 B1 B2 B3 B4 B5 B6B7 B8 B9 B10 B11 Bn-1 C0 C1 C2 C3 C4 C5 C6C7 C8 C9 C10 C11 Cn-1 D0 D1 D2 D3 D4 D5 D6D7 D8 D9 D10 D11 Dn-1 Group 1 Group N A0 B0 C0 D0 A4 B4 C4D4 An-4An-3An-2An-1 A1 B1 C1 D1 A5 B5 C5D5 Bn-4Bn-3Bn-2Bn-1 A2 B2 C2 D2 A6 B6 C6D6 Cn-4Cn-3Cn-2Cn-1 A3 B3 C3 D3 A7 B7 C7D7 Dn-4Dn-3Dn-2Dn-1 Time Module 1

Figure 3.6: Relation between Module 1 input and Module 1 output.

3.2.2 Module 2 to 6 (radix-2 FFT algorithm)

Module 2 contains four butterfly units, two multipliers, four memory banks, ROM table, rotation factor and some multiplexors, as shown in Fig. 3.7. Module 2 is similar to previous architecture as mentioned in Sec. 2.1.2. The difference is that Module 2 includes the rotation factor W_N1 to reduce the memory size. The architecture of Module 2 to Module 6 are actually the same except their memory sizes are different. The memory sizes from Module 2 to Module 6 are 1024, 512, 256, 128 and 64, respectively. Module 2 to Module 6 realize the radix-2 FFT algorithm and the SFG is shown in Fig. 3.1. From Fig. 3.1, the value N for

(41)

Module 2 to Module 6 are 2048, 1024, 512, 256 and 128, respectively. In addition to the advantage of time sharing as mentioned in Sec. 2.1.2, we also have the advantage of memory sharing in Module 2. The memory sharing is explained below.

BF1

A B C D D . . . A B C D D . . . A B C D D . . . A B C D R O M DATA IN

DATA IN DATA OUT

MEM OUT MEM IN

´ ´ 1 N W MEMORY SIZE

BANK1 BANK2 BANK3 BANK4

0 1 1 0 0 1 D . . . P1 P2 MUX 4 MUX 4

Memory sharing: The twiddle factors for T = 4 at the first stage butterfly is shown in Fig. 3.9, where A, B, C and D represent the four data sequences. This figure also shows the used twiddle factors in each time instance. Originally, we need to store N₂ twiddle factors for first stage butterfly. Also, since we need ROM tables to store the twiddle factors for the radix-2 butterfly in modules 3, 4, 5 and 6. Thus, the total twiddle factors are N₂ +N₄ +N₈ +₁₆N +N₃₂ = 62N₆₄ , where N = 2048 here. Hence, when N is large, the memory size also become large. Here, we will propose a memory sharing method that can reduce the memory size from 62N₆₄ to 31N₆₄ and we can sharing the memory from Module 2 to Module 6 as shown in Fig. 3.8, thus the memory size can be reduced to N₄. This memory sharing method is described as follows:

(42)

sequences from x(0) to x(1023) arrive, they are stored in memory of Module 2 (at clock cycle 0 to 1023). When the data sequences from x(1024) to x(2047) arrive, radix-2 PEs start to work (at clock cycle 1024 to 2047). Then, the added results are fed to the next stage, and two of the four subtracted data sequences are multiplied by appropriate twiddle factors and all of the four data sequences are saved in memory. Finally, all the four data sequences stored in the memory are read, and two of them (those who have not multiplied the twiddle factors) are multiplied by appropriate twiddle factors and then pass to the next stage (at clock cycle 2048 to 3071). Hence, we need a ROM table of size N₂ to store the twiddle factor for the ﬁrst stage butterﬂy.

1 2048 W

ROM

1 1024 W 1 512 W 1 256 W 1 128 W To Module 2 To Module 3 To Module 4 To Module 5 To Module 6

Figure 3.8: Memory sharing from Module 2 to Module 6.

However, from Fig. 3.9, the twiddle factors in P 1 and P 2 has a ratio of W_N1, where P 1 and P 2 are also shown in Fig. 3.7. Since the rotation factor is a shift-and-add, from Fig. 3.10 we know the rotation factor does not increase the critical path. The critical path is shown with red color in Fig. 3.12, which is 13ns.

(43)

) 3 2 ( ) 4 2 ( -N N N N W W ) 3 2 ( ) 4 2 ( -N N N N W W ) 3 2 ( ) 4 2 ( -N N N N W W ) 3 2 ( ) 4 2 ( -N N N N W W 0 N W 0 N W 0 N W 0 N W 1 N W 1 N W 1 N W 1 N W 4 N W 4 N W 4 N W 4 N W 5 N W 5 N W 5 N W 5 N W

. . .

Time

P1 P2

A

B

C

D

A

B

C

D

A

B

C

D

) 1 2 ( ) 2 2 ( -N N N N W W ) 1 2 ( ) 2 2 ( -N N N N W W ) 1 2 ( ) 2 2 ( -N N N N W W ) 1 2 ( ) 2 2 ( -N N N N W W 2 N W WN2 2 N W 2 N W 3 N W 3 N W 3 N W 3 N W 6 N W 6 N W 6 N W 6 N W 7 N W 7 N W 7 N W 7 N W

. . .

Time

P1 P2

A

B

C

D

A

B

C

D

A

B

C

D

(a)

(b)

Figure 3.9: (a) ROM table at clock cycle 1024 to 2047; (b) ROM table at clock cycle 2048 to 3071. ) 4 2 (N -N W (2-4) N N W ) 4 2 (N -N W (2-4) N N W 0 N W 0 N W 0 N W 0 N W 4 N W 4 N W 4 N W 4 N W

. . .

Time

P1

A

B

C

D

A

B

C

D

A

B

C

D

) 2 2 (N -N W (2-2) N N W (2-2) N N W (2-2) N N W 2 N W 2 N W 2 N W WN2 6 N W WN6 6 N W WN6

. . .

Time

P1

A

B

C

D

A

B

C

D

A

B

C

D

(a)

(b)

Figure 3.10: (a) ROM table at clock cycle 1024 to 2047; (b) ROM table at clock cycle 2048 to 3701.

(44)

BF1

R O M

DATA IN DATA OUT

´ ´ 1 N W 0 1 0 1 -5.04 ns 3.26 ns MUX 4 MUX 4 P1 P2

Figure 3.11: Analysis for critical path.

BF1

R O M DATA OUT ´ ´ 1 N W 0 1 0 1 -Critical Path (13ns) MUX 4 MUX 4 P1 P2

Figure 3.12: The FFT critical path.

3.2.3 Module 7 (radix-2

3

FFT algorithm)

Module 7 is the same as Module 3 in Sec. 2.1.2. The function of Module 7 is to preform the 6th stage butterﬂy of Fig. 3.2.

3.2.4 Module 8 (radix-2

3

FFT algorithm)

Module 8 is the same as Module 4 in Sec. 2.1.2. The function of Module 8 is to preform the 7th stage butterﬂy of Fig. 3.2.

(45)

3.3 Complexity comparison

Let T = 4, let us compare the hardware complexity of the proposed architec-ture and the others FFT architecarchitec-tures which is shown in Table 3.1. For four data sequences the proposed approach can save 69% complex multiplier and 75% ROM tables compare with R2SDF. Note that for other architectures in Table 3.1, they may not be able to support the variable FFT sizes required by Wi-MAX standards. For instance, although using the R23SDF can save 70% complex mul-tiplier, its FFT size is limited to power of eight.

Architecture Four data sequence

R2SDF (2048-pt) Proposed (Variable 2048-pt) 2 R2 SDF (2048-pt) 3 R2 SDF (2048-pt) Complex multiplier Complex adder ROM table Memory size Throughput rate 10+4 (31.1%) 10 4=40 (100%) 4 4=16 (40%) 3 4=12 (30%) 80 (90.9%) 88 (100%) 88 (100%) 88 (100%) 512 (25%) 2046 (100%) 2040 (99.7%) 2044 (99.9%) 8188 (100%) 8188 (100%) 8188 (100%) 8188 (100%) 4R 4R 4R 4R

(46)

3.4 Simulation

Determining appropriate bit width in the FFT processor is important. Since the bit width aﬀects the hardware cost directly. Bit width can be determine by ﬁxed-point simulation. We use the SQNR (Signal to Quantization Noise Ratio) system model to determine the FFT bit width. The system model of SQNR is shown below. + Signal Noise P/S . . . Fixed-point FFT Float-point FFT . . .

+

. . . . . . Error b yk uk

-Figure 3.13: System model of SQNR.

The SQNR is deﬁned as SQNR = 1 N N−1 k=0 σ2 x |yk− uk| + β2σe2 . (3.9)

Consider the SNR (Signal to Noise Ratio) is one (σ2x

σ2_e = 1), we can rewrite the

equation as SQNR = 1 N N−1 k=0 1 |yk− uk| + β2, (3.10)

where β = 10−SNR20 . The relationship between SQNR (with ﬁnal bit-width) and

SNR is shown in Fig. 3.12.

When SQNR is large, it means that quantization error is small. From the ﬁgure, the ﬁnal bit-width can support SQNR up to 35 dB.

(47)

0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 SNR (dB) SQNR (dB) Figure 3.14: SQNR v.s. SNR.

(48)

Chapter 4 Chip implementation and

veriﬁcation

In this chapter we state how to build the MATLAB platform of the MIMO variable FFT and how to verify the design. The MATLAB platform can be used to verify the function of the RTL platform. The design flow is illustrate in Fig. 4.1. The design flow is suggested by CIC (Chip Implementation Center) for cell-based design. This design flow includes several important steps:

• Analysis and veriﬁcation for architecture (using Matlab tool to simulation). • Architecture design (using NCverilog, Modelsim and debussy tool to

simu-lation).

• Design for testability (using DFT compiler and TetraMAX tool to synthesis circuit and testability estimate).

• Gate level simulation and timing veriﬁcation.

• Power estimation analysis (using Encounter tool to estimate). • Post-layout gate level simulation.

• Layout veriﬁcation DRC/LVS (using calibre tool for veriﬁcation). • Static timing analysis.

(49)

Specification development system model

Build the system model (Matlab platform)

RTL code (Verilog platform)

RTL level simulation (NC-verilog and modelsim)

Logic synthesis (Design compiler)

Gate level netlist

Gate level simulation

Scan chain synthesis (DFT Complier)

Place & Route (Soc Encounter)

Layout verification (DRC/LVS)

RC Extraction

Delay Calculation Power analysis Gate level STA

Layout Merging (Calibre)

Layout verification

(DRC/LVS) Circuit Extraction

Circuit Level

Simulation Circuit Level STA

Tapeout

RTL verification

Gate level pre-layout

verification

Gate level pos-layout

verification

Circuit level verification

Gate level netlist Testability estimation

(TetraMAX)

(50)

4.1 Cell-based design ﬂow

• Build the MATLAB platform and verify the RTL platform

Random pattern Fixed-point FFT model Design input Matlab environment Expect output Verilog environment Test pattern Verilog model FFT Simulation environment Verification Bit True Successful or Fail

Figure 4.2: Simulation environment for variable FFT.

We need to build the MATLAB platform and verilog platform. Fig. 4.2 shows the simulation environment for bit-true verification. At first we need to develop floating-point FFT model in MATLAB environment. Since the FFT function in MATLAB tool is a floating-point function. We need to build the MATLAB quantizable FFT function using the proposed archi-tecture mentioned earlier. When the quantizable FFT function ready, we can determine the bit-width for in dividual stages using the procedure men-tioned in Sec. 3.2.6. Fig. 4.3 shows the quantized result in all stages. Note that total number of bits in all stage is 16. When the fixed-point FFT model is ready, we can generate random patterns from this model. That is, the design input and expected output can be generated from this MATLAB platform. We can save the input and the expect of output in a Text file for bit-true verification. In the verilog platform, we read input from the save text file generated by MATLAB platform. We compare the output from the verilog platform and the MATLAB platform to verify the result.

(51)

M U X []* Module 1 Module 2 Data In FFT/IFFT M U X Module 3 M U X Module 4 Module 6 Module 7 Module 8 []* M U X M U X MODE Data Out M U X Module 5 1/N1 1/N2 1/N3 1/N4 A B C D E F G H I Point Integer bit A B C D E F G H I 2 2 3 3 4 4 4 6 7

Figure 4.3: Bit-width in all stage.

• Synthesis

We use the Design Compiler to synthesize the RTL code. At synthesis phase, we need to constrain the following conditions including timing, area and other rules to meet speciﬁcation. The coding style is important since it aﬀect the synthesis results a lot.

• Gate-level simulation

After synthesis, the gate level circuit will include timing information. Thus, we need to check the function correctness again. The nWave tool can help us to check the function with timing information.

• Memory BIST (Built-In Self-Test)

We use the memory generator to describe the speciﬁcation and create the memory. We also use the memory speciﬁcation to generate memory BIST

(52)

Memory Mux Analyzer & Pattern generator BIST Controller Memory Wrapper Original Memory Port

BistMode bist_ctrl mem_ctrl Q BistFail ErrorMap Finish

Figure 4.4: BIST circuit.

circuit, as shown in Fig. 4.4. From Fig. 4.4, the BistMode control the function mode and test mode. The BIST circuit contains memory wrapper and BIST controller. When the memory size is large, the number of pins for BistFail, ErrorMap and Finish is increate. We use the OR gate to connect each pin of BistFail or ErrorMap to reduce the core pad.

• Scan chain insertion

Flip-Flop CLK D Q QN Flip-Flop D Q QN CLK TE TI 1 0

Figure 4.5: From Flip-Flop to scan Flip-Flop.

For testability the scan chain synthesis is needed and this can be done by DFT compiler. The Fig. 4.5 shows ﬂip-ﬂop after scan chain insertion.

• ATPG (Auto Test Pattern Generator)

We use the TetraMAX tool to generate patterns for testing. The function of testing is to test the fault of stuck-at 0 and stuck-at 1. After ATPG we

(53)

can get the information with fault coverage. The fault coverage revel the probability that a chip is in good condition.

• Scan gate level simulation

After scan chain synthesis, we need to do scanned gate level simulation. From Fig. 4.5, due to the added multiplexors the critical path increases. Thus we need to adjust timing for function correctness.

• APR (Automatic Place and Route)

We use the SOC encounter to perform the placing and routing. After APR, we can check some parameters such as timing, power and design rule viloation.

• DRC/LVS veriﬁcation

We need to verify the DRC (Design Rule checking) and LVS (Layout V.S. Schematic) using the calibre tool.

(54)

4.2 Chip summary

• Chip layout

Figure 4.6: Layout view of the proposed FFT processor.

This proposed variable MIMO FFT processor is fabricated in TSMC 0.18um 1P6M CMOS technology. The layout is as shown in Fig. 4.6. Table 4.1 lists the expected chip performance and the performance satisﬁes the require-ment for IEEE 802.16e standard. The 208-pin package will be used for the chip, where 175 pins are I/O pins and others are power pins. The core size is 25mm2 and including total 31.9375K-byte SRAM that used in feedback memory, as shown in Fig. 4.6. The total power consumption and total area is 181mW and 41.8mm2, respectively.

(55)

Items Specification Technology Package Core size Die size Gate count Memory Max Frequency Power consumption TSMC 18um CQFP208 1350 K 31.9375 KB 40 MHz 181 mW 2 25 mm 2 41.8 mm

Table 4.1: Expected chip performance of the proposed FFT processor.

• Performance comparison

Table 4.2 shows that performance comparison with others FFT architec-tures. The proposed architecture has advantages in throughput and area.

Technology Core area Die area Work frequency Power This work 0.18um Throughput 4R 40 MHz 181 mW 2 41.8 mm [20] 0.18um FFT size 2048-pt 1024-pt 2 7.6 mm R 32 mW 52 MHz [22] 0.6um 256-pt N.A. R N.A. 50 MHz [16] 0.13um 128-pt 4R 5.2 mW 40 MHz 2 2.69 mm [19] 0.35um 2048-pt N.A. R 574 mW 60 MHz 2 12.25 mm [21] 0.5um 1024-pt N.A. R N.A. 30 MHz 2 40 mm 2 4.6 mm 36 mm2 2 1.4 mm 2 25 mm

(56)

Chapter 5 Conclusions

In this thesis, we proposed a variable FFT for MIMO-OFDM over Wi-MAX ap-plication. In chapter 2 we discuss several FFT architectures which can be applied in MIMO systems with various FFT size. In chapter 3 we use the advantages of the architectures in chapter 2 to implement our design. We also proposed a memory sharing method to reduce the memory size to 25% compare with R2SDF. In chapter 4, we discuss how to build the system platform in MATLAB environ-ment and the chip impleenviron-mentation ﬂow. Finally, we follow CIC design ﬂow to implement the proposed FFT processor in a TSMC 0.18um technology. The total area and power consumption are 41.8mm2 and 181mW, respectively.

(57)

Bibliography

[1] A. V. Oppenheim, R. W. Schafer, “Discrete-Time Signal Processing,” Prentice-Hall Inc., 1999.

[2] S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proc. of Int. Parallel Processing Symposium, pp. 766-770, Apr. 1996.

[3] S. He and M. Torkelson, “Designing Pipeline FFT Processor for OFDM (de) Modulation,” URSI International Symposium on Signals, Systems and Electronics, pp. 257-262, 1998.

[4] L. Jia, Y. Gao, J. Isoaho and H. Tenhunen,, “A New VLSI-Oriented FFT Algorithm and Implementation,” IEEE ASIC Conference, pp. 337-341, Sep. 1998.

[5] W. C. Yeh, C. W. Jen, “High-speed and low-power split-radix FFT,” IEEE Trans. Acoust, Speech, Signal Processing, vol. 51, pp. 864-874, Mar. 2003.

[6] B. M. Baas, “An approach to low power, high performance, fast Fourier transform processor design,” PhD Thesis, Stanford University, 1999.

[7] C. P. Hsu, “Design of Fast Fourier Transform Processor in DVB-T Inner Receiver,” MS Thesis, Central University, 2005.

[8] Y. T. Lin, P. Y. Tsai, and T.D.Chiueh, “Low-power Variable-length Fast Fourier Transform Processor,” IEEE Proc. Computer. Digit. Tech., vol. 152, no. 4, pp. 499-506, Jul. 2005.

[9] J. G. Nash, “A High Performance Scalable FFT,” IEEE Wireless Commu-nications and Networking Conference, pp. 2367-2372, Mar. 2007.

(58)

[10] Y. Zhao, A.T. Erdogan, T. Arslan, “A low-power and domain-speciﬁc re-conﬁgurable FFT fabric for system-on-chip applications,” 19th IEEE Int., Parallel and Distributed Processing Symposium, pp. 4, Apr. 2005.

[11] K. Manolopoulos, K. Nakos, D. Reisis, N. Vlassopoulos, V.A. Chouliaras, “High Performance 16K, 64K, 256K complex points VLSI Systolic FFT Ar-chitectures,” 14th IEEE International Conference on Electronics, Circuits and Systems, pp. 146-149, Dec. 2007.

[12] ETSI EN 300 401 (v1.3.2): “Radio broadcasting systems; digital audio broadcasting (DAB) to mobile, portable and ﬁxed receivers,” Sep. 2000.

[13] ETSI EN 300 744 (v1.2.1): “Digital video broadcasting (DVB); framing structure, channel coding and modulation for digital terrestrial television,” Jul. 1999.

[14] T1E1.4/98-007R4: “Standards project for interfaces relating to carrier to customer connection of asymmetrical digital subscriber line (ADSL) equip-ment,” Jun. 1998.

[15] ETSI TS 101 270-2 (v1.1.1): “Transmssiion and multiplexing (TM); access transmission systems on metallic access cables; very high speed digital sub-scriber line (VDSL); Part 2: Transceiver speciﬁcation,” Feb. 2001.

[16] S. F. Hsiao, W. R. Shiue, “Design of low-cost and high-throughput linear arrays for DFT computations: algorithms, architectures, and implementa-tion,” IEEE Trans. on Circ. and Syst. II. vol 47. pp. 1188-1203, 2000.

[17] V. Boriakoﬀ, “FFT computation with systolic arrays, a new architecture,” IEEE Trans. on Circ. and Syst. II. vol 41. pp. 278-284, 1994.

[18] Y. W. Lin, H. Y. Liu, and C. Y. Lee, “A 1-GS/s FFT/IFFT processor for UWB applications,” IEEE Journal of Solid-State Circuits, vol. 40, issue 8, pp. 17226-1735, Aug. 2005.

[19] Y. W. Lin and C. Y. Lee, “Design of an FFT/IFFT Processor for MIMO-OFDM Systems,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.54, no.4, pp.807-815, Apr. 2007.

(59)

[20] K. Maharatna, E. Grass, and U. Jagdhold, “A 64-Point Fourier Transform Chip For High-Speed Wireless LAN Application Using OFDM,” IEEE Jour-nal of Solid-State Circuits, vol. 39, no. 3, pp. 484-493, Mar. 2004.

[21] J. C. Kuo, C. H. Wen, A. Y. Wu, “Implementation of a Programmable 64∼2048-Point FFT/IFFT Processor for OFDM-based Communication Sys-tems,” in Proc. IEEE ISCAS, vol. 2, pp. 121-124, May 2003.

[22] H. Zou and B. Daneshard, “VLSI implementation for a low power mobile OFDM receiver ASIC ,” in Proc. IEEE Wireless Communications and Net-working Conference, vol.4, pp. 2120-2124, Mar. 2004.

[23] S. He and M. Torkelson, “Design and Implementation of a 1024-point pipeline FFT processor,” in Proc. IEEE Custom Integrated Circuits. Conf., pp. 131-134, May 1998.

[24] L. Fanucci, M. Forliti, F. Gronchi, “Single-Chip Mixed-Radix FFT Processor for Real-Time On-Board SAR Processing,” in Proc. IEEE Int. Conf. on Electronics, Circuits and Systems, vol. 2, pp. 1135-1138, Sep. 1999.

適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度快速傅立葉轉換

國

立

交

通

大

學

電機與控制工程學系

碩

士

論

文

適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度

快速傅立葉轉換

A Variable FFT for MIMO-OFDM Systems over Wi-MAX

Applications

研 究 生：葉柏賢

指導教授：蔡尚澕 教授

適用於多輸入多輸出正交分頻多工 Wi-MAX 系統之可變長度快速傅

立葉轉換

A Variable FFT for MIMO-OFDM Systems over Wi-MAX Applications

研 究 生：葉柏賢 Student：Bo-Xian Ye

指導教授：蔡尚澕 Advisor：Shang-Ho Tsai

國 立 交 通 大 學

電 機 與 控 制 工 程 學 系

碩 士 論 文

適 用 於 多 輸 入 多 輸 出 正 交 分 頻 多 工 W i - M A X 系 統 之 可 變 長 度 快 速 傅 立 葉 轉 換

學生：葉柏賢

指導教授：蔡尚澕

國立交通大學電機與控制工程學系﹙研究所﹚碩士班

摘

要

在這篇論文，我們介紹一個可以應用於 Wi-MAX 系統中的可變長度快速傅立

葉轉換。這個可變長度快速傅立葉轉換可以提供許多快速傅立葉轉換的長度及多

天線傳輸。這個 2048/1024/512/128-point 可變長度快速傅立葉轉換是以 radix-2

及

radix-2 快速傅立葉轉換演算法。我們也提出一個記憶體分享的方法去減少記

憶體的使用。這個方法比較於 R2SDF 的方法可以減少 ROM 表格大小從 1023N/1024

到 N/4 ，N 為快速傅立葉轉換的長度。此外，我們使用

radix-2 快速傅立葉轉換

演算法使得複數乘法器的數量減少並且也使用修正的複數乘法器使的所使用的

邏輯閘數比較少。如此功率消耗也能更加節省。我們所提出的可變長度快速傅立

葉轉換是使用台積電 0.18um CMOS 製程所製造，晶片的面積為 25

mm 。當處理

器操作於頻率 40MHz 時所需的功率為 181 mW。

A Variable FFT for MIMO-OFDM Systems over Wi-MAX Applications

Student：Bo-Xian Ye

Advisors：Dr. Shang-Ho Tsai

Department﹙Institute﹚of Electrical and Control Engineering

National Chiao Tung University

ABSTRACT

In this thesis, we present a variable FFT that it support multiple FFT size and

multiple antennas for Wi-MAX systems. The 2048/1024/512/128-point variable FFT

is based on

radix-2 and

radix-2 FFT algorithm. We propose a memory sharing

method to reduce the memory size. This method can reduce the ROM table size from

1023N/1024 to N/4, where N is the FFT size, compared with R2SDF. Furthermore,

we use the

radix-2 FFT algorithm to reduce the number of complex multipliers, and

the modified complex multiplier leads to a smaller gate count. Thus, the power

consumption can be to reduced as well. The proposed variable FFT is fabricated using

a TSMC

0.18um CMOS technology with chip area 25

mm . The average dynamic

power consumption is 181 mW at 40 MHz operating frequency.

誌

謝

兩年來的研究生活終於要告一個段落了，此篇論文能夠順利的完

成首先要感謝的是我的指導教授蔡尚澕教授。在兩年的研究生活中，

老師不辭辛苦的一步一步的帶領我們走進通訊晶片設計的領域，也很

配服老師的研究精神及超人的體力，讓我在學習上也有更明確的目

標。也希望老師在忙碌之於能多愛惜自己的身體。也感謝我的口試委

員:林源倍教授、簡鳳村教授、董蘭榮教授的經驗提供使得我的論文更

加的完整。

另外，感謝 535 實驗室的學長及同學，因為有你們在課業上的幫

忙及意見的提供，讓我在修課上的疑惑能夠有很大的幫助。另外，還

需感謝實驗室一起打拼的同學，讓我在作研究中可以有更多的思考方

式去解決作研究時所遇到的種種困難。也感謝學弟妹們的加入，因為

有你們的加入使的我的研究生活更加有樂趣。

研究生：葉柏賢

指導教授：蔡尚澕教授

研究生：葉柏賢 Student：Bo-Xian Ye

國立交通大學

電機與控制工程學系

碩士論文

適用於多輸入多輸出正交分頻多工 W i - M A X 系統之可變長度快速傅立葉轉換

₌