Thesis Organization - 可平行順序輸入及輸出快速傅立葉轉換處理器之設計

Chapter 1 Introduction

1.2 Thesis Organization

In this thesis, FFT/IFFT designs for robust channel estimation of high-mobility STBC/OFDMA communication system are proposed. System simulation, architecture and circuit design, and implementation of FFT/IFFT processor with baseband of 802.16e are carried output in thesis. IEEE 802.16e, DFT-based channel estimation, and the system block we used, will be introduced in Chapter 2. Since the system block we used including two kinds of FFT/IFFT processor design, we also introduce the system requirement for different kind of FFT/IFFT processor: one for OFDMA demodulation, the other for DFT-based channel estimation. The system requirement of FFT processor for 802.16e OFDMA demodulation has no difference with other OFDM communication, thus, the thesis will introduce the conventional FFT processor we used in Chapter 2. Shared memory concept is used between FFT processor, used for OFDMA demodulation, and channel estimation. The requirement of FFT/IFFT processor used in DFT-based channel estimation is different from the conventional FFT processor by two aspects. One aspect is parallel-in-parallel-out of data and the other aspect is a FFT processor with several zero value input or several valid output, called partial FFT processor. The thesis focus on the FFT/IFFT processor hardware

design for channel estimation with parallel-in-parallel-out in normal order, and then the concept of partial FFT processor design will be demonstrated.

Investigation of the conventional FFT algorithm and various parallel-in-parallel-out FFT architectures is presented in Chapter 3. The conventional high throughput FFT processors usually use a pipeline-based FFT architecture which provide high throughput but also has high hardware cost. Memory-based FFT architecture has the advantage of low hardware cost, and it can also provide high throughput by parallel-in-parallel-out with multi-partitioned memories. The comparisons among the different parallel-in-parallel-out FFT architectures are also carried out in Chapter 3. The comparison results are helpful to FFT processor design in our system. At the end of Chapter 3, concept of partial FFT processor design will be introduced to solve another goal of FFT processor for DFT-based channel estimation.

The architecture design of FFT processor with parallel-in-parallel-out in normal order will be proposed in Chapter 4. A novel memory allocation method for parallel-in-parallel-out in normal are proposed in this chapter. Designs of processing elements, memory allocation, commutator, scale down block, and coefficient ROM table for the proposed FFT processor will be introduced, and considered as the key contribution of this thesis. In the end of Chapter 4, comparisons are carried out for the hardware implement result with other FFT processor with parallel-in-parallel-out in normal order.

Backend design flow for the chip of 802.16e receiver will be introduced in Chapter 5. In order to tape out the chip, two versions of chip implementation results are presented, one for UMC shuttle, the other for CIC. The chip floor plan and design flow will be presented in Chapter 5.

In the end of the thesis, the conclusion future works will be presented in Chapter

Chapter 2 FFT Application in OFDM Communication System

2.1 Concept of OFDM

Orthogonal Frequency Division Multiplexing (OFDM) is based on frequency division multiplexing (FDM). FDM translates several message signals to different spectral locations. An example of bandwidth allocation of FDM is shown in Fig. 2.1.

Frequency

SC.1 SC.2 SC.3 SC.4 SC.5 SC.6 SC.7 SC.8

Fig. 2.1 Bandwidth allocation for sub-cannels in FDM system

FDM technique keeps all sub-channels away from overlapping by guard bands to against the adjacent sub-channels producing inter-channel interference (ICI); however, guard bands waste the bandwidth efficiency, which is important in communication system, because it is not used to carry any message signals. OFDM uses orthogonal sub-carriers to overlap the sub-channels to carry more message signals in the same bandwidth than FDM as shown in Fig. 2.2.

Fig. 2.2 Bandwidth allocation for sub-channels in OFDM system

A basic block diagram of OFDM system is shown in Fig. 2.3. Fig. 2.3 shows the transmitter in OFDM system need IFFT module to modulate the message signal, called OFDM modulation, and the receiver also need FFT module for OFDM de-modulation; thus, FFT processor is a key block in OFDM transceiver system.

S/P Signal

Fig. 2.3 Basic block diagram of an OFDM transceiver system

2.2 Introduction of IEEE 802.16e

IEEE 802.16 is a broadband wireless access (BWA) standard. The first standard of IEEE 802.16 approved in December 2001 called IEEE 802.16-2001 [6]. It delivered a standard, which transmits in 10-66 GHz with only a line-of-sight (LOS) capability, used in Wireless Metropolitan Area Networks (WiMAN). It uses a single carrier (SC) physical (PHY) standard. IEEE 802.16a is a extension of IEEE 802.16-2001. It transmits in 2-11 GHz with both LOS and none-light-of-sight (NLOS),

and less distortion by rain than IEEE 802.16-2001. IEEE 802.16-2004 (also called IEEE 802.16d) is a fixed broadband wireless access (BWA) standard, which combines both of IEEE 802.16-2001 and IEEE 802.16a standards. IEEE 802.16-2004 describes more detail for media access control layer (MAC) and PHY in 2-66 GHz. It supports multiple physical layer (PHY) specifications, such as WiMAN-SC, WiMAN-OFDM, WiMAN-OFDMA, and WiMAN-SCa, operation in different frequency. For operation frequency in 10-66 GHz, the WiMAN-SC PHY, based on single carrier, is specified; for operation frequency below 11 GHz, the IEEE 802.16-2004 transmitting in NLOS provides three alternative PHY specifications:

WiMAN-OFDM (based on orthogonal frequency division multiplexing), WiMAN-OFDMA (based on orthogonal frequency division multiple access), WiMAN-SCa (based on single carrier). IEEE 802.16e, which is a fixed and mobile broadband wireless access (BWA) standard, is an enhancement of IEEE 802.16-2004 standard. It fills the gap between very high data rate local area network and very high mobility cellular system. An extension PHY layer specification called scalable-OFDMA (SOFDMA), based on WiMAN-OFDMA, provide different FFT Size for OFDMA, such as 128, 512, 1024, 2048 points. Table 2-1 is the summary of IEEE 802.16-2001, IEEE 802.16a, and IEEE 802.16e.

Table 2-1 Comparisons of IEEE 802.16 standards

IEEE 802.16-2001 IEEE 802.16a IEEE 802.16e

Spectrum 10-66 GHz 2-11 GHz 2-6 GHz

Channel Bandwidth 20, 25, 28 MHz 1.5 to 20 MHz 1.5 to 20 MHz

Carrier Single Carrier OFDM/OFDMA OFDM/SOFDMA

FFT Size N/A 256(OFDM)

2048(OFDMA)

256(OFDM) 128/512/1024/2048 (SOFDMA)

Modulation QPSK, 16QAM,

64QAM

Channel Conditions LOS Non-LOS Non-LOS

Typical Cell Radius 2-5 Km 7-10 Km, max 50 Km

2-5 Km

Application Fixed Fixed and portable Fixed and mobile

2.3 DFT-Based Channel Estimation

Channel estimation in conventional OFDM system is a simple one-tap equalizer since the channel gain varies slowly between each adjacent OFDM symbol. However, in the mobile wireless communication environment, such as the channel in IEEE 802.16e, the channel gain varies rapidly between each adjacent OFDM symbol, so a one-tap equalizer seems not suitable for the time-varying channel environment. The one-tap equalizer can be realized as a least square (LS) channel estimator, and it has low hardware complexity but low performance than minimum-mean-square-error (MMSE) estimator. MMSE estimator has better performance but the hardware complexity is too high. DFT-based channel estimation [7-9] is presented to combine the LS and MMSE estimator, and it reduces the hardware complexity of MMSE estimator. A simple block diagram of DFT-based channel estimation is shown in Fig.

2.4. R(k) is the received data in sub-carrier k after OFDM demodulation, X(k) is the

decision data, which is determined by the latest OFDM symbol channel estimator, and H(k) is the channel estimator used in next OFDM symbol.

Fig. 2.4 Block diagram of DFT-based channel estimation

DFT-based channel estimation can provide more accurate channel gain with lower hardware complexity than the original MMSE estimator. However, it needs both IFFT block and FFT block to implement the algorithm. Thus, a suitable FFT or IFFT processor design can reduce the hardware cost of DFT-based channel estimation.

2.4 System Specification

For mobile WMAN baseband transceiver using standard IEEE 802.16e, we proposed a baseband transceiver [10]. A simply block diagram of the 2×1 multiple- input-single-output (MISO) IEEE 802.16e OFDM system is shown in Fig. 2.5. For chip implementation, we only implement the receiver part of Fig. 2.5. The key system specifications are listed in Table 2-2.

Fig. 2.5 Block diagram of baseband transceiver in IEEE 802.16e

Table 2-2 System specification of IEEE 802.16e transceiver system Items Specification

Bandwidth 10 MHz

PHY Layer Specification WiMAN-SOFDMA

FFT Size 1024

Sample Rate 11.2 MHz

Guard Interval 1/8

Constellation QPSK, 16QAM

OFDM Symbol Time 102.9 us

The channel estimation block is a decision feedback (DF) DFT-based channel estimation [10], which combines the channel estimation and data detection as shown in Fig. 2.6. The system requirement for channel estimation will be introduced in the following sections.

Fig. 2.6 Block diagram of decision feedback DFT-based channel estimation There are two kinds of FFT processor in the receiver part, FFT_dem located of the synchronization block called OFDM demodulator. FFT_ch and IFFT_ch blocks are required in channel estimation block. The following sections will introduce the system specifications of these two kinds of FFT processor.

2.4.1 Specification of FFT Processor on Demodulation Path

The FFT_dem processor in Fig. 2.5 receives the data from synchronization block, and passes the data to channel estimation and space-time decoding. The input data format of FFT processor is like that in other OFDM communication system. However, the output ports have to buffer 2 OFDM symbol since we use 2×1 MISO system with STBC coding and DF DFT-based channel estimation. For this reason, we design a conventional memory-based FFT processor [11] with 5 memory banks shown in Fig.

2.7.

Fig. 2.7 FFT Processor with 5 shared memories

SYN_wr is the data from synchronization block, and only one of the memory banks would be written by synchronization block in an OFDM symbol time. Then, the written memory bank would be used to do FFT by the FFT processor. In the same time, the synchronization block is writing the data to another memory bank. After the two OFDM symbols in a STBC time slot have been calculate by FFT processor, the memories, which stored the FFT calculation result of this two OFDM symbols, would be read from channel estimation, called CE_rd, in two OFDM symbol time.

The time chart of 5 memory banks is shown in Fig. 2.8. At the first preamble

symbol, the data from synchronization block are written to MEM_1_0. At the second and the third symbols, the data from synchronization block are written to MEM_0_0 and MEM_1_1 while the data in MEM_1_0 are calculated by FFT processor and read by channel estimation. Furthermore, the memory operations for OFDM symbol index 12 is the same as index 0, thus the memory operations of 5 memory banks are repeated every 12 OFDM symbols.

MEM_1_0

Fig. 2.8 Time chart for the 5 memory banks

2.4.2 Specification of FFT Processor in Channel Estimation

The FFT_ch and IFFT_ch blocks in decision feedback DFT-based channel estimation (DF DFT-based CE) are shown in Fig. 2.9. Before introducing the system requirement, we make a brief description of the DF DFT-based CE. The DF DFT-based CE has two parts. One is initial channel gain calculated by using the preamble signals. The operational blocks are preamble match block, two IFFT_ch blocks, path selection block, inverse hessian matrix calculation, two FFT_ch blocks, and channel estimator block. The channel gain should be calculated within 2 OFDM symbol time. The second part is channel gain tracking loop. The operational block s are gradient estimator, two IFFT_ch blocks, search direction estimator calculation, two FFT_ch blocks, channel estimator modification block, and the channel estimator block. The channel gain is calculated by tracking loop with 2 iterations. At the first iteration, the channel gain variance for the channel estimator modification block is determined by the pilot signals, called global tracking, since the pilot signals have higher SNR than data signals. At the second iteration, variance is determined by the data signals, called local tracking. Both of two parts can use the same IFFT_ch blocks and FFT_ch blocks.

Fig. 2.9 FFT processor in decision feedback DFT-based channel estimation

Since the channel estimation included tracking loop, the channel gain should be calculated within 2 OFDM symbol time before the data buffers for channel estimation in Fig. 2.7 are updated; thus, data latency is an important issue to implement the channel estimation block into hardware. With this purpose, a parallel-in-parallel-out (PIPO) FFT/IFFT processor is necessary for not only increasing the throughput rate of FFT/IFFT processor but also increasing the throughput rate of other blocks in channel estimation block.

The DFT-based channel estimation has a special feature for the FFT_ch and IFFT_ch blocks. Only a subset of output data is required for IFFT_ch output ports.

Also, the input data of FFT_ch block may have several zero points, which are no required to be computed with other non-zero points. The FFT processors design for only some subset of input or output points are called partial FFT [23]. The thesis will introduce the idea of partial FFT processor design for DFT-based channel estimation.

Finally, there are two purposes of FFT processor design, one is a FFT processor with parallel-in-parallel-out in normal order, and the other is partial FFT processor design. The thesis will focus on the FFT processor design with parallel-in-parallel-out in normal order. The partial FFT processor design concept will be introduced in the end of next chapter.

Chapter 3 FFT Algorithms and Architectures

3.1 Concept of FFT Algorithms

Discrete Fourier Transform (DFT) is a key block in OFDM communication system, and it is widely used in many applications; however, its computational complexity is so high that implementation of DFT algorithm directly seems not feasible to meet low cost design goal. Fortunately, early contributors, particularly Cooley and Turkey in 1965 [12], employed the redundancy of DFT operations by iteratively decomposing the computation, called radix-2 FFT algorithm, to reduce the computation complexity from O(N²) to O(Nlog2N). Based on Cooley and Turkey’s FFT algorithm, various FFT algorithms were later developed, which provide flexible choices for implementation.

According to the ways of decomposing DFT, there are two types of FFT algorithms: one is the decimation-in-time (DIT) decomposition, which decomposes the time domain input sequence into successively smaller subsequences; the other is the decimation-in-frequency (DIF) decomposition, which alternately decomposes the frequency domain output sequence into smaller subsequences.

The basic N-point DFT equation is defined as

( ) ( )

k n N n

X k x n W

− ⋅

∑

⋅ (3.1) where W_N^{k n}^⋅ =exp(−j2πnk N/ ) is the DFT coefficient. Since a complex number multiplied with a coefficient is equivalent to a vector rotation, the DFT coefficient is also called twiddle factor.

The key feature of the FFT algorithm is to divide a complete DFT operations multiplications in Eq. (3.1) by half as shown in Eq. (3.2).

can reduce the computational complexity of Eq. (3.1) by using Eq. (3.3).

⁴ ( ) Finally, symmetry feature of its phase difference of ± 45° is also common in FFT algorithms. Based on the symmetry, the equation can be reduced to

may be customized to shifter and adder, and will be demonstrate in Chapter 4.

According to the symmetry of twiddle factors, the computation complexity of DFT operation can be reduced to a fraction of the original operation. We will take a example of radix-2 DIF FFT algorithm in following subsection.

3.1.1 Radix-2 DIF FFT Algorithm

The DIF FFT Algorithm is decomposed the frequency domain output sequence

into small subsequence. Here we take a example of radix-2 DIF FFT algorithm. The radix-2 DIF FFT algorithm divided the frequency domain sequence into even and odd parts and using the symmetry of twiddle factor in Eq. (3.2), as shown below.

The DFT operation can be divided into 2 stages, one is 2-point DFT, and another is N/2-point DFT, which is shown below

…… ……

Fig. 3.1 Radix-2 DIF FFT algorithm architecture

After the first decomposition, the N-point DFT operation can be divided into N/2 2-point DFT operation and 2 N/2-point DFT operation, where the 2-point DFT is well known that the operation can be realized as a radix-2 butterfly (BF) module, shown as Fig. 3.2.

Fig. 3.2 Radix-2 butterfly module

Similar to the first decomposition, we can further decompose the N/2-point DFTs into even smaller DFTs until all DFTs are decomposed into 2-point DFT.

3.2 Concept of FFT Architectures

The FFT processor architecture design can be simply divided into two types, one is pipeline-based FFT architecture [13-15], and the other is memory-based FFT architecture [16-17]. Pipeline-based FFT architecture has the advantage of high throughput rate and low data latency, but it also has the disadvantage of high hardware cost; in contrast, memory-based FFT architecture has low hardware cost but high data latency.

For both of the FFT processor architectures, to increase the FFT processor throughput rate, high working clock rate is the simplest way to meet the throughput constrain; however, it will also increase the FFT processor hardware cost and power consumption. In this chapter, we will discuss different architectures for high throughput FFT processors with multi-input-and-multi-output in normal order.

3.2.1 Pipeline-Based FFT Architecture

The pipeline-based FFT architectures are the most popular architectures in many applications because they are designed for high speed performance and sequence of data input; but, in order to make the output data in normal order, they usually need a

reorder buffer in output stage, which regular a very high hardware cost. The best way to obtain the pipeline-based architecture is through vertical projection of signal flow graph (SFG). Fig. 3.3 shows an example to explain vertical projection mapping of 8-point radix-2 DIF FFT.

Fig. 3.3 Vertical projection mapping of 8-point radix-2 DIF FFT

Each stage obtained by vertical projection is called a processing element (PE), which contains a delay buffer (Buffer), a radix-2 butterfly unit (Radix-2 BF), and a complex multiplier. The delay buffer is used to reorder the data input for each stage butterfly unit. There are two types of the delay buffer, one is called delay-feedback (DF), and the other is called delay-commutator (DC). According to the structure difference, pipeline-based FFT architecture can be divided into three types:

single-path delay feedback (SDF) architecture, single-path delay commutator (SDC)

architecture, and multi-path delay commutator (MDC) architecture. Since the SDC architecture can provide only one-path output data stream, similar to SDF architecture, and hardware cost is between the SDF and MDC architectures, the SDC architecture can not provide parallel data stream with least hardware cost. Here we only focus on SDF and MDC architectures. In the following subsections, we will introduce different radix-r SDF and MDC pipeline-based FFT architectures, where r is the radix number for the decimation-in-time (DIT) or decimation in frequency (DIF) algorithm.

3.2.1.1 Radix-r Multi-Path Delay Commutator Architecture

Radix-r MDC architecture [18-19] uses commutator to break the input data into r parallel data streams flowing forward with correct ordering for the data entering the butterfly unit by proper delays. Here are two examples to introduce MDC architecture in the following discussions.

(1). Radix-4 Multi-Path Delay Commutator (R4MDC) Architecture

Fig. 3.4 shows a 64-point FFT with radix-4 multi-path delay commutator (R4MDC) architecture. In Fig. 3.4, the elements of the R4MDC architecture are commutators, shift registers, and radix-4 butterfly units. The butterfly unit is also called arithmetic element (AE). At the beginning, the first 16 points of input data are delay at the first line of AE1’s inputs, the next 16 points are delay at the second line, and the next 16 points are delay at the third line. When the 49^th point of input data coming at the forth line, the first butterfly is computing at AE1. With proper delays and commutation between each AE, the input data of each AE has correct ordering to

在文檔中可平行順序輸入及輸出快速傅立葉轉換處理器之設計 (頁 14-0)