Specification of FFT Processor in Channel Estimation

Chapter 2 FFT Application in OFDM Communication System

2.4 System Specification

2.4.2 Specification of FFT Processor in Channel Estimation

The FFT_ch and IFFT_ch blocks in decision feedback DFT-based channel estimation (DF DFT-based CE) are shown in Fig. 2.9. Before introducing the system requirement, we make a brief description of the DF DFT-based CE. The DF DFT-based CE has two parts. One is initial channel gain calculated by using the preamble signals. The operational blocks are preamble match block, two IFFT_ch blocks, path selection block, inverse hessian matrix calculation, two FFT_ch blocks, and channel estimator block. The channel gain should be calculated within 2 OFDM symbol time. The second part is channel gain tracking loop. The operational block s are gradient estimator, two IFFT_ch blocks, search direction estimator calculation, two FFT_ch blocks, channel estimator modification block, and the channel estimator block. The channel gain is calculated by tracking loop with 2 iterations. At the first iteration, the channel gain variance for the channel estimator modification block is determined by the pilot signals, called global tracking, since the pilot signals have higher SNR than data signals. At the second iteration, variance is determined by the data signals, called local tracking. Both of two parts can use the same IFFT_ch blocks and FFT_ch blocks.

Fig. 2.9 FFT processor in decision feedback DFT-based channel estimation

Since the channel estimation included tracking loop, the channel gain should be calculated within 2 OFDM symbol time before the data buffers for channel estimation in Fig. 2.7 are updated; thus, data latency is an important issue to implement the channel estimation block into hardware. With this purpose, a parallel-in-parallel-out (PIPO) FFT/IFFT processor is necessary for not only increasing the throughput rate of FFT/IFFT processor but also increasing the throughput rate of other blocks in channel estimation block.

The DFT-based channel estimation has a special feature for the FFT_ch and IFFT_ch blocks. Only a subset of output data is required for IFFT_ch output ports.

Also, the input data of FFT_ch block may have several zero points, which are no required to be computed with other non-zero points. The FFT processors design for only some subset of input or output points are called partial FFT [23]. The thesis will introduce the idea of partial FFT processor design for DFT-based channel estimation.

Finally, there are two purposes of FFT processor design, one is a FFT processor with parallel-in-parallel-out in normal order, and the other is partial FFT processor design. The thesis will focus on the FFT processor design with parallel-in-parallel-out in normal order. The partial FFT processor design concept will be introduced in the end of next chapter.

Chapter 3 FFT Algorithms and Architectures

3.1 Concept of FFT Algorithms

Discrete Fourier Transform (DFT) is a key block in OFDM communication system, and it is widely used in many applications; however, its computational complexity is so high that implementation of DFT algorithm directly seems not feasible to meet low cost design goal. Fortunately, early contributors, particularly Cooley and Turkey in 1965 [12], employed the redundancy of DFT operations by iteratively decomposing the computation, called radix-2 FFT algorithm, to reduce the computation complexity from O(N²) to O(Nlog2N). Based on Cooley and Turkey’s FFT algorithm, various FFT algorithms were later developed, which provide flexible choices for implementation.

According to the ways of decomposing DFT, there are two types of FFT algorithms: one is the decimation-in-time (DIT) decomposition, which decomposes the time domain input sequence into successively smaller subsequences; the other is the decimation-in-frequency (DIF) decomposition, which alternately decomposes the frequency domain output sequence into smaller subsequences.

The basic N-point DFT equation is defined as

( ) ( )

k n N n

X k x n W

− ⋅

∑

⋅ (3.1) where W_N^{k n}^⋅ =exp(−j2πnk N/ ) is the DFT coefficient. Since a complex number multiplied with a coefficient is equivalent to a vector rotation, the DFT coefficient is also called twiddle factor.

The key feature of the FFT algorithm is to divide a complete DFT operations multiplications in Eq. (3.1) by half as shown in Eq. (3.2).

can reduce the computational complexity of Eq. (3.1) by using Eq. (3.3).

⁴ ( ) Finally, symmetry feature of its phase difference of ± 45° is also common in FFT algorithms. Based on the symmetry, the equation can be reduced to

may be customized to shifter and adder, and will be demonstrate in Chapter 4.

According to the symmetry of twiddle factors, the computation complexity of DFT operation can be reduced to a fraction of the original operation. We will take a example of radix-2 DIF FFT algorithm in following subsection.

3.1.1 Radix-2 DIF FFT Algorithm

The DIF FFT Algorithm is decomposed the frequency domain output sequence

into small subsequence. Here we take a example of radix-2 DIF FFT algorithm. The radix-2 DIF FFT algorithm divided the frequency domain sequence into even and odd parts and using the symmetry of twiddle factor in Eq. (3.2), as shown below.

The DFT operation can be divided into 2 stages, one is 2-point DFT, and another is N/2-point DFT, which is shown below

…… ……

Fig. 3.1 Radix-2 DIF FFT algorithm architecture

After the first decomposition, the N-point DFT operation can be divided into N/2 2-point DFT operation and 2 N/2-point DFT operation, where the 2-point DFT is well known that the operation can be realized as a radix-2 butterfly (BF) module, shown as Fig. 3.2.

Fig. 3.2 Radix-2 butterfly module

Similar to the first decomposition, we can further decompose the N/2-point DFTs into even smaller DFTs until all DFTs are decomposed into 2-point DFT.

3.2 Concept of FFT Architectures

The FFT processor architecture design can be simply divided into two types, one is pipeline-based FFT architecture [13-15], and the other is memory-based FFT architecture [16-17]. Pipeline-based FFT architecture has the advantage of high throughput rate and low data latency, but it also has the disadvantage of high hardware cost; in contrast, memory-based FFT architecture has low hardware cost but high data latency.

For both of the FFT processor architectures, to increase the FFT processor throughput rate, high working clock rate is the simplest way to meet the throughput constrain; however, it will also increase the FFT processor hardware cost and power consumption. In this chapter, we will discuss different architectures for high throughput FFT processors with multi-input-and-multi-output in normal order.

3.2.1 Pipeline-Based FFT Architecture

The pipeline-based FFT architectures are the most popular architectures in many applications because they are designed for high speed performance and sequence of data input; but, in order to make the output data in normal order, they usually need a

reorder buffer in output stage, which regular a very high hardware cost. The best way to obtain the pipeline-based architecture is through vertical projection of signal flow graph (SFG). Fig. 3.3 shows an example to explain vertical projection mapping of 8-point radix-2 DIF FFT.

Fig. 3.3 Vertical projection mapping of 8-point radix-2 DIF FFT

Each stage obtained by vertical projection is called a processing element (PE), which contains a delay buffer (Buffer), a radix-2 butterfly unit (Radix-2 BF), and a complex multiplier. The delay buffer is used to reorder the data input for each stage butterfly unit. There are two types of the delay buffer, one is called delay-feedback (DF), and the other is called delay-commutator (DC). According to the structure difference, pipeline-based FFT architecture can be divided into three types:

single-path delay feedback (SDF) architecture, single-path delay commutator (SDC)

architecture, and multi-path delay commutator (MDC) architecture. Since the SDC architecture can provide only one-path output data stream, similar to SDF architecture, and hardware cost is between the SDF and MDC architectures, the SDC architecture can not provide parallel data stream with least hardware cost. Here we only focus on SDF and MDC architectures. In the following subsections, we will introduce different radix-r SDF and MDC pipeline-based FFT architectures, where r is the radix number for the decimation-in-time (DIT) or decimation in frequency (DIF) algorithm.

3.2.1.1 Radix-r Multi-Path Delay Commutator Architecture

Radix-r MDC architecture [18-19] uses commutator to break the input data into r parallel data streams flowing forward with correct ordering for the data entering the butterfly unit by proper delays. Here are two examples to introduce MDC architecture in the following discussions.

(1). Radix-4 Multi-Path Delay Commutator (R4MDC) Architecture

Fig. 3.4 shows a 64-point FFT with radix-4 multi-path delay commutator (R4MDC) architecture. In Fig. 3.4, the elements of the R4MDC architecture are commutators, shift registers, and radix-4 butterfly units. The butterfly unit is also called arithmetic element (AE). At the beginning, the first 16 points of input data are delay at the first line of AE1’s inputs, the next 16 points are delay at the second line, and the next 16 points are delay at the third line. When the 49^th point of input data coming at the forth line, the first butterfly is computing at AE1. With proper delays and commutation between each AE, the input data of each AE has correct ordering to compute a radix-4 butterfly in each AE. Finally, the output data of AE3 are 2bit-reverse order of the input data order.

Commutator Radix-4 BF Commutator Radix-4 BF Commutator Radix-4 BF

Fig. 3.4 64-point FFT with R4MDC architecture

In order to revise the 64-point FFT of R4MDC architecture with multi-input and multi-output in normal order, we have to replace the first stage and add a reorder stage at the output stage, which is shown in Fig. 3.5.

Fig. 3.5 Modified input stage and output stage of 64-point R4MDC architecture For multi-input in normal order, the first stage has to change the commutator from one input to four inputs. And, in order to write four data in one cycle to one of the input shift registers of AE1, the shift registers have to be changed into random access registers. Thus, the first to the third line of AE1’s inputs shifter registers are changed into

N random access registers. Furthermore, the fourth line have to add a

4×4 random access registers to buffer the input data because the fourth line has four input data and one output data.

For multi-output in normal order, the output stage has to add a reorder stage to reorder the output data from bit-reverse order to normal order. By using the similar way of delay commutator, the output data order will be changed into normal order.

For N-point FFT computation, R4MDC needs 11

4 N−4 registers, 3 · (log4N–1) complex multipliers, and 8 · log4N complex adders. The latency is 11

16N−5 cycles.

(2). Radix-8 Multi-Path Delay Commutator (R8MDC) Architecture

Radix-8 multi-path delay commutator (R8MDC) is similar to R4MDC architecture, it use radix-8 algorithm with MDC architecture, and it can provide higher throughput rate than R4MDC architecture with 8 parallel data streams. But, it also has more delay buffers and other arithmetic elements. The 512-point FFT with R8MDC architecture is shown in Fig. 3.6.

Fig. 3.6 512-point FFT with R8MDC architecture

Also, for multi-output in normal order, the architecture has to revise the first stage and add a reorder stage at output stage, which is shown in Fig. 3.7. The input stage and output stage are similar to R4MDC architecture with multi-input and multi-output in normal order. As a result, reorder stage has more delay buffer when higher radix is used in MDC architecture.

For N-point FFT computation, R8MDC needs 23

8 N−8 registers, 7×(log8N–1) complex multipliers, and (24+2T)×log8N complex adders, where the parameter T indicates the number of adders required in the implementation of multiplications by constant values. The latency is 23

64N−9 cycles.

Fig. 3.7 Modified input stage and output stage of 512-point R8MDC architecture

3.2.1.2 Radix-r Single-Path Delay Feedback Architecture

Unlike multi-path delay commutator (MDC) architecture, single-path delay feedback (SDF) architecture combines the commutator and the radix-r butterfly unit, and uses delay feedback method to reuse the delay buffer of each stage to reorder the data input of butterfly unit. The SDF architecture’s hardware is less than the MDC architecture’s, but the data latency is more than the MDC architecture’s. Moreover, the SDF has only one path between butterfly units, the throughput rate can’t be higher even it uses higher radix FFT algorithm. For input and output data in normal order, it needs a reorder buffer at output stage, and, the buffer size is about N/2 for N-point DFT with SDF architecture. Also, we take two cases of SDF architecture in the following discussions.

(1). Radix-2 Single-Path Delay Feedback (R2SDF) Architecture

The radix-2 single-path delay feedback (R2SDF) architecture combines radix-2 MDC architecture’s commutator and radix-2 butterfly unit in R2SDF’s radix-2 butterfly unit shown in Fig. 3.8. Without 2 parallel data streams from the radix-2 butterfly unit output to the next stage, R2SDF only has one output to the next stage,

and the other output is feedback to store in delay buffer; therefore, it is called single-path delay feedback architecture.

For N-point FFT computation, R2SDF needs N–1 registers, (log2N–1) complex multipliers, and 2 ×log2N complex adders. The latency is N–1 cycles without reorder buffer.

Radix-2 BF

32 16 8 4 2 1

Fig. 3.8 64-point FFT with radix-2 SDF architecture

(2). Radix-8 Single-Path Delay Feedback (R8SDF) Architecture

The block diagram of 64-point radix-8 single-path delay feedback (R8SDF) architecture is shown in Fig. 3.9. It has less multiplier than the R2SDF architecture, for 64-point FFT architecture, R8SDF can save 80% of complex multipliers; but it also has more register banks to store the data for BF unit, which may have more power consumption.

For N-point FFT computation, R8SDF needs N–1 registers, (log8N–1) complex multipliers, and (24+2T) ×log8N complex adders. The latency is N–1 cycles without reorder buffer.

8 8 8

X 8

8 8 8

1 1 1 1 1 1 1

Fig. 3.9 64-point FFT with R8SDF architecture

(3). Radix-2/4/8 Single-Path Delay Feedback (R2³SDF) Architecture

Radix-2/4/8 single-path delay feedback (R2³SDF) architecture is based on R2SDF architecture with radix-8 FFT algorithm shown in Fig. 3.10, and, it replaces the radix-2 butterfly unit with the radix-8 FFT processing element, which is shown in Fig. 3.11. The numbers of required complex multiplier are the same as R8SDF architecture, and the numbers of required complex adders are less than R8SDF architecture; moreover, the partitions of registers are less than R8SDF architecture, which may have less power consumption.

Fig. 3.10 64-point FFT with R2³SDF architecture

For N-point FFT computation, R2³SDF needs N–1 registers, (log8N–1) complex

Fig. 3.11 8-point FFT radix-2/4/8 SDF architecture

3.2.2 Memory-Based FFT Architecture

Memory-based FFT architecture, unlike pipeline-based FFT architecture, only has a few arithmetic elements (AE), which also called processing element (PE). There are two advantage of using memory-based FFT architecture: One is that the hardware area of the processing elements for N-point DFT computation is the same even N is very large; the other is that the total number of memory banks are less than pipeline-based FFT architecture because it used a few PE and need less read or write

operations in the same time. Fig. 3.12 shows a radix-8 memory-based FFT architecture, it only has one radix-8 butterfly unit and 8 memory banks.

Fig. 3.12 Radix-8 memory-based (R8M) FFT architecture

For multi-input in normal order, different input data in one cycle should write to different memory banks, but, this requirement is conflict with radix-r FFT algorithm for memory-based architecture. Similar to the MDC architecture, radix-r memory-based FFT architecture can add reorder stage at the input stage for parallel data to be written to different memory banks. Also, for multi-output in normal order, it needs a reorder stage at the output stage.

Another choice for memory-based FFT architecture with multi-input and multi-output in normal order is rearrangement of data in memory with higher control complexity. Next chapter will show the proposed FFT processor architecture based on this concept.

For N-point FFT computation, R8M needs 7

8N+56registers and N words memory with 8 memory banks, 7 complex multipliers, and (24+2T) complex adders.

The latency is 15 ₈

8 log

64 8

N− +N N cycles.

3.3 Comparison of Different FFT Architecture

Table 3-1 Comparison of different FFT architecture

R2SDF R8SDF R2³SDF R4MDC R8MDC R8M

The comparison of different FFT architecture with multi-input and multi-output in normal order is shown in Table 3-1, where the N is the FFT size and R is the internal clock rate of the FFT processor. Due to the FFT algorithms, all architecture need reorder buffer at input stage or output stage, and the hardware cost of reorder buffer is so high that the conventional FFT architecture can’t provide an efficient way to make the output sequence in normal order. For this reason, we have to develop a FFT processor providing high throughput rate with multi-input and multi-output in normal order in an efficient way for low hardware cost. As the goal of low hardware cost, radix-8 memory-based FFT architecture has the least hardware cost for high throughput rate with the same clock working frequency. However, it also needs a very

large reorder buffer. Therefore, the main issue of the FFT architecture with multi-input and multi-output in normal order is to reduce the reorder buffer. The proposed FFT architecture can provide high throughput rate with multi-input and multi-output in normal order, and does not need any reorder buffer. It will be introduced in next chapter.

3.4 Partial FFT Design

3.4.1 Concept of Partial FFT

Partial FFT design is a study of redundancies of the standard FFT algorithm due to a reduction in either the number of input or output points. For most applications, the input and the output sequence of the DFT operation are equal, but, there are still some applications where the numbers of input and output points are different, such as DFT-based channel estimation. Hence, many researches of partial FFT design are presented to reduce the redundant operations of FFT algorithm. The thesis will introduce the partial FFT design in two points of view in the following subsections, one is concerned that only a subset of input or output points of DFT operation are computed, another point is concerned that multiple subsets of input or output points of DFT operation are computed. Finally, we propose a partial FFT design, combining the reducing methods with only a subset and multiple subsets of input or output points of DFT operation, suitable for DFT-based channel estimation.

3.4.2 DFT with only a Subset of Input or Output Points

There are two conditions we have to design a partial FFT with only a subset of input or output points, one is that only a narrow spectrum is interested but the resolution within the band has to be very high; the other is that a very high resolution

spectrum is to pad the input sequence with a large number of zeros. It usually use a regular FFT to compute the results, but if the number of nonzero input or the number of output concerned is small compared with the DFT length, it is very inefficient. The pruning algorithm [25][26] and transform decomposition [27] is presented for efficient DFT computation with only a subset of input or output points. Because the transform decomposition method is not suitable in our application, we only introduce the pruning algorithm in the following.

The pruning algorithm is first developed by Markel [25] for computing only a subset of input or output points. An example of Markel’s pruned 16-point FFT with a subset of nonzero input is shown in Fig. 3.13, where the Markel’s pruning algorithm is based on radix-2 DIF FFT algorithm. We focus on the case that the nonzero input points are from the first L points of input sequence because this case is similar to the

在文檔中可平行順序輸入及輸出快速傅立葉轉換處理器之設計 (頁 25-0)