Chapter 1 Introduction
1.3 Organization of Thesis
……
To RF
From RF Input
data
Output data
Transmitter
Receiver
Figure 1.1 Generic OFDM block diagram
On the other side, it is desirable for a processor to perform flexible-size FFTs, thereby facilitating software adaptability when different formats and changing standards must be accommodated. Processors with high re-configurability incur inevitable overhead in all terms. In order to minimize the overhead, the design of such reconfigurable processors must be considered from both algorithm-level and architecture-level.
This thesis aims to design a high performance FFT/IFFT processor that can meet modern high-speed criterions while maintaining low power consumption. The processor can be flexible to perform different lengths of FFTs and thus suitable for various protocols and applications. The FFT length should be easily reconfigured by setting control registers and with minimum hardware overhead possible.
1.3 Organization of Thesis
The rest of this thesis is organized as follow. Chapter 2 is a review of general FFT algorithms and architectures. The basic concept of the FFT algorithm is
explained and various FFT algorithms are introduced here. Also, popular FFT architectures in implementation, memory-based and pipeline-based, are depicted and compared in this chapter. In conclusion, we will give a direction of algorithms and architecture that is most suitable for modern high-speed applications.
In this thesis, we propose an energy-aware reconfigurable mixed-radix FFT/IFFT.
The proposed processor can be easily reconfigured as from 16-point to 4096-point FFT/IFFT with proper mixed-radix algorithm assigned for each mode. In chapter 3, we will derive the proposed reconfigurable mixed-radix algorithm. The architecture design and principle of each block will be illustrated in chapter 4.
In chapter 5, the RMR FFT is implemented using TSMC 0.13μm technology. As will be shown in the proposed architecture, we find that the internal storage block takes out most of the FFT area and power during the cell-based synthesis flow. The implementation strategy of the internal storage blocks is different from that of the rest RMR FFT. The simulation result will be analyzed and compared with other reconfigurable architectures. Finally, some conclusions and future work will be presented in Chapter 6.
Chapter 2
Review of FFT Algorithms and Architectures
2.1 Introduction
The discrete Fourier transform (DFT) is widely employed in the analysis, design, and implementation of signal processing algorithms and systems. However, the computational complexity of direct evaluation of an N-point DFT is O(N2), which results in a long computation time and excessive hardware cost. Fortunately, considerable symmetry exists in the operations and coefficients required to compute a DFT. Such symmetry can be exploited to reduce the number of operations required, thus reducing the time required for DFT computation. Collectively, the resulting efficient computation algorithms are called fast Fourier transform (FFT).
Mainly, the FFT is a way of computing the DFT by decomposing the computation into successively smaller DFT computations. In this process, both the symmetry and the periodicity of the complex exponential are exploited.
Algorithms in which the decomposition is based on the input sequence x[n] into successively smaller subsequences are called decimation-in-time (DIT) algorithms.
Alternatively, we can consider dividing output sequence X[k] into smaller subsequences and such algorithms are called decimation-in-frequency (DIF) algorithms.
(2 / )
nk j N nk
WN =e− π
By far the most common FFT is the Cooley-Tukey algorithm [2.1], which is suitable in decomposing DFT that is of size of power of 2. We would like to introduce some variants based on Cooley-Tukey algorithm in this chapter. These variants can be classified as fixed-radix and the others, respectively. Also, we will discuss the architectures for these algorithms in VLSI implementation. Both of the two popular architectures, memory-based and pipeline-based, have their advantages and certain shortcomings.
2.2 Basic Concept of FFT Algorithms
The discrete Fourier transform of a complex data sequence x[n] of length N is defined as:
1
0
( ) N [ ] Nnk k=0,1,...,N-1
n
X k x n W
−
=
=
∑
(2.1)where the coefficient WNnk is defined as
2 j nk
nk N
WN e
π
−
= ,which are called twiddle factors. The approach used to improve the efficiency in FFT is to exploit the symmetry and the periodicity properties of WNnk;
(N n k) nk ( nk
N N
W − =W− = WN )*
N
(Symmetry property) (2.2)
( ) ( )
nk n k N n N k
N N N
W =W + =W + (Periodicity in n and k) (2.3) As an illustration, using the periodicity property, we can group terms in Eq. (2.1) for n and (n+N):
( )
[ ] Nnk [ ] Nn N k ( [ ] [ ]) nk
x n W +x n+N W + = x n +x n+N W (2.4) Similar groupings can be used for other terms in Eq. (2.1). In this way, the number of complex multiplication can be reduced by approximately a factor of 2. We can also take the advantage of the fact that for certain factors, the real and imaginary parts take on the value 1 or 0, which eliminating the need for multiplication. As a result, applying the above properties achieves significantly reduction in computation.
The Cooley-Tukey algorithm is the most common FFT algorithm. It re-expresses the DFT of an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1
and N2 recursively. FFT algorithms are based on the fundamental principle of decomposing the computation of the DFT of an N-length sequence into successively smaller DFT. The manner of how this principle is implemented leads to a variety of different algorithms. In the following section, various FFT algorithms will be introduced.
2.3 The FFT Algorithms
According to the manner of decomposition, the FFT algorithms can be classified as DIT and DIF algorithms. The difference is the object to be decomposed, input sequence for DIT and output sequence for DIF.
2.3.1 Decimation-in-Frequency (DIF) Fixed-Radix Algorithms
The principle of the decimation-in-frequency algorithm is most conveniently illustrated by considering the N-point DFT where N is an integer power of 2, i.e., N=2v. Since N is an even integer, we can consider computing the even-numbered frequency samples and odd-numbered frequency samples separately. Referring to Eq.
(2.1), we can express X(k) as:
Based on the above equation, the even-numbered frequency samples are:
2 1 22 2
The result of Eq. (2.6) can be seen as the N/2-point DFT of the sequence {x[n]+x[n+N/2]}, which is obtained by adding the first half and the second half of the
input sequence. In the same way, the odd-numbered frequency points are:
2 1 2(2 1) (2 1)
0 2 1
0 2
(2 1) [ ] [ ]
2 [ ] [ ]
2
N
N r
n r
N N
n N
n nr N N n
X r x n x n N W W
x n x n N W W
− + +
=
−
=
⎧ ⎫
+ = ⎨ + + ⎬
⎩
⎧ ⎫
= ⎨ − + ⎬
⎩ ⎭
∑
∑
⎭ (2.7)
Eq. (2.7) is then the N/2-point DFT of the sequence obtained by subtracting the second half from the first half of the input sequence and multiplying the resulting sequence by WNn. Therefore, the problem of computing N-point DFT becomes computing N/2-point DFT. Recursively, we can further decompose the N/2-point DFT in Eq. (2.6) and (2.7) into smaller DFT. Proceed with these decomposition until the only DFT required are 2-point DFTs. The 2-point DFT can be derived as the simple form in Eq. (2.6) and (2.7), which are multiplication and addition/subtraction operations. As a result, the computation of N-point DFT requires no real DFT computation but only multiplication and addition/subtraction operations.
Figure 2.1, which is called a signal flow graph (SFG), illustrates the procedure of decomposing the 8-point DFT by the DIF algorithm. First we decompose the 8-point DFT as combinations of two 4-point DFT according to Eq. (2.6) and (2.7), as shown in (a). We can see now the output frequency points have been separated into even-numbered and odd-numbered parts. We then divide the 4-point DFT, respectively, into 2-point DFTs. Again, the output frequency points are separated. For the sequence {X(0),X(2),X(4),X(6)}, the even-numbered points are {X(0),X(4)} and the odd-numbered points are {X(2),X(6)}. The flow graph then becomes (b). Finally, we decompose the 2-point DFTs further and obtain the flow graph in (c). As we can see, the demand of any DFT block is now eliminated.
The basic computation unit in the flow graph of Figure 2.1, as brought up in Figure 2.2, is called a butterfly. The butterfly output in DIF algorithms have to multiply certain constants and such constants are called twiddle factors. This basic computation unit is effectively a 2-point DFT unit, as can be seen from (b) and (c) of Figure 2.1. Since the N-point DFT is always divided by 2 recursively, the above algorithm is called the radix-2 DIF algorithm.
0
Figure 2.1 Decomposition of the 8-point DFT step by step in DIF algorithm
x[n]
x[n+N/2]
X(n) = x[n]+x[n+N/2]
X(n+N/2) = x[n]-x[n+N/2]
Figure 2.2 The butterfly unit of radix-2 DIF FFT
Further more, the output ordering, as shown in the SFG, is not in normal order as the time-domain input. In fact, the order which the output data present is referred to as bit-reversed order. The idea of the bit-reversed order can be well depicted by tree diagrams. As we take the 8-point DFT as an example, three binary digits are required to index through the data. Figure 2.3 shows the way how normal order and bit-reversed order are derived, respectively. In (a), the normal order is obtained through sorting data sequence by successive examination of the data index bits. In (b), the same procedure takes place to obtain the bit-reversed order except that the data index bits examination is backward.
2 1 0
Figure 2.3 Tree diagrams of (a) normal order and (b) bit-reversed order
Similar to the way of decomposing the even integer N, we can decompose N into four parts if N is an integer power of 4, i.e., N=4v. We can divide frequency samples into four parts and consider computing them separately. The equation represents these four frequency parts are thus:
4 1
4 1
A decomposition of a 4v-point DFT can also be shown through a signal flow graph, similar to the one in Figure 2.1. This time, the basic computation unit is no longer a 2-point DFT butterfly but a 4-point DFT butterfly, as shown in Figure 2.4. The resulting algorithm, therefore, is called a radix-4 DIF algorithm.
x[n]
Figure 2.4 The butterfly unit of radix-4 DIF FFT
Practicing the above decomposition procedures, we can further derive even higher radix-r DIF algorithms by restricting N as an integer power of r. The advantage of a higher radix algorithm is that the number of complex multiplications can be effectively lowered. As one radix-4 stage corresponds to two radix-2 stage in the SFG, the twiddle-factor multiplications between the two radix-2 stages are now covered in the radix-4 stage. As shown in Figure 2.4, complex multiplications in the radix-4 butterfly, multiplication by { , , , }, are thought as trivial multiplications.
This means that these multiplications can be carried without a true multiplier.
Therefore, the effective number of complex multiplication required in radix-4 algorithm is fewer than that in radix-2 algorithm. Accordingly, algorithms with higher radix are more efficient than those with lower radix in arithmetic aspect. On the other hand, the butterfly of a higher radix algorithm is more complicated. The trade-off is between addition/subtraction and multiplications. Since addition/subtractions are of lower computational complexity than multiplications in complex-number computation, the higher radix algorithms are usually preferred. However, the radix-r algorithm is only suitable for r
0
W4 W41 W42 W43
v-point FFT. For a DFT sequence of length not power of r, lower radix algorithm must be used.
2.3.2 Decimation-in-Time (DIT) Fixed-Radix Algorithms
To develop the DIT algorithm, let us again consider the N-point DFT where N is an integer power of 2, i.e., N=2v. Since N is an even integer, we can consider computing X(k) by separating x[n] into the even-numbered points and odd-numbered points. With the X(k) given in Eq. (2.1), we can derive the following equation:
1
In the above equation, X(k) can be seen as a combination of the DFT of the even-numbered points and odd-numbered points of x[n]. Replace them with G(k) and H(k), respectively:
1 1
G(k) represents the N/2-point DFT of the even-numbered points in x[n] and H(k) represents the N/2-point DFT of the odd-numbered points in x[n]. We can then treat G(k) as an independent DFT and decompose it as the manner in Eq. (2.12).
Recursively, G(k) will finally be decomposed into 2-point DFTs, which is multiply-and-add operation of two data. In the same way, H(k) can also be recursively decomposed into combinations of 2-point DFTs. A 2-point DFT, according to Eq.
(2.13), is a multiply-and-add operation. Therefore, the N-point DFT can be calculated without any real DFT computations.
0
Figure 2.5 Decomposition of the 8-point DFT step by step in DIT algorithm
Figure 2.5 shows the procedure of how an 8-pont DFT is composed by the DIT algorithms. First we decompose the 8-point DFT as combinations of two 4-point DFT according to Eq. (2.8) and (2.9), as shown in (a). We can see now the time-domain input points have been separated into even-numbered and odd-numbered parts. We then divide the 4-point DFT, respectively, into 2-point DFTs. Again, the input points are separated. For the sequence {x[0],x[2],x[4],x[6]}, the even-numbered points are
{x[0],x[4]} and the odd-numbered points are {x[2],x[6]}. The flow graph then becomes (b). Finally, we decompose the 2-point DFTs further and obtain the flow graph in (c). At last, the demand of any DFT block is now eliminated.
Similar to the DIF algorithm, the basic butterfly unit of the DIT algorithm is shown in Figure 2.6(a). However, be aware of the fact that:
/ 2 / 2
r N r N r
N N N N
W + =W W = −W (2.14)
The butterfly is modified as in (b), which reduces the number of multiplications to 1.
This basic computation unit is also effectively a 2-point DFT unit, as can be seen from (b) and (c) of Figure 2.5. Therefore, the above algorithm is called a radix-2 DIT algorithm.
x[2r]
x[2r+1]
r
X(k) = x[2r]+x[2r+1]WN
r+N/2
X(k+N/2) = x[2r]+x[2r+1]WN
x[2r]
x[2r+1]
r
WN
(b) (a)
r
X(k) = x[2r]+x[2r+1]WN
r
X(k+N/2) = x[2r]-x[2r+1]WN
Figure 2.6 The butterfly unit of radix-2 DIT FFT
Observing Figure 2.5, the time-domain input for the DIT decomposition are in bit-reversed order while the frequency-domain output are in normal order.
Comprehensively, the SFG of the DIT algorithm is a reverse of the SFG of the DIF algorithm. We can also use the same methods as in previous section to derive a higher radix decomposition of the DIT algorithm.
2.3.3 Other FFT Algorithms
There are many other variations on the Cooley-Tukey algorithm. Mixed-radix implementations [2.2-2.5] handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(N2) algorithm for the prime base cases of the recursion. The idea of mixed-radix algorithms is straightforward. As the fixed-radix algorithms recursively decompose the N-point DFT into N/r-point DFT, we can also decompose the N-point into N/r1-point, N/r2-point…, and N/rm-point DFTs as long as N= r1×r2…×rm.
Split radix [2.6-2.8] merges radices 2 and 4, exploiting the fact that the first transform of radix-2 requires no twiddle factor, in order to achieve the lowest known arithmetic operation count for power-of-two sizes. The DIF split-radix 2/4 algorithm decomposes the frequency sample as:
2 1
The SFG of the split-radix algorithm can also be drawn as the fixed-radix algorithms. Figure 2.7 shows the basic butterfly unit for split-radix 2/4 algorithm. The split-radix algorithm features low computational complexity and is flexible as radix-2 algorithm.
Figure 2.7 The butterfly unit of split-radix 2/4 algorithm
2.4 The FFT Architecture
The FFT architecture is the way to implement the signal flow graph of the FFT algorithms. In this section, we will introduce the FFT architectures which are common for VLSI implementation. There are two popular architectures to implement the FFT algorithms for real time applications. They are pipeline-based architecture and memory-based architecture.
2.4.1 Pipeline-Based Architecture
The pipeline-based architecture is of high regularity and can be easily scaled and parameterized in implementation [2.6, 2.8-2.15]. Compared to the memory-based architecture, it is characterized in high throughput rate while keeping moderate hardware complexity. An efficient method to obtain the pipeline architecture is to project the signal flow graph of the FFT algorithm to the hardware data flow. Two common pipeline-based architectures will be introduced next, the single-path delay feedback (SDF) and the multiple-delay commutator (MDC) architecture.
2.4.1.1 Single-Path Delay Feedback (SDF) Architecture
The block diagram of the SDF architecture in radix-2 DIF algorithm is shown in Figure 2.8. For the FFT length N = 16, there will be 4 butterfly stages in the SFG. As we can see from the figure, a butterfly element is dedicated to each stage. The feedback registers are used to store output data of the butterfly outputs. The butterfly element perform the butterfly operation when the required data are ready at the input ports, otherwise it perform the swap operation to store data into the feedback registers.
The memory requirement of the SDF architecture is minimal. However, the utilization rate of the butterfly and multiplier units is only 50%.
BF2 8
BF2 4
BF2 2
j BF2
1
W W
Figure 2.8 Radix-2 DIF SDF architecture for N = 16
Similar to the radix-2 SDF architecture, the SDF architecture for the radix-4 algorithm can also be derived from the SFG. Figure 2.9 shows the case when the SDF architecture is applied to the radix-4 algorithm. Compared to the radix-2 architecture, the radix-4 architecture can implement the FFT with fewer computation stages.
However, the butterfly unit will be more complicated.
BF4 3x16
W
BF4 3x4
W
BF4 3x1
Figure 2.9 Radix-4 DIF SDF architecture for N = 64
2.4.1.2 Multiple-Path Delay Commutator (MDC) Architecture
The MDC approach is even more straightforward than the SDF approach. As the butterfly in the SFG, parallel data paths are used in the architecture. Instead of using the delay feedback registers, delay elements are placed on the data paths. Between each computation stages, a commutator is used to switch data to correct positions.
Figure 2.10 shows the block diagram of the radix-2 DIF MDC architecture. The throughput rate of the radix-2 MDC architecture is twice that of the radix-2 SDF architecture due to the parallel data paths. However, the memory requirement is larger than that of the SDF architecture and also extra commutators are required.
c 2
B F 2 8
W
4 c 2
B F 2 4
W
2 c 2
B F 2 2
j 1
c 2
B F 2 1
Figure 2.10 Radix-2 MDC architecture for N = 16
The radix-4 MDC architecture is of the same principle as the radix-2 one. Figure 2.11 shows the block diagram of the radix-4 MDC architecture for N = 64. In the radix-4 MDC architecture, higher throughput rate can be achieved due to the four parallel data paths. However, more memory requirement and higher hardware complexity are the overhead in return.
c 4
B F 16 4
12 c 4 32
48
8 4
W
B F 4 4
3 c 4 8
12
2 1
W
B F 1 4 2 3
Figure 2.11 Radix-4 DIF MDC architecture for N = 64
2.4.2 Memory-Based Architecture
The memory-based architecture is considered the most area efficient way of implementing the FFT [2.2, 2.4-2.5, 2.16-2.19]. It usually consists of one computation block, coefficient memory for twiddle factors, and memory to store IO and internal data. The feature of such architecture is that it usually uses only one or few butterfly elements as the computation block. Since the butterflies and multipliers usually take out most area and consume large power in the pipeline-based architecture, the memory-based architecture reduces such hardware cost and thus lowers the power consumption. Figure 2.12 shows the generic block diagram of the memory-based architecture. The hardware complexity of the memory-based architecture concentrates on the control block. Since there are only one or few butterfly elements available, the execution order is stage by stage as in the SFG. The memory-based architecture usually uses one memory module to store the intermediate data. Since the data ordering is different from stage to stage, the order of data stored in the memory must be taken care after every stage of operation
Figure 2.12 Block diagram of the memory-based architecture
As the number of butterfly units available reduces, the number of butterfly on the SFG remains the same. Therefore, the memory-based architecture results in low throughput rate. In a radix-r algorithm, an N-point FFT requires N logr
r × N radix-r butterfly operation. Assume that the memory access bandwidth is K and the time for a butterfly operation is t. Then, the time to compute an N-point FFT can be expressed as:
Time for one FFT = N logr r = N log
N t N
r × × ×K K× r ×t (2.18)
From the above equation, it can be seen that the time for one FFT can be reduced linearly with K and exponentially with r. Therefore, using high radix algorithms is an efficient way to raise the throughput rate of a memory-based architecture.
2.4.3 Reconfigurable Architecture
2.4.3 Reconfigurable Architecture