國
立
交
通
大
學
電子工程學系 電子研究所碩士班
碩
士
論
文
具能量察覺管線化架構可重組混合基底的
快速傅利葉轉換處理器設計
Energy-Aware Pipeline-based Reconfigurable Mixed-Radix
FFT/IFFT Processor Design
研 究 生:賴祈成
指導教授:黃 威 教授
具能量察覺管線化架構可重組混合基底的
快速傅利葉轉換處理器設計
Energy-Aware Pipeline-based Reconfigurable Mixed-Radix
FFT/IFFT Processor Design
研 究 生:賴祈成 Student:Chi-Chen Lai
指導教授:黃 威 教授 Advisor:Prof. Wei Hwang
國 立 交 通 大 學
電 子 工 程 學 系 電 子 研 究 所 碩 士 論 文
A Thesis
Submitted to Department of Electronics Engineering & Institute of Electronics College of Electrical Engineering and Computer Science
National Chiao Tung University in Partial Fulfillment of the Requirements
for the Degree of Master
in
Electronics Engineering
June 2006
Hsinchu, Taiwan, Republic of China
具能量察覺管線化架構可重組混合基底的
快速傅利葉轉換處理器設計
學生:賴祈成
指導教授:黃 威 教授
國立交通大學電子工程學系電子研究所碩士班
摘
要
本論文提出一個先進的可重組混合基底的快速傅利葉轉換處理器。該處理器
可動態重組為 16 點至 4096 點之快速傅利葉/反向快速傅利葉轉換運算,並且對
於不同長度之模式使用不同的混合基底演算法,所提出的架構同時具有能量察覺
的特色。不同於一般管線化架構使用較大的內部字長來提高抗雜訊比,我們的架
構使用與輸入資料相同的內部字長,並使用區塊浮點的方法來維持抗雜訊比。並
且,使用八個平行資料傳輸路徑的管線化架構有效的降低計算週期。
模擬的結果顯示,所提出的快速傅利葉轉換器在不同的資料長度下,能將抗
雜訊比維持在 110dB 以上。所提出的快速傅利葉轉換器以 TSMC 0.13μm 的技術
實現,供應電壓為 1.2V,最高時脈週期為 110MHz,產出率可達四倍時脈週期,
亦即 440Msample/s;隨著快速傅利葉轉換運算的長度增加,每筆運算所消耗的能
量從 4.34nJ 增加到 5.115μJ。
Energy-Aware Pipeline-based Reconfigurable Mixed-Radix
FFT/IFFT Processor Design
Student:Chi-Chen Lai
Advisor:Prof. Wei Hwang
Department of Electronics Engineering & Institute of Electronics
National Chiao-Tung University
ABSTRACT
In this thesis, we present a novel FFT/IFFT processor, called reconfigurable
mixed-radix (RMR) FFT. It can be easily reconfigured as from 16-point to 4096-point
FFT/IFFT with proper mixed-radix algorithm assigned for each mode. The proposed
architecture is characterized with scalable energy dissipation for different FFT/IFFT
sizes. Unlike general pipeline-based architectures which use a larger internal
wordlength to achieve a high signal-to-noise ratio (SNR), our processor keeps the
internal wordlength the same as the wordlength of the input data while the
block-floating-point (BFP) approach is adopted to maintain the SNR. The
pipeline-based architecture with 8-parallel datapath results in low computation cycles.
The simulation result shows that RMR FFT maintain the SNR above 110dB as
the FFT size varies. The proposed RMR FFT processor is implemented using TSMC
0.13μm technology with a supply voltage of 1.2V. With the maximum clock rate of
110MHz, the throughput rate can reach 440Msample/s, which is 4 times of the input
clock rate. The energy dissipation per FFT ranges from 4.34nJ to 5.115μJ with
increasing FFT sizes.
Acknowledgements
I would like to thank my advisor, Prof. Wei Hwang, who has provided me a free research environment for the past two years. He has been supportive all the way, which does help me get rid of the fear of any disturbance in the rear. I was able to think and research independently on interesting topics. I have learned more from these experiences than what books or papers may show.
The fellows of my laboratory also help lot on my study. In addition, they are more helpful on life and many daily events. I have also learned a lot from them and overcome many difficulties with their help.
I would also like to thank my roommates and many schoolmates for the past few years. They have accompanied me a long time and so much has happened. The life in NCTU would be less colorful without them.
T
able of
C
ontents
Chapter 1 Introduction ... 1
1.1 Background... 1
1.2 Motivation... 1
1.3 Organization of Thesis... 2
Chapter 2 Review of FFT Algorithms and Architectures... 4
2.1 Introduction... 4
2.2 Basic Concept of FFT Algorithms ... 5
2.3 The FFT Algorithms... 6
2.3.1 Decimation-in-Frequency (DIF) Fixed-Radix Algorithms ... 6
2.3.2 Decimation-in-Time (DIT) Fixed-Radix Algorithms... 11
2.3.3 Other FFT Algorithms... 13
2.4 The FFT Architecture ... 14
2.4.1 Pipeline-Based Architecture... 15
2.4.1.1 Single-Path Delay Feedback (SDF) Architecture... 15
2.4.1.2 Multiple-Path Delay Commutator (MDC) Architecture... 16
2.4.2 Memory-Based Architecture... 17
2.4.3 Reconfigurable Architecture ... 18
2.5 Conclusion ... 20
Chapter 3 Algorithm of Reconfigurable Mixed-Radix FFT ... 21
3.1 Introduction... 21
3.2 Reconfigurable Mixed-Radix Algorithm... 21
3.3 Data Ordering and Twiddle Factors... 25
3.4 Finite Register Length Effect and Block-Floating-Point Method... 30
3.5 Conclusion ... 33
Chapter 4 Architecture of Reconfigurable Mixed-Radix FFT ... .35
4.1 Introduction... 35 4.2 Overall Architecture ... 36 4.3 Architecture Design... 37 4.3.1 Butterfly (BF) Unit ... 37 4.3.1.1 General BF... 37 4.3.1.2 Reconfigurable BF ... 38 4.3.2 Multiplier Stage ... 39
4.3.2.2 Complex Multiplier Approach... 43
4.3.3 Register Banks (RB)... 45
4.3.3.1 RB_64 ... 46
4.3.3.2 RB_512 and RB_4096 ... 48
4.3.3.3 Duplicate Module Insertion... 51
4.3.4 BFP... 52
4.3.5 Input/Output Buffer ... 53
4.4 Data Flow ... 55
4.5 Conclusion ... 58
Chapter 5 Implementation of RMR FFT/IFFT Processor ... 59
5.1 Introduction... 59
5.2 Implementation Issue on Register Banks... 59
5.3 Power Control ... 64 5.4 Simulation Result... 66 5.4.1 Performance of the RMR FFT ... 66 5.4.2 Comparison... 69 5.5 Layout Implementation... 72 5.6 Conclusion ... 77
Chapter 6 Conclusions and Future Work... 72
6.1 Conclusions... 79
6.2 Future Work... 80
List of Tables
TABLE 3.1 Mixed-radix algorithms for different FFT sizes. ... 23
TABLE 3.2 N-based twiddle factors required for each multiplier stage under different FFT size ... 29
TABLE 3.3 Storage elements required between each stage... 33
TABLE 4.1 Truth table of control signals for reconfigurable BF... 39
TABLE 4.2 Mapping table of the twiddle factors ... 40
TABLE 4.3 Implementation table of constants ... 41
TABLE 4.4 Scheduling of twiddle factors, 64 ... 41
p W TABLE 4.5 Scheduling of twiddle factors, 4096... 44
p W TABLE 4.6 Twiddle factors, 4096, stored in 4 p W th ROM ... 44
TABLE 4.7 Comparison of memory requirement (including the input buffer) ... 57
TABLE 4.8 Execution cycles required for different FFT length ... 57
TABLE 5.1 Truth table of the activated modules... 65
TABLE 5.2 Execution cycles required for various FFT sizes... 68
TABLE 5.3 Comparison with other reconfigurable architectures... 70
L
ist of
F
igures
Figure 1.1 Generic OFDM block diagram ... 2
Figure 2.1 Decomposition of the 8-point DFT step by step in DIF algorithm ... 8
Figure 2.2 The butterfly unit of radix-2 DIF FFT... 8
Figure 2.3 Tree diagrams of (a) normal order and (b) bit-reversed order... 9
Figure 2.4 The butterfly unit of radix-4 DIF FFT... 10
Figure 2.5 Decomposition of the 8-point DFT step by step in DIT algorithm ... 12
Figure 2.6 The butterfly unit of radix-2 DIT FFT... 13
Figure 2.7 The butterfly unit of split-radix 2/4 algorithm ... 14
Figure 2.8 Radix-2 DIF SDF architecture for N = 16... 15
Figure 2.9 Radix-4 DIF SDF architecture for N = 64... 16
Figure 2.10 Radix-2 MDC architecture for N = 16 ... 16
Figure 2.11 Radix-4 DIF MDC architecture for N = 64... 17
Figure 2.12 Block diagram of the memory-based architecture ... 18
Figure 2.13 Architecture of 1024-point radix-4 reconfigurable pipelined FFT processor ... 19
Figure 3.1 SFG of 8-point DIF FFT... 24
Figure 3.2 SFG of 128-point FFT in radix-2 DIF algorithm ... 26
Figure 3.3 SFG of 128-point FFT in mixed-radix algorithm ... 26
Figure 3.4 SFG of 16-point DFT in (a) radix-2 algorithm, and (b) radix-8/2 algorithm... 27
Figure 3.5 Extraction of radix-8 butterfly ... 28
Figure 3.6 Procedure of combining three radix-2 stages into one radix-8 stage... 28
Figure 3.7 The twiddle factors for the mth radix-8 butterfly for N-point decomposition... 29
Figure 3.8 Concept of block-floating-point... 31
Figure 3.9 Blocks decomposition of 128-point FFT... 32
Figure 4.1 FFT environment... 35
Figure 4.2 Block diagram of the proposed RMR FFT... 36
Figure 4.3 Circuit diagram of multiplication by 1/√2... 38
Figure 4.4 Block diagram of general radix-8 butterfly... 38
Figure 4.5 Block diagram of reconfigurable butterfly... 39
Figure 4.6 Twiddle factors on the unit circle ... 40
Figure 4.7 Block diagram of a constant multiplier ... 42
Figure 4.8 Block diagram of CMULT stage ... 42
Figure 4.9 Block diagram of MULT stage... 43
Figure 4.12 Block diagram of RB_64 for three different capacities, (a) 16-word, (b)
32-word, and (c) 64-word ... 46
Figure 4.13 Data flow in RB for 16-word mode... 47
Figure 4.14 Block diagram of reconfigurable RB_64 ... 48
Figure 4.15 Data flow in RB for 128-word mode... 49
Figure 4.16 Control zones for the RB ... 50
Figure 4.17 Control signals for the 128-word RB ... 50
Figure 4.18 Block diagram of reconfigurable RB_512 ... 51
Figure 4.19 Block diagram of BFP ... 52
Figure 4.20 Block diagram of reconfigurable input buffer... 54
Figure 4.21 Data flow in the input buffer for N = 16 ... 54
Figure 4.22 Flow of data path for 16, 32, 64-pont FFT... 55
Figure 4.23 Flow of data path for 128, 256, 512-pont FFT... 56
Figure 4.24 Control signal, PHASE, for duplicate RB modules ... 56
Figure 4.25 Flow of data path for 1024, 2048, 4096-pont FFT... 56
Figure 5.1 Circuit of synthesized scan D flip-flop... 60
Figure 5.2 Block diagram of the two-input register array ... 60
Figure 5.3 Various structure of D flip-flops... 62
Figure 5.4 Average current under low-clock-transition cases... 63
Figure 5.5 Average current under high-clock-transition cases... 63
Figure 5.6 SNR comparison ... 67
Figure 5.7 Power consumption for various FFT sizes (110MHz, 1.2V)... 67
Figure 5.8 Power distribution characteristics... 68
Figure 5.9 Energy dissipation per FFT operation, (a) in normal scale, (b) i n log scale ... 69
Figure 5.10 Comparison of Energy dissipation between RMR FFT and the other reconfigurable pipeline-based architecture... 71
Figure 5.11 Comparison of Energy dissipation between RMR FFT and the other reconfigurable memory-based architecture ... 71
Figure 5.12 Layout and schematic view of the 1-bit D flip-flop... 72
Figure 5.13 Layout and schematic view of a basic block in RB_512... 73
Figure 5.14 Layout and schematic view of RB_512 ... 74
Figure 5.15 Layout and schematic view of RB_4096 ... 75
Figure 5.16 Layout and schematic view of the 1-bit D flip-flop... 76
Chapter 1
Introduction
1.1 Background
In discrete-time signal processing (DSP), engineers usually study and practice digital signals between time domain and frequency domain [1.1]. A sequence of samples from a measuring device produces a time or spatial domain representation, whereas a discrete Fourier transform (DFT) produces the frequency domain information, that is, the frequency spectrum. As many communications theories are based on frequency domain, the DFT becomes an important component.
However, the direct mapping of DFT equation into a physical implementation results in unacceptable hardware overhead. The fast Fourier transform (FFT) is thus developed to make the implementation possible. FFTs became popular after J. W. Cooley of IBM and John W. Tukey of Princeton published a paper in 1965 [1.2] reinventing the algorithm and describing how to perform it conveniently on a computer. FFTs are of great importance to a wide variety of applications, from digital signal processing to solving partial differential equations to algorithms for quickly multiplying large integers.
The performance of FFT is often the bottle neck of a DSP system. The design of a high-speed FFT processor has been an important topic for many years. Various architectures have been proposed to serve different applications. Recently, the popularity of portable systems raises the low-power consumption as another serious design issue. The demand for low-power and high-speed FFT processors never stops.
1.2 Motivation
Many recent communication standards propose the orthogonal frequency division multiplexing (OFDM) as the primary modulation method. A general block diagram of an OFDM system is shown in Figure 1.1. The FFT and inverse FFT
(IFFT), which are essential for such modulation, are both computation-intensive and data-exchange-intensive. Many FFT algorithms and architecture have been proposed to drive the performance further in the past decades. However, modern communication standards require even faster FFT processors while the power-consumption is critical. For example, in the popular orthogonal frequency-division multiple (OFDM)-based UWB systems, the execution time of the 128-point FFT/IFFT is only 312.5 ns, or equivalent 409.6Msample/s [1.3].
Base-band Modulator Serial-to-parallel IFFT Cyclic Prefix D/A Converter Parallel-to-serial FFT Cyclic Prefix Remover A/D Converter Base-band Demodulator … … … … To RF From RF Input data Output data
Transmitter
Receiver
Figure 1.1 Generic OFDM block diagram
On the other side, it is desirable for a processor to perform flexible-size FFTs, thereby facilitating software adaptability when different formats and changing standards must be accommodated. Processors with high re-configurability incur inevitable overhead in all terms. In order to minimize the overhead, the design of such reconfigurable processors must be considered from both algorithm-level and architecture-level.
This thesis aims to design a high performance FFT/IFFT processor that can meet modern high-speed criterions while maintaining low power consumption. The processor can be flexible to perform different lengths of FFTs and thus suitable for various protocols and applications. The FFT length should be easily reconfigured by setting control registers and with minimum hardware overhead possible.
1.3 Organization of Thesis
The rest of this thesis is organized as follow. Chapter 2 is a review of general FFT algorithms and architectures. The basic concept of the FFT algorithm is
explained and various FFT algorithms are introduced here. Also, popular FFT architectures in implementation, memory-based and pipeline-based, are depicted and compared in this chapter. In conclusion, we will give a direction of algorithms and architecture that is most suitable for modern high-speed applications.
In this thesis, we propose an energy-aware reconfigurable mixed-radix FFT/IFFT. The proposed processor can be easily reconfigured as from 16-point to 4096-point FFT/IFFT with proper mixed-radix algorithm assigned for each mode. In chapter 3, we will derive the proposed reconfigurable mixed-radix algorithm. The architecture design and principle of each block will be illustrated in chapter 4.
In chapter 5, the RMR FFT is implemented using TSMC 0.13μm technology. As will be shown in the proposed architecture, we find that the internal storage block takes out most of the FFT area and power during the cell-based synthesis flow. The implementation strategy of the internal storage blocks is different from that of the rest RMR FFT. The simulation result will be analyzed and compared with other reconfigurable architectures. Finally, some conclusions and future work will be presented in Chapter 6.
Chapter 2
Review of FFT Algorithms and
Architectures
2.1 Introduction
The discrete Fourier transform (DFT) is widely employed in the analysis, design, and implementation of signal processing algorithms and systems. However, the computational complexity of direct evaluation of an N-point DFT is O(N2
), which
results in a long computation time and excessive hardware cost. Fortunately, considerable symmetry exists in the operations and coefficients required to compute a DFT. Such symmetry can be exploited to reduce the number of operations required, thus reducing the time required for DFT computation. Collectively, the resulting efficient computation algorithms are called fast Fourier transform (FFT).
Mainly, the FFT is a way of computing the DFT by decomposing the computation into successively smaller DFT computations. In this process, both the symmetry and the periodicity of the complex exponential are exploited. Algorithms in which the decomposition is based on the input sequence x[n] into successively smaller subsequences are called decimation-in-time (DIT) algorithms. Alternatively, we can consider dividing output sequence X[k] into smaller subsequences and such algorithms are called decimation-in-frequency (DIF) algorithms.
(2 / )
nk j N nk
N
W =e− π
By far the most common FFT is the Cooley-Tukey algorithm [2.1], which is suitable in decomposing DFT that is of size of power of 2. We would like to introduce some variants based on Cooley-Tukey algorithm in this chapter. These variants can be classified as fixed-radix and the others, respectively. Also, we will discuss the architectures for these algorithms in VLSI implementation. Both of the two popular architectures, memory-based and pipeline-based, have their advantages and certain shortcomings.
2.2 Basic Concept of FFT Algorithms
The discrete Fourier transform of a complex data sequence x[n] of length N is defined as: 1 0 ( ) N [ ] nk k=0,1,...,N-1 N n X k x n W − = =
∑
(2.1)where the coefficient nk is defined as
N W 2 j nk nk N N W e π −
= ,which are called twiddle factors. The approach used to improve the efficiency in FFT is to exploit the symmetry and the periodicity properties of nk;
N W (N n k) nk ( nk N N W − =W− = WN )* N (Symmetry property) (2.2) ( ) ( ) nk n k N n N k N N N W =W + =W + (Periodicity in n and k) (2.3)
As an illustration, using the periodicity property, we can group terms in Eq. (2.1) for n and (n+N):
( )
[ ] nk [ ] n N k ( [ ] [ ]) nk
N N
x n W +x n+N W + = x n +x n+N W (2.4)
Similar groupings can be used for other terms in Eq. (2.1). In this way, the number of complex multiplication can be reduced by approximately a factor of 2. We can also take the advantage of the fact that for certain factors, the real and imaginary parts take on the value 1 or 0, which eliminating the need for multiplication. As a result, applying the above properties achieves significantly reduction in computation.
The Cooley-Tukey algorithm is the most common FFT algorithm. It re-expresses the DFT of an arbitrary composite size N = N1N2 in terms of smaller DFTs of sizes N1 and N2 recursively. FFT algorithms are based on the fundamental principle of decomposing the computation of the DFT of an N-length sequence into successively smaller DFT. The manner of how this principle is implemented leads to a variety of different algorithms. In the following section, various FFT algorithms will be introduced.
2.3 The FFT Algorithms
According to the manner of decomposition, the FFT algorithms can be classified as DIT and DIF algorithms. The difference is the object to be decomposed, input sequence for DIT and output sequence for DIF.
2.3.1 Decimation-in-Frequency (DIF) Fixed-Radix Algorithms
The principle of the decimation-in-frequency algorithm is most conveniently illustrated by considering the N-point DFT where N is an integer power of 2, i.e., N=2v. Since N is an even integer, we can consider computing the even-numbered
frequency samples and odd-numbered frequency samples separately. Referring to Eq. (2.1), we can express X(k) as:
1 0 1 1 2 0 2 1 1 2 2 ( ) 2 0 0 1 1 2 2 2 0 0 2 ( ) [ ] [ ] [ ] [ ] [ ] 2 [ ] [ ] 2 [ ] [ ] 2 N nk N n N N nk nk N N N n n N N N n k nk N n n N N N k nk nk N N n n N k nk N N n X k x n W x n W x n W N x n W x n W N N N x n W x n W W N x n x n W W − = − − = = − − + = = − − = = = = = + = + + = + + ⎧ ⎫ = ⎨ + + ⎬ ⎩ ⎭
∑
∑
∑
∑
∑
∑
∑
1 2 0 N −∑
(2.5)Based on the above equation, the even-numbered frequency samples are:
1 2 2 2 2 0 1 2 0 2 (2 ) [ ] [ ] 2 [ ] [ ] 2 N N r n r N N n N nr N n N X r x n x n W W N x n x n W − = − = ⎧ ⎫ = ⎨ + + ⎬ ⎩ ⎧ ⎫ = ⎨ + + ⎬ ⎩ ⎭
∑
∑
⎭ (2.6)The result of Eq. (2.6) can be seen as the N/2-point DFT of the sequence
input sequence. In the same way, the odd-numbered frequency points are: 1 2 (2 1) (2 1) 2 0 1 2 0 2 (2 1) [ ] [ ] 2 [ ] [ ] 2 N N r n r N N n N n nr N N n N X r x n x n W W N x n x n W W − + + = − = ⎧ ⎫ + = ⎨ + + ⎬ ⎩ ⎧ ⎫ = ⎨ − + ⎬ ⎩ ⎭
∑
∑
⎭ (2.7)Eq. (2.7) is then the N/2-point DFT of the sequence obtained by subtracting the second half from the first half of the input sequence and multiplying the resulting sequence by n
N
W . Therefore, the problem of computing N-point DFT becomes
computing N/2-point DFT. Recursively, we can further decompose the N/2-point DFT in Eq. (2.6) and (2.7) into smaller DFT. Proceed with these decomposition until the only DFT required are 2-point DFTs. The 2-point DFT can be derived as the simple form in Eq. (2.6) and (2.7), which are multiplication and addition/subtraction operations. As a result, the computation of N-point DFT requires no real DFT computation but only multiplication and addition/subtraction operations.
Figure 2.1, which is called a signal flow graph (SFG), illustrates the procedure of decomposing the 8-point DFT by the DIF algorithm. First we decompose the 8-point DFT as combinations of two 4-point DFT according to Eq. (2.6) and (2.7), as shown in (a). We can see now the output frequency points have been separated into even-numbered and odd-numbered parts. We then divide the 4-point DFT, respectively, into 2-point DFTs. Again, the output frequency points are separated. For the sequence {X(0),X(2),X(4),X(6)}, the even-numbered points are {X(0),X(4)} and the odd-numbered points are {X(2),X(6)}. The flow graph then becomes (b). Finally, we decompose the 2-point DFTs further and obtain the flow graph in (c). As we can see, the demand of any DFT block is now eliminated.
The basic computation unit in the flow graph of Figure 2.1, as brought up in Figure 2.2, is called a butterfly. The butterfly output in DIF algorithms have to multiply certain constants and such constants are called twiddle factors. This basic computation unit is effectively a 2-point DFT unit, as can be seen from (b) and (c) of Figure 2.1. Since the N-point DFT is always divided by 2 recursively, the above algorithm is called the radix-2 DIF algorithm.
0 8 W 1 8 W 2 8 W 3 8 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] 4-point DFT 4-point DFT X(0) X(2) X(4) X(6) X(1) X(3) X(5) X(7) 0 8 W 1 8 W 2 8 W 3 8 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] 2-point DFT 2-point DFT X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7) 0 8 W 1 8 W 2 8 W 3 8 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7) 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 0 8 W 0 8 W 0 8 W
(a)
(b)
(c)
Figure 2.1 Decomposition of the 8-point DFT step by step in DIF algorithm
x[n]
x[n+N/2]
X(n) = x[n]+x[n+N/2]
X(n+N/2) = x[n]-x[n+N/2]
Further more, the output ordering, as shown in the SFG, is not in normal order as the time-domain input. In fact, the order which the output data present is referred to as bit-reversed order. The idea of the bit-reversed order can be well depicted by tree diagrams. As we take the 8-point DFT as an example, three binary digits are required to index through the data. Figure 2.3 shows the way how normal order and bit-reversed order are derived, respectively. In (a), the normal order is obtained through sorting data sequence by successive examination of the data index bits. In (b), the same procedure takes place to obtain the bit-reversed order except that the data index bits examination is backward.
2 1 0 [ ] x n n n 0 0 0 1 1 1 1 1 0 1 0 1 0 0 2 n n1 n0 x[000] x[001] x[010] x[011] x[100] x[101] x[110] x[111] 2 1 0 ( ) X n n n 0 0 0 1 1 1 1 1 0 1 0 1 0 0 2 n 1 n X(000) 0 n X(100) X(010) X(110) X(001) X(101) X(011) X(111)
(a)
(b)
Figure 2.3 Tree diagrams of (a) normal order and (b) bit-reversed order
Similar to the way of decomposing the even integer N, we can decompose N into four parts if N is an integer power of 4, i.e., N=4v. We can divide frequency
samples into four parts and consider computing them separately. The equation represents these four frequency parts are thus:
1 4 0 4 2 3 (4 ) [ ] [ ] [ ] [ ] 4 4 4 N n N n N N N X r x n x n x n x n W − = ⎧ = ⎨ + + + + + + ⎩ ⎭
∑
⎫⎬ (2.8) 1 4 1 2 3 4 4 4 0 4 2 3 (4 1) [ ] [ ] [ ] [ ] 4 4 4 N n n N N n N N N X r x n x n W x n W x n W W W − = ⎧ ⎫ + = ⎨ + + + + + + ⎬ ⎩ ⎭∑
(2.9) 1 4 2 4 6 4 4 4 0 4 2 3 (4 2) [ ] [ ] [ ] [ ] 4 4 4 N n n N N n N N N X r x n x n W x n W x n W W W − = ⎧ ⎫ + = ⎨ + + + + + + ⎬ ⎩ ⎭∑
2 (2.10)1 4 3 6 9 4 4 4 0 4 2 3 (4 3) [ ] [ ] [ ] [ ] 4 4 4 N n n N N n N N N X r x n x n W x n W x n W W W − = ⎧ ⎫ + = ⎨ + + + + + + ⎬ ⎩ ⎭
∑
3 (2.11)A decomposition of a 4v-point DFT can also be shown through a signal flow graph,
similar to the one in Figure 2.1. This time, the basic computation unit is no longer a 2-point DFT butterfly but a 4-point DFT butterfly, as shown in Figure 2.4. The resulting algorithm, therefore, is called a radix-4 DIF algorithm.
x[n] x[n+2N/4] x[n+N/4] x[n+3N/4] 2 3 (4 ) [ ] [ ] [ ] [ ] 4 4 4 N N N X r =x n +x n+ +x n+ +x n+ 1 2 4 4 2 3 (4 1) [ ] [ ] [ ] [ ] 4 4 4 N N N 3 4 X r+ =x n +x n+ W +x n+ W +x n+ W 2 4 4 4 2 3 (4 2) [ ] [ ] [ ] [ ] 4 4 4 N N X r+ =x n +x n+ W +x n+ W +x n+ W46 N 3 6 4 4 2 3 (4 3) [ ] [ ] [ ] [ ] 4 4 4 N N 9 4 N X r+ =x n +x n+ W +x n+ W +x n+ W
Figure 2.4 The butterfly unit of radix-4 DIF FFT
Practicing the above decomposition procedures, we can further derive even higher radix-r DIF algorithms by restricting N as an integer power of r. The advantage of a higher radix algorithm is that the number of complex multiplications can be effectively lowered. As one radix-4 stage corresponds to two radix-2 stage in the SFG, the twiddle-factor multiplications between the two radix-2 stages are now covered in the radix-4 stage. As shown in Figure 2.4, complex multiplications in the radix-4 butterfly, multiplication by { , , , }, are thought as trivial multiplications. This means that these multiplications can be carried without a true multiplier. Therefore, the effective number of complex multiplication required in radix-4 algorithm is fewer than that in radix-2 algorithm. Accordingly, algorithms with higher radix are more efficient than those with lower radix in arithmetic aspect. On the other hand, the butterfly of a higher radix algorithm is more complicated. The trade-off is between addition/subtraction and multiplications. Since addition/subtractions are of lower computational complexity than multiplications in complex-number computation, the higher radix algorithms are usually preferred. However, the radix-r algorithm is only suitable for r
0 4
W W41 W42 W43
v-point FFT. For a DFT sequence of length not power of r, lower
2.3.2 Decimation-in-Time (DIT) Fixed-Radix Algorithms
To develop the DIT algorithm, let us again consider the N-point DFT where N is an integer power of 2, i.e., N=2v. Since N is an even integer, we can consider
computing X(k) by separating x[n] into the even-numbered points and odd-numbered points. With the X(k) given in Eq. (2.1), we can derive the following equation:
1 0 1 1 2 2 2 ( 0 2 1 1 2 2 2 2 0 0 1 1 2 2 0 2 0 2 ( ) [ ] [2 ] [2 1] [2 ] [2 1] [2 ] [2 1] N nk N n N N rk r k N N N r r N N rk k rk N N r r N N rk k rk N N N r r X k x n W x r W x r W x n W x r W W x r W W x r W − = − − + = = − − = = − − = = = = + + = + + = + +
∑
∑
∑
∑
∑
∑
∑
2 1) N (2.12)In the above equation, X(k) can be seen as a combination of the DFT of the even-numbered points and odd-numbered points of x[n]. Replace them with G(k) and
H(k), respectively: 1 1 2 2 0 2 0 ( ) [2 ] [2 1] ( ) ( ) N N rk k rk N N r r k N 2 N X k x r W W x r W G k W H k − − = = = + + = +
∑
∑
(2.13)G(k) represents the N/2-point DFT of the even-numbered points in x[n] and H(k)
represents the N/2-point DFT of the odd-numbered points in x[n]. We can then treat
G(k) as an independent DFT and decompose it as the manner in Eq. (2.12).
Recursively, G(k) will finally be decomposed into 2-point DFTs, which is multiply-and-add operation of two data. In the same way, H(k) can also be recursively decomposed into combinations of 2-point DFTs. A 2-point DFT, according to Eq. (2.13), is a multiply-and-add operation. Therefore, the N-point DFT can be calculated without any real DFT computations.
0 8 W 1 8 W 2 8 W 3 8 W X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) 4-point DFT 4-point DFT x[0] x[2] x[4] x[6] x[1] x[3] x[5] x[7] 0 8 W 1 8 W 2 8 W 3 8 W X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) 2-point DFT 2-point DFT x[0] x[4] x[2] x[6] x[1] x[5] x[3] x[7] 0 8 W 1 8 W 2 8 W 3 8 W X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) x[0] x[4] x[2] x[6] x[1] x[5] x[3] x[7] 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 2 8 W 0 8 W 0 8 W 0 8 W 0 8 W
(a)
(b)
(c)
Figure 2.5 Decomposition of the 8-point DFT step by step in DIT algorithm
Figure 2.5 shows the procedure of how an 8-pont DFT is composed by the DIT algorithms. First we decompose the 8-point DFT as combinations of two 4-point DFT according to Eq. (2.8) and (2.9), as shown in (a). We can see now the time-domain input points have been separated into even-numbered and odd-numbered parts. We then divide the 4-point DFT, respectively, into 2-point DFTs. Again, the input points are separated. For the sequence {x[0],x[2],x[4],x[6]}, the even-numbered points are
{x[0],x[4]} and the odd-numbered points are {x[2],x[6]}. The flow graph then becomes (b). Finally, we decompose the 2-point DFTs further and obtain the flow graph in (c). At last, the demand of any DFT block is now eliminated.
Similar to the DIF algorithm, the basic butterfly unit of the DIT algorithm is shown in Figure 2.6(a). However, be aware of the fact that:
/ 2 / 2
r N r N r
N N N N
W + =W W = −W (2.14)
The butterfly is modified as in (b), which reduces the number of multiplications to 1. This basic computation unit is also effectively a 2-point DFT unit, as can be seen from (b) and (c) of Figure 2.5. Therefore, the above algorithm is called a radix-2 DIT algorithm. x[2r] x[2r+1] r N X(k) = x[2r]+x[2r+1]W r+N/2 N X(k+N/2) = x[2r]+x[2r+1]W x[2r] x[2r+1] r N W (b) (a) r N X(k) = x[2r]+x[2r+1]W r N X(k+N/2) = x[2r]-x[2r+1]W
Figure 2.6 The butterfly unit of radix-2 DIT FFT
Observing Figure 2.5, the time-domain input for the DIT decomposition are in bit-reversed order while the frequency-domain output are in normal order. Comprehensively, the SFG of the DIT algorithm is a reverse of the SFG of the DIF algorithm. We can also use the same methods as in previous section to derive a higher radix decomposition of the DIT algorithm.
2.3.3 Other FFT Algorithms
There are many other variations on the Cooley-Tukey algorithm. Mixed-radix implementations [2.2-2.5] handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(N2) algorithm for the prime base cases of the recursion. The idea of mixed-radix algorithms is straightforward. As the fixed-radix algorithms recursively decompose the N-point DFT into N/r-point DFT, we can also decompose the N-point into N/r1-point,
Split radix [2.6-2.8] merges radices 2 and 4, exploiting the fact that the first transform of radix-2 requires no twiddle factor, in order to achieve the lowest known arithmetic operation count for power-of-two sizes. The DIF split-radix 2/4 algorithm decomposes the frequency sample as:
1 2 0 2 2 (2 ) [ ] [ ] 4 N nk N n N X k x n x n W − = ⎧ = ⎨ + + ⎩ ⎭
∑
⎫⎬ (2.15) 1 4 4 0 2 3 (4 1) [ ] [ ] [ ] [ ] 4 4 4 N n n N N n N N N X k x n x n j x n x n W W − = ⎧ ⎡ ⎫ + = ⎨ − + − ⎢⎣ + − + ⎬ ⎦ ⎩ ⎭∑
⎤ k ⎥ (2.16) 1 4 3 4 0 2 3 (4 3) [ ] [ ] [ ] [ ] 4 4 4 N n n N N n N N N X k x n x n j x n x n W W − = ⎧ ⎡ ⎤⎫ + = ⎨ − + + ⎢ + − + ⎥⎬ ⎣ ⎦ ⎩ ⎭∑
k (2.17)The SFG of the split-radix algorithm can also be drawn as the fixed-radix algorithms. Figure 2.7 shows the basic butterfly unit for split-radix 2/4 algorithm. The split-radix algorithm features low computational complexity and is flexible as radix-2 algorithm. x[n] x[n+2N/4] x[n+N/4] x[n+3N/4] 1 4 W WN3n n N W
Figure 2.7 The butterfly unit of split-radix 2/4 algorithm
2.4 The FFT Architecture
The FFT architecture is the way to implement the signal flow graph of the FFT algorithms. In this section, we will introduce the FFT architectures which are common for VLSI implementation. There are two popular architectures to implement the FFT algorithms for real time applications. They are pipeline-based architecture and memory-based architecture.
2.4.1 Pipeline-Based Architecture
The pipeline-based architecture is of high regularity and can be easily scaled and parameterized in implementation [2.6, 2.8-2.15]. Compared to the memory-based architecture, it is characterized in high throughput rate while keeping moderate hardware complexity. An efficient method to obtain the pipeline architecture is to project the signal flow graph of the FFT algorithm to the hardware data flow. Two common pipeline-based architectures will be introduced next, the single-path delay feedback (SDF) and the multiple-delay commutator (MDC) architecture.
2.4.1.1 Single-Path Delay Feedback (SDF) Architecture
The block diagram of the SDF architecture in radix-2 DIF algorithm is shown in Figure 2.8. For the FFT length N = 16, there will be 4 butterfly stages in the SFG. As we can see from the figure, a butterfly element is dedicated to each stage. The feedback registers are used to store output data of the butterfly outputs. The butterfly element perform the butterfly operation when the required data are ready at the input ports, otherwise it perform the swap operation to store data into the feedback registers. The memory requirement of the SDF architecture is minimal. However, the utilization rate of the butterfly and multiplier units is only 50%.
BF2 8 BF2 4 BF2 2 j BF2 1 W W
Figure 2.8 Radix-2 DIF SDF architecture for N = 16
Similar to the radix-2 SDF architecture, the SDF architecture for the radix-4 algorithm can also be derived from the SFG. Figure 2.9 shows the case when the SDF architecture is applied to the radix-4 algorithm. Compared to the radix-2 architecture, the radix-4 architecture can implement the FFT with fewer computation stages. However, the butterfly unit will be more complicated.
BF4 3x16 W BF4 3x4 W BF4 3x1
Figure 2.9 Radix-4 DIF SDF architecture for N = 64
2.4.1.2 Multiple-Path Delay Commutator (MDC) Architecture
The MDC approach is even more straightforward than the SDF approach. As the butterfly in the SFG, parallel data paths are used in the architecture. Instead of using the delay feedback registers, delay elements are placed on the data paths. Between each computation stages, a commutator is used to switch data to correct positions. Figure 2.10 shows the block diagram of the radix-2 DIF MDC architecture. The throughput rate of the radix-2 MDC architecture is twice that of the radix-2 SDF architecture due to the parallel data paths. However, the memory requirement is larger than that of the SDF architecture and also extra commutators are required.
c 2 B F 2 8 W 4 c 2 B F 2 4 W 2 c 2 B F 2 2 j 1 c 2 B F 2 1
Figure 2.10 Radix-2 MDC architecture for N = 16
The radix-4 MDC architecture is of the same principle as the radix-2 one. Figure 2.11 shows the block diagram of the radix-4 MDC architecture for N = 64. In the radix-4 MDC architecture, higher throughput rate can be achieved due to the four parallel data paths. However, more memory requirement and higher hardware complexity are the overhead in return.
c 4 B F 4 16 12 c 4 32 48 8 4 W B F 4 4 3 c 4 8 12 2 1 W B F 4 1 2 3
Figure 2.11 Radix-4 DIF MDC architecture for N = 64
2.4.2 Memory-Based Architecture
The memory-based architecture is considered the most area efficient way of implementing the FFT [2.2, 2.4-2.5, 2.16-2.19]. It usually consists of one computation block, coefficient memory for twiddle factors, and memory to store IO and internal data. The feature of such architecture is that it usually uses only one or few butterfly elements as the computation block. Since the butterflies and multipliers usually take out most area and consume large power in the pipeline-based architecture, the memory-based architecture reduces such hardware cost and thus lowers the power consumption. Figure 2.12 shows the generic block diagram of the memory-based architecture. The hardware complexity of the memory-based architecture concentrates on the control block. Since there are only one or few butterfly elements available, the execution order is stage by stage as in the SFG. The memory-based architecture usually uses one memory module to store the intermediate data. Since the data ordering is different from stage to stage, the order of data stored in the memory must be taken care after every stage of operation
Figure 2.12 Block diagram of the memory-based architecture
As the number of butterfly units available reduces, the number of butterfly on the SFG remains the same. Therefore, the memory-based architecture results in low throughput rate. In a radix-r algorithm, an N-point FFT requires logr
N
N
r × radix-r
butterfly operation. Assume that the memory access bandwidth is K and the time for a butterfly operation is t. Then, the time to compute an N-point FFT can be expressed as:
Time for one FFT = N logr N r t = N log N
r × × ×K K× r ×t (2.18)
From the above equation, it can be seen that the time for one FFT can be reduced linearly with K and exponentially with r. Therefore, using high radix algorithms is an efficient way to raise the throughput rate of a memory-based architecture.
2.4.3 Reconfigurable Architecture
A FFT processor that can perform various lengths of FFT is usually preferred. For the pipeline-based architecture, the reconfiguration can be easily achieved. Recall the principle of the FFT algorithms. The idea is to break the N-point DFT into smaller DFTs recursively. Therefore, after a radix-r butterfly stage, the N-point FFT is decomposed into r N/r-point FFTs. This relation can be observed from the SFG as
previously shown in Figure 2.1 or Figure 2.5. Since the pipeline-based architecture is the projection of the SFG, the backend stages actually calculate the FFTs of smaller sizes. Therefore, the pipeline-based architecture can be reconfigured for calculating FFT of smaller size by feeding input data directly into later stages [2.3, 2.20].
However, such reconfiguration does require lots of multiplexers when we demand higher flexibility in the FFT size. The multiplexers added between each stage not only increase the overhead on area and power, but also influence the speed performance of overall architecture. Figure 2.13 shows an example of the reconfigurable pipeline-based architecture. The 1024-point FFT architecture is divided into five stages (1024=45). The architecture can also be reconfigured as 16, 64, or 256-point FFT. Reconfiguration is achieved by inserting three multiplexers namely MUX I, MUX II, and MUX III. The FFT processor can act as a 256-point processor by feeding the input data directly into stage 2 and clocking down the first stage. In the same way, reconfigurations to 64-point or 16-point FFT can also be achieved by feeding input data directly into stage 3 or stage 4, respectively.
Figure 2.13 Architecture of 1024-point radix-4 reconfigurable pipelined FFT processor
Alternatively, the memory-based architecture can be modified as reconfigurable architecture [2.21-2.22], too. Unlike the pipeline-based architecture, no much hardware needs to be added since there is only one butterfly computation block. Reconfigurability is achieved by adding flexibility to address generation block, coefficient memory block, and data memory block. The difficulty lies on the generation of control signal and the data ordering in the memory.
2.5 Conclusions
In this chapter, we have reviewed the generic FFT algorithms and architectures. The fixed-radix algorithms are popular in VLSI implementation due to the regularity of their SFGs. However, while algorithms with high radix are of lower computational complexity, the flexibility in FFT size is also limited. The mixed-radix algorithms are thus more suitable for decomposing various FFT sizes. The drawback is that their twiddle-factor multiplications are more irregular than fixed-radix algorithms.
In the architecture level, the memory-based architecture which only uses one or few computation blocks, is consider the most area efficient architecture. However, the low throughput rate makes it unsuitable for the high-speed application. The pipeline-based architecture is easy to scale and parameterize in hardware design. Although it is also easy to reconfigure for different FFT size, the data path may grow too long if we want higher flexibility.
Chapter3
Algorithm of
Reconfigurable Mixed-Radix FFT
3.1 Introduction
Our purpose is to design a reconfigurable FFT processor that can be dynamically configured to perform FFT length as from 16-point to 4096-point. In the fixed-radix algorithms, only radix-2 FFT algorithms can cover this range of reconfiguration. However, the radix-2 algorithms result in large calculation cycles and low throughput rate. As the higher radix algorithms are preferred for our high throughput purpose, the flexibility of the FFT size is also limited. Therefore, the mixed-radix algorithm is adopted in our design to keep the architecture flexible while using a high radix algorithm. Also, the algorithm should have certain common properties for decomposing different points of FFT.
In this chapter, we will derive a reconfigurable mixed-radix algorithm. We manage to find regularity for data ordering and twiddle factors for FFTs of different sizes. Such regularities facilitate the construction of the hardware architecture. Also, special block execution order for the RMR FFT will be introduced in order to adopt the block-floating-point method.
3.2 Reconfigurable Mixed-Radix Algorithm
The Discrete Fourier Transform (DFT) of a complex data sequence x[n] of length N is defined as 1 0 ( ) [ ] k=0,1,...,N-1 N nk N n X k x n W − = =
∑
(3.1)where the coefficient nkis defined as
N W 2 j nk nk N N W e π −
A direct implementation of this equation requires large hardware and thus is impractical. By using the FFT algorithm, the computational complexity can be reduced. Let 1 2 1 1 1 1 2 2 2 1 1 2 1 2 2 2 2 0,1..., 1 , { 0,1..., 1 0,1..., 1 , { 0,1..., 1 v N r r n r n n r n n r n r k r k k n r = = × = − = + = = − = + = − − r ⎪⎪ (3.2)
Combining (1) and (2), the N-point FFT can be formulated as
N 1 2 1 1 2 2 1 2 1 2 1 2 2 2 1 2 1 1 2 1 1 2 2 1 1 1 ( )( ) 2 1 2 1 1 2 0 0 1 1 1 1 2 0 0 Twiddle factor r -point DFT r -point DFT ( ) [ ] [ ] r r n r n r k k N n n r r n k n k n k r N n n X r k k x n r n W x n r n W W W − − + + = = − − = = + = + ⎧ ⎫ ⎪⎪ = ⎨ + × ⎬ ⎪ ⎪ ⎪ ⎪ ⎩ ⎭
∑ ∑
∑ ∑
(3.3)From the above equations, we can divide any N-point (power of 2) FFT into a combination of r1-point and r2-point FFT. The r2-point DFT in (3) can be further
decomposed in the same manner. Therefore, when N is not prime, such that N =
r1×r2×r3×…×rm, it is possible to divide the N-point DFT as combination of r1, r2, r3…,
rm-point DFTs.
The proposed RMR FFT is divided as four-stage pipeline architecture. The idea is that, if each butterfly unit can act as radix-2, radix-4, or radix-8 butterfly, then the processor is capable of performing different points of FFT algorithms ranging from
2×2×2×2=16 points to 8×8×8×8=4096 points. That is, decompose the N-point FFT
as combination of r1, r2, r3, r4 –point DFT, where N= r1 × r2 × r3 × r4. In this way,
one FFT may have several combinations of radixes. Since the duplications are unnecessary for the hardware design, specific mixed-radix algorithm is assigned for each FFT mode.
The higher radix is chosen first. Based on the radix-8 algorithm, smaller FFT sizes are realized by bypassing preceding stages. For example, the 512-point FFT can be decomposed by the radix-8 algorithm as three stages and the four-stage pipeline thus becomes unnecessary. In such cases, we would like to bypass one of the four stages as the conventional reconfigurable pipeline architecture does, instead of assigning an 8×8×4×2 algorithm or other four-stage decomposition. Radix smaller than 8 is arranged at last stage. In this way, we only have to consider the last stage as
a reconfigurable butterfly stage while other stages being radix-8 under all modes. The resulting radix arrangement is shown in TABLE 3.1. As the table shown, we only need four-stage butterflies when the FFT is of size {1024, 2048, 4096}. Meanwhile, FFTs of size {128, 256, 512} need three-stage butterflies and FFTs of size {16, 32, 64} need only two.
TABLE 3.1 Mixed-radix algorithms for different FFT sizes
FFT size Stage 1 Stage 2 Stage 3 Stage 4
16 8 2 32 8 4 64 8 8 128 8 8 2 256 8 8 4 512 8 8 8 1024 8 8 8 2 2048 8 8 8 4 4096 8 8 8 8
The basic butterfly units in our design are thus radix-2, radix-4, and radix-8 butterflies. Based on the decimation in frequency decomposition, the SFG of the 8-point DFT is shown in Fig. 3.1. Notice that there is no explicit multiplication operation in realization of an 8-point DFT. The trivial multiplications of ±j, (1-j)/√2, and -(1+j)/√2 can be realized by using only shift-and add operation. Another observation through the SFG is that the 8-point DFT is a combination of two parallel 4-point DFTs if we neglect the first stage and a combination of four parallel 2-point DFTs if the first two stages are neglected. Therefore, the radix-8 butterfly can serve as radix-4 and radix-2 butterfly as well. The side advantage is that the width of data path can stay at 8-data when goes from radix-8 to a lower radix stage.
0 8 W 1 8 W 2 8 W 3 8 W 0 8 W 2 8 W 0 8 W 2 8 W -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] X(0) X(4) X(2) X(6) X(1) X(5) X(3) X(7)
Stage 1 Stage 2 Stage 3
Figure 3.1 SFG of 8-point DIF FFT
Moreover, consider the DFT and IDFT equations:
DFT: 1 (3.4) 0 ( ) N [ ] nk N n X k x n W − = =
∑
IDFT: 1 0 1 [ ] N ( ) nk N n x n X k W N − − = =∑
(3.5) Let nk N nk N W Wr jW W− Wr W = + = − i i ) ) ) ) (3.6)We can find that the differences between DFT and IDFT are (a) the scaling constant
1/N and (b) twiddle factors are conjugate of each other. Now, consider the complex
multiplication of multiplying conjugate twiddle factors respectively:
(Xr+ jXi Wr)( + jWi) (= XrWr−XiWi)+ j XiWr( +XrWi (3.7)
(Xr+ jXi Wr)( − jWi) (= XrWr+XiWi)+ j XiWr( −XrWi (3.8)
If we swap the real and imaginary parts of the input variable, that is, changing to , Eq. (3.7) becomes:
(Xr+ jXi) (Xi jXr+
(Xi+ jXr Wr)( + jWi) (= XiWr−XrWi)+ j XrWr( +XiWi (3.9)
Comparing Eq. (3.8) and Eq. (3.9), the real part of Eq. (3.8) equals to the imaginary part of Eq. (3.9) and the imaginary part of Eq. (3.8) equals to the real part of Eq. (3.9). This means that these two equations are equal if we swap the real and imaginary parts
of one of the equations. Therefore, there is way to transform Eq. (3.7) to Eq. (3.8). Since Eq (3.7) represents the multiplications in DFT and Eq. (3.8) represents multiplications in IDFT, this means that we are able to use the DFT to calculate the IDFT.
In summary, the IDFT can be performed by first swap the real and imaginary parts of input data. Then, after the DFT computation, swap the real and imaginary parts of output data. By scaling the output with the constant 1/N, the IDFT result is obtained. In the view of hardware implementation, we only have to add the swap unit at the input and output data port of the FFT processor in order to use the same processor to calculate IFFT.
3.3 Data Ordering and Twiddle Factors
As the proposed architecture can be reconfigured from 16 to 4096-point FFT, the data ordering will be different from mode to mode due to the dedicated mixed-radix algorithm. To make the architecture realizable, there must be rules that apply to all modes. The dedicated mixed-radix algorithms for different modes are listed in TABLE 3.1. The approach we use here is first to decompose the N-point FFT by the radix-2 decimation-in-frequency algorithm. As mentioned in the previous section, a radix-8 stage can be decomposed as combination of radix-2 stages or radix-4 stages. In other word, we can combine two radix-2 stages as one radix-4 stage and three radix-2 stages as a radix-8 stage, as shown in Figure 3.1. Based on signal flow graph of the radix-2 DIF decomposition, we recompose N-point FFT to the mixed-radix algorithms listed in TABLE 3.1. Since all the FFTs are decomposed by the radix-2 DIF algorithm, the data ordering follows the same rules. The order of output data for the radix-2 flow graph is referred to as bit-reversed order. Figure 3.2 shows the example of radix-2 decomposition of the 128-point FFT. As Table 3.1 listed, the assigned mixed-radix algorithm for 128-point is radix-8/8/2. Figure 3.3 shows the recomposed SFG, which is the desired SFG for our architecture. Notice that the black nodes in Figure 3.2 correspond to the nodes in Figure 3.3 respectively.
x [ 0 ] x [ 1 ] x [ 2 ] x [ 3 ] x [ 4 ] x [ 5 ] x [ 6 ] x [ 7 ] x [ 8 ] x [ 9 ] x [ 1 0 ] x [ 1 1 ] x [ 1 2 ] x [ 1 3 ] x [ 1 4 ] x [ 1 5 ] x [ 1 6 ] x [ 1 7 ] x [ 1 8 ] x [ 1 9 ] x [ 2 0 ] x [ 2 1 ] x [ 2 3 ] x [ 2 4 ] x [ 2 5 ] x [ 2 6 ] x [ 2 7 ] x [ 2 8 ] x [ 2 9 ] x [ 3 0 ] x [ 3 1 ] x [ 3 2 ] x [ 3 3 ] x [ 3 4 ] x [ 3 5 ] x [ 3 6 ] x [ 3 7 ] x [ 3 8 ] x [ 3 9 ] x [ 4 0 ] x [ 4 1 ] x [ 4 2 ] x [ 4 3 ] x [ 4 4 ] x [ 4 5 ] x [ 4 6 ] x [ 4 7 ] x [ 4 8 ] x [ 4 9 ] x [ 5 0 ] x [ 5 1 ] x [ 5 2 ] x [ 5 3 ] x [ 5 4 ] x [ 5 5 ] x [ 5 6 ] x [ 5 7 ] x [ 5 8 ] x [ 5 9 ] x [ 6 0 ] x [ 6 1 ] x [ 6 2 ] x [ 6 3 ] x [ 6 4 ] x [ 6 5 ] x [ 6 6 ] x [ 6 7 ] x [ 6 8 ] x [ 6 9 ] x [ 7 0 ] x [ 7 1 ] x [ 7 2 ] x [ 7 3 ] x [ 7 4 ] x [ 7 5 ] x [ 7 6 ] x [ 7 7 ] x [ 7 8 ] x [ 7 9 ] x [ 8 0 ] x [ 8 1 ] x [ 8 2 ] x [ 8 3 ] x [ 8 4 ] x [ 8 5 ] x [ 8 6 ] x [ 8 7 ] x [ 8 8 ] x [ 8 9 ] x [ 9 0 ] x [ 9 1 ] x [ 9 2 ] x [ 9 3 ] x [ 9 4 ] x [ 9 5 ] x [ 9 6 ] x [ 9 7 ] x [ 9 8 ] x [ 9 9 ] x [ 1 0 0 ] x [ 1 0 1 ] x [ 1 0 2 ] x [ 1 0 3 ] x [ 1 0 4 ] x [ 1 0 5 ] x [ 1 0 6 ] x [ 1 0 7 ] x [ 1 0 8 ] x [ 1 0 9 ] x [ 1 1 0 ] x [ 1 1 1 ] x [ 1 1 2 ] x [ 1 1 3 ] x [ 1 1 4 ] x [ 1 1 5 ] x [ 1 1 6 ] x [ 1 1 7 ] x [ 1 1 8 ] x [ 1 1 9 ] x [ 1 2 0 ] x [ 1 2 1 ] x [ 1 2 2 ] x [ 1 2 3 ] x [ 1 2 4 ] x [ 1 2 5 ] x [ 1 2 6 ] x [ 1 2 7 ] x [ 2 2 ] X ( 0 ) X ( 6 4 ) X ( 3 2 ) X ( 9 6 ) X ( 1 6 ) X ( 8 0 ) X ( 4 8 ) X ( 1 1 2 ) X ( 8 ) X ( 7 2 ) X ( 4 0 ) X ( 1 0 4 ) X ( 2 4 ) X ( 8 8 ) X ( 5 6 ) X ( 1 2 0 ) X ( 4 ) X ( 6 8 ) X ( 3 6 ) X ( 1 0 0 ) X ( 2 0 ) X ( 8 4 ) X ( 1 1 6 ) X ( 1 2 ) X ( 7 6 ) X ( 4 4 ) X ( 1 0 8 ) X ( 2 8 ) X ( 9 2 ) X ( 6 0 ) X ( 1 2 4 ) X ( 2 ) X ( 6 6 ) X ( 3 4 ) X ( 9 8 ) X ( 1 8 ) X ( 8 2 ) X ( 5 0 ) X ( 1 1 4 ) X ( 1 0 ) X ( 7 4 ) X ( 4 2 ) X ( 1 0 6 ) X ( 2 6 ) X ( 9 0 ) X ( 5 8 ) X ( 1 2 2 ) X ( 6 ) X ( 7 0 ) X ( 3 8 ) X ( 1 0 2 ) X ( 2 2 ) X ( 8 6 ) X ( 5 4 ) X ( 1 1 8 ) X ( 1 4 ) X ( 7 8 ) X ( 4 6 ) X ( 1 1 0 ) X ( 3 0 ) X ( 9 4 ) X ( 6 2 ) X ( 1 2 6 ) X ( 1 ) X ( 6 5 ) X ( 3 3 ) X ( 9 7 ) X ( 1 7 ) X ( 8 1 ) X ( 4 9 ) X ( 1 1 3 ) X ( 9 ) X ( 7 3 ) X ( 4 1 ) X ( 1 0 5 ) X ( 2 5 ) X ( 8 9 ) X ( 5 7 ) X ( 1 2 1 ) X ( 5 ) X ( 6 9 ) X ( 3 7 ) X ( 1 0 1 ) X ( 2 1 ) X ( 8 5 ) X ( 5 3 ) X ( 1 1 7 ) X ( 1 3 ) X ( 7 7 ) X ( 4 5 ) X ( 1 0 9 ) X ( 2 9 ) X ( 9 3 ) X ( 6 1 ) X ( 1 2 5 ) X ( 3 ) X ( 6 7 ) X ( 3 5 ) X ( 9 9 ) X ( 1 9 ) X ( 8 3 ) X ( 5 1 ) X ( 1 1 5 ) X ( 1 1 ) X ( 7 5 ) X ( 4 3 ) X ( 1 0 7 ) X ( 2 7 ) X ( 9 1 ) X ( 5 9 ) X ( 1 2 3 ) X ( 7 ) X ( 7 1 ) X ( 3 9 ) X ( 1 0 3 ) X ( 2 3 ) X ( 8 7 ) X ( 5 5 ) X ( 1 1 9 ) X ( 1 5 ) X ( 7 9 ) X ( 4 7 ) X ( 1 1 1 ) X ( 3 1 ) X ( 9 5 ) X ( 6 3 ) X ( 1 2 7 ) X ( 5 2 )
Figure 3.2 SFG of 128-point FFT in radix-2 DIF algorithm
x [ 0 ] x [ 1 ] x [ 2 ] x [ 3 ] x [ 4 ] x [ 5 ] x [ 6 ] x [ 7 ] x [ 8 ] x [ 9 ] x [ 1 0 ] x [ 1 1 ] x [ 1 2 ] x [ 1 3 ] x [ 1 4 ] x [ 1 5 ] x [ 1 6 ] x [ 1 7 ] x [ 1 8 ] x [ 1 9 ] x [ 2 0 ] x [ 2 1 ] x [ 2 3 ] x [ 2 4 ] x [ 2 5 ] x [ 2 6 ] x [ 2 7 ] x [ 2 8 ] x [ 2 9 ] x [ 3 0 ] x [ 3 1 ] x [ 3 2 ] x [ 3 3 ] x [ 3 4 ] x [ 3 5 ] x [ 3 6 ] x [ 3 7 ] x [ 3 8 ] x [ 3 9 ] x [ 4 0 ] x [ 4 1 ] x [ 4 2 ] x [ 4 3 ] x [ 4 4 ] x [ 4 5 ] x [ 4 6 ] x [ 4 7 ] x [ 4 8 ] x [ 4 9 ] x [ 5 0 ] x [ 5 1 ] x [ 5 2 ] x [ 5 3 ] x [ 5 4 ] x [ 5 5 ] x [ 5 6 ] x [ 5 7 ] x [ 5 8 ] x [ 5 9 ] x [ 6 0 ] x [ 6 1 ] x [ 6 2 ] x [ 6 3 ] x [ 6 4 ] x [ 6 5 ] x [ 6 6 ] x [ 6 7 ] x [ 6 8 ] x [ 6 9 ] x [ 7 0 ] x [ 7 1 ] x [ 7 2 ] x [ 7 3 ] x [ 7 4 ] x [ 7 5 ] x [ 7 6 ] x [ 7 7 ] x [ 7 8 ] x [ 7 9 ] x [ 8 0 ] x [ 8 1 ] x [ 8 2 ] x [ 8 3 ] x [ 8 4 ] x [ 8 5 ] x [ 8 6 ] x [ 8 7 ] x [ 8 8 ] x [ 8 9 ] x [ 9 0 ] x [ 9 1 ] x [ 9 2 ] x [ 9 3 ] x [ 9 4 ] x [ 9 5 ] x [ 9 6 ] x [ 9 7 ] x [ 9 8 ] x [ 9 9 ] x [ 1 0 0 ] x [ 1 0 1 ] x [ 1 0 2 ] x [ 1 0 3 ] x [ 1 0 4 ] x [ 1 0 5 ] x [ 1 0 6 ] x [ 1 0 7 ] x [ 1 0 8 ] x [ 1 0 9 ] x [ 1 1 0 ] x [ 1 1 1 ] x [ 1 1 2 ] x [ 1 1 3 ] x [ 1 1 4 ] x [ 1 1 5 ] x [ 1 1 6 ] x [ 1 1 7 ] x [ 1 1 8 ] x [ 1 1 9 ] x [ 1 2 0 ] x [ 1 2 1 ] x [ 1 2 2 ] x [ 1 2 3 ] x [ 1 2 4 ] x [ 1 2 5 ] x [ 1 2 6 ] x [ 1 2 7 ] x [ 2 2 ] X ( 0 ) X ( 6 4 ) X ( 3 2 ) X ( 9 6 ) X ( 1 6 ) X ( 8 0 ) X ( 4 8 ) X ( 1 1 2 ) X ( 8 ) X ( 7 2 ) X ( 4 0 ) X ( 1 0 4 ) X ( 2 4 ) X ( 8 8 ) X ( 5 6 ) X ( 1 2 0 ) X ( 4 ) X ( 6 8 ) X ( 3 6 ) X ( 1 0 0 ) X ( 2 0 ) X ( 8 4 ) X ( 1 1 6 ) X ( 1 2 ) X ( 7 6 ) X ( 4 4 ) X ( 1 0 8 ) X ( 2 8 ) X ( 9 2 ) X ( 6 0 ) X ( 1 2 4 ) X ( 2 ) X ( 6 6 ) X ( 3 4 ) X ( 9 8 ) X ( 1 8 ) X ( 8 2 ) X ( 5 0 ) X ( 1 1 4 ) X ( 1 0 ) X ( 7 4 ) X ( 4 2 ) X ( 1 0 6 ) X ( 2 6 ) X ( 9 0 ) X ( 5 8 ) X ( 1 2 2 ) X ( 6 ) X ( 7 0 ) X ( 3 8 ) X ( 1 0 2 ) X ( 2 2 ) X ( 8 6 ) X ( 5 4 ) X ( 1 1 8 ) X ( 1 4 ) X ( 7 8 ) X ( 4 6 ) X ( 1 1 0 ) X ( 3 0 ) X ( 9 4 ) X ( 6 2 ) X ( 1 2 6 ) X ( 1 ) X ( 6 5 ) X ( 3 3 ) X ( 9 7 ) X ( 1 7 ) X ( 8 1 ) X ( 4 9 ) X ( 1 1 3 ) X ( 9 ) X ( 7 3 ) X ( 4 1 ) X ( 1 0 5 ) X ( 2 5 ) X ( 8 9 ) X ( 5 7 ) X ( 1 2 1 ) X ( 5 ) X ( 6 9 ) X ( 3 7 ) X ( 1 0 1 ) X ( 2 1 ) X ( 8 5 ) X ( 5 3 ) X ( 1 1 7 ) X ( 1 3 ) X ( 7 7 ) X ( 4 5 ) X ( 1 0 9 ) X ( 2 9 ) X ( 9 3 ) X ( 6 1 ) X ( 1 2 5 ) X ( 3 ) X ( 6 7 ) X ( 3 5 ) X ( 9 9 ) X ( 1 9 ) X ( 8 3 ) X ( 5 1 ) X ( 1 1 5 ) X ( 1 1 ) X ( 7 5 ) X ( 4 3 ) X ( 1 0 7 ) X ( 2 7 ) X ( 9 1 ) X ( 5 9 ) X ( 1 2 3 ) X ( 7 ) X ( 7 1 ) X ( 3 9 ) X ( 1 0 3 ) X ( 2 3 ) X ( 8 7 ) X ( 5 5 ) X ( 1 1 9 ) X ( 1 5 ) X ( 7 9 ) X ( 4 7 ) X ( 1 1 1 ) X ( 3 1 ) X ( 9 5 ) X ( 6 3 ) X ( 1 2 7 ) X ( 5 2 )
After determine the data ordering for every butterfly stage, the next question is how the twiddle factors arrange. Clearly, it is not likely that we can directly map the twiddle factors from the radix-2 SFG to our mixed-radix SFG. However, we have found relation between that is easy enough for us to derive a common rule.
Start with the example of the 16-point FFT SFG and as mentioned before, first we draw the SFG using radix-2 algorithm, as shown in Figure 3.4(a). According to TABLE 3.1, the 16-point FFT is supposed to recompose as radix-8/2 butterfly stages and thus we know that the first three radix-2 stages should be combined as one radix-8 stage. Since there are 16 points, there will be two radix-8 butterflies and we extract them as in Figure 3.5. The first butterfly is readily a radix-8 butterfly as shown in Figure 3.1. For the second butterfly, we must transform the internal twiddle factors in order to map to Figure 3.1. The procedure is shown in Figure 3.6.
0*0 16 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] x[13] x[14] x[15] X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15) 1*0 16 W 2*0 16 W 3*0 16 W 4*0 16 W 5*0 16 W 6*0 16 W 7*0 16 W 0*1 16 W 1*1 16 W 2*1 16 W 3*1 16 W 4*1 16 W 5*1 16 W 6*1 16 W 7*1 16 W 0*0 8 W 1*0 8 W 2*0 8 W 3*0 8 W 0*1 8 W 1*1 8 W 2*1 8 W 3*1 8 W 0*0 4 W 1*0 4 W 0*1 4 W 1*1 4 W 0*0 8 W 1*0 8 W 2*0 8 W 3*0 8 W 0*1 8 W 1*1 8 W 2*1 8 W 3*1 8 W 0*0 4 W 1*0 4 W 0*1 4 W 1*1 4 W 0*0 4 W 1*0 4 W 0*1 4 W 1*1 4 W 0*0 4 W 1*0 4 W 0*1 4 W 1*1 4 W 0*0 16 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] x[13] x[14] x[15] X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15) 1*0 16 W 0*4 16 W 1*4 16 W 0*2 16 W 1*2 16 W 0*6 16 W 1*6 16 W 0*1 16 W 1*1 16 W 0*5 16 W 1*5 16 W 0*3 16 W 1*3 16 W 0*7 16 W 1*7 16 W (a) (b) Figure 3.4 SFG of 16-point DFT in (a) radix-2 algorithm, and (b) radix-8/2 algorithm
0*0 16 W x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] x[13] x[14] x[15] X(0) X(8) X(4) X(12) X(2) X(10) X(6) X(14) X(1) X(9) X(5) X(13) X(3) X(11) X(7) X(15) 1*0 16 W 2*0 16 W 3*0 16 W 4*0 16 W 5*0 16 W 6*0 16 W 7*0 16 W 0*1 16 W 1*1 16 W 2*1 16 W 3*1 16 W 4*1 16 W 5*1 16 W 6*1 16 W 7*1 16 W 0*0 8 W 1*0 8 W 2*0 8 W 3*0 8 W 0*1 8 W 1*1 8 W 2*1 8 W 3*1 8 W 1*0 4 W 1*1 4 W 0*0 8 W 1*0 8 W 2*0 8 W 3*0 8 W 0*1 8 W 1*1 8 W 2*1 8 W 3*1 8 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W 0*0 16 W 2*0 16 W 4*0 16 W 6*0 16 W 0*1 16 W 2*1 16 W 4*1 16 W 6*1 16 W 0*0 8 W 2*0 8 W 0*1 8 W 2*1 8 W 0*0 8 W 2*0 8 W 0*1 8 W 2*1 8 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W 0*0 4 W 0*1 4 W x[0] x[2] x[4] x[6] x[8] x[10] x[12] x[14] 1*0 16 W 3*0 16 W 5*0 16 W 7*0 16 W 1*1 16 W 3*1 16 W 5*1 16 W 7*1 16 W 1*0 8 W 3*0 8 W 1*1 8 W 3*1 8 W 1*0 8 W 3*0 8 W 1*1 8 W 3*1 8 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W x[1] x[3] x[5] x[7] x[9] x[11] x[13] x[15] 1st radix-8 butterfly 2nd radix-8 butterfly
Figure 3.5 Extraction of radix-8 butterfly
The transformation starts with the first stage. In order to map the twiddle factors in the first column to those of the radix-8 butterfly, 1
16
W− is multiplied to the twiddle
factors, as shown in (a). Since the radix-2 butterfly performs only addition/subtraction operations, the output must multiply in order to compensate the multiplication at input. The procedure goes on through (b) and (c), and we can obtain the resulting SFG as in (d). 1 16 W 1 1 16 1 6 W W × − 1 3 16 1 6 W W × − 1 5 16 1 6 W W × − 1 7 16 1 6 W W × − 1 8 W 3 8 W 1 1 8 1 6 W ×W 1 3 8 1 6 W ×W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W 1*0 4 W 1*1 4 W x[1] x[3] x[5] x[7] x[9] x[11] x[13] x[15] 0 16 W 2 16 W 4 16 W 6 16 W 8 1 8 1 W ×W− 8 3 8 1 W ×W− 0 4 W 1 4 W 8 0 4 1 W ×W 8 1 4 1 W ×W 1 0 4 1 6 W ×W 1 1 4 16 W ×W 1 0 4 36 W ×W 1 1 4 3 6 W ×W x[1] x[3] x[5] x[7] x[9] x[11] x[13] x[15] 1 16 W 1 16 W 1 3 16 W63 W × − 1 7 16 3 6 W W × − 1 1 16 W61 W × − 1 1 16 1 6 W W × − 0 8 W 2 8 W 0 8 W 2 8 W 1*0 16 W 1*4 16 W 1*2 16 W 1*6 16 W 1*1 16 W 1*5 16 W 1*3 16 W 1*7 16 W x[1] x[3] x[5] x[7] x[9] x[11] x[13] x[15] 0 8 W 1 8 W 2 8 W 3 8 W 1*0 16 W 1*4 16 W 1*2 16 W 1*6 16 W 1*1 16 W 1*5 16 W 1*3 16 W 1*7 16 W x[1] x[3] x[5] x[7] x[9] x[11] x[13] x[15] Twiddle Factors (a) (b) (c) (d)
We can then conclude the formula for the twiddle factors scheduling. For the
N-point radix-8 decomposition, there will be N/8 butterflies. The twiddle factors for
the mth radix-8 butterfly are { , , , , , , , },
where m is a integer from 0 to (N/8)-1. The relation is shown in Figure 3.7. We will later find that such relation greatly simplify the control for the multiplier stage. Therefore, we can say that, for the N-point radix-8 decomposition, the complex multiplications required are of N-based twiddle factors. As our reconfigurable FFT may maximally perform 4 BF-stage operations, three multiplier stages are required. For these three multiplier stages, the possible N-based twiddle factors required are shown in TABLE 3.2. *0 m N W WNm*4 WNm*2 WNm*6 WNm*1 WNm*5 WNm*3 WNm*7 *0 m N W *4 m N W *2 m N W *6 m N W *1 m N W *5 m N W *3 m N W *7 m N W x[m+N/8] x[m+2N/8] x[m+3N/8] x[m+4N/8] x[m+5N/8] x[m+6N/8] x[m+7N/8] x[m]
Figure 3.7 The twiddle factors for the mth radix-8 butterfly for N-point
decomposition
TABLE 3.2 N-based twiddle factors required for each multiplier stage under different FFT size
FFT size Stage 1 Stage 2 Stage 3
16 16 32 32 64 64 128 128 16 256 256 32 512 512 64 1024 1024 128 16 2048 2048 256 32 4096 4096 512 64