Chapter 3. Proposed architecture
3.1. Design issue
We discuss some idea of hardware design in this section. Table 3.1 shows the control code for different Size of FFT. In order to achieve simple control circuit, control code adopts four bit instead of three bit. The most significant bit (MSB) of control code is the Radix-2_Flag. When the Radix-2_Flag is high, the radix-2/4 butterfly operator executes radix-2 butterfly computation. The last three bit of control code is related to initial stage, as shown in eq. 3.1. According to Fig. 3.2, when N is equal to 128, 512, 2048 and 8192, butterfly operator need an extra step of radix-2 computation first.
Table 3.1 Different size of FFT and the corresponding control code
(3.1)
Figure 3.2 Control of signal flow
3.2. Radix-2/4 Butterfly
The proposed radix-2/4 butterfly operator is shown in Fig. 3.3. Radix-4
computation is the common mode. Radix-2 mode is enabled when N is equal to 128, 512, 2048 and 8192. The proposed architecture can execute two radix-2 computations at a time. So the throughput is the double of conventional architecture.
Figure 3.3 Radix-2/4 butterfly operator
3.3. Multiplier module
6
W
16Figure 3.4 The proposed multiplier module
Fig. 3.4 shows the proposed multiplier module. It consists of three complex multiplier and several multiplexers. Exactly as we discussed in chapter 2.2.1,it has no switching activity at stage 010, as shown in Fig. 3.5. Fig. 3.5(a), (b), (c) and (d) represent the state of execution of data sequence 1, 2, 3, and 4 at stage 010, respectively.
6 Figure 3.5 The behavior of multiplier module at stage 010
6
3.4. Phase Compensator
Figure 3.6 Phase compensator
Fig 3.6 shows the architecture of phase compensator. The purpose is to recover the phase of outputs from multiplier module since the modified coefficients are fed into the multiplier module. From eq. 2.3, we assume the modified coefficient,
W ,
NC' is fed into the multiplier module. We derived the output of phase compensator as follows:Assume
input = X W
* NC'1 *1 * NC'
out = input = X W
' 3 / 4
2 * * NC N
out = input j = X W
+' / 2
3 * ( 1) * NC N
out = input − = X W
+' / 4
4 * ( ) * NC N
out = input − = j X W
+' ' / 4 ' / 2 ' 3 / 4
*{ , , , }
*
C C N C N C N
N N N N
C N
output X W W W W
X W
+ + +
=
=
3.5. Memory Address Assignment
We adopt the in-place memory addressing scheme for the radix-4 FFT
algorithm [8]. For the concurrent read and write operations, the memory is partitioned into four banks. Table 3.2 shows the address assignment for a 16-point FFT.
Table 3.2 Address assignment for a 16-point FFT
According to this table, four inputs can be read from different banks and four outputs can be written to different banks for all butterfly computation of 16-poin FFT.
SEL is the selection of memory banks. In our design, we use several adders to implement, as shown in eq. 3.2.
(3.2)
This memory assignment strategy can be extended for long-length point FFT.
Table 3.3 shows the address assignment for a 64-point FFT. We can see table 3.2 is a sub-block in table 3.3. It means that the memory assignment strategy is adaptable for reconfigurable design.
Table 3.3 Address assignment for a 64-point FFT
Chapter 4. Simulation and Performance Analysis
In this chapter we discuss simulation and verification with ideal model which is built by MATLAB. The ideal model can provide a complete mathematical and
simulation environment. The design flow is illustrated in Fig 4.1, and this is a kind of waterfall models which is worked well up to 100k gate count design.
After RTL code is developed, we verify the ideal model and RTL model to check if they have the same function. There are two ways for implement design after
function verification, one is synthesis for ASIC, the other is FPGA prototyping. FPGA prototyping is usually used to verify design, because FPGA can verify the behavior of real hardware. We synthesize the RTL design to Gate-level netlist by reasonable design constrain after the FPGA prototyping. The synthesis report shows the timing, area. We also run Gate-level simulation to double check the function and then run PrimePower by waveform from Gate-level simulation to analyze power consumption.
Figure 4.1 Design and verification flow
4.1. Simulation
Figure 4.2 Simulation environment
Fig. 4.2 shows the RTL simulation environment that determines the signal to quantization noise ratio (SNQR) between the ideal FFT and a Fixed-point FFT model.
Ideal FFT is built by MATLAB and practical FFT is our RTL design. The input data are 100 random patterns with 16 bit word length for each Size of FFT. The definition of SNQR is
Table .4.1 shows the mean square error for each Size of FFT. Fig. 4.3 is plotted according to table 4.1. We can see that mean square error increases when the size of FFT increases. The SNQR has the same conclusion, as shown in Fig 4.4.
Table 4.1 Mean square error for each size of FFT
Figure 4.3 Mean square error versus different size of FFT
64 128 256 512 1024 2048 4096 8192
20 40 60 80 100 120 140
FFT Size
S N Q R (d B)
Figure 4.4 SNQR of each Size of FFT
4.2. FPGA prototyping
Figure 4.5 XILINX VIRTEX-4 FPGA
Fig. 4.5 shows the used FPGA. It is convenient to verify the design because it can connect to computer through USB. Patterns are fed to FPGA by computer and the results of computation return back to monitor. Fig 4.6 shows the waveform of the design in FPGA. The first four signals are inputs which are made by MATLAB, and the rest are outputs which are delivered to computer by FPGA.
Figure 4.6 Waveform of the proposed design in FPGA
Figure 4.7 Verification flow of FPGA
Fig 4.7 shows the verification flow of FPGA. The sine wave which is sampled by N-point is fed to FPGA, and then we export the output file from FPGA to MATLAB.
The output file is converted to waveform by MATLAB in order to verification easily.
The following figures show the verifications for each Size of FFT according to Fig 4.7.
We feed a sine wave in image part and observe the output from FPGA. We decreased the magnitude of sine wave for long-length point FFT like 4096 and 8192 to avoid overflow. So the figures have a little distortion. Fig 4.16 shows that the synthesis report of proposed architecture from Xiline ISE.
N=64:
Input signal
0 10 20 30 40 50 60 70
Real part of output
0 10 20 30 40 50 60 70
Image part of output
Figure 4.8 Verification of 64-point FFT
N=128
Input signal
Real part of output
Image part of output
Figure 4.9 Verification of 128-point FFT
N=256
0 50 100 150 200 250 300
-0.015 -0.01 -0.005 0 0.005 0.01 0.015
Input signal
0 50 100 150 200 250 300
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Real part of output
0 50 100 150 200 250 300
-0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025
Image part of output
Figure 4.10 Verification of 256-point FFT
N=512
Input signal
0 100 200 300 400 500 600
Real part of output
0 100 200 300 400 500 600
Image part of output
Figure 4.11 Verification of 512-point FFT
N=1024
0 200 400 600 800 1000 1200
-4
Input signal
0 200 400 600 800 1000 1200
-2
Real part of output
0 200 400 600 800 1000 1200
-0.035
Image part of output
Figure 4.12 Verification of 1024-point FFT
N=2048
0 500 1000 1500 2000 2500
-2
Input signal
0 500 1000 1500 2000 2500
-2
Real part of output
0 500 1000 1500 2000 2500
-0.07
Image part of output
Figure 4.13 Verification of 2048-point FFT
N=4096
0 500 1000 1500 2000 2500 3000 3500 4000 4500 -1
Input signal
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 -0.8
Real part of output
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 -1
Image part of output
Figure 4.14 Verification of 4096-point FFT
N=8192
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 -4
Input signal
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 -0.4
Real part of output
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 -0.6
Image part of output
Figure 4.15 Verification of 8192-point FFT
Fig 4.16 The synthesis report of proposed architecture from Xilinx ISE
4.3. Synthesis Reports and Power Analysis
In this section, we discuss the implementation of the proposed FFT design. The proposed design synthesized to UMC 0.18um CMOS standard cell technology library with Synopsys Design Compiler. The gate count of the proposed architecture is shown in Table 4.2.
Table 4.2 Synthesis report of proposed architecture
Gate count Size
Proposed FFT
(not include memory)
53306 xRAM 4*483248 4*32*2048
Coefficient Rom1
6201 32*1024Coefficient Rom2
10417 32*2048Coefficient Rom3
6502 32*1024Table 4.3 shows the power consumptions for each Size of FFT. The report of power consumptions is obtained by the waveform of gate level simulation which is fed to PrimePower. The whole time of power consumption measurement includes the time which data write into memory and the time of computation and the time which data read from memory to output. According to table 2.8, the reduction of switching activity decreases as the increasing size of the FFT. Table 4.3 proves the conclusion of chapter 2.3. Fig. 4.17 is plotted according to table 4.3.
Table 4.3 Power consumption for each Size of FFT
Figure 4.17 Power consumption versus each size of FFT
4.4. Comparisons
The following tables shows the comparisons of gate count, power consumption and latency.
Table 4.4 Comparisons of gate count and power consumption
Table 4.5 Comparisons of latency
Chapter 5.
Conclusions and Future works
In this thesis, we propose a low power reconfigurable FFT/IFFT processor. The proposed memory based FFT processor can be configured as from 64-point to 8192-point. A low power design with minimum switching activity is proposed.
Chapter 4.3 shows that it is efficient for short-length point FFT. The maximum power consumption is 75.82 at power supply 1.8 V. The gate count of the proposed architecture without memory is 53306 under Synopsys Design Complier with UMC 0.18um library.
A low power reconfigurable FFT is presented in this thesis. The minimum switching activity is efficient to reduce the dynamic power consumption. However, this technique restricts the flexible of the FFT processor. In the future, we can try to use the digital signal processor to implement this technique.
Bibliography
[1] A. V. Oppenheim and R. W. Schafer, DISCRETE-TIME SIGNAL PROCESSING, New Jersey, 2
nd
Edition, Prentice-Hall, 1999.[2] J.W. CooIey and J.W. Tukey, “An algorithm for the machine calculation of complex Fourier series. Math. Comp., 19:297-301. April 1965.
[3] L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Prentice-Hall. 1975.
[4] E. H. Wold, A. M. Despain, “Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementation”, IEEE Trans. Comput., Vol. 33, no. 5, pp. 414—426, May 1984.
[5] Jen-Ming Wu, and Yang-Chun Fan,"“Coefficient Ordering Based Pipelined FFT/IFFT with Minimum Switching Activity for Low Power OFDM
Communication”, IEEE Int’t Symposium on Consumer Electronics, St.
Petersburg, Russia, 2006.
[6] “A low-power VLSI architecture for a shared-memory FFT processor with a mixed-radix algorithm and a simple memory control scheme”, Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International
Symposium
[7] M. Hasan, T. Arslan and J.S. Thompson “A Novel Coefficient Ordering based
Low Power Pipelined Radix-4 FFT Processor for Wireless LAN Applications”
IEEE transactions on consumer electronics, Vol. 49, No. 1, Feb. 2003.
[8] L. G. Johnson ‘Conf1ict free memory addressing for dedicated FFT hardware.”
IEEE Trans. Circuits Syst. II, vol. 39. pp. 312-316. May 1992.
[9] Xiaojin Li, Zongsheng Lai, Jianmin Cui “A Low Power and Small Area FFT Processor for OFDM Demodulator” IEEE Transactions on Consumer Electronics, Vol. 53, No. 2, MAY 2007
[10] Chin-Long Wey, Wei-Chien Tang, and Shin-Yo Lin “Efficient Memory-Based FFT Architectures for Digital Video Broadcasting (DVB-T/H)” VLSI Design, Automation and Test, 2007. VLSI-DAT 2007.
[11] Guihua Liu,Quanyuan Feng” ASIC Design of Low-power Reconfigurable FFT Processor” ASIC, 2007. ASICON '07.
Vita
姓名 : 黃謙若 性別 : 男
出生地 : 台北市
生日 : 民國七十三年八月二十日
地址 : 台北市北投區復興四路 99 號 3 樓
學歷 : 國立交通大學電子工程研究所碩士班 2006/09~2008/06
國立中興大學電機工程學系 2002/09~2006/06
國立師範大學附屬高級中學 1999/09~2002/06
論文題目 : A Low Power Reconfigurable FFT Processor with Minimum Switching Activity
可重組態之低功率快速傅立葉轉換處理器