Simulations Results - CORDIC-based Processing Element

Chapter 4 Processing Elements of FFT Processor

4.2 CORDIC-based Processing Element

4.2.5 Simulations Results

Based on the structures shown in Fig. 4.3, we performed fixed-point hardware simulations by using Matlab simulation tool. Exhausted simulations were conducted for all the rotation angles in the range of 0˚~ 45˚. The simulation result is shown in Fig. 4.4, and the detailed information with 8-bit, 12-bit and 16-bit accuracy (including 1-bit integer part) is shown in the Table 4.6.

Fig. 4.4 Fixed-point simulation results of new CORDIC algorithm

Table 4.6 Detailed simulation results with 8-bit, 12-bit and 16-bit output accuracy

Output accuracy 8 bits 12 bits 16 bits

Angle decomposition 1.835 2.727 3.644 Scale factor compensation 1.786 3.092 4.153 Average

iteration number

Overall iteration 2.437 3.482 4.424

Angle decomposition 3 4 5

Scale factor compensation 4 5 6

Worst-case iteration number

Overall iteration 4 5 6

Let Tr denote the iteration number of the x-y rotations (angle decomposition) and Ts denote the iteration number of scale factor compensation. For general CORDIC algorithm, the compensation operations of variable scale factor are performed right after all the x-y rotations have been done. That is to say, the number of overall itera-tions is Tr +Ts. However, our new CORDIC algorithm not only use the leading-one bit detection scheme to speedup convergence rate, but also combine the on-line scale factor compensation scheme. Therefore, the number of overall iterations of our algo-rithm is the maximum value of Tr and Ts, i.e. overall iteration number = max (Tr , Ts).

The occurrence versus iteration number of 12-bit simulation is shown in Fig. 4.5.

And the occurrence percentage of the dominant factor between Tr or Ts versus itera-tion number, of 12-bit precision is shown in Table 4.7.

Fig. 4.5 The total occurrence percentage versus iteration number, 12-bit precision

Table 4.7 Occurrence percentage of the dominant factor between Tr or Ts, 12-bit precision

Overall iteration = max(Tr , Ts) 1 2 3 4 5

max(Tr , Ts) = Tr 73.33% 84.44% 31.24% 0.88% No Tr= 5 case

max(Tr , Ts) = Ts 0% 0.74% 27.11% 96.35% 100 %

Tr = Ts 26.67% 14.81% 41.65% 2.78% No Tr= 5 case

In order to further reduce the iteration number of our CORDIC algorithm, we consider the issue of internal datapath word length. From computer simulation results as shown in Table 4.8, we can observe that the convergence rates will get better as the word length of residue angle is increased. However, the angle table size which stores the value of tan^-12^-^m will become larger, when we increase internal word length. The improvement also will be less significant when the word length of residue angles is increased. From simulations, the optimal internal word length is roughly 2 bits more than the target output word length.

Table 4.8 Simulation results of different residue angle word length (a) 8-bit input and output accuracy

The internal word length of residue angle 8 bits 9 bits 10 bits Angle decomposition 1.835 1.767 1.728 Scale factor compensation 1.786 1.786 1.786 Average

iteration number

Overall iteration 2.437 2.398 2.388

Angle decomposition 3

Scale factor compensation 4

Worst case iteration number

Overall iteration 4

(b) 12-bit input and output accuracy

The internal word length of residue angle 12 bits 13 bits 14 bits Angle decomposition 2.727 2.681 2.62 Scale factor compensation 3.092 3.095 3.093 Average

iteration number

Overall iteration 3.482 3.469 3.45

Angle decomposition 4

Scale factor compensation 5

Worst case iteration number

Overall iteration 5

Similarly, we can analyze the required data word length of equation (4.16). Ac-cording to Fig. 4.6, the SNR does not significantly improve when the internal word length is longer than input precision by three bits. Furthermore, if we use complex multipliers to perform vector rotation, the SNR values for 8-bit and 12-bit accuracy are about 44dB and 68dB, respectively. Therefore, the optimal word length is 3 bits more than the target output accuracy.

(a) 8-bit output accuracy

(b) 12-bit output accuracy

Fig. 4.6 SNR performance vs. internal datapath word length of the new CORDIC processor

4.2.6 Comparison

We compare the new design with some of the notable efficient designs in speed and area performances. Without analyzing the pipeline architecture, we only consider the structure of word serial architecture here. Since the very high-radix CORDIC al-gorithm [39], [40] is highly dependent on circuit designers’ expertise as mentioned in section 4.2.1, we will not include it in the comparison. The CORDIC algorithm with close-to-optimal angle recoding scheme [43] can reduce the iteration number to n/3 in average (excluding the introduced complicated variable scale factors). However, it has to perform O(n²) comparison operations. As it is a huge overhead compared to the other CORDIC algorithms, we also exclude it from the comparison. For the trel-lis-based searching schemes [44], it needs enormous ROM table to stores not only the result of angle decompositions but also the variable scale factor compensation

se-quences. Therefore, it isn’t suited for hardware implementation or SoC design. Due to the long initial delay of n units of time, the differential CORDIC algorithm [50] is de-signed for efficient parallel pipeline operations, not for serial computation. In addition, it is still based on conventional CORIDC algorithm, which needs n iterations for mi-cro-rotation plus O(n) shift-and-add iterations for constant scale factor compensation.

Therefore, we also don’t take it into comparison.

Table 4.9 lists the iteration counts and the required major area for the serial im-plementation of these algorithms [35-38], [41-42], [45-48]. In order to roughly quan-tify the comparison, we focus on the key circuit modules in the critical paths for those designs.

Table 4.9 Area and speed comparison of the new and several notable CORDIC processors (n-bit accuracy)

Algorithm Total itera-tion number

Main Area

(Adder, Barrel shifter, and Angle table) Conventional 4n/3 3 Adders + n-words ROM + 2 Barrel shifters

Takagi [35] n 3 Adders + n-words ROM + 2 Barrel shifters Timmermann [36] n 3 Adders + n-words ROM + 2 Barrel shifters Antelo [37] 4n/5 5 Adders + 2n/3-words ROM + 4 Barrel

shift-ers Rao [38] n/2 + 3

5 Adders + 2n/3-words ROM + 4 Barrel shift-ers

+ CSD coding + Distributed multiplication Hsiao [41] n 3 Adders + n-words ROM + 2 Barrel shifters

Li [45] 2n 4 Adders + 2n-word ROM + 2 Barrel shifters Li [46] 4/5n 4 Adder + 2n-word ROM + 2 Barrel shifters Lin [47] n + 1 3 Adder + n/2-word ROM + 2 Barrel shifters Ours new CORDIC n/3 8 Adder + n/3-word ROM+6 Barrel shifters

Base on above discussion, we can find that although our new proposed CORDIC algorithm has more adders and barrel shifters, it has small table size and the least it-eration number.

4.2.7 The Proposed CORDIC-based FFT PE

Instead of using the conventional complex multiplier, we can apply the new CORDIC algorithm to the processing element design of FFT processor. In addition, we also consider the special cases of input angles being odd multiples of 45˚, that is, when twiddle factor=W^N^/8, N is odd. The factors can be realized by the following equation

8 4 1 2 8

6 4 3 1 8

2 ) 2 2 )(

2 1 ( 2 2 2 2 2 7071068 .

2 0 2

) 1 2 (

−

− + + + + = + + +

= j

W_N^N

(4.17)

Therefore, we can use our CORDIC datapath to execute this particular rotation easily.

In this case, we only need one cycle to obtain the result by using the data path as shown in Fig 4.7. We can accomplish it by adding some minor modification to the data path of our CORDIC processor as shown in Fig. 4.3(b). However, it will increase control complexity of an FFT processor, because one has to distinguish those special angles from all other rotation angles. Hence, we must analyze the amount of the spe-cial twiddle factors, including ( 2/2)(1± and j) ( 2/2)(−1± j), and make sure that those special cases are large enough. For the adopted radix-2² algorithm and the variable-length FFT processor which supports power-of-4 and non-power-of-4 FFT operations as discussed in Section 3.1, Table 4.10 shows the analyzed results.

Table 4.10 Percentages of twiddle factors which are equal to odd multiples of 45˚ in using the adopted radix-2² algorithm and the radix-2²/2 PE

FFT Points 256 512 1024 2048 8192

Percentage of W^N^/8, N is odd 24.6% 24.6% 19.9% 19.9% 16.7%

According to the information of Table 4.10, we utilize Fig 4.7 to achieve the butterfly operation with twiddle factor W^N^/8 (for odd N) to reduce the total iteration number of FFT computation. The block diagram of the shared hardware is shown in Fig. 4.8.

+ x ＇

B arrel S h ifter

> > 1

+ y ＇

2 8

Fig. 4.7 The implement of twiddle factor W^N^/8 (odd N)

x(i+1)

y(i+1)

Leading-one

bit Detector ^s(i+1)t(i+1)

z(i+2)

x(i) s, t, c Residue

Angle z(i+1)

y(i)

Angle recoding table

Angle ROM

MUX

BS MUX

>>1 MUX

BS ^MUX

>>1 MUX

BS MUX

BS ^MUX

1 0

0 0

1 1 0 1 0

0 1 1 0

Error term cj cj

MUX

MUX Initial angle

Input data (real part)

Input data (imaginary part)

Fig. 4.8 Block diagram of the Proposed CORDIC-based FFT PE

In Fig. 4.8, when the control signals of the multiplexers are assigned 0, the PE is to execute the vector rotation and on-line scale factor compensation. Alternatively, if control signals are assigned 1, the PE is to compute the trivial multiplication of twid-dle factor W^N^/8 with odd N This shared hardware design does not increase the number of the required arithmetic units.

4.3 Comparison of FFT Processing Elements

Table 4.11 is the comparison of the multiplier-based PE and CORDIC-based PE.

This 8192-point radix-2² FFT PE, with 12-bit accuracy, is synthesized based on UMC 0.18μm standard cell library by Synopsis Design Analyzer. The multiplier-based PE includes three 1024-word twiddle factor ROM table.

The proposed CORDIC-based PE performs the front add/sub of a butterfly op-eration in the first cycle, and then executes the rotation opop-erations to carry out the butterfly complex multiplications. The average operation cycles are about 4.76 per butterfly computation for an 8192-point FFT.

Table 4.11 Comparison of the multiplier-based PE and CORDIC-based PE Proposed CORDIC-based PE

(word serial architecture) Multiplier-based PE

Gate counts 5163 34591

(Single complex multiplier: 5746)

Path delay 2.15ns 9.76ns

Required operation cycles per butterfly

computation

4.76

(averaged) 1

Chapter 5 EDA Realization of the New

Multi-Standard CORDIC-Based FFT Processor

5.1 Design Overview

The proposed design is an in-place memory-based FFT processor. The processor needs four-bank memory that matches the in-place memory address generator for high-bandwidth data access. In order to meet specifications of 802.16, DAB, and DVB-T, we employ the variable-length data address generator which covers five dif-ferent FFT lengths, including 256, 512, 1024, 2048, and 8192 points. Correspondingly, the processing element is based on radix-2² DIF FFT algorithm and also supports non-power-of-4 FFT computation, as discussed in Chapter 3. Since we replace the conventional complex multipliers of the PE with CORDIC processor, the ROM table which stores twiddle factors can be eliminated. Block diagram of our design is shown in Fig. 5.1.

SR A M B ank 3 SR A M B ank 2 SR A M B ank 1 SR A M B ank 0

Commutator read

C O RD IC-based PE

Commutatorwrite

R otation angle generator D ata address generator

Fig. 5.1 Block diagram of the proposed FFT processor

5.2 Components of FFT Processor

5.2.1 The Data Memory

The memory block of our FFT processor design is a 4-bank synchronous SRAM.

Each bank of SRAM has 2048 words and 24 bits per word which is generated by Ar-tisan™ UMC™ 0.18µm SRAM generator. The memory word length is 12-bit for both real and imaginary FFT data and is 24-bit in total for each data.

The details of memory partition scheme and address generation method are pre-sented in Chapter 3. Data address for each memory bank can be obtained by shifting the one-dimension data address right by two bits, which is easy to implement. On the other hand, bank index can be obtained by performing summation and module 4 of one-dimension data address as mentioned in Section 3.1.

For the general FFT processor with multiplier-based PE, in order to avoid stall operation and increase throughput, the data required by butterfly unit need to be read from and written to main memory simultaneously. Since continuous read or write op-eration is not allowed in the SRAM design, it is a serious problem when continuous memory access is assumed and preferred. In order to solve this problem, one may use the dual-port SRAM. The disadvantage of the dual-port type memory is that it has a larger area than that of a single-port type, because of two read/write ports, two sense amplifiers and two address generators in a dual-port memory. Furthermore, the power consumption is also a problem. According to Table 5.1, we can find that the power consumption per MHz of dual-port memory is larger than the single-port memory of same size.

Table 5.1 Power consumption of SRAM at 0.18µm process

SRAM size = 2048 × 24 Power (mW/MHz)

Single port 0.21

Dual port 0.49

In our design with CORDIC-based PE, the butterfly unit needs at least two cy-cles to execute the required operation. Since continuous read or write operation is avoided, the single-port memory which has smaller area and power consumption can be adopted.

5.2.2 The Processing Element

We replace the complex multipliers of the PE as shown in Fig. 3.6 by the CORDIC processor as shown in Fig. 4.8. The CORDIC-based PE structure is shown in Fig. 5.2. When the control signals of the MUX_1 are assigned 0, the PE is to exe-cute radix-2² FFT butterfly. Alternatively, if control signals are assigned 1, the PE is modified to process two radix-2 butterflies simultaneously. The required rotation an-gles for the variable-length CORDIC-based FFT processor can be generated by the similar hardware of the coefficient address generator mentioned in Section 3.1.

+ -+

-+

--j

MUX_1 MUX_1

MUX_1MUX_1

Radix-2²/2 select

Data in 0

Data in 1

Data in 2

Data in 3

Data out 0

Data out 1

Data out 2

Data out 3 CORDIC

processor (Fig. 4.8)

MUX_2

CORDIC processor (Fig. 4.8)

MUX_2

CORDIC processor (Fig. 4.8)

Reg

MUX_2 RegReg

Rotation angle 3 Rotation angle 1 Rotation angle 2

Fig. 5.2 The CORDIC-based PE structure

5.2.3 Controller

By combining trivial 2/2(±1± j)multiplications and front add/sub of a butter-fly operation with the basic CORDIC rotation operation (for butterbutter-fly complex multi-plications), we can design a flexible CORDIC processor that can execute the men-tioned three sub operations of a butterfly operation. The operation flow chart and the timing diagram of CORDIC-based FFT processor are shown in Fig 5.3 and Fig 5.4 respectively.

S ta rt

Fig. 5.3 The flow chart of the butterfly operations with proposed CORDIC-based PE

CORDIC iteration 2 (Data 0)

Data 4 Angle Value

Data 4 Address Value Data 2 Angle Value

Data 2 Address Value

Data 3 Angle Value

Data 3 Address Value Data 1

Address (Read)

Data 0 Address

(Write) XX ^Address^{Data 2}_(Read) ^Address_(Write)^{Data 1} ^Address_(Read)^{Data 3} ^Address_(Write)^{Data 2} XX SRAM Address

Write_enable

CORDIC iteration 3 (Data 0)

Angle decomposition (Data 3) Residue

angle =0 (Data 2) Angle

decompo sition (Data 2) Residue

angle =0 (Data 1) Residue

angle =0 (Data 0)

Angle decomposition (Data 1)

iteration 1 (Data 2) Butterfly

(Data 2) CORDIC

iteration 2 (Data 1) CORDIC

iteration 1 (Data 1)

Write back flag

Data Address generator Angle Value

generator

Fig. 5.4 Timing diagram of CORDIC-based FFT processor

5.3 Design of Data Interface

Since a practical FFT processor shall receive serial in data in reality, N input samples have to be temporarily stored in a buffer before FFT operations are started.

Similarly, FFT output data should be recorded in a memory buffer for the following channel equalization or demodulation operations.

Fig 5.5 shows three popular memory arrangement schemes that properly handle those input and output data. One scheme is inserting an input RAM buffer that per-forms serial to parallel converter, and an output RAM buffer that preserve the previ-ous FFT results. Another is using three identical memory blocks, where one of them alternately acts as PE’s data memory and the remaining two act as the input buffer and output buffer respectively. The third scheme [51] is reading the input RAM buffer and performing the first-stage FFT before the guard interval has passed. Furthermore, during the final-stage FFT operation, the computational results are written to the out-put RAM for the following demodulation operations, instead of the main RAM for intermediate data read and write.

In the structure of Fig. 5.5(a), there exists clock rate difference between the front-end function modules and FFT processor, because of the rate mismatch between the input data rate N and the total operation count O(NlogrN). Namely, the intermedi-ate data memory is accessed with a faster PE’s clock rintermedi-ate, while the input buffer is accessed with a slower front-end system clock rate. Similarly, the output buffer is ac-cessed with another back-end system clock rate. However, when an FFT computation has been completed, we have to directly transfer the N-point output data from the in-termediate data memory to output buffer in a short time and then load the next N-point

rate is a critical issue during the input and output data transfers, and the input (output) buffer has to be driven by another clock rate which is faster than the front-end (back-end) system clock rate. This kind of clock difference isn’t too hard to handle with state-of-art VLSI technology. But the direct input and output data transfers with-out memory remapping is inefficient.

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

Access

Output buffer (N-word RAM)

Load Load Output data

(a) The 1st data interface structure for FFT PE

RAM_ 1 (N-word RAM)

RAM_2 (N-word RAM)

PE Access

RAM_3 (N-word RAM)

Switch Switch

Access

Access Access

Switch Input data

Load

Output data

Load

(b) The 2nd data interface structure for FFT PE

Input buffer (N-word RAM)

Input data Intermediate

data memory (N-word RAM)

PE Access

Output buffer (N-word RAM)

Output data

Read Write

In the interface structure of Fig. 5.5(b), three identical memory blocks take turns in serving as input buffer, PE’s data memory, or output buffer. Namely, when one memory block is loading the next N-point input data, another memory block provides current N-point FFT data executed by PE, and the other holds the previous FFT result for back-end function module. When the next symbol period begins, memory blocks change their roles and repeat the mentioned process. For instance, the memory block which stores the input data will act as PE’s data memory next time. However, clock of the memory block is synchronous to front-end function modules when working as in-put buffer, while it should be synchronous to the faster FFT processor when working as PE’s data memory. As a result, those memory blocks have to be driven by different clocking systems. This status is similar to the first interface structure, but without di-rectly transference.

In the interface structure of Fig. 5.5(c), the N input data collected in the input buffer will be read to PE to perform the first-stage FFT operation and then written back to the PE’s intermediate data memory before the guard interval has passed.

Therefore, we don’t have to execute the copy operations between data memory and input buffer. Similarly, the results of the last-stage FFT operation are written to output buffer instead of PE’s data memory. However, for the proposed CORDIC-based FFT PE, we need more PE operation cycles than the multiplier-based FFT PE. Conse-quently, in order to complete the required computation within the guard interval, we have to speed up the operation clock rate of CORDIC-based PE, especially for DVB-T and 802.16. Therefore, we don’t adopt this structure.

By employing the interface structure of Fig. 5.5(b), the total required number of CORDIC iteration operation with respect to various OFDM communication systems is shown in Table 5.2. In this table, 802.16 is the most demanding in speed issue. If

cover all the OFDM communication systems listed in Table 5.2.

Table 5.2 The required operation counts and clock rates of the proposed

CORDIC-based PE to various OFDM communication specifications (output precision is 12-bit)

Standards Symbol duration Total PE opera-tion cycles

Cycle duration (ns)

Clock rate (MHz) 8K mode

(924µs) 68252 924/68252 = 13.5 73.8

DVB-T

2K mode

(231µs) 14472 231/14472 = 15.9 62.6

2048

(1246µs) 14472 1246/14472 =

86.1 11.6

1024

(623µs) 5204 623/5204 = 119.7 8.3

512

(312µs) 3092 312/3092 = 101 9.9

DAB

256

(156µs) 1232 156/1232 = 126.6 7.9

802.16 2048

(105.6µs) 14472 105.6/14472 = 7.3 137

Chapter 6 Conclusion

In this thesis, we propose an in-place memory-based variable-length FFT proc-essor architecture, which is suited for multi-mode and multi-standard OFDM systems, including 802.16a, DAB, and DVB-T. The design is featured with the variable-length data address generator which simplifies the original area-consuming barrel-shifter based designs with a few simpler multiplexer-based addressing functions. Further-more, we propose an efficient twiddle factor generator, which has the merit of low

在文檔中適用於正交分頻多工系統之快速傅立葉轉換處理器設計 (頁 75-0)