New Angle Decomposition and Table Size Reduction Schemes

Chapter 4 Processing Elements of FFT Processor

4.2 CORDIC-based Processing Element

4.2.2 New Angle Decomposition and Table Size Reduction Schemes

The method that detects the leading-one bit position of the residual angle zi in the i-th iteration is employed in order to speed up the convergence rate. This operation can skip the unnecessary rotations required by conventional CORDIC algorithms.

In [48], the rotation signsδ_s_iandδ_t_istored in a look-up table in advance are opti-mized according to zi,r, where zi,r denotes the most significant r bits counted from the

leading-one bit of zi, and the optimal combined rotation angle best matches z

i is performed. While the residue angle zi is negative, we replace the leading-one bit detection by the leading-zero bit detection operations.

}

In generalization, one may include more than two non-zero δm to speed up the convergence rate rapidly. However, it results in significant increase of computational complexity. Therefore, we only investigate the case of two combined parameters here, while the similar techniques can be extended to the general case.

Sinceδ_m∈{+1,−1}, the look-up table must store the information including the index value (s and t) and the sign value (+1 or -1) of δm. However, with similar con-vergence rate, we can replace the δm set by the setδ_m∈{1}. Consequently, we only have to record the index value (s and t) of δm in the look-up table. Then the iteration operation can be simplified as equation (4.9). For positive residue angle zi, the rotation direction denoted as di is assigned “+1” that executes counterclockwise rotation. On the other hand, di is assigned “-1” that performs clockwise rotation when residue an-gle zi is negative. For simplicity, we detect the leading-one bit position of the absolute value of residual angle zi to get the zi,r and the parameters si and ti.

i formed by full search. We can record the residue rotation angles approximately like a radix-2^r algorithm, by examining the MSB part zi,r of the residual rotation angle zi. In the scheme, there are two design parameters, one is r, and the other is number of δm. In our design, we pick r = 4, and number of δm to be 2.

Since the look-up table depends on the iteration index i, it is better to have an op-timized look-up table for each i. However, it will increase the table size significantly.

For simplicity, we can separate the tables into two different cases: k = {0, 1} and k > 1 according to the analysis of Taylor expansion of θk, where θk = tan^-12^-^k, and k is the leading-one bit position of the residue rotation angle zi. By computer simulations, the result is good enough by using only two different tables instead of using tables for each i. The optimized si and ti of the

δ and

δ parameters corresponding to zi,r infor-mation are shown in Table 4.2. According to the symmetry property of the complex number coordinates, we can achieve the convergence range of [+π, -π] by performing rotation in 0 ~ π/4. Therefore, the table only covers the input angle range from 0 to π/4.

In the design of new angle encoding scheme, we must include an angle table for the terms of tan^-12^-^m, where m = 1, 2…, n. The angle table size can be reduced ac-cording to the Taylor expansion as shown in equation (4.11) with x = 2^-m.

range

If the accuracy of the output data is n bits, we can ignore the second term of equation (4.11) when m > n/3. By the method, we only need to store about n/3 words of the angle tan^-12^-^m instead of the n words. Moreover, we can get the tan^-12^-^m value by

shifting tan^-12^-(^m^-1) value to the right by 1 bit, when m > n/3. Table 4.3 shows 12-bit tan^-12^-^m values both in radian and degree.

Table 4.2 Recoding table for the decomposition of residual rotation angle, r =4 Optimized parameters si, ti of

δ and

k=0,1 k>1

zi,r(2^-^k~2^-^k^-3)

si ti si ti

1000 k k+3 k k+4

1001 k k+2 k k+3

1010 k k+2 k k+2

1011 k k+1 k k+1

1100 k k+1 k k+1

1101 Unused, for θ0=0 ~ π/4 k k+1

1110 Unused, for θ0=0 ~ π/4 k-1 k+5

1111 Unused, for θ0=0 ~ π/4 k-1 k+3

Table 4.3 Angle table of tan^-12^-^m value for the 12-bit accuracy m tan^-12^-^m ( radian) tan^-12^-^m( degree )

1 0.463867 ( 001110110110₂ ) 26.565051 ( 110101001000₂ ) 2 0.245117 ( 0001111101102 ) 14.036243 ( 1110000010012 ) 3 0.124512 ( 0000111111112 ) 7.125016 ( 1110010000002 ) 4 0.062500 ( 000010000000₂ ) 3.576334 ( 111001001110₂ ) 5 0.031250 ( 0000010000002 ) 1.789911 ( 011100101000₂ ) 6 0.015625 ( 0000001000002 ) 0.895174 ( 0011100101002 ) 7 0.007813 ( 0000000100002 ) 0.447614 ( 0001110010102 ) 8 0.003906 ( 0000000010002 ) 0.223811 ( 0000111001012 ) 9 0.001953 ( 0000000001002 ) 0.111906 ( 0000011100102 ) 10 0.000977 ( 0000000000102 ) 0.055953 ( 0000001110012 ) 11 0.000488 ( 0000000000012 ) 0.027976 ( 0000000111002 )

4.2.3 A New On-line Variable Scale Factor Compensation Scheme

For low-complexity decomposition and compensation of variable scale factors, in the past, our lab proposed an on-line variable scale factor compensation method [45], [46] as briefly introduced below. The variable scale factor K can be first trans-formed to T as shown in equation (4.12), where total iteration number is denoted as I.

∏ ∑

T value can be accumulated with the rotation iterations simultaneously, where the

terms and can be obtained from a lookup

table. In the end of all the rotation operations, we can decompose T into a sequence of shift-and-add term (1+ρ

)] as shown in equation (4.13), where total number of required compensation operations is denoted as J.

Here we propose a new on-line variable scale factor compensation algorithm. We don’t have to accumulate T value and decompose it into shift-and-add terms anymore.

The algorithm is detailed below. From Taylor’s expansion of the scale factor as shown in equation (4.14), we can get a simple and rough on-line scale factor compensation term based on the parameters s(1−2⁻⁽²^sⁱ⁺¹⁾) i. And the compensation of this term can also be realized by shift-and-add operation.

∏

⁻

error term the

From this, there are three conditions to realize scale factor compensations with n-bit accuracy:

1. When si > (n/2)-1, the scale factor cosθi will approximate to 1, and no com-pensation is required.

2. When si < (n/4), besides compensating the operation of , we have to store the error and recoding its value to the form of , where the number of required compensation operations is

de-noted as J. The error term then can be compensated by performing the op-eration of right after the on-line compensation operations are completed.

) opera-tion according to , and no compensation error will occur within n-bit accuracy.

) 2 1

( − ⁻⁽²^sⁱ⁺¹⁾

Thus, the compensation operation of the variable scale factor can be implemented by the following equation.

)

Table 4.4 shows the 12-bit example for the expansion of and its recoding values c

)

Table 4.4 Error term of the on-line scale factor compensation algorithm si cos(tan⁻¹2⁻^sⁱ) (1−2⁻⁽²^sⁱ⁺¹⁾) multiple cj

1 0.8945313 0.8750000 1.02232 6 and 7

2 0.9702148 0.9687500 1.0015 10 and 11

3 0.9921875 0.9921875 4 0.9980469 0.9980469 5 0.9995117 0.9995117

1 none

6~11 cos(tan⁻¹2⁻^sⁱ)=1, no scale factor compensation

In order to reduce computational complexity, we shouldn’t merge lots of on-line scale factor compensation with the elementary rotation operation. Furthermore, only half rotation operations need scale factor compensation, because cosθi =1 when si >

(n/2)-1. Here, we just combine one shift-and-add compensation operation with the it-erative rotation operation, although we can obtain two rotation parameters si and ti per rotation iteration. The iteration operation is

(4.16)

Since we only perform single scale factor compensation per rotation, the pa-rameters si and ti smaller than n/2 must be kept for the later compensation operations.

And the error term will be compensated right after all the compensation operations based on s or t have been completed.

) 2 1 ( + ⁻^c^j

4.2.4 The Overall Operation Flow and Architecture

By combining the leading-one bit detection scheme, the residual angle recoding technique, and the on-line variable scale factor compensation, we have a CORDIC algorithm as detailed by the following steps:

(1) Set the initial iteration number i = 0, initial residual angle z0 = θ, and initial rotation vector (x0, y0) = (x, y).

(2) If θ = 0, then (x’, y’) = (x, y), and exit the rotation iteration. Otherwise, check leading-one bit position k and obtain z0,r of z0, and then use z0,r to re-trieve the optimized s0 and t0;z₁ =z₀−d₀(tan⁻¹2⁻^s⁰ +tan⁻¹2⁻^t⁰).

(3) Perform equation (4.16), and store the unused parameters (si or ti) and cor-rect parameter ci for the later operation.

(4) If zi+1≠0, perform leading-one bit detection and get the optimized si+1 and

ti+1 according to |zi+1,r|; ; go to step

(5). Otherwise, go to step (6).

) 2 tan 2

(tan ¹ ¹ ¹ ¹

1 1

2 ⁺ ⁺

− −

− − + +

+ = _i − _i ^sⁱ + ^tⁱ

i z d

(5) If the values of si+1 and ti+1 obtained in step (4) are larger than n/2, no com-pensation of scale factor is required. Set i = i+1 and go to step (3). Other-wise, proceed to step (6).

(6) If all the compensation operations based on s or t have been completed, go to step (7). Otherwise, set i = i+1 and go to step (3); read the parameter mi

from {s, t} stored in previous angle decompositions to execute the compen-sation operations(1−2⁻^mⁱ).

(7) If all the compensation operations based on ci have been completed, set (x’, y’) = (x, y) and exit the rotation iteration. Otherwise, read the parameter mi

from {c} and execute the compensation (1+2⁻^mⁱ) and go to step (3); set i

The whole operation flow is shown in Fig. 4.2.

Start

Initial iteration number i =0 Initial residual angle θ₀

Yes

End

Output the final value error term compensation finish

On-line scale compensation finish

Yes No

Get mi from {c} Yes

Get mi from {s, t}

Check leading-one bit position of z₀, and retrieve the optimized s₀and t₀ )

Check |zi+1 | leading-one bit position, and retrieve optimized si+1and ti+1

)

Perform the iterations: Xi+1, Yi+1simultaneously,

and store correct parameters cj and the unused parameters (si or ti)

Fig. 4.2 Flow chart of the new on-line scale factor compensation CORDIC algorithm

Fig. 4.3 shows the architectures for our new CORDIC processor, and Table 4.5

lists the comparison of the two type structures. However, for the consideration of high-speed operations, they can be put in a pipelined structure in cascade. The pipe-lined structure is particularly efficient for the applications that require intensive and sustaining vector rotation operations.

x(i+1) Barrel Shifter

-+

Barrel Shifter

±

Barrel Shifter

±

y(i+1) Barrel Shifter

-+

Barrel Shifter Barrel Shifter

±

Barrel Shifter

±

Angle ROM (Table 4.3)

Leading-one bit Detector

Angle recoding table (Table 4.2)

s(i+1) t(i+1) z(i+2)

+

±

y(i) x(i) s, t, c Residue

Angle z(i+1)

Error term ci ROM

(Table 4.4) cj

MUX

Initial angle

Input data

(real part)

Input data

(imaginary part)

(a) The 1st new CORDIC processor

x(i+1)

Barrel Shifter

y(i+1) Barrel Shifter

Leading-one bit Detector

s(i+1) Residue

Angle z(i+1)

Barrel Shifter

y(i)

Angle recoding table (Table 4.2)

Angle ROM (Table 4.3)

Error term cj ROM

(Table 4.4) cj

MUX

MUX Initial angle

Input data (real part)

Input data (imaginary part)

(b) The 2nd new CORDIC processor Fig. 4.3 Architectures of the CORDIC algorithm

Table 4.5 Comparison of the two new CORDIC processors

1st 2nd Hardware complexity 8 barrel shifters + 10 adders 6 barrel shifters + 8

ad-ders Critical path 2TBS + 3Tadder 3TBS + 3Tadder

Basic operations

⎪⎪

4.2.5 Simulations Results

Based on the structures shown in Fig. 4.3, we performed fixed-point hardware simulations by using Matlab simulation tool. Exhausted simulations were conducted for all the rotation angles in the range of 0˚~ 45˚. The simulation result is shown in Fig. 4.4, and the detailed information with 8-bit, 12-bit and 16-bit accuracy (including 1-bit integer part) is shown in the Table 4.6.

Fig. 4.4 Fixed-point simulation results of new CORDIC algorithm

Table 4.6 Detailed simulation results with 8-bit, 12-bit and 16-bit output accuracy

Output accuracy 8 bits 12 bits 16 bits

Angle decomposition 1.835 2.727 3.644 Scale factor compensation 1.786 3.092 4.153 Average

iteration number

Overall iteration 2.437 3.482 4.424

Angle decomposition 3 4 5

Scale factor compensation 4 5 6

Worst-case iteration number

Overall iteration 4 5 6

Let Tr denote the iteration number of the x-y rotations (angle decomposition) and Ts denote the iteration number of scale factor compensation. For general CORDIC algorithm, the compensation operations of variable scale factor are performed right after all the x-y rotations have been done. That is to say, the number of overall itera-tions is Tr +Ts. However, our new CORDIC algorithm not only use the leading-one bit detection scheme to speedup convergence rate, but also combine the on-line scale factor compensation scheme. Therefore, the number of overall iterations of our algo-rithm is the maximum value of Tr and Ts, i.e. overall iteration number = max (Tr , Ts).

The occurrence versus iteration number of 12-bit simulation is shown in Fig. 4.5.

And the occurrence percentage of the dominant factor between Tr or Ts versus itera-tion number, of 12-bit precision is shown in Table 4.7.

Fig. 4.5 The total occurrence percentage versus iteration number, 12-bit precision

Table 4.7 Occurrence percentage of the dominant factor between Tr or Ts, 12-bit precision

Overall iteration = max(Tr , Ts) 1 2 3 4 5

max(Tr , Ts) = Tr 73.33% 84.44% 31.24% 0.88% No Tr= 5 case

max(Tr , Ts) = Ts 0% 0.74% 27.11% 96.35% 100 %

Tr = Ts 26.67% 14.81% 41.65% 2.78% No Tr= 5 case

In order to further reduce the iteration number of our CORDIC algorithm, we consider the issue of internal datapath word length. From computer simulation results as shown in Table 4.8, we can observe that the convergence rates will get better as the word length of residue angle is increased. However, the angle table size which stores the value of tan^-12^-^m will become larger, when we increase internal word length. The improvement also will be less significant when the word length of residue angles is increased. From simulations, the optimal internal word length is roughly 2 bits more than the target output word length.

Table 4.8 Simulation results of different residue angle word length (a) 8-bit input and output accuracy

The internal word length of residue angle 8 bits 9 bits 10 bits Angle decomposition 1.835 1.767 1.728 Scale factor compensation 1.786 1.786 1.786 Average

iteration number

Overall iteration 2.437 2.398 2.388

Angle decomposition 3

Scale factor compensation 4

Worst case iteration number

Overall iteration 4

(b) 12-bit input and output accuracy

The internal word length of residue angle 12 bits 13 bits 14 bits Angle decomposition 2.727 2.681 2.62 Scale factor compensation 3.092 3.095 3.093 Average

iteration number

Overall iteration 3.482 3.469 3.45

Angle decomposition 4

Scale factor compensation 5

Worst case iteration number

Overall iteration 5

Similarly, we can analyze the required data word length of equation (4.16). Ac-cording to Fig. 4.6, the SNR does not significantly improve when the internal word length is longer than input precision by three bits. Furthermore, if we use complex multipliers to perform vector rotation, the SNR values for 8-bit and 12-bit accuracy are about 44dB and 68dB, respectively. Therefore, the optimal word length is 3 bits more than the target output accuracy.

(a) 8-bit output accuracy

(b) 12-bit output accuracy

Fig. 4.6 SNR performance vs. internal datapath word length of the new CORDIC processor

4.2.6 Comparison

We compare the new design with some of the notable efficient designs in speed and area performances. Without analyzing the pipeline architecture, we only consider the structure of word serial architecture here. Since the very high-radix CORDIC al-gorithm [39], [40] is highly dependent on circuit designers’ expertise as mentioned in section 4.2.1, we will not include it in the comparison. The CORDIC algorithm with close-to-optimal angle recoding scheme [43] can reduce the iteration number to n/3 in average (excluding the introduced complicated variable scale factors). However, it has to perform O(n²) comparison operations. As it is a huge overhead compared to the other CORDIC algorithms, we also exclude it from the comparison. For the trel-lis-based searching schemes [44], it needs enormous ROM table to stores not only the result of angle decompositions but also the variable scale factor compensation

se-quences. Therefore, it isn’t suited for hardware implementation or SoC design. Due to the long initial delay of n units of time, the differential CORDIC algorithm [50] is de-signed for efficient parallel pipeline operations, not for serial computation. In addition, it is still based on conventional CORIDC algorithm, which needs n iterations for mi-cro-rotation plus O(n) shift-and-add iterations for constant scale factor compensation.

Therefore, we also don’t take it into comparison.

Table 4.9 lists the iteration counts and the required major area for the serial im-plementation of these algorithms [35-38], [41-42], [45-48]. In order to roughly quan-tify the comparison, we focus on the key circuit modules in the critical paths for those designs.

Table 4.9 Area and speed comparison of the new and several notable CORDIC processors (n-bit accuracy)

Algorithm Total itera-tion number

Main Area

(Adder, Barrel shifter, and Angle table) Conventional 4n/3 3 Adders + n-words ROM + 2 Barrel shifters

Takagi [35] n 3 Adders + n-words ROM + 2 Barrel shifters Timmermann [36] n 3 Adders + n-words ROM + 2 Barrel shifters Antelo [37] 4n/5 5 Adders + 2n/3-words ROM + 4 Barrel

shift-ers Rao [38] n/2 + 3

5 Adders + 2n/3-words ROM + 4 Barrel shift-ers

+ CSD coding + Distributed multiplication Hsiao [41] n 3 Adders + n-words ROM + 2 Barrel shifters

Li [45] 2n 4 Adders + 2n-word ROM + 2 Barrel shifters Li [46] 4/5n 4 Adder + 2n-word ROM + 2 Barrel shifters Lin [47] n + 1 3 Adder + n/2-word ROM + 2 Barrel shifters Ours new CORDIC n/3 8 Adder + n/3-word ROM+6 Barrel shifters

Base on above discussion, we can find that although our new proposed CORDIC algorithm has more adders and barrel shifters, it has small table size and the least it-eration number.

4.2.7 The Proposed CORDIC-based FFT PE

Instead of using the conventional complex multiplier, we can apply the new CORDIC algorithm to the processing element design of FFT processor. In addition, we also consider the special cases of input angles being odd multiples of 45˚, that is, when twiddle factor=W^N^/8, N is odd. The factors can be realized by the following equation

8 4 1 2 8

6 4 3 1 8

2 ) 2 2 )(

2 1 ( 2 2 2 2 2 7071068 .

2 0 2

) 1 2 (

−

− + + + + = + + +

= j

W_N^N

(4.17)

Therefore, we can use our CORDIC datapath to execute this particular rotation easily.

In this case, we only need one cycle to obtain the result by using the data path as shown in Fig 4.7. We can accomplish it by adding some minor modification to the data path of our CORDIC processor as shown in Fig. 4.3(b). However, it will increase control complexity of an FFT processor, because one has to distinguish those special angles from all other rotation angles. Hence, we must analyze the amount of the spe-cial twiddle factors, including ( 2/2)(1± and j) ( 2/2)(−1± j), and make sure that those special cases are large enough. For the adopted radix-2² algorithm and the variable-length FFT processor which supports power-of-4 and non-power-of-4 FFT operations as discussed in Section 3.1, Table 4.10 shows the analyzed results.

Table 4.10 Percentages of twiddle factors which are equal to odd multiples of 45˚ in using the adopted radix-2² algorithm and the radix-2²/2 PE

FFT Points 256 512 1024 2048 8192

Percentage of W^N^/8, N is odd 24.6% 24.6% 19.9% 19.9% 16.7%

According to the information of Table 4.10, we utilize Fig 4.7 to achieve the butterfly operation with twiddle factor W^N^/8 (for odd N) to reduce the total iteration number of FFT computation. The block diagram of the shared hardware is shown in Fig. 4.8.

+ x ＇

B arrel S h ifter

> > 1

+ y ＇

2 8

Fig. 4.7 The implement of twiddle factor W^N^/8 (odd N)

x(i+1)

y(i+1)

Leading-one

bit Detector ^s(i+1)t(i+1)

z(i+2)

x(i) s, t, c Residue

Angle z(i+1)

y(i)

Angle recoding table

Angle ROM

MUX

BS MUX

>>1 MUX

BS ^MUX

>>1 MUX

BS MUX

BS ^MUX

1 0

0 0

1 1 0 1 0

0 1 1 0

Error term cj cj

MUX

MUX Initial angle

Input data (real part)

Input data (imaginary part)

Fig. 4.8 Block diagram of the Proposed CORDIC-based FFT PE

In Fig. 4.8, when the control signals of the multiplexers are assigned 0, the PE is to execute the vector rotation and on-line scale factor compensation. Alternatively, if control signals are assigned 1, the PE is to compute the trivial multiplication of twid-dle factor W^N^/8 with odd N This shared hardware design does not increase the number of the required arithmetic units.

4.3 Comparison of FFT Processing Elements

Table 4.11 is the comparison of the multiplier-based PE and CORDIC-based PE.

This 8192-point radix-2² FFT PE, with 12-bit accuracy, is synthesized based on UMC 0.18μm standard cell library by Synopsis Design Analyzer. The multiplier-based PE includes three 1024-word twiddle factor ROM table.

The proposed CORDIC-based PE performs the front add/sub of a butterfly op-eration in the first cycle, and then executes the rotation opop-erations to carry out the butterfly complex multiplications. The average operation cycles are about 4.76 per butterfly computation for an 8192-point FFT.

Table 4.11 Comparison of the multiplier-based PE and CORDIC-based PE Proposed CORDIC-based PE

(word serial architecture) Multiplier-based PE

Gate counts 5163 34591

(Single complex multiplier: 5746)

Path delay 2.15ns 9.76ns

Required operation cycles per butterfly

computation

4.76

(averaged) 1

Chapter 5 EDA Realization of the New

Multi-Standard CORDIC-Based FFT Processor

5.1 Design Overview

The proposed design is an in-place memory-based FFT processor. The processor needs four-bank memory that matches the in-place memory address generator for high-bandwidth data access. In order to meet specifications of 802.16, DAB, and DVB-T, we employ the variable-length data address generator which covers five dif-ferent FFT lengths, including 256, 512, 1024, 2048, and 8192 points. Correspondingly, the processing element is based on radix-2² DIF FFT algorithm and also supports non-power-of-4 FFT computation, as discussed in Chapter 3. Since we replace the conventional complex multipliers of the PE with CORDIC processor, the ROM table which stores twiddle factors can be eliminated. Block diagram of our design is shown in Fig. 5.1.

SR A M B ank 3 SR A M B ank 2 SR A M B ank 1 SR A M B ank 0

Commutator read

C O RD IC-based PE

Commutatorwrite

R otation angle generator D ata address generator

Fig. 5.1 Block diagram of the proposed FFT processor

5.2 Components of FFT Processor

5.2.1 The Data Memory

The memory block of our FFT processor design is a 4-bank synchronous SRAM.

Each bank of SRAM has 2048 words and 24 bits per word which is generated by Ar-tisan™ UMC™ 0.18µm SRAM generator. The memory word length is 12-bit for both

在文檔中適用於正交分頻多工系統之快速傅立葉轉換處理器設計 (頁 64-0)