CORDIC- BASED P ROCESSING E LEMENT OF FFT P ROCESSOR

2. DISCUSSION AND RESULTS

2.2 CORDIC- BASED P ROCESSING E LEMENT OF FFT P ROCESSOR

2.2.1 The CORDIC Algorithm and Architecture

In many FFT applications, the butterfly processing element (PE) often is realized with complex multipliers which have characteristics of high complexity and huge amount of area.

Further, for the requirement of the twiddle factor multiplications, the twiddle factors must be stored in a look-up table which is generally implemented by ROM in advance. However, since long-length FFTs are commonly used in modern applications such as 8192-point FFT in DVB-T, the look-up table approach becomes inefficient because of enormous chip area cost. For example, even if we employ the symmetric property of the sinusoid function, the total ROM space requirement is 2*12*8192 /8 = 24576(bits) ≈ 3(KB) in an 8192-point FFT with 12-bit accuracy.

For this reason, the CORDIC (Coordinate Rotation Digital Computer) algorithm is proposed here to substitute for conventional complex multiplier and look-up table approach.

The CORDIC algorithm developed by Volder [7] in 1959 is a generalized algorithm that can perform vectoring and rotation operations of a two dimensional vector. The rotation operation is to compute the target vector of the initial vector and the given rotation angle θ, while the intention of the vectoring operation is to compute the angle between the start vector and the end vector. Furthermore, there are three different kinds of coordinate systems: the linear coordinate system, the circular coordinate system, and the hyperbolic coordinate system. Walther [8] extend the algorithm to compute multiplication, division, and hyperbolic functions. The applications of CORDIC-based are 3-D graphic [9], [10], adaptive filter [11], floating point unit, DSP processor, and so on.

When employing CORDIC algorithm to FFT PE, we only investigate the most popular circular coordinate system and the rotation mode operations. The basic theory of the CORDIC algorithm is reviewed as follows section.

The rotation operations are approached by a sequence of micro-rotations (elementary angles) using only shift-and-add operations, and therefore it is very suited for VLSI implementation and DSP applications. There have been numerous improved CORDIC algorithms and structures proposed ever since its introduction. Most of the CORDIC algorithms assume a constant scale factor for the ease of scale factor compensation. However, they have to rotate even when the residual rotation angles have converged [12], [13], [14], [15]. In some cases, they either have to do accurate but slow decision operations for rotation directions or do rough direction decisions at the expense of extra compensation operations [12], [13]. To speedup CORDIC operations, the following techniques are widely used: (1) use carry-free redundant addition scheme [12], [13], [16-19]; (2) fast decision of rotation directions with only a few most significant digits (MSDs) of the control parameters [12], [13], [16-19]; (3) skip unnecessary rotations; (4) effectively recode rotation angles for saving rotation iterations [20]; (5) apply radix-4 rotation schemes [17], [21], [22], [23], to reduce iteration numbers; and (6) predict the rotation sequence for parallel and pipelined processing.

Some of the mentioned techniques result in variable scale factors. Variable scale factors have the trouble of complicated scale factor computation followed by penalty compensation [18], [19]. Due to the considerable overhead generated by variable scale factor, most of the existing radix-4 CORDIC algorithms resort to constant scale factor approach [17], [22]. However, these constant-scale-factor CORDICs are basically hybrid radix-2 and radix-4 algorithms. As a result, their iteration numbers are not fully reduced. Recently, we proposed CORDIC algorithms with variable scale factors [21] skip unnecessary rotations and at the same time perform low-complexity on-line decompositions and compensations for the variable scale factors.

Specifically, the radix-4 algorithm costs less iteration (including rotations and compensations) than the existing radix-4 algorithms. The radix-4 CORDIC algorithm proposed in [23] is similar to the one in [21], except the ways they handle variable scale factors. Both designs share the same low iteration number of 0.8n. Although the very high-radix CORDIC algorithm has an extremely small iteration number, it is irregular in realization which needs multiplication-and-accumulation circuits. Its efficiency is high dependent on practical circuit optimization.

To reduce the shift-and-add operations of both rotation iterations and scale factor compensations, we will present a new table lookup recoding scheme for rotation angles and variable scale factors. The new method can speedup both the convergence rates of the residual rotation angles and our fast variable scale factor decomposition and compensation algorithm [21]. For more reduction of iteration number, the new CORDIC algorithm also applies the leading-one bit detection operations to both residual rotation angles and decomposition of variable scale factors.

2.2.2 The New Angle Decomposition Scheme

For speeding up convergence, first we detect the leading-one (leading-zero) bit positions, for positive (negative) residual angle zi, respectively, in the i-th iteration. This action can avoid unnecessary rotations required by conventional CORDIC algorithms. Then the most significant r bits (denoted as zi,r ), counted from the leading-one (or leading-zero) bit of zi, are used to access δm and δn information from a table. These two retrieved parameters correspond to a combined rotation angle tan^-12^-m + tan^-12^-n that best matches zi,r (in a least-square error sense), which makes zi,r – ( tan^-12^-m + tan^-12^-n ) as close to zero as possible. This approach corresponds to the following iteration operation (2.5), and this iteration results in a variable scale factor described as the following equation (2.6).



In generalization, we may include more than two δn’s to speedup the convergence rate.

However, the computational complexity increases significantly, and therefore we only investigate the case of two combined direction parameters here. Similar techniques can be extended to the general case. Based on equation (2-5), some lookup tables for the residual rotation angles can be constructed by computer search with the closest match as mentioned before. In a sense, it approximately amounts to a radix-2^r CORDIC algorithm, by examining the MSB part zi,r of the residual rotation angle zi,. Since an optimal table depends on the iteration index i, it is better to have an optimized lookup table for each i. However, it will increase the table size accordingly.

From easy Taylor expansion, we can get tan^-12^-i ≈ 2^-i when i>>0. Then, in computer simulations, we find that it is enough to have good results by using only two different tables, as shown below.

Here, we take r= 4 bits (i.e., radix-2⁴) as a design example. Table 2.6 shows the stored optimized m and n of the δm and δn patterns, corresponding to the zi,r information. From Taylor’s expansion of θk = tan^-12^-k , we find that the binary patterns of θ0 and θ1 are noticeably different from those of the θk’s, k>1. Therefore, two different tables are used for the cases of k= {0,1} and k>1, respectively, where k is the leading-one (or leading-zero) bit position of the residual rotation angle zi.

Table 2.6 Recoding table for the decomposition of residual rotation angle Optimized patterns of m and n

k=0,1 k>1 θi(2^-k~2^-k-3)

m n m n 1000 k k+3 k k+4 1001 k k+2 k k+3 1010 k k+2 k k+2 1011 k k+1 k k+1 1100 k k+1 k k+1

1101 Unused , for θ0=0 ~ π/4 k k+1

1110 Unused , for θ0=0 ~ π/4 k-1 k+5

1111 Unused , for θ0=0 ~ π/4 k-1 k+3

2.2.3 Table reducing scheme

In above description of new angle encoding scheme, the given table size does not include the term n*p for {tan^-12^-i, i=0,1,2,…,n-1}, required by conventional CORDIC algorithms. We can find the equation (2.7) in Taylor expansion.

range error max the is ) ( where ), ) (

tan ( ₂ ₂ ⁷ ⁷

1 3

x O x

x O x x

x +

− +

− = (2.7)

Table 2.7 the table of tan^-12^-i value for the 12-bit accuracy i tan^-12^-i ( radius ) tan^-12^-i( degree ) 1 0.463867 ( 0011101101102 ) 26.565051 ( 1101010010002 ) 2 0.245117 ( 0001111101102 ) 14.036243 ( 1110000010012 ) 3 0.124512 ( 000011111111₂ ) 7.125016 ( 111001000000₂ ) 4 0.062500 ( 000010000000₂ ) 3.576334 ( 111001001110₂ ) 5 0.031250 ( 0000010000002 ) 1.789911 ( 011100101000₂ ) 6 0.015625 ( 0000001000002 ) 0.895174 ( 0011100101002 ) 7 0.007813 ( 0000000100002 ) 0.447614 ( 0001110010102 ) 8 0.003906 ( 0000000010002 ) 0.223811 ( 0000111001012 ) 9 0.001953 ( 0000000001002 ) 0.111906 ( 0000011100102 ) 10 0.000977 ( 0000000000102 ) 0.055953 ( 0000001110012 ) 11 0.000488 ( 0000000000012 ) 0.027976 ( 0000000111002 )

In equation (2.7), if we need n bit output precision and x = 2^-i, we can ignore the second item when i ≥ n/3. Then, we can get the tan^-12^-i value by shifting tan^-12^-(i-1). By the method, we only need n/3 words to store the angle tan^-12^-i, replacing the traditional n words. For instance, the terms of tan^-12^-i value which have be stored in ROM are 4 and 5 for radius and degree representation respectively.

2.2.4 On-line variable factor compensation

For low-complexity on-line variable scale factor compensation described by equation (2.6), here we further improve and speedup our previous efficient variable scale factor algorithm, by using a on-line variable factor compensation. The whole improved algorithm is detailed below.

Rewriting equation (2.6), K can be first transformed to

i n

i m

K 2 1 2 2

1 2

1 1

−

− +

= + (2.8)

The same in Taylor, we can find Ki = (1-2^-(2m+1))(1-2^-(2n+1))+O(2^-(4m+1)). From this expansion, the Ki will approximate to 1 when i > (n/2)-1. Therefore, we can get the most suitable scale factor compensation values when we get the rotation items δm and δn. And the compensation computation can also be calculated by shift-and-add operation. In every time scale factor compensation, we will have an error item O(2^-(4m+1)), when i < (n/4)-1. The error will be store and than be compensated just after the rotation operations i > (n/2)-1.

2.2.5 The Overall Operation Flow

In summary, by combing the leading-one bit detection scheme, the residual recoding technique, and the on-line variable scale factor compensation, we have a CORDIC algorithm as detailed by the following steps:

(1) Set the initial iteration number i = 0, initial residual angle z0 = θ, initial rotation vector (x0, y0) = (x, y), and initial exponent residual T0 = 0. If θ = 0, then (x’, y’) = (x, y), and exit the rotation iteration. Otherwise, proceed to step (2).

(2) Check leading-one bit position k and obtain zi,r of zi,. If zi≠0, go to step (3). Otherwise, zi

= 0: rotation operations are completed and set the total iteration number I=i-1; go to step (5).

(3) Using zi,r retrieve the optimized m and n of δm and δn, and then get the value of tan^-12^-m and tan^-12^-n from lookup tables. To perform the iteration as shown in equation (2.5) and zi+1 = zi– ( tan^-12^-m + tan^-12^-n ). And the scale variable compensation:

) 2

1 (

) 2

1 ), (

1 4 / (

If ₍₂ ₁₎

' 1 1

) 1 2 1 (

' 1

+ + −

−

− =

< i

l i i

y y

x n x

We will store the compensation error, e_i =m_i.

) 2 1 (

) 2 1 ), (

1 4 / (

If ₍₂ ₁₎

' 1 1

) 1 2 1 (

' 1

+ + −

− =

> i

l i i

e y

e x

n x i

(4) Set i=i+1, go to step (2).

(5) Calculation complete and the output values are (xi+1, yi+1).

Fig. 2.13 shows the architecture for our new CORDIC processor. However, for the consideration of high-speed operations, they can be put in a pipelined structure in cascade. The pipelined structure is particularly efficient for the applications that require intensive and sustaining vector rotation operations.

x(i+1)

±

Barrel Shifter

-y(i+1) Barrel Shifter

-ROM (tan^-12^-i)

Detect leading-one

Circuit

ROM (Angle encode

table) m(i+1)

n(i+1) θ(i+2) Scale factor error

accumulator T(i)

T(i+1)

+

±

x(i) m(i) n(i) Residue

Angle θ(i+1)

ROM

) 2 1 ( ln -)) 2

ln(cos(tan⁻¹ ⁻ⁱ − ⁻⁽²ⁱ⁺¹⁾

+

Barrel Shifter

±

Barrel Shifter

±

Barrel Shifter

±

Barrel Shifter

y(i)

Fig. 2.13 The structure of new CORDIC algorithm

2.2.6 Simulations Results

Based on the structures shown in Fig. 2.13, we performed fixed-point hardware simulations using Matlab & Verilog hardware description language, assuming 8-bit, 12-bit and 16-bit accuracy (including 2-bit integer part). Exhausted simulations were conducted for all the rotation angles in the range of 0˚~ 45˚. The simulation result will be shown in the table 2.8.

Table 2.8 Simulation results in different output bits precision with our new CORDIC algorithm

Output precision 8 12 16

Angle composition 1.835 2.727 3.644 Scale factor composition 1.786 3.092 4.153 Average

Overall iteration 2.437 3.482 4.424

Angle composition 3 4 5

Scale factor composition 4 5 6

Worst case

Overall iteration 4 5 6

在文檔中用於軟體無線電基頻處理之系統晶片設計技術---子計畫IV：OFDM FFT架構下軟體無線電訊號處理之軟、硬體輔成設計及其數位通信之應用設計(II) (頁 16-22)