• 沒有找到結果。

The Proposed Recursive Twiddle Factor Generator

Chapter 3 Data Address Generation and Twiddle Factor Generation

3.2 Twiddle Factor Generation

3.2.2 The Proposed Recursive Twiddle Factor Generator

The twiddle factors can be efficiently generated by using the polynomial-based DDFS [22-28], which are based on the concept of piecewise polynomial approxima-tion to a funcapproxima-tion value. They ordinarily need memory space for the storage of poly-nomial coefficients and/or some function values, and some multiplication and addition operations to generate an output function point. However, the complexities of poly-nomial-based algorithms will grow with the output precision significantly. If a higher output precision is desired, a higher computation complexity will be required.

The twiddle factor generator can also be realized with recursive function. The recursive sine function generator [29] is based on the well-known recursive feedback difference equation for the computation of sine and cosine functions. The approach’s key advantage is low complexity. For combined sine and cosine function generations, the approach only needs two multiplication and two addition operations per output point, which is less complicated than the mentioned table-lookup and polynomial de-signs. However, its serious drawback is the introduced error propagation problem in finite-precision computation, due to its error feedback character. Al-Ibrahim [30] pro-posed a solution to alleviate the error-propagation problem. However, this approach needs two extra multiplications and corresponding area to correct the propagation er-ror.

By considering the low-complexity merit of the recursive sine function generator, we attempt to reduce its main drawback of error-propagation problem with low over-head [21]. We will first analyze the error propagation bound of this approach and de-rive two error correction tables (for both sine and cosine functions), for the compensa-tion the propagated finite-precision error. In all, the new design needs two constant multipliers and two adders to calculate these two functions. The correction table en-tries are at most three bit long, regardless of output word lengths. As a result, the pro-posed design has a smaller area than those mentioned designs in most cases. We will detail it as follows.

Let’s first consider the generation of the imaginary part of a twiddle factor, that is, the sine function x[n] = (sinnθ)u[n]. In z-transform, one can get the following equa-tion (3.10) [31].

From the result, the corresponding difference equations for the generation of sine function, and similarly the cosine functions are:

θ

There will introduce error propagation problem in finite-precision calculation due to the feedback terms of the previous generated output to the present output computation.

The reduction method for finite-precision error propagation in [30] needs extra multi-pliers, adders and computation. To achieve a low-complexity error correction, here we include two error correction tables to recover the full-precision outputs. To decide the word length of the correction tables, we first analyze the upper bound of the fi-nite-precision error propagation as expressed in equation (3.12).

precision precision

Therefore, the maximum error is bounded by:

) This result shows a maximum two-bit error. Similar analysis results can be obtained with the cosine function.

From this result, we can conclude an error correction word of three-bit long per output point. Specifically, in every iteration cycle, we will replace the 3 LSB’s of the currently generated output with the correction value stored in the correction tables. As a result, there will be no error propagation problem, and all the generated twiddle fac-tors will be guaranteed full B-bit precision. In all, we need two tables (one for cosine and the other for sine values) of 3×N/8 bits each for total N twiddle factor values in the angle range of 0 to 2π. The factor 1/8 is due to table redundancy reduction by con-sidering the symmetric property of trigonometric functions. This greatly reduces the required table size.

Figure 3.13(a) shows the architecture of the proposed generator for the real part (the cosine function) of a twiddle factor. The same structure can be obtained for the imaginary part (the sine function), except that an additional table for the initial sine values sinθ should be included. Since both sine and cosine generators are exactly the same, they can be also combined as one interleaving structure by including simple control, a few registers, and a MUX after the correction-table stage, as shown in Fig.

3.13(b).

(a) Architecture of the proposed generator for real part (cosine) of a twiddle factor

(b) Combined twiddle factor generator architecture for interleaving computation of real and imaginary parts

Fig. 3.13 Architecture of the proposed twiddle factor generator

Considering the radix-2 DIF N-point FFT as an example, where N=2k, operation steps of the proposed design can be summarized as:

(1) Set the FFT stage number k=0.

(2) Set the FFT subgroup number i=0, within the k-th stage.

(3) Set the Initial values sinnθ =0 and cosnθ =1, n=0, and output the twid-dle factor values to the FFT processing element.

(4) Load cosθ =cos(2π/M) and sinθ =sin(2π/M) from a small table of size of K-2=(log2N)-2 words, where M=N/2k. Output the twiddle factor val-ues to the FFT processing element. Set the twiddle factor sequence number n=2.

(5) Compute the finite-precision sinnθ and cosnθ according to equation (3.11). Replace the 3 LSB’s of sinnθ and cosnθ with the corresponding 3 LSB correction bits retrieved from the correction table.

(6) Output the corrected sinnθ and cosnθ values to FFT processing ele-ment for twiddle factor multiplication.

(7) If n < M-1, then n = n+1, go to step (4), otherwise proceed to the next step.

(8) If i<k, i=i+1, go to step (3); otherwise, k=k+1.

(9) If k<K, go to step (2); otherwise end of the twiddle factor computation.

3.2.3 Comparison

Table 3.8 is the simulation results of the proposed twiddle factor generator for N-point FFT operations with 16-bit output precision, where N=64, 256, 512, 1024, 2048, 4096, and 8192. The synthesized total gate counts and delays are based on UMC 0.18 µm standard cell library, subject to the maximum delay constraint of 5ns.

From this table, since the areas and delay times increase very slowly with the in-creasing FFT lengths, we can conclude that both areas and delay are dominated by multipliers. The areas taken by correction tables are relatively insignificant. The syn-thesized small table size results are partially contributed by efficient EDA optimiza-tion process in synthesizing the table with combinaoptimiza-tional logic cells. Moreover, since our design costs the least multipliers among the existing designs (except for the ta-ble-lookup approach), we will expect a good area performance, compared with other

designs as shown in Table 3.9 for 16-bit precision generators.

The tables summarizes the simulated performance results of our design, polyno-mial-based designs [23-25], the ROM-based design, and the improved recursive twid-dle factor generator [30], in terms of gate counts and the maximum operation fre-quency with FFT lengths of 64, 4096 and 8192. The proposed design has better area performance and comparable speed performance to the existing design in most cases, varying from small lengths to large lengths. Compared with the conventional ROM-based twiddle factor generators, the proposed design obviously costs much less area, especially for the cases of moderate to high output word lengths. From the simu-lation results, we can find that the ROM-based design is only suitable for smaller FFT lengths, about less than 512.

Table 3.8 Area and speed simulation results of the proposed twiddle factor generator, with word length B=16 bits

FFT length Gate counts Delay(ns) Max. operation freq

64 2316 2.51 398MHz

256 2437 2.71 369MHz

512 2798 2.83 353MHz

1024 3267 2.95 338MHz

2048 3612 3.01 332MHz

4096 4512 3.15 317MHz

8192 8983 3.21 311MHz

Table 3.9 Performance comparison of twiddle factor generators, B=16

64-point 4096-point 8192-point Gate

counts

Delay (ns)

Gate counts

Delay (ns)

Gate counts

Delay (ns)

Proposed 2316 3.01 4512 3.65 8983 3.71

Polynomial [23] 9812 4.12 9812 4.12 9812 4.12

Polynomial [24] 9125 4.23 9125 4.23 9125 4.23

Polynomial [25] 9274 4.57 9274 4.57 9274 4.57

ROM based 534 2.21 18198 2.40 36742 2.83

Recursive [30] 9879 2.89 9879 2.89 9879 2.89

Chapter 4

Processing Elements of FFT Processor

4.1 Multiplier-based Processing Element

The butterfly processing element mostly is realized by complex multipliers, and the complex multipliers dominate the most chip area in the combinational circuit of the FFT processor. The general equation for complex multiplication is shown in equa-tion (4.1), where the Xr and Xi are the real part and imaginary part of input data from the previous stage, respectively, and W is the twiddle factor.

) (

) (

)

( i r r i i r i i r

r j out X W X W j X W X W

out + = × − × + × + × (4.1)

We need two real additions and four real multiplications to achieve the complex multiplication operation as shown in equation (4.1). However, the equation can be re-written as equation (4.2), which can reduce the complexity to five additions and three real multiplications, but increase the algorithm steps. Another implementation scheme for equation (4.2) which also has lower cost but longer critical path than that of (4.1) is that the values of Wr-Wi and Wr+Wi have to be stored in look-up table instead of the values of Wr and Wi.

)]

( )

[(

)]

( )

[(

) (

i r r r i r

i r i r i r i

r

W W X W X X j

W W X W X X out

j out

×

× + +

+

×

× +

=

+ (4.2)

From above discussion, three complex multiplier-based PE structures for equa-tion (4.1) and (4.2) are shown in Fig. 4.1. Comparison of arithmetic complexity and critical path is in Table 4.1. It is a tradeoff between cost and performance in choosing

a proper one among the structures for a target FFT processor. However, the maximum length of the memory-based FFT processor supported for DVB-T is 8192 points, so that the area of the main data memory with 8192-entry is much larger than the com-plex multipliers. Therefore, in our variable-length FFT processor, we still utilize equa-tion (4.1) for high speed performance consideraequa-tion.

Xr

Xi

Wr

Wi +

-x

x x

x

outr

outi

Xr

Xi

Wr

Wi

-x

x x

outr

outi

+

-+

Xr

Xi

Wr+Wi

Wr-Wi

-x

x x

outr

outi

+

+ (a)

(b)

(c) Fig. 4.1 (a) Complex multiplier structure of equation (4.1)

(b) Complex multiplier structure Type I of equation (4.2) (c) Complex multiplier structure Type II of equation (4.2)

Table 4.1 Comparison of complexity and critical path of complex multipliers Arithmetic Complexity

Implementation

real adders real multipliers

Critical path

Complex multiplier of equation (4.1) 2 4 1Tmul + 1Tadd

Complex multiplier Type I of equation (4.2) 5 3 1Tmul + 2Tadd

Complex multiplier Type II of equation (4.2) 4 3 1Tmul + 2Tadd

4.2 CORDIC-based Processing Element

In most FFT applications, the butterfly processing elements (PE) often are real-ized with complex multipliers which have characteristics of high complexity. Further, twiddle factors must be stored in a look-up table in advance which is generally im-plemented by ROM. Since long-length FFTs are commonly used in modern applica-tions such as 8192-point FFT in DVB-T, the look-up table approach becomes ineffi-cient because of enormous chip area cost. For example, even if we employ the sym-metric property of the sinusoid function, the total ROM space requirement is 2*12*8192 /8 = 24576(bits) ≈ 3(KB) in an 8192-point FFT with 12-bit accuracy. For this reason, the CORDIC (Coordinate Rotation Digital Computer) algorithm is pro-posed here to substitute conventional complex multiplier and look-up table approach.

The CORDIC algorithm developed by Volder [32] in 1959 is a generalized algo-rithm that can perform vectoring and rotation operations of a two dimensional vector.

The rotation operation is to compute the target vector of the initial vector and the given rotation angle θ, while the intention of the vectoring operation is to compute the angle between the start vector and the end vector. Furthermore, there are three

differ-ent kinds of coordinate systems: the linear coordinate system, the circular coordinate system, and the hyperbolic coordinate system. Walther [33] extend the algorithm to compute multiplication, division, and hyperbolic functions. The applications of CORDIC-based are 3-D graphic, adaptive filter, floating point unit, DSP processor, and so on.

When employing CORDIC algorithm in FFT PE design, we only investigate the most popular circular coordinate system and the rotation mode operations. The basic theory of rotation mode of CORDIC algorithm is reviewed in the following section.

4.2.1 Review of CORDIC Algorithm and Implementation

Given a vector p(x, y) to be rotated by an angle αi, the rotated vector p’ (x’, y’) can be expressed as

(4.3) where δi = {+1, -1}. Therefore, we can get the following equations.

vector

α

Hence, computation of p(x, y) can be separate into two parts. One is the operations of shift-and-add operations xii yi2-i and yii xi2-i, and the other is the compensation of scale factor K. Separating the scale factor, equation (4.5) can be rewritten as equation (4.6), where zi+1 is the residue rotation angle in the i-th iteration and the total required shift-and-add operations is n for n-bit accuracy.

(4.6)

We have to do scale factor compensation as shown in equation (4.7) after the iterative rotation operations.

Since the rotation operations of CORDIC are approximated by a sequence of mi-cro-rotations (elementary angles) using only shift-and-add operations, it is suited for VLSI implementation. Recently, there have been several improved CORDIC algo-rithms and structures proposed, and the focus are to reduce the iteration numbers and speed up its convergence rate.

Since carry-propagation delay of adders makes a large portion circuit delay, it is advantageous to replace the binary adders by carry-saved adders, especially with high bit accuracy [34]. As the carry-saved adder is carry delay independent of the word length, the complexity of the basic cycle is O(1) and that the total computing time is O(n) for n-bit accuracy CORDIC operations. Unfortunately, if the redundant arithme-tic (carry-free additions) is employed and consider only a few MSB’s of the redundant

adder result, the rotation signs δi will be in the set {+1, 0, -1} instead of {+1, -1} and the overall scale factor K is no longer constant.

In order to solve the compensation problem of variable scale factor, there are many schemes. The double rotation method proposed in [35] can not only reduce the iteration numbers but also keep the scale factor constant. In [36], the prediction rota-tion method was proposed, and it also combined the signed digit and the Wallace tree concepts to optimize the micro rotations and the total hardware complexity. In [37], Antelo proposed a radix-4 CORDIC algorithm to reduce iteration numbers with zero-skipping scheme. In [38], Rao improved the radix-4 CORDIC algorithm of An-telo and speed up the convergence rate by two folds. Although the very-high radix CORDIC [39], [40] can improve the performance noticeably, it is irregular in realiza-tion which needs multiplicarealiza-tion-and-accumularealiza-tion circuits, so that its efficiency is highly dependent on practical circuit optimization. In [41], the redundant con-stant-factor CORDIC with merged scaling operation is also employed to reduce hardware complexity. A differential CORDIC algorithm for constant scale factor re-dundant implementation without correction iterations is proposed in [42]. The angle encoding methods proposed in [43] and [44] reduce iteration numbers and find the scale factor compensation sequences easily by using semi-greedy algorithm.

However, the convergence rate and the hardware complexity of all above meth-ods are still not good enough. By analyzing these above methmeth-ods, there are some con-clusions:

1. If scale factor is not constant, it needs extra effort to deal with its computa-tion and compensacomputa-tion operacomputa-tions.

2. If scale factor is constant, the CORDIC convergence rate will be slow.

In our lab in the past, we proposed a fast variable scale factor compensation method [45] that achieves good convergence rate. Moreover, the radix-4 CORDIC algorithm [46] with an on-line variable scale factor compensation method not only solves the variable scale factors easily, but also compensates the variable factor effi-ciently. In [47], we also proposed a fast prediction CORDIC algorithm which has small iteration numbers. Furthermore, a radix-16 CORDIC algorithm which only needs about 3n/5 iterations in average for an n-bit output accuracy is proposed in [48].

Considering the high throughput rate requirement of FFT processors for the mentioned OFDM applications, iteration numbers of CORDIC operations should as possibly reduced as one can. If the iteration number isn’t small enough, we may need to use expensive pipeline FFT architectures to meet the computation requirement.

Therefore, to reduce iteration numbers, we propose a new on-line scale factor com-pensation method combined with a new angle recoding scheme to get the better per-formance than all the mentioned methods. To reduce the iteration number significantly, the new algorithm also applies the leading-one bit detection operations of the residual rotation angles, so that redundant iterations can be skipped.

4.2.2 New Angle Decomposition and Table Size Reduction Schemes

The method that detects the leading-one bit position of the residual angle zi in the i-th iteration is employed in order to speed up the convergence rate. This operation can skip the unnecessary rotations required by conventional CORDIC algorithms.

In [48], the rotation signsδsiandδtistored in a look-up table in advance are opti-mized according to zi,r, where zi,r denotes the most significant r bits counted from the

leading-one bit of zi, and the optimal combined rotation angle best matches z

i is performed. While the residue angle zi is negative, we replace the leading-one bit detection by the leading-zero bit detection operations.

}

In generalization, one may include more than two non-zero δm to speed up the convergence rate rapidly. However, it results in significant increase of computational complexity. Therefore, we only investigate the case of two combined parameters here, while the similar techniques can be extended to the general case.

Sinceδm∈{+1,−1}, the look-up table must store the information including the index value (s and t) and the sign value (+1 or -1) of δm. However, with similar con-vergence rate, we can replace the δm set by the setδm∈{1}. Consequently, we only have to record the index value (s and t) of δm in the look-up table. Then the iteration operation can be simplified as equation (4.9). For positive residue angle zi, the rotation direction denoted as di is assigned “+1” that executes counterclockwise rotation. On the other hand, di is assigned “-1” that performs clockwise rotation when residue an-gle zi is negative. For simplicity, we detect the leading-one bit position of the absolute value of residual angle zi to get the zi,r and the parameters si and ti.

i formed by full search. We can record the residue rotation angles approximately like a radix-2r algorithm, by examining the MSB part zi,r of the residual rotation angle zi. In the scheme, there are two design parameters, one is r, and the other is number of δm. In our design, we pick r = 4, and number of δm to be 2.

Since the look-up table depends on the iteration index i, it is better to have an op-timized look-up table for each i. However, it will increase the table size significantly.

For simplicity, we can separate the tables into two different cases: k = {0, 1} and k > 1 according to the analysis of Taylor expansion of θk, where θk = tan-12-k, and k is the leading-one bit position of the residue rotation angle zi. By computer simulations, the result is good enough by using only two different tables instead of using tables for each i. The optimized si and ti of the

si

δ and

ti

δ parameters corresponding to zi,r infor-mation are shown in Table 4.2. According to the symmetry property of the complex number coordinates, we can achieve the convergence range of [+π, -π] by performing rotation in 0 ~ π/4. Therefore, the table only covers the input angle range from 0 to π/4.

In the design of new angle encoding scheme, we must include an angle table for the terms of tan-12-m, where m = 1, 2…, n. The angle table size can be reduced ac-cording to the Taylor expansion as shown in equation (4.11) with x = 2-m.

range

If the accuracy of the output data is n bits, we can ignore the second term of equation (4.11) when m > n/3. By the method, we only need to store about n/3 words of the angle tan-12-m instead of the n words. Moreover, we can get the tan-12-m value by

shifting tan-12-(m-1) value to the right by 1 bit, when m > n/3. Table 4.3 shows 12-bit tan-12-m values both in radian and degree.

Table 4.2 Recoding table for the decomposition of residual rotation angle, r =4 Optimized parameters si, ti of

si

δ and

ti

δ

k=0,1 k>1

zi,r(2-k~2-k-3)

si ti si ti

1000 k k+3 k k+4

1001 k k+2 k k+3

1010 k k+2 k k+2

1011 k k+1 k k+1

1100 k k+1 k k+1

1101 Unused, for θ0=0 ~ π/4 k k+1

1110 Unused, for θ0=0 ~ π/4 k-1 k+5

1111 Unused, for θ0=0 ~ π/4 k-1 k+3

Table 4.3 Angle table of tan-12-m value for the 12-bit accuracy m tan-12-m ( radian) tan-12-m( degree )

1 0.463867 ( 0011101101102 ) 26.565051 ( 1101010010002 ) 2 0.245117 ( 0001111101102 ) 14.036243 ( 1110000010012 ) 3 0.124512 ( 0000111111112 ) 7.125016 ( 1110010000002 ) 4 0.062500 ( 0000100000002 ) 3.576334 ( 1110010011102 ) 5 0.031250 ( 0000010000002 ) 1.789911 ( 0111001010002 ) 6 0.015625 ( 0000001000002 ) 0.895174 ( 0011100101002 ) 7 0.007813 ( 0000000100002 ) 0.447614 ( 0001110010102 ) 8 0.003906 ( 0000000010002 ) 0.223811 ( 0000111001012 ) 9 0.001953 ( 0000000001002 ) 0.111906 ( 0000011100102 ) 10 0.000977 ( 0000000000102 ) 0.055953 ( 0000001110012 ) 11 0.000488 ( 0000000000012 ) 0.027976 ( 0000000111002 )

4.2.3 A New On-line Variable Scale Factor Compensation Scheme

For low-complexity decomposition and compensation of variable scale factors, in the past, our lab proposed an on-line variable scale factor compensation method [45], [46] as briefly introduced below. The variable scale factor K can be first trans-formed to T as shown in equation (4.12), where total iteration number is denoted as I.

∏ ∑

T value can be accumulated with the rotation iterations simultaneously, where the

terms and can be obtained from a lookup

table. In the end of all the rotation operations, we can decompose T into a sequence of

table. In the end of all the rotation operations, we can decompose T into a sequence of