• 沒有找到結果。

Chapter 1 Introduction

1.3 Organization of the Thesis

In Chapter 2, we review fundamental theory, including some key FFT algorithms based on Cooley-Tukey decomposition and popular FFT architectures.

Chapter 3 discusses the address generation of memory-based FFT processors for both fixed-length and variable-length FFT. Designs of data address generator, coeffi-cient index generator and processing element adapted to variable-length FFT applica-tions will be proposed and explained in this chapter. Moreover, a new twiddle factor generator which has smaller area and lower complexity will be detailed as well.

In chapter 4, a new CORDIC algorithm which combines a new on-line scale factor compensation method with a new rotation angle recoding scheme is proposed and considered as the key contribution of this thesis. Architecture of an CORDIC-based FFT PE based on the new CORDIC algorithm is also presented in this chapter.

Finally, we synthesize a memory-based variable-length FFT processor with the proposed CORDIC-based PE in Chapter 5.

In the end of this thesis, brief conclusions and ideas on future works will be pre-sented in chapter 6.

Chapter 2

Review of FFT Algorithms and Architectures

2.1 Introduction

Discrete Fourier Transform (DFT) is widely applied to communications and digi-tal signal processing. However, the computational complexities of direct implementa-tions of DFT algorithms are too high to meet high-performance and low-cost design goal. Therefore, a fast DFT algorithm is indispensable. Since the early paper proposed by Cooley and Tukey in 1965 [1], numerous improvement have been done on the FFT algorithm. The Cooley-Tukey based FFT algorithms reduce the computational com-plexity from O(N2) to O(NlogN). Generally, those types of FFT algorithms can be de-fined as radix-2n FFT algorithms, which provide flexible implementation.

There are two basic classes of FFT algorithms. One is the decimation-in-time (DIT) decomposition which rearranges DFT equation by decimating and regrouping time signals. Another is the decimation-in-frequency (DIF) decomposition which re-arranges DFT equation by decimating and regrouping frequency signals. Both DIF and DIT algorithms utilize the symmetry and the periodicity of the complex sine wave WNkn = e-j(2π/N)kn combined with the idea of divide-and-conquer to reduce the DFT multiplicative computation. Since both DIF and DIT algorithms have the same com-putational complexity and signal flow, we only focus on DIF FFT algorithms.

2.2 Key FFT Algorithms

The basic N-point DFT of a sequence x(n) is defined as:

(2.1)

2.2.1 Radix-2 DIF FFT Algorithm

Radix-2 DIF FFT algorithm is developed by partitioning the frequency samples into even part and odd part. By dividing the DFT outputs into even-indexed and odd-indexed parts and utilizing the symmetry of twiddle factor, the even and odd fre-quency samples can be expressed as

(2.2)

As shown, equations (2.2) and (2.3) are composed by two (N/2)-point DFTs. It is well known that one can combine these two equations as one basic butterfly (BF) module as shown in Fig. 2.1, where x(n) and x(n+N/2) are the input data.

n

Fig. 2.1 The butterfly signal flow graph of radix-2 DIF FFT

By recursive decompositions, we can further partition these two small DFTs into even smaller DFTs, and so on. Finally, the completed N-point radix-2 DIF FFT algo-rithm can be obtained. The example of a 16-point radix-2 DIF FFT, in signal flow graph, is shown in Fig. 2.2.

Fig. 2.2 Radix-2 DIF FFT signal flow graph of a 16-point example

We can find that order of the output frequency coefficients is bit-reversed, as shown in Fig. 2.2.

2.2.2 Radix-4/ Radix-2

2

DIF FFT Algorithm

Similar to radix-2 algorithm, the frequency-domain data indices of a DFT can be divided into four groups 4r, 4r+1, 4r+2, and 4r+3. Then equation (2.1) can be rewrit-ten in terms of four partial sums as shown in equation (2.4). The corresponding basic butterfly processing block is shown in Fig. 2.3.

3

Fig. 2.3 The basic butterfly processing block of the radix-4 DIF algorithm

The relationship between inputs and outputs of the Fig. 2.3 are

n equation (2.4) can be rewritten as (2.6). Equation (2.6) is called radix-22 FFT algo-rithm and its butterfly signal flow graph is shown in Fig. 2.4. The complete signal flow graph of a 16-point radix-22 DIF FFT is shown in Fig. 2.5.

X

Fig. 2.4 Butterfly signal flow graph of a raidx-22 DIF FFT algorithm

The output data order of radix-4 approach is digit-reversed, while that of radix-22 approach is bit-reversed order (the same as radix-2 algorithm mentioned in previous section). In fact, the radix-22 FFT algorithm can be regarded as a modification of the radix-2 FFT algorithm, which rearranges the twiddle factors of two radix-2 butterfly stages. However, the pure radix-4/ radix-22 FFT algorithm is only suitable for 4n-point FFT. To deal with a non-power-of-4 DFT, one can use radix-2 butterfly at the first or the last stage.

x [0 ]

Fig. 2.5 16-point signal flow graph of the radix-22 DIF FFT

2.2.3 Radix-8/ Radix-2

3

DIF FFT Algorithm

For the radix-8 DIF FFT algorithm, the index k of equation (2.1) is divided into eight subsets indexed by 8r+l, where l is from 0 to 7. With the properties of twiddle factor, equation (2.1) can be rewritten as the equation (2.7). The basic transformation unit is an 8-point DFT as shown in Fig. 2.6.

Like radix-4 FFT algorithm, we can derive a radix-23 FFT algorithm from the radix-8 FFT algorithm by replacing index 8r+l with 8r+4l3+2l2+l1, as shown in equa-tion (2.8). The basic butterfly blocks of radix-23 DIF FFT algorithm is shown in Fig.

2.7.

x (n + 4 N /8 )

Fig. 2.6 The butterfly signal flow graph of the radix-8 DIF FFT algorithm

+

Fig. 2.7 The butterfly signal flow graph of the radix-23 DIF FFT algorithm

The relationship between inputs and outputs of Fig. 2.6 is

Similar to radix-4/radix-22 case, output data ordering of radix-8 case is digit-reversed (where each digit contains three bits), while radix-23 case is bit-reversed.

2.3 Classification of FFT Architectures

Generally, FFT architectures can be classified as two categories: one is the pipe-line-based architectures [2], [3], [4], [5], and the other is the memory-based architec-tures [6], [7], [8], [9], [10], [11]. These two types of architecarchitec-tures have their own ad-vantages, disadad-vantages, and application purposes. The pipeline-based architecture is usually used in high throughput applications, because of high computation pipe stage.

On the other hand, the memory-based architectures generally include a single butter-fly PE (or more than one to enhance computation power), a centralized memory block to store input, output or intermediate data, and a control unit to handle memory ac-cesses and data flow direction. Therefore, the hardware cost of a memory-based ar-chitecture is cheaper than pipeline-based arar-chitectures, but the negative side effect is a loss of throughput rate.

We will introduce these two categories of FFT architectures in the following subsections.

2.3.1 Pipeline-Based FFT Architectures

Pipeline-based architectures are designed by emphasizing speed performance and the regularity of data path. The best way to obtain a pipeline-based FFT architecture is through vertical projection of signal flow graph (SFG) of designer’s algorithm selec-tion. Fig. 2.8 shows a 16-point radix-2 DIF FFT example of projection mapping.

The pipelined structure contains four butterfly processing elements (PE) (denotes as “BF” in Fig. 2.8) for addition and subtraction between two input data of each stage, three complex multiplier with coefficient look-up table, and four blocks of buffers to

store and reorder data and provide smooth data flow for butterfly PE of the next stage.

There are two types of data buffering strategies [5], [12], [13], [14] for pipeline based FFT processor. One is delay-commutator (DC) and the other is delay-feedback (DF).

Multi-path structure is generally applied with delay-commutator type, and single-path structure is generally applied to delay-feedback type, which is called multi-path de-lay-commutator (MDC) and single-path delay-feedback (SDF), respectively.

4

buffer BFbuffer BFbuffer BFbuffer

Projection direction

ROM ROM ROM

Fig 2.8 Projection mapping of radix-2 DIF FFT signal flow graph

The SDF pipeline architecture is adaptable to various FFT algorithms such as radix-2, radix-4, radix-22, and higher-radix algorithms with (N-1) registers. The final output data order obtained from the SDF structure is always bit-reversed, and the utilization rate of the registers is 100%. Although FFT processor based an SDF archi-tecture provides high performance throughput with simple control method, it suffers

from some disadvantages. For instance, the utilization rate of processing element isn’t 100%. The number of required arithmetic units depends on the number of pipeline stages. Furthermore, we need extra buffer to reorder output data because of the bit-reversed output order.

MDC is also a flexible architecture that is adaptable to various FFT algorithms.

Unlike SDF architecture, intermediate data are direct output to the next stage or coef-ficient multiplier instead of being written back, in a multi-path delay commutator (MDC) architecture. However, the MDC architecture also suffers from the similar disadvantages to the SDF architecture.

The SDF has less hardware complexity and higher utilization rate, but longer FFT computation time, compared with MDC structures, assuming the same operating clock rate. The total time delay to compute an N-point FFT is expressed in equations (2.9), where TFFT is the clock cycle time operated on the radix-r algorithm and N is the length of FFT.

r T N T MDC

N T T SDF

clk FFT

clk FFT

×

=

×

= : :

(2.9)

From these equations, one has the following conclusions:

1. The computation time of an N-point FFT using the SDF pipeline architec-ture is independent of r. Therefore, a higher-radix algorithm does not imply a higher throughput rate, when realized by SDF structure.

2. The computation time of an N-point FFT in the MDC pipeline architecture will be reduced linearly with r. But it consumes large hardware estate with lower utilization rate when a high-radix algorithm is used.

2.3.2 Memory-Based FFT Architectures

A general memory-based FFT processor structure mainly consists of a Butterfly PE, a main memory, ROM for twiddle factor storage, and a controller.

The Butterfly PE is responsible for the butterfly operations required by FFT op-erations. Moreover, the architecture design of PE is dependent on the FFT algorithm used and generally dominates the performance of whole processor. The main memory stores processed data. The controller contains three functional units: data memory ad-dress generator, coefficient index generator, and operation state controller. The data memory address generator follows a regular pattern to generate several addresses, and then the main memory provides input data for butterfly PE and stores output data from butterfly PE according to these addresses. The coefficient index generator pro-vides indices to select coefficients from coefficient ROM or maps to coefficients through twiddle factor generator.

The data memory address generator and twiddle factor generator of mem-ory-based FFT processor will be discussed in Chapter 3. Furthermore, the architecture designs of multiplier-based PE and CORDIC-based PE will be investigated in Chapter 4.

Chapter 3

Data Address Generation and Twiddle Factor Generation

3.1 Data Address Generation

Memory-based FFT processor architectures are designed to increase the utiliza-tion rate of butterfly PE’s. Different from the pipeline-based architectures, mem-ory-based FFT processor often has one or two large memory block(s) that is accessed by all other PE components, instead of being distributed to many pipelined local arithmetic units.

Main memory allocation and access strategy of a memory-based FFT processor can be classified as two types: in-place type [6], [7], [8], [9] and out-of-place type [10], [11], [15], [16], [17]. In in-place architecture, output data of butterfly PE are written back to the original memory bank with the same addresses as the previously loaded of input data. Alternatively, if output data are written to another memory block without overwriting input data, this design is generally called out-of-place. Therefore, memory size of the out-of-place design generally will be twice that of the in-place de-sign.

In in-place memory-based FFT processor, the data address generator design is based on butterfly execution order, while coefficient address is also dependent on the FFT algorithms and butterfly processing order. A conventional processing order and control scheme was proposed early by Cohen (1976) [18]. The algorithm was then

extended and generalized by Johnson [6], Lo [7], and Ma [8], [9].

3.1.1 Fixed-length Data Address Generator

In Cohen’s scheme [18], the data addresses needed for the i-th butterfly of the k-th radix-2 FFT stage can be described by equation (3.1), while the operator RO-TATEn(X, m) circularly rotates X left by m bits out of n bits.

Since the original Cohen’s scheme was based on DIT FFT, we can modify it to match the general DIF FFT algorithm such as described by equation (3.2), while the operator ROTATEn(X, m) circularly rotates X right by m digits out of n digits.

)

To realize equation (3.1), Cohen proposes are efficient address generator archi-tecture based on barrel shifters. Similarly, archiarchi-tecture for executing equation (3.2) is shown in Fig. 3.1, which is modified from the Cohen’s proposed design [18]. The main idea is putting digits 0, 1..., and r-1 to MSB of the content of butterfly counter, and then using barrel shifters to achieve the circular rotation. The circular shift amount is equal to the content of the stage counter.

Fig. 3.1 Data address generator for radix-r N-point DIF FFT

The memory access bandwidth is a critical issue in memory-based FFT processor design. An N-point memory-based FFT processor based on radix-r algorithm needs (N/r)logrN PE operations to transform one N-point symbol. Further, each PE operation needs 2r memory accesses to read data from memory and write back to it, so that each N-point symbol requires 2NlogrN memory accesses. In order to handle this require-ment, there are two solutions:

1. Use a higher-radix algorithm to reduce total memory accesses.

2. Increase memory access bandwidth by using multiple memory ports.

However, it is expensive to do so. Further, the number of memory ports is not con-trolled by architecture designer but cell library provider and device vendor. In addition, ideally the in-place design should simultaneously delivering r memory data to the radix-r butterfly PE and writing back r output data of butterfly PE to the data memory.

Therefore, the solution to increase memory access bandwidth is to partition memory into r banks, and than store the data properly in the memory banks for conflict-free memory access.

There are several efforts on memory partition and addressing methods to achieve conflict-free memory access [18], [6], [7]. The conflict-free memory partition scheme [6] that translates sequential data address data_count into bank index and data index of each memory bank is described in equation (3.3).

(3.3)

⎡ ⎤

r n

n n t

t

r n

n r

d d d

d index Data

r d

index Bank

d d d d

d count Data

N n

] ...

[ _

mod ) ( _

] ...

[ _

log

1 2 2 1 1 0

0 1 2 2 1

=

=

=

=

=

Similar result can also be found in Lo’s scheme derived by vertex coloring rule [7].

The original data address data_count can be generated according to the content of the butterfly counter and the stage counter of FFT processing. The data_index is the new address assigned to each memory bank. For radix-r butterfly PE, this allocation algo-rithm can access r data from r different banks simultaneously at proper addresses ac-cording to the original data addresses. The data address generator with conflict-free memory accesses is often implemented by the following hardware [6], [7] as shown in Fig. 3.2.

Fig. 3.2 Block diagram of the fixed-length data address generator

The coefficient index can be also generated from butterfly counter and stage counter. Its implementation [18] is generally simpler than the data address generator.

The hardware diagram is shown in Fig. 3.3.

Fig. 3.3 Block diagram of the fixed-length coefficient generator

3.1.2 Variable-length Data Address Generator

In order to meet specifications of multi-mode and multi-standard OFDM com-munication systems, an FFT processor must support length-independent computation and meet the worst-case hardware requirement. Consequently, the FFT processor de-sign must contains an efficient processing element, a variable-length data address generator and a variable-length twiddle factor generator.

However, we can find out that Cohen’s scheme is not suitable for a vari-able-length FFT when we analyze a sub-segment of signal flow graph for a shorter-length FFT [19]. To give an example, a 16-point radix-2 DIF FFT is demon-strated as shown in Fig. 3.4. In direct-order we can process butterflies from top to down and from left stage to right stage as marked by the numbers on the right-hand sides of butterfly ellipses. On the other hand, since the main idea of Cohen’s process-ing order is groupprocess-ing butterflies associated with the same twiddle factor together to

tion in butterfly (DIB) order as marked by the numbers on the left-hand sides of but-terfly ellipses in Fig. 3.4. In the example, the data addresses needed for butbut-terfly PE in direct processing order and in Cohen’s processing order are listed in Table 3.1 and Ta-ble 3.2 respectively. In the taTa-bles, <s, t> denotes data address pair for both input and output data for radix-2 butterfly PE, and s and t are indices of one dimension memory array. Note that the address translation and mapping from one dimension index to multi-bank memory system are considered later.

1

2

3

4

5

6

7

8 1

2

3

4

5

6

7

8

1

3

5

7

2

4

6

8 1

2

3

4

5

6

7

8

1

5

2

6

3

7

4

8 1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8 8

7 6

5 4 3 2 1

16-point 8-point 4-point

1st stage 2nd stage 3rd stage 4th stage

Fig. 3.4 Butterfly processing sequence for memory based DIF FFT processors

Table 3.1 Data addresses needed for butterfly PE in direct processing order

BF 0 BF 1 BF 2 BF 3 BF 4 BF 5 BF 6 BF 7

Stage 0 <0, 8> <1, 9> <2, 10> <3, 11> <4, 12> <5, 13> <6, 14> <7, 15>

Stage 1 <0, 4> <1, 5> <2, 6> <3, 7> <8, 12> <9, 13> <10, 14> <11, 15>

Stage 2 <0, 2> <1, 3> <4, 6> <5, 7> <8, 10> <9, 11> <12, 14> <13, 15>

Stage 3 <0, 1> <2, 3> <4, 5> <6, 7> <8, 9> <10, 11> <12, 13> <14, 15>

Table 3.2 Data address pairs for butterfly PE in Cohen’s scheme

BF 0 BF 1 BF 2 BF 3 BF 4 BF 5 BF 6 BF 7

Stage 0 <0, 8> <1, 9> <2, 10> <3, 11> <4, 12> <5, 13> <6, 14> <7, 15>

Stage 1 <0, 4> <8, 12> <1, 5> <9, 13> <2, 6> <10, 14> <3, 7> <11, 15>

Stage 2 <0, 2> <4, 6> <8, 10> <12, 14> <1, 3> <5, 7> <9, 11> <13, 15>

Stage 3 <0, 1> <2, 3> <4, 5> <6, 7> <8, 9> <10, 11> <12, 13> <14, 15>

In Fig 3.4, when we isolate the sub-SFG of a shorter-length FFT from the longer SFG, the Cohen’s DIB execution order mismatches with the variable-length FFT de-sign concept. On the contrary, the direct processing order is suited for variable-length FFT operations. Therefore Cohen’s data address generation scheme has to be modified to deal with the operations of different FFT lengths. The direct ordered addressing scheme suitable for variable-length FFT design can be described by equation (3.4). In this equation, variable k and variable i are the contents of the stage counter and the butterfly counter, respectively, and the supported longest length of radix-r FFT is N.

(3.4)

Chang [19] proposed a variable-length data address generator, which was modi-fied from Cohen’s fixed-length data address generator. In order to rotate the content of butterfly counter circular left first, the design includes an additional barrel shifter. The following operations that perform digit appending operations and rotate right circu-larly are similar to Fig. 3.1. The block diagram is shown in Fig. 3.5.

Fig. 3.5 Chang’s variable-length DIF data address generator

When observing the results of equation (3.4) in digit representation, we can find that it is equivalent to dynamically inserting digit 0, 1…, r-1 into the middle position of the butterfly counter content. Equation (3.5) expresses the data address representa-tion in digits, and the content of the stage counter denoted as variable k determines the

When observing the results of equation (3.4) in digit representation, we can find that it is equivalent to dynamically inserting digit 0, 1…, r-1 into the middle position of the butterfly counter content. Equation (3.5) expresses the data address representa-tion in digits, and the content of the stage counter denoted as variable k determines the