• 沒有找到結果。

Chapter 2 Review of FFT Algorithms and Architectures

2.3 Classification of FFT Architectures

2.3.2 Memory-Based FFT Architectures

A general memory-based FFT processor structure mainly consists of a Butterfly PE, a main memory, ROM for twiddle factor storage, and a controller.

The Butterfly PE is responsible for the butterfly operations required by FFT op-erations. Moreover, the architecture design of PE is dependent on the FFT algorithm used and generally dominates the performance of whole processor. The main memory stores processed data. The controller contains three functional units: data memory ad-dress generator, coefficient index generator, and operation state controller. The data memory address generator follows a regular pattern to generate several addresses, and then the main memory provides input data for butterfly PE and stores output data from butterfly PE according to these addresses. The coefficient index generator pro-vides indices to select coefficients from coefficient ROM or maps to coefficients through twiddle factor generator.

The data memory address generator and twiddle factor generator of mem-ory-based FFT processor will be discussed in Chapter 3. Furthermore, the architecture designs of multiplier-based PE and CORDIC-based PE will be investigated in Chapter 4.

Chapter 3

Data Address Generation and Twiddle Factor Generation

3.1 Data Address Generation

Memory-based FFT processor architectures are designed to increase the utiliza-tion rate of butterfly PE’s. Different from the pipeline-based architectures, mem-ory-based FFT processor often has one or two large memory block(s) that is accessed by all other PE components, instead of being distributed to many pipelined local arithmetic units.

Main memory allocation and access strategy of a memory-based FFT processor can be classified as two types: in-place type [6], [7], [8], [9] and out-of-place type [10], [11], [15], [16], [17]. In in-place architecture, output data of butterfly PE are written back to the original memory bank with the same addresses as the previously loaded of input data. Alternatively, if output data are written to another memory block without overwriting input data, this design is generally called out-of-place. Therefore, memory size of the out-of-place design generally will be twice that of the in-place de-sign.

In in-place memory-based FFT processor, the data address generator design is based on butterfly execution order, while coefficient address is also dependent on the FFT algorithms and butterfly processing order. A conventional processing order and control scheme was proposed early by Cohen (1976) [18]. The algorithm was then

extended and generalized by Johnson [6], Lo [7], and Ma [8], [9].

3.1.1 Fixed-length Data Address Generator

In Cohen’s scheme [18], the data addresses needed for the i-th butterfly of the k-th radix-2 FFT stage can be described by equation (3.1), while the operator RO-TATEn(X, m) circularly rotates X left by m bits out of n bits.

Since the original Cohen’s scheme was based on DIT FFT, we can modify it to match the general DIF FFT algorithm such as described by equation (3.2), while the operator ROTATEn(X, m) circularly rotates X right by m digits out of n digits.

)

To realize equation (3.1), Cohen proposes are efficient address generator archi-tecture based on barrel shifters. Similarly, archiarchi-tecture for executing equation (3.2) is shown in Fig. 3.1, which is modified from the Cohen’s proposed design [18]. The main idea is putting digits 0, 1..., and r-1 to MSB of the content of butterfly counter, and then using barrel shifters to achieve the circular rotation. The circular shift amount is equal to the content of the stage counter.

Fig. 3.1 Data address generator for radix-r N-point DIF FFT

The memory access bandwidth is a critical issue in memory-based FFT processor design. An N-point memory-based FFT processor based on radix-r algorithm needs (N/r)logrN PE operations to transform one N-point symbol. Further, each PE operation needs 2r memory accesses to read data from memory and write back to it, so that each N-point symbol requires 2NlogrN memory accesses. In order to handle this require-ment, there are two solutions:

1. Use a higher-radix algorithm to reduce total memory accesses.

2. Increase memory access bandwidth by using multiple memory ports.

However, it is expensive to do so. Further, the number of memory ports is not con-trolled by architecture designer but cell library provider and device vendor. In addition, ideally the in-place design should simultaneously delivering r memory data to the radix-r butterfly PE and writing back r output data of butterfly PE to the data memory.

Therefore, the solution to increase memory access bandwidth is to partition memory into r banks, and than store the data properly in the memory banks for conflict-free memory access.

There are several efforts on memory partition and addressing methods to achieve conflict-free memory access [18], [6], [7]. The conflict-free memory partition scheme [6] that translates sequential data address data_count into bank index and data index of each memory bank is described in equation (3.3).

(3.3)

⎡ ⎤

r n

n n t

t

r n

n r

d d d

d index Data

r d

index Bank

d d d d

d count Data

N n

] ...

[ _

mod ) ( _

] ...

[ _

log

1 2 2 1 1 0

0 1 2 2 1

=

=

=

=

=

Similar result can also be found in Lo’s scheme derived by vertex coloring rule [7].

The original data address data_count can be generated according to the content of the butterfly counter and the stage counter of FFT processing. The data_index is the new address assigned to each memory bank. For radix-r butterfly PE, this allocation algo-rithm can access r data from r different banks simultaneously at proper addresses ac-cording to the original data addresses. The data address generator with conflict-free memory accesses is often implemented by the following hardware [6], [7] as shown in Fig. 3.2.

Fig. 3.2 Block diagram of the fixed-length data address generator

The coefficient index can be also generated from butterfly counter and stage counter. Its implementation [18] is generally simpler than the data address generator.

The hardware diagram is shown in Fig. 3.3.

Fig. 3.3 Block diagram of the fixed-length coefficient generator

3.1.2 Variable-length Data Address Generator

In order to meet specifications of multi-mode and multi-standard OFDM com-munication systems, an FFT processor must support length-independent computation and meet the worst-case hardware requirement. Consequently, the FFT processor de-sign must contains an efficient processing element, a variable-length data address generator and a variable-length twiddle factor generator.

However, we can find out that Cohen’s scheme is not suitable for a vari-able-length FFT when we analyze a sub-segment of signal flow graph for a shorter-length FFT [19]. To give an example, a 16-point radix-2 DIF FFT is demon-strated as shown in Fig. 3.4. In direct-order we can process butterflies from top to down and from left stage to right stage as marked by the numbers on the right-hand sides of butterfly ellipses. On the other hand, since the main idea of Cohen’s process-ing order is groupprocess-ing butterflies associated with the same twiddle factor together to

tion in butterfly (DIB) order as marked by the numbers on the left-hand sides of but-terfly ellipses in Fig. 3.4. In the example, the data addresses needed for butbut-terfly PE in direct processing order and in Cohen’s processing order are listed in Table 3.1 and Ta-ble 3.2 respectively. In the taTa-bles, <s, t> denotes data address pair for both input and output data for radix-2 butterfly PE, and s and t are indices of one dimension memory array. Note that the address translation and mapping from one dimension index to multi-bank memory system are considered later.

1

2

3

4

5

6

7

8 1

2

3

4

5

6

7

8

1

3

5

7

2

4

6

8 1

2

3

4

5

6

7

8

1

5

2

6

3

7

4

8 1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8 8

7 6

5 4 3 2 1

16-point 8-point 4-point

1st stage 2nd stage 3rd stage 4th stage

Fig. 3.4 Butterfly processing sequence for memory based DIF FFT processors

Table 3.1 Data addresses needed for butterfly PE in direct processing order

BF 0 BF 1 BF 2 BF 3 BF 4 BF 5 BF 6 BF 7

Stage 0 <0, 8> <1, 9> <2, 10> <3, 11> <4, 12> <5, 13> <6, 14> <7, 15>

Stage 1 <0, 4> <1, 5> <2, 6> <3, 7> <8, 12> <9, 13> <10, 14> <11, 15>

Stage 2 <0, 2> <1, 3> <4, 6> <5, 7> <8, 10> <9, 11> <12, 14> <13, 15>

Stage 3 <0, 1> <2, 3> <4, 5> <6, 7> <8, 9> <10, 11> <12, 13> <14, 15>

Table 3.2 Data address pairs for butterfly PE in Cohen’s scheme

BF 0 BF 1 BF 2 BF 3 BF 4 BF 5 BF 6 BF 7

Stage 0 <0, 8> <1, 9> <2, 10> <3, 11> <4, 12> <5, 13> <6, 14> <7, 15>

Stage 1 <0, 4> <8, 12> <1, 5> <9, 13> <2, 6> <10, 14> <3, 7> <11, 15>

Stage 2 <0, 2> <4, 6> <8, 10> <12, 14> <1, 3> <5, 7> <9, 11> <13, 15>

Stage 3 <0, 1> <2, 3> <4, 5> <6, 7> <8, 9> <10, 11> <12, 13> <14, 15>

In Fig 3.4, when we isolate the sub-SFG of a shorter-length FFT from the longer SFG, the Cohen’s DIB execution order mismatches with the variable-length FFT de-sign concept. On the contrary, the direct processing order is suited for variable-length FFT operations. Therefore Cohen’s data address generation scheme has to be modified to deal with the operations of different FFT lengths. The direct ordered addressing scheme suitable for variable-length FFT design can be described by equation (3.4). In this equation, variable k and variable i are the contents of the stage counter and the butterfly counter, respectively, and the supported longest length of radix-r FFT is N.

(3.4)

Chang [19] proposed a variable-length data address generator, which was modi-fied from Cohen’s fixed-length data address generator. In order to rotate the content of butterfly counter circular left first, the design includes an additional barrel shifter. The following operations that perform digit appending operations and rotate right circu-larly are similar to Fig. 3.1. The block diagram is shown in Fig. 3.5.

Fig. 3.5 Chang’s variable-length DIF data address generator

When observing the results of equation (3.4) in digit representation, we can find that it is equivalent to dynamically inserting digit 0, 1…, r-1 into the middle position of the butterfly counter content. Equation (3.5) expresses the data address representa-tion in digits, and the content of the stage counter denoted as variable k determines the insertion location of the butterfly counter content, which is indicated by variable i. For the data address, the right-side digits of the insertion location are equal to the content of the butterfly counter, while the left-side digits shifts the content of the butterfly counter by log2r bits. counter butterfly

the of number digit

Based on the described property described above, we can use multiplexer arrays to perform the required shift and insertion operations instead of using barrel shifters.

Hung [20] proposed a variable-length data address generator by simplifying the original area-consuming barrel-shifter based designs with simpler multiplexer-based addressing functions. In order to suit for 802.16a, DAB and DVB-T, the design covers seven different FFT lengths including 256, 512, 1024, 2048 and 8192 points. Fur-thermore, it is based on radix-22 DIF FFT algorithm but can still compute a general power-of-2 FFT by adding some minor modification to the original radix-22 architec-ture. The detailed block diagram is shown in Fig. 3.6.

(a) The shift-insert-bypass multiplexer and the control signals

(b) Shift-insert-bypass MUX array architecture

(c) Full block diagram of DIF FFT data address generator

Fig. 3.6 Block diagram of the Hung’s variable-length data address generator

In Hung’s design as shown in Fig. 3.6, the stage counter and butterfly counter are designed to process the longest N-point FFT. When computing a shorter-length FFT such as N/2c, the butterfly counter simply adjust the count step from 1 to 2c instead of change the counting limit of the butterfly counter from N/2 to N/2c+1. This method re-sult in scattered memory addresses. For example, the accessed memory locations of a 1024-point FFT are 0, 8, 16… and 8184 of the 8192-entry memory block. In order to access the data memory with continuous location, an additional shifter has to be used to convert the scattered data addresses to continuous values.

Besides the design requirement of the butterfly counter, in order to support radix-2 and radix-22 modes, the stage counter is designed to have two different modes to match the butterfly counter operations and the multiplexer-based addressing func-tions. In non-power-of-4 FFT operations, the first stage must perform radix-2 compu-tation, so that the radix-22 butterfly should be reconfigured as two radix-2 butterflies and the data addresses <s, t, u, v> required by butterfly PE are generated by the fol-lowing equation (3.6).

(3.6)

In addition, the data addresses <s, t, u, v> are described in equation (3.7) when processing varying stages of a power-of-4 point FFT expect the first stage.

(3.7)

Hence, a special designed stage counter, which is denoted as K, is increased by one each time when performing a radix-2 butterfly operations or increased by two each time after all the radix-22 butterfly operations of an FFT stage are completed. For example, the sequences of the stage counter for a power-of-4 operation are 0, 2, 4 …, 2a, while the sequences for a power-of-2 operation are 0, 1, 3 …, 2a+1.

However, Hung’s stage counter design has a high control complexity, because it has two different modes. Furthermore, equation (3.7) is radix-2, although it is based on radix-22 DIF FFT algorithm and the generalized data address equation (3.8) is radix-22. Therefore, the required multiplexer number of the SIB multiplexer array in Fig. 3.6(b) in realizing is double that of the multiplexer array in realizing equation (3.8). The look-up table that stores the control signals needed for the multiplexer array is also doubled as shown in Fig. 3.6(a).

(3.8)

In order to reduce complexity when implementing a variable-length data address generator, there are some design considerations. One is to centralize the data memory locations by generating successive data addresses <s, t, u, v>. Another is to simplify the control of the stage counter and realizing the multiplexer array by using equation (3.8) instead of equation (3.7). Moreover, since the four SIB MUX arrays in Fig. 3.6(c) have symmetry property, we can reduce them to only one multiplexer array.

3.1.3 The Proposed Variable-length Data Address Generator

By considering the above discussions, we propose a variable-length data address generator. The main control is divided into two modes. One is “radix-2-mode”, which corresponds to the first stage of a non-power-of-4 FFT operation and performs radix-2 computation. The other mode, called “radix-22-mode”, executes radix-22 computation, includes all stages expect the 1st stage of a non-power-of-4 FFT and all stages of a power-of-4 FFT. The counting steps of the butterfly counter are equal to 1 and 2 for

“radix-22-mode” and “radix-2-mode”, respectively. And the counting limit of the but-terfly counter of the P–point FFT is P/4 for the “radix-22-mode”, while 2*P/4 for the

“radix-2-mode”. Furthermore, the initial value of the stage counter of the P–point FFT is log4(N/P) and the maximum stage number is log4N, where N is the supported longest FFT length. Regardless of modes, the stage counter just increases by one each time after the butterfly operations is completed. The data addresses are generated ac-cording to equation (3.8), and therefore the multiplexer number of the multiplexer ar-ray is only logr(N/r) +1. But for the “radix-2-mode”, there is an extra operation that shift the data addresses right by 1 bit.

In order to fit both “radix-22-mode” and “radix-2-mode”, the butterfly PE based on radix-22 algorithm is modified to process two radix-2 butterflies simultaneously when computing “radix-2-mode” operations. The block diagram of respective mode is shown in Fig. 3.7. By the way, if n is the content of butterfly counter which is shifted left by the amount of indicated by stage counter, the twiddle factors of

“radix-22-mode” are WNn, WN2n and WN3n, while that of “radix-2-mode” are WNn and WNn+(N/4). This shared PE design doesn’t increase the complexity of the data address generator.

+

MUXMUX

n

(a) The datapath configuration of “radix-22-mode”

+

MUX MUX

MUXMUX

n

(b) The datapath configuration of “radix-2-mode”

Fig. 3.7 Data path of the radix-22/2 butterfly processing element

Examples of 16, 8 and 4 points FFT are shown in Fig. 3.8 and Table 3.3. In the 0-th stage of the 8-points FFT operation, the butterfly counter is increasing by 2 each time and the output addresses must shift right by 1 bit. In the 4- point FFT, the initial value of the stage counter is 1, and the maximum butterfly value is 0 instead of 3.

x[0]

Stage counter = 0 Stage counter = 1

3

Butterfly counter content

2

16-point twiddle factor

(8-point twiddle factor)

F ig. 3.8 SFG of 16, 8 and 4-point DIF FFT’s

Table 3.3 Computing flows of the stage and butterfly counters for 16, 8 and 4 points DIF FFT

(a) 16 points

mode radix-22-mode

Stage counter 0 1

Butterfly counter 0 1 2 3 0 1 2 3

Data addresses

< s, t, u, v>

<0, 4, 8, 12>

<1, 5, 9, 13>

<2, 6, 10, 14>

<3, 7, 11, 15>

<0, 1, 2, 3>

<4, 5, 6, 7>

<8, 9, 10, 11>

<12, 13, 14, 15>

(b) 8 points

mode radix-2-mode radix-22-mode

Stage counter 0 1

Butterfly counter 0 1 2 3 0 1 2 3

Data addresses

< s, t, u, v>

<0, 4>

<2, 6>

(0, 4, 8, 12)>>1 Skip

<1, 5>

<3, 7>

(2, 6, 10, 14)>>1

Skip <0, 1, 2, 3>

<4, 5,

6, 7> Skip

(c) 4 points

mode radix-22-mode

Stage counter 0 1

Butterfly counter 0 1 2 3

Data addresses

< s, t, u, v>

Skip

<0, 1, 2, 3> Skip

When using the multiplexer array to perform equation (3.8), we need a decoder which decides the control signals for the SIB multiplexer arrays according to the con-tent of the stage counter, and the control function is shown in Table 3.4. Furthermore, the four addresses of equation (3.8) are similar, so that we can only compute the first address <s> and obtain the other addresses by using the bit-wise OR as shown in equation (3.9). Since the digit-insertion position is dependent on the content of the stage counter, the decoder which decides the control signals of the MUX array can also generate the t_OR, u_OR and v_OR by the similar function.

OR

Table 3.4 Control signals of the SIB MUX array

Stage

000 Insert Bypass Bypass Bypass Bypass Bypass Bypass

001 Shift 2 bits Insert Bypass Bypass Bypass Bypass Bypass

010 Shift 2 bits Shift 2 bits Insert Bypass Bypass Bypass Bypass

011 Shift 2 bits Shift 2 bits Shift 2 bits Insert Bypass Bypass Bypass

100 Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Insert Bypass Bypass

101 Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Insert Bypass

110 Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Shift 2 bits Insert

The proposed architecture of the variable-length data address generator is shown in Fig. 3.9, and the multiplexer array not only realizes the generalized equation (3.8) but also takes advantage of the symmetry of the addressing functions. The design in-volves various FFT lengths including 8192, 4096, 2048…, 512 and 256 points, which can meet the requirement of 802.16a, DAB and DVB-T systems.

MUX_0 2

MUX_1 2

MUX_2 2

MUX_3 2

MUX_4 2

MUX_5 2

MUX_6

[1:0]

[3:2]

[5:4]

[7:6]

[9:8]

[11:10]

[13:12]

2 2 2

2 2 2 2

2 2'b00

14 Stage counter [2:0]

Decoder (ROM)

MUX_control_signal

Bit-wise OR Bit-wise OR Bit-wise OR

t[13:0]

14

u[13:0]

14 14

v[13:0] s[13:0]

Butterfly counter [11:0]

[1:0]

[3:2]

[5:4]

[7:6]

[9:8]

[11:10]

Insert Bypass Shift

MUX

Control

2 2 2

2 2'b00

+

Stage counter [2:0]

Butterfly counter [11:0]

MUX

1 2

Stage finish controller

Radix-22/Radix-2 Mode selector

+

MUX

1

reset Initial stage

Symbol Length

MUX

Fig. 3.9 Variable-length data address generator

In this design, the memory data allocation must be assigned to a four-bank mem-ory block. Therefore, the conflict-free memmem-ory access method as discussed in section 3.1.1 is also implemented here. The new data address in each memory bank can be obtained by shifting the original data address right by two bits, which is easy to im-plement. On the other hand, the required bank index should be obtained by

In this design, the memory data allocation must be assigned to a four-bank mem-ory block. Therefore, the conflict-free memmem-ory access method as discussed in section 3.1.1 is also implemented here. The new data address in each memory bank can be obtained by shifting the original data address right by two bits, which is easy to im-plement. On the other hand, the required bank index should be obtained by