GDA-based Variable Length DFT Design and Evaluation

Chapter 4 Long-length DSST’s designs

4.3 Variable-length DFT Design to Communication System Application

4.3.3 GDA-based Variable Length DFT Design and Evaluation

Fig. 4.8: Delay-area product of the FFT versus the proposed

GDA-based DFT

4.3.3 GDA-based Variable Length DFT Design and Evaluation

Exploiting the Cooley-turkey decomposition algorithm, we first decompose the long length 1-D DFT into 2-D short length DFT, and form the shortened DFTs in each dimension in cyclic convolution. Then, with the pseudocirculant factorization algorithm, we factorize the cyclic convolutions as the sum of the shortened cyclic convolutions, and apply the proposed GDA design to realize of the short-length cyclic convolutions for achieving a hardware efficient long-length DFT design. Table 4.10 shows the proposed design can flexibly be used to compute the 1-D 64/128/256/512/1024/2048/4096-point DFT by cascading the decomposed short length DFT.

Table 4.10: Length of 1-D DFT constructed by the decomposed short length DFTs

Length of the decomposed DFT

8 16 32 64

8 64 128 256 512

16 128 256 512 1024 32 256 512 1024 2048

Length of the decompose d DFT

64 512 1024 2048 4096

Architecture design

Fig. 4.9 shows the block diagram of the proposed GDA-based DFT architecture with variable length with the Cooley-Turkey decomposition. This architecture consists of two configurable GDA units for respectively computing the row and column 1-D 8/16/32/64-point DFT, a multiplier for performing the twiddle factor multiplications serially, and a transpose memory for data transposition. Fig. 4.10 shows the block diagram more detail with real input data and complex output data. For efficiently realizing the twiddle factor multiplications, the complex number multiplier with serial manner, such as CORDIC processor or the serial multiplier set, can be a proper choice combined with DA-based design. In cyclic convolution formulation, the architecture in Fig. 4.10 can be redrawn as Fig. 4.11. It is composed of serial multiplication for preprocessing, GDA computation for Tij ( ), and serial multiplication for post-processing. Each the Tij ( ) block can be configured for the 1-D DFT computation with different length, where i, j denote the computation with real part of input data and real part of DFT coefficient (i.e., RR), imaginary part of input data and imaginary part of DFT coefficient (i.e., II), real part of input data and imaginary part of DFT coefficient (i.e., RI), or imaginary part of input data and real part of DFT coefficient (i.e., IR). In Fig. 4.11, we can see that the output data of Tij ( ) is sequentially multiplied by the post-processing coefficient of row 1-D DFT, the twiddle factor, and preprocessing coefficient of column 1-D DFT. Thus we can combine the three multiplications, and replace with one multiplication only. According to the tradeoff between word-length of the transpose memory and word-length of the multiplier, as shown in Fig. 4.12 and Fig. 4.13, this multiplication can selectively be located in front or real of the transpose memory.

) ,

1 2 k n G ) , (n₂ k₁ G )

(N₂n₁ n₂

x + Y(k₁+N₁k₂)

2 1n k

Fig. 4.9: Block diagram of the proposed variable-length DFT architecture.

Fig. 4.10: Architecture of 2-D DFT with real input.

GDA-based Variable-length

T_II( ) Serial

multiplier

GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

CORDIC/SMUL

Transpose memory

CORDIC/SMUL

GDA-based Variable-length

T_II( ) GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

CORDIC/SMUL

Real

input Complex

input

CORDIC/SMUL

Row 1-D DFT Twiddle factor Column 1-D DFT

Fig. 4.11: Architecture design of the 2-D DFT in cyclic convolution formulation.

GDA-based Variable-length

T_II( ) Serial

multiplier

GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

Transpose memory

GDA-based Variable-length

T_II( ) GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

CORDIC/SMUL

Real

input Complex

input

Merged CORDIC/SMUL

Row 1-D DFT Column 1-D DFT

Fig. 4.12: Version 1 of the reduced architecture of 2-D DFT in cyclic convolution formulation.

Merged CORDIC/SMUL CORDIC/SMUL

Fig. 4.13: Version 2 of the reduced architecture of 2-D DFT in cyclic convolution

formulation.

For the purpose of performing the variable-length DFT computation with identical hardware, we adopt the pseudocirculant matrix factorization algorithm to factorize the cyclic convolution Tij( ) in 1-D DFT with different length as the composition of 8-point cyclic convolutions. For the case of 64-point cyclic convolution, as shown in Fig. 4.14, the matrix of input data can be decomposed as an eight by eight blocked matrix. Since each block in the matrix has preserved as an 8-point cyclic convolution, we can allocate the computation of every eight row blocks into eight 8-point GDAU and sum up the outputs of GDAUs to have eight outputs of

the 64-point cyclic convolution. Observing the matrix form in left side of the Fig. 4.14, we can see that each computation of eight row blocks with rotated order can be folded onto the identical eight 8-point GDAUs. Totally, eight iterations are needed to compute all the outputs of 64-point cyclic convolution. For the case of 32-point cyclic convolution, due to it is composed of four by four blocked matrix with pseudocirculant, as shown in Fig. 4.15, we can compute every eight outputs of the 32-point cyclic convolution by summing up the results of four 8-point cyclic convolution. With the same amount of GDA computation hardware resource, it needs two iterations to compute all the 32 outputs of 32-point cyclic convolution. With the same way, the case of 16-point cyclic convolution can also be composed of two by two blocked matrix with pseudocirculant. In the proposed design, we have constructed the hardware with eight 8-point cyclic convolution modules for the computation of cyclic convolution in the variable-length DFT. This hardware can compute the 64 outputs of 64-point cyclic convolution by eight iterations, the 32 outputs of 32-point cyclic convolution by two iterations, the 16 outputs of 16-point cyclic convolution by one-second iteration, and the 8 outputs of 8-point cyclic convolution by one-eighth iteration. Thus for the computation of 64/256/1024/4096-point 1-D DFT, the lengths of row DFT and column DFT are respectively 8/16/32/64, and the number of iterations with the identical hardware is 1/8/64/512.

Fig. 4.14: Folding of the computation of each eight row blocks in 64-point cyclic convolution.

Coefficient vector 1

Output vector 1

Fig. 4.15: Folding of the computation of each four row blocks in 32-point cyclic convolution.

With the identical hardware, due to the numbers of iterations for the computations of DFT with different lengths are not the same, the variable-length DFT design must be worked with different control states. Since the hardware resource in the proposed design can compute eight 8-point 1-D DFTs in each iteration, the 64-point 1-D DFT needs only one iteration to compute all the output data in row and column DFT. For the computation of 256-point 1-D DFT, each of the iterations can be used for the computation of two 16-point DFTs in each dimension so that 16 16-point DFT computations need totally eight iterations, as well as 64 iterations needed for 1024-point 1-D DFT and 512 iterations needed for 4096-point 1-D DFT. Due to the coefficients of 8, 16, 32, and 64-point DFT are different, we use RAM instead of ROM for replacing the contents of memory needed for computing the variable-length DFT. The partial products stored in this memory for DA computation can be downloaded in the initialization phase from the main frame. Since there are thirty-six memory entries in the 8-point GDAUs, thirty-six write cycles are consumed in each of the initial phases. Due to the data rate and the length of DFT in a communication system is fixed while the condition of environment is remained, once for loading coefficients of the DFT with decided length into the memory of variable-length DFT core is required. However, if the length of DFT is decided larger than 64, there are

required respectively 4, 16, and 64 initial phases for 256-, 1024-, and 4096-point DFT.

All the coefficients of DFTs with different lengths can be stored previously in the low cost memory of main frame.

1-bit 3-D rotator

2 2 1i

i n

Pre-processing Input buffer

PISO groups

CORDIC/SMUL groups

IBUF groups

Input data

Re[Output data]

Computing

GDA-based Variable-length

T_II( ) GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

Im[Output data]

(a)

1-bit 3-D rotator

2 2 1i

i k

Post-processing

PISO groups

Re[Output data]

Computing

GDA-based Variable-length

T_II( ) GDA-based Variable-length

T_RR( )

GDA-based Variable-length

T_IR( ) GDA-based Variable-length

T_RI( )

Im[Output data]

Im[Input data]

Re[Input data]

OBUF groups

CORDIC/SMUL groups

(b)

Fig. 4.16: Detail architecture of (a) the row 1-D DFT with input buffer and (b) the column 1-D DFT with output buffer.

Fig. 4.16 shows the proposed variable-length DFT design more detail in row stage and column stage, including input buffer (IBUF), serial multiplier (SMUL), parallel-in-serial-out (PISO), 1-bit three-dimension (3-D) rotator, variable-length GDA-based module, and output buffer (OBUF). The length of DFT in each stage can be configured with 8/16/32/64-point. In the following, we will illustrate detail design of the modules in the proposed variable-length DFT.

Similar to most of the DA-based designs, Fig. 4.17 (a), (b), and (c) show the input buffer for serially storing input data, the parallel-in serial-out (PISO) module for issuing the input data of DA with word-parallel-bit-serial manner, and the output buffer for serially outputting the output data.

IBUF7IBUF6IBUF0

IBUF D

D Input data

(a)

PISOgroup7 PISOgroup6 PISOgroup0

(b) (c)

Fig. 4.17: Detail design of (a) input buffer groups, (b) PISO groups, and (c) output buffer groups in the proposed 1-D DFT architecture.

On the consideration of input data permutation for GDAUs, according to the formulation of any-length cyclic convolution in (2.6), the input data of the eight 8-point GDAUs in each of the iterations is block rotated and in-block rotated. Then a 1-bit rotator is needed for preparing the exact data on the inputs of GDAUs. Since the rotator needs to work with different lengths for variable-length DFT, a specific 1-bit 3-D barrel rotator is designed as Fig. 4.18 (a). The mode of 1-bit 3-D rotator can be decided by three variables for how many bits are rotated in a block, how many blocks are rotated in cyclic convolution for the chosen length of DFT, and which length of DFT is chosen. It performs the in-block rotation with 8-bit barrel rotator (BR) in stage 1. For the block rotation, in the stage 2, the barrel rotator group (BRG) with eight 2-bit barrel rotators is used in 16-point DFT in each dimension of the 256-point DFT.

In the stage 3, the barrel rotator group (BRG) with eight 4-bit barrel rotators is used in 32-point DFT in each dimension of the 1024-point DFT. In the stage 4, the barrel rotator group (BRG) with eight 8-bit barrel rotators is used in 64-point DFT in each dimension of the 4096-point DFT. Table 4.11 shows the condition of BR in each stage for DFT with the lengths of 64, 256, 1024, and 4096. This specific 1-bit 3-D rotator design provides to permute the exact data on the inputs of GDAUs for computation of the proposed variable-length DFT design.

Table 4.11: Condition of BR in each stage for DFT with the lengths of 64, 256, 1024, and 4096.

length of DFT stage 1 stage 2 stage 3 stage 4

64 P P P P

256 R R P P

1024 R P R P

4096 R P P R

Note:

1. D denotes the BR works on bypass mode.

2. R denotes the BR works on rotation mode.

(a)

(b) (c) (d)

Fig. 4.18: (a) design of the 1-bit 3-D rotator and the routing for (b) 2-bit BRG in stage 2, (c) 4-bit BRG in stage 3, and (d) 8-bit BRG in stage 4.

Following the 1-bit 3-D barrel-rotator, with identical hardware, the module with GDAUs is used to compute all the output data or part of the output data in each of the iterations for DFT with variable length. As shown in Fig. 4.19, each of the GDAUs performs the computation of 8-point cyclic convolution. In the following stage, shown in Fig. 4.20, an adder-group tree is used to sum up the partial outputs from these GDAUs for the shortened cyclic convolutions in case of the length of row or column DFT is larger than eight, where the different dash lines respectively denote the data-flows in the row or column DFT with different lengths. In each of the iterations for DFT computation, the numbers of output data computed by the identical computation resource for the 1-D DFT with lengths of 64/256/1024/4096 are 64/32/16/8. With the limitation of the number of GDAUs, we place the multiplexers with different width to select out the different number of output data for the 1-D DFT with different length.

Fig. 4.19: Detail design of variable-length GDA-based module used for the computation of Tij( ) in the proposed 1-D DFT architecture.

From the eight GDAUs Adder-group

Adder-group

MUXMUX

To SMUL group7

Adder-group

MUXMUX

Adder-group

To SMUL group6

To SMUL group5

To SMUL group4

To SMUL group3

To SMUL group2

To SMUL group1

To SMUL group0

Fig. 4.20: Data-flow of the adder-group tree follows the GDAUs in the proposed variable-length DFT design.

As the formulation mentioned in the chapter 3, the multiplications need for pre- and post- processing of the 1-D DFT in cyclic convolution. For reducing hardware cost of the multiplications, we combine the multiplication of pre-processing in row DFT and the multiplication of post-processing in column DFT with the multiplication of twiddle-factor processing such that only one multiplier is remained between row DFT and column DFT. With the feature of serial manner in DA computation, the complex multiplier with serial manner should be a proper choice for the multiplication.

SMULgroup7 SMULgroup6 SMULgroup0

SMUL7

SMUL6

SMUL5

SMUL1

SMUL0 SMUL group

Fig. 4.21: Detail design of serial multiplier groups in the proposed 1-D DFT architecture.

Since the output data is out of order in the row DFT, shown as Fig. 4.22, for the usage of column DFT, we can reorder these data while writing them into the transpose memory by using a specific address generator.

Fig. 4.22: The transpose memory with the specific address generator

Design evaluation

Based on the proposed GDA-based 1-D variable length DFT architecture as Fig.

4.16, the number of cycle consumed for computing the 64/256/1024/4096-point DFT with the 8/16/32/64-point DFT in two dimensions is proportional toO(8^log² ^N⁻³×L), where N denotes the length of 1-D DFT, and L denotes the word-length of GDA input data. Referring to the simulation results of the DFT with lengths of 8, 16, 32, and 64, we can further evaluate the DFT designs with the lengths of 128 (i.e., 8 * 16), 512 (i.e., 16 * 32), and 2048 (i.e., 32 * 64), respectively. However, since the cycle count consumed in two stages of 8-point DFT and 16-point DFT in the 128-point DFT design as well as in the 512- and 2048-point DFT designs, are not the same, we must take the largest one of the two stages.

We have evaluated the proposed design with UMC 0.18um cell-library. For fairly compared with the existing long-length and variable-length FFT designs [67]-[71], we eliminate the factor of different technology by normalizing all the design areas with the normalized index [72] as (4.36). As the simulation result, except for the advantages of short latency and high hardware utilization efficiency in the GDA-based design, checked with the hardware cost analysis mentioned above, Table 4.12 also reveals that the power of two variable-length DFT realized with the proposed decomposition approach and GDA design can achieve competitive hardware cost under the same throughput rate, especially the length of DFT is ranged between 64 and 512. Thus the proposed variable-length DFT can be a more efficient dedicated design to the application of ADSL system.

18 . 0 / log

(Techno y um

Area Area

Normalized = (4.36)

Table 4.12: Comparison of the existing FFT designs and our DFT design

Bidgt [67] Jia [68] Kuo [69] Pao [70] Lin [71] ours

DFT size 8192 8192 64 ~ 2048 512 ~ 8192 512~2048 64 ~ 4096 Algorithm Radix-4 FFT Radix-2/4/8

FFT Cached FFT

Radix-4 DHT-based

FFT

Radix-2/4/8 FFT

Cooly-Turkey/

cyclic convolution/

Pseudocirculant factorization/GDA DFT Word-length

(bit) 12 12 16 22 12 20

Process (um) 0.5 0.6 0.35 0.25 0.35 0.18

Clock rate

(MHz) 20 20 60 35 45 85

Throughput

(sample/cycle) 1 1 1 1 1 5.33 ~ 0.67

Latency (cycle) N N N N N 60

Area (mm²) 100 107 12.25 25 13.05 7.79

Normalized area 12.96 13.87 3.24 12.96 3.45 7.79

Normalized

area/throughput 12.96 13.87 3.24 12.96 3.45 1.46 ~ 11.62

Chapter 5 Conclusion

In this chapter, we summarize with some useful results and contributions presented in this dissertation, and point out some future research directions.

5.1 Contributions

In this dissertation, an entire bit-level hardware-efficient group distributed arithmetic (GDA) design approach has been presented for Discrete Sinusoidal transform (DSST’s). A new hardware-efficient GDA datapath and the essential partitioning schemes are involved in the development of the proposed new DA design approach for long-length cyclic convolution of the DSST’s, where Agarwal-Cooley algorithm and Pseudocirculant matrix factorization algorithm are respectively adopted for the cyclic convolution with prime length and non-prime length. Furthermore, for the long-length DSST’s designs, we combine the proposed design approach with the fast transform algorithms, such as Cooley-Tukey algorithm and prime factor algorithm, to achieve the low hardware cost.

In the proposed bit-level design approach, we adopt the way of distributed arithmetic (DA) computation and exploit the good features of the cyclic convolution to facilitate an efficient DA realization of 1-D N-point DSST,s using a very small memory module, a barrel shifter, and N accumulators. The proposed GDA design is achieved by re-arranging the contents of the memory into few groups such that all the elements in a group can be accessed simultaneously in accumulating all the DSST’s outputs for increasing the memory utilization. This design reveals that the complexity of DA design is improved from O(2^N) to O(2^N⁻^log²^N +N+2).

For the purpose of further reducing the hardware cost in DSST’s design, we exploit the symmetrical property of DFT coefficients with the proposed GDA design approach such that the DFT requires only half the contents to be stored, which further reduces the memory size by a factor of two. For the DCT design, we exploit the symmetry property of DCT coefficients, merge the elements in the matrix of DCT kernel, separate the kernel of DCT to be two perfect cyclic forms, and partition the content of memory into groups to facilitate an efficient realization of 1-D N-point

DCT kernel using (N-1)/2 adders or substractors, one small memory module, a (N-1)/2-bit barrel shifter, and (N-1)/2+1 accumulators. Compared with the existing systolic array designs and DA-based designs, the realizations of 1-D DFT, DHT, and DCT with the proposed GDA-based design approach reduce the delay-area product more than 29% according to Avanti 0.35 um CMOS cell library. However, observing the DCT and DHT in cyclic convolution algorithm with non-prime length, there exists the inherent overhead for handling the issue of numerical instability such that the proposed design approach is not efficient for design with this case.

Finally, combining the proposed GDA design approach with the suggested long-length transform decomposition methodology, a variable-length DFT design has been proposed and implemented in our studies for the popular application of DFT with the length of power of two in the communication system. The proposed design can flexibly be used to compute the 1-D 64/128/256/512/1024/2048/4096-point DFT by cascading the 1-D short length DFTs and summing up the partitioned short length cyclic convolutions for each stage of the cascaded DFT. Besides, the proposed hardware efficient design approach can be applied to the design with any length beyond power of two. Compared with the existing long-length and variable-length FFT design, in addition to the advantages of short latency and high hardware utilization efficiency, the proposed power of two variable-length DFT design can achieve competitive hardware cost under the same throughput rate.

5.2 Future Research Directions

The presented GDA design approach involves cyclic convolution, its

在文檔中新式位元層次設計方法及其應用於離散弦轉換 (頁 118-0)