The Comparison Discussion of MIMO-FFT Architecture

2 Literature Review

4.4 The Comparison Discussion of MIMO-FFT Architecture

Considering the most efficient pipeline FFT processor in a single-input single-output OFDM (SISO-OFDM) WLAN application, He et al. have presented several reliable architectures and the detailed comparison of their hardware costs [31]. The comparison of these architectures indicates that the radix-2² single-path delay feedback (R2²SDF) has the highest 50% butterfly utilization and lowest hardware resource consumption [31, 34].

However, the radix-2² based algorithm has a higher complex multiplicative complexity than high-radix and other mixed-radix FFT algorithms, as revealed in Table 1. Furthermore, the SDF based architecture has the lowest throughput rate of R, which can not meet the requirements of the MIMO-OFDM applications. Considering the most efficient pipeline FFT processor in the MIMO-OFDM WLAN applications, the comparison results of [26] indicates that the R4MDC architecture meets the most efficient 64-points FFT/IFFT processor for the 4×4 MIMO-OFDM WLAN system. Although several R4MDC based 64-point FFT chips have been discussed [40, 61, 72], only the design of Swartzlander et al. [40] can operate at the data sampling frequency in the 4×4 MIMO-OFDM systems. Notably, Hui et al. [56] proposed a digit-serial architecture base on radix-4 decomposition, with higher hardware utilization (100%) than the R4MDC based design [40] in the SISO-OFDM system. Hui et al. made good tradeoffs between the digit size and throughput rate in the SISO system. However, the radix-4 based design has a higher complex multiplicative complexity than high-radix and other mixed-radix FFT algorithms, too. This work focuses on the high throughput rate design with the low multiplicative complexity to fit the requirements of 2×2 and 4×4 MIMO-OFDM systems.

This section presents detailed comparisons among the two proposed architectures, R28MDF and R28MDC, and several famous FFT architectures in the 2×2 and 4×4 MIMO-OFDM systems. An effective design is well to be dictated by considerations on area, timing, power consumption and easily reuse. In this investigation, the systems were compared using five indices —MIMO-FFT architecture, complex multiplicative complexity, throughput rate, utilization and cost— to assess the effectiveness of FFT/IFFT processors. For the purpose of estimating the area index between the different architectures, the conventional comparative methodology [26] with the unit of equivalent adders was adopted. Based on the implementation results of our process, one complex multiplier is equivalent to 50 complex adders if it utilizes 16-bit precision and the scheme of three real multiplications and five real additions. The 16-bit complex memory was converted to 1.3 complex adders. The area report

of the logic synthesis tool demonstrates that one proposed MU is considered to equal 3.2 complex multipliers. Furthermore, the area of proposed constant multiplier is equivalent to one-eighth times that of the proposed MU. Restated, one constant multiplier is approximately equivalent to 0.4 complex multipliers.

4.4.1 2×2 MIMO-OFDM WLAN Application

Table 5: Comparison results of the 64-point FFT/IFFT chip designs in 2x2 MIMO-OFDM system.

Cost Architecture MIMO-FFT architecture

(Frequency, MHz)

Complex Multipli-c ation #

Through-put rate

Butterfly Utilization

ROM # complex multipliers

constant multipliers

Area without memory (Area with memory) Modified R2²SDF [34] Parallel Multi-Path (20) 76 R 50% 2 4 0 224 (390.4)

R28SDF [32] Parallel Multi-Path (20) 48 R 25% 4 4 8 304 (470.4)

R2MDC [67] Serial Blockwise (20) 98 2R 100 % 4 4 0 424 (834.8)

R4MDC [40] Serial Blockwise (20) 76 4R 50% 6 6 0 324 (776.4)

Modified R4MDC [61] Serial Multi-Stream (80) 76 4R 50% 4 4 0 340 (1120) Modified R8MDC [41] Serial Blockwise (20) 48 5.33R 25% 0 3.2 4 228 (709)

Proposed R28MDF Serial Blockwise (20) 48 2R 100 % 0 3.2 1 197 (446.6)

Table 5 presents the comprehensive comparison results of seven existing 64-point FFT/IFFT processors and the proposed R28MDF design in terms of MIMO-FFT architecture, complex multiplicative complexity, throughput rate, butterfly utilization, the number of ROM/complex multipliers/constant multipliers and the area index. Table 5 shows that the proposed R28MDF and R8MDC design achieve the lowest complex multiplicative complexity among the tested design. In terms of butterfly utilization, the proposed R28MDF design achieved the highest butterfly utilization (100%) among those tested. The R28SDF [32]

and R2²SDF [34] designs clearly have lowest throughput rates of R than other designs.

Significantly, the R8MDC-based FFT/IFFT architecture in [41] has two butterfly stages, which only needs 12 and 11 clock cycles respectively. Base on the serial blockwise architecture, the parallel input data for each butterfly stages in [41] could be provided simultaneously to achieve the higher throughput rate. Table 5 shows that the modified R8MDC [41] and R4MDC [40] design could attain higher throughput rates of 5.33R and 4R, respectively, but both of them have lower butterfly utilization and higher chip cost than the proposed R28MDF design. Sansaloni et al. [26] indicated that the MIMO-FFT processor with

throughput rates of 2R and 4R with the least amount of hardware was more appropriate than other architectures for 2×2 and 4×4 MIMO-OFDM applications, respectively.

Based on the serial blockwise architecture, the proposed R28MDF design should incur a small cost penalty on two IUs and one DFM memory in the 2×2 MIMO-OFDM system.

When considering the memory area, the cost of the R28MDF design increases the area index to 14.4% higher than that obtained with the R2²SDF [34] design. However, the R²SDF design increases the multiplicative complexity by 58.3% and reduces the butterfly utilization to 50%

of that of the R28MDF design. Furthermore, the proposed design and that of Maharatna et al.

[41], which only adopt one parallel type multiplier unit, do not require any coefficient ROM.

Following comprehensive comparison between different architectures, this investigation demonstrates that the proposed R28MDF implementation minimizes the chip cost problem associated with the R8MDC, R4MDC architectures, low throughput rate problem of R2²SDF and R28SDF architectures, and the high multiplicative complexity problem of R2²SDF and R2MDC architectures. Thus, the proposed R28MDF design makes an effective tradeoff between complex multiplicative complexity, throughput rate, butterfly utilization and cost for the 2×2 MIMO-OFDM application.

4.4.2 4×4 MIMO-OFDM WLAN Application

For a 4×4 MIMO-OFDM system, Table 6 presents the comprehensive comparison result of several pipeline FFT/IFFT architectures in terms of the MIMO-FFT architecture, throughput rate, complex multiplicative complexity, the utilization of all components, the number of complex multipliers/complex adders/memory size and the area index of the entire system. Table 6 shows that the proposed R28MDC design achieves the lowest complex

multiplicative complexity among the tested design. Furthermore, the proposed R28MDC and R4MDC [40] achieved the highest utilization (100%) for all components; thus R28MDC and R4MDC design were the best among all pipeline architectures tested for the 4×4 MIMO-OFDM application. Although the R4MDC architecture [40] achieved 100% utilization for all components, it also resulted in a chip area 25.6% larger than that of the R28MDC architecture, when considering the memory cost. Regardless of whether memory cost is considered, the proposed R28MDC architecture had the smallest chip area among all pipeline architectures tested in the 4×4 MIMO-OFDM system. The R28MDC architecture did not require any coefficient ROM, also representing an improvement over the R4MDC architecture. Then, the R28MDC architecture achieved the lowest complex multiplicative complexity, appropriate throughput of 4R, highest utilization for all components and lowest chip cost, making it very suitable for the 4×4 WLAN MIMO-OFDM application.

Table 6: Comparison results of the 64-point pipelined FFT/IFFT architecture in 4x4 MIMO-OFDM system.

Pipeline Architecture

MIMO-FFT architecture

Complex multipli-c ation #

Through-put rate

Complexm ultiplier # (Utilization)

Complex adder

# (Butterfly Utilization)

Memory Size (Utilization)

Area without memory (Area with memory) R2SDF [42] Parallel Multi-Path 98 R 20 (50%) 48 (50%) 252 (100%) 1048 (1375.6) R2²SDF [34] Parallel Multi-Path 76 R 8 (75%) 48 (50%) 252 (100%) 448 (775.6) R2³SDF [31] Parallel Multi-Path 48 R 8 (87.5%) 48+16T (50%) 252 (100%) 528 (855.6) R24SDF [68] Parallel Multi-Path 76 R 8 (75%) 48 (50%) 252 (100%) 448 (775.6) R4SDF [69] Parallel Multi-Path 76 R 8 (75%) 96 (25%) 252 (100%) 496 (823.6) R4SDC [70] Parallel Multi-Path 76 R 8 (75%) 36 (25%) 504 (100%) 436 (1091.2) R28SDF [32] Parallel Multi-Path 48 R 8 (12.5%) 64+8T (25%) 252 (100%) 504 (831.6) R2MDC [67] Parallel Multi-Path 98 2R 8 (100%) 24 (100%) 316 (100%) 424 (834.8) R2³MDC [36] Parallel Multi-Path 48 2R 8 (87%) 24+8T (100%) 316 (100%) 464 (874.8) R24MDC [71] Parallel Multi-Path 76 2R 16 (75%) 56 (71.2%) 380 (100%) 856 (1350) R4MDC [40] Serial Blockwise 76 4R 6 (100%) 24 (100%) 348 (100%) 324 (776.4) Modify

R4MDC [61]

Serial Multi-Stream 76 4R 4 (100%) 80+12T (100%) 600 (100%) 340 (1120) Modify

R8MDC [41]

Serial Blockwise 48 5.33R 3.2 (75%) 48+4T (75%) 370 (75%) 228 (709) Proposed

28MDC

Serial Blockwise 48 4R 3.2 (100%) 32+2T (100%) 320 (100%) 202 (618)

4.5 Summary

This work proposes a hardware-orientated approach for high efficiency to minimize the complex multiplicative complexity, area cost and achieve 100% butterfly utilization with an appropriate throughput rate. By adopting the proposed R8-FFT unit combined with the MAW method, two efficient serial blockwise type 64-point FFT/IFFT processors are constructing for the 2×2 and 4×4 MIMO-OFDM WLAN systems. For the 2×2 MIMO-OFDM system, the proposed R28MDF design has the best performance in terms of lowest complex multiplicative complexity, appropriate throughput rate of 2R, highest butterfly utilization and the fewest complex multipliers, when compared with other existing 64-point FFT/IFFT processor architectures. For the 4×4 MIMO-OFDM system, the proposed R28MDC outperforms existing FFT/IFFT pipeline processor architectures and has the lowest complex multiplicative complexity, an appropriate throughput rate of 4R, highest utilization rate (100%) of all components and the lowest hardware cost. According to the IEEE 802.11n standard [23], execution time for the 128-point and 64-point FFT/IFFT processor with 1–4 simultaneous data sequences must be calculated within 3.6 or 4.0 µs. In total, eight operational modes of the FFT/IFFT processor are required in the IEEE 802.11n standard. The effective reconfigurable FFT/IFFT processor [73] supports eight operational modes in the IEEE 802.11n standard, consumes small hardware and little power, is easily reused, and is an important topic for future work.

Chapter 5 Long-Length based Effective Pipeline FFT/IFFT Processor

In order to demonstrating the high efficiency for the long-length FFT/IFFT computations, the proposed effective architecture focus on the design of 4096-point FFT/IFFT processor ensuring the reasonable operating times for low chip cost and on the features of the high hardware utilization rate. In this chapter, two high effective 4096-point pipeline FFT/IFFT processors have been presented, namely R4²SDF and R4³SDF design, to achieve the less complex multiplicative complexities as radix-16 and radix-64 based algorithm with only radix-4 based algorithm. Results of comprehensive comparison further indicate that the proposed R4²SDF and R4³SDF based pipeline processors achieve a higher utilization with a smaller hardware requirement than R2²SDF [31, 34] and other pipeline processors in the 4096-point FFT/IFFT computation, and thus have the higher hardware efficiency. Then, the proposed architectures are very appropriate for the long-length based FFT/IFFT system. The organization of this chapter is structured as follows. A new R4²SDF and R4³SDF FFT/IFFT algorithms are given in Section 5.1. Section 5.2 demonstrates the proposed R4²SDF and R4³SDF VLSI architectures. The finite word-length analysis is given in Section 5.3, and indicates that the proposed architectures achieve the satisfactory system performance. Section 5.4 tabulates the comparison results in terms of hardware utilization and cost to demonstrate the high cost-efficiency of the proposed architectures. The chip implementation is discussed in Section 5.5. The section 5.6 draws conclusions.

5.1 New Radix-4

and Radix-4

based FFT/IFFT Algorithm

where the butterfly structure of the first stage takes the form

4) Following a similar decomposition procedure, Eq. (78) can be decomposed as

Meanwhile, the butterfly structure of the second stage can be obtained as



written in (81). Three full complex multipliers from the second butterfly stage

can be simplified as one single constant multiplier in the proposed R4

SDF

architecture. The constant multiplier cost can be further reduced by applying the

subexpression elimination algorithm. The detailed hardware structure of

constant multiplier is described in the next section. The second radix-4 butterfly

structure in (81) is the same as the first radix-4 butterfly structure in (79) after

simplification of the common factor of the constant multiplier. The complete

radix-4

decimation-in-frequency (DIF) FFT algorithm is obtained by applying

the CFA procedure recursively to the remaining FFTs of length N/16 in (80), as

illustrated in Fig. 21. Figure 21 indicate that the proposed radix-4

algorithm

decomposes the N-points FFT computation by cascading the number of log

N

radix-16 based butterfly (R16-BF) computations, which can be split into two

cascading radix-4 based butterfly (R4-BF) computations as depicted in (79) and

(81). When the variables of k

, k

and k

were treated as constants for each single

output X[k

+ 4k

+16k

] as depicted in (78) and (80), the summation rages

indicate that the required computation results of first and second radix-4

butterfly stage were N/4 and N/16, respectively, as depicted in Fig. 21. The

radix-4

algorithm has the same multiplicative complexity as the radix-16

algorithm, but still retains the radix-4 butterfly structure. Significantly, the

radix-16 algorithm clearly has a lower multiplicative complexity than other

low-radix algorithm, such as a radix-2

algorithm. For instance, the number of

complex multiplications of the 256-point FFT computation adopting the radix-2

and radix-4

algorithms are 1539 and 224, respectively. Thus, the proposed

design based on the new radix-4

algorithm has a lower multiplication

complexity (85.4%) than the R2

SDF design [31][34]. Furthermore, as

mentioned above, the radix-4

algorithm does not require any multiplication in

the single butterfly structure.

)

Fig. 21: The CFA decomposition procedure of the proposed radix-4² based N-point FFT algorithm.

5.1.2 Radix-4

based IFFT Formula

Following the similar procedure, the radix-4² IFFT algorithm can be obtained as below. The IFFT of the N-point input X[k] is given by

the IFFT derivation results can be written as

)

where the butterfly structure of the first and second stage has the form

4 ))

Notably, the only difference between FFT and IFFT algorithm are the sign bits as given in (79), (81) (84a) and (84b). Therefore, the pipeline FFT/IFFT processor can be easily implemented with a single module by controlling the sign coefficient. Additionally, the proposed pipeline IFFT processor has a similar butterfly structure and a single constant multiplier structure with the proposed pipeline FFT processor, which could replace the three multipliers:

W₁₆⁻ⁿ¹

,

W₁₆⁻²ⁿ²

and

W₁₆⁻³ⁿ³

.

5.1.3 Radix-4

based FFT/IFFT Formula

Applying another 4-dimensional linear index map in (76), the parameters n and k could be expressed as the combinations of n₁, n₂, n₃, n₄ and k₁, k₂, k₃, k₄, respectively.

where the butterfly structure of the each stage takes the form The first butterfly stage:

The second butterfly stage:



The third butterfly stage:

)

radix-4

algorithm has few multiplicative complexities as the radix-64 algorithm,

but still retains the simple radix-4 butterfly structure. For instance, the numbers

of complex multiplications in the 4096-point FFT computation adopting the

radix-2

, radix-4

and radix-4

algorithms are 13996, 7425 and 3969,

respectively. Thus, the proposed radix-4

algorithm has a lower multiplication

complexity (71.6%) than the radix-2

algorithm [31, 34]. Significantly, the

radix-4

algorithm clearly has a lower multiplicative complexity than the

purposed radix-4

algorithm and other low-radix algorithms. According to the

similar radix-4 based butterfly architecture with only some sign inversions, the

radix-4

DIF IFFT computation could be obtained.

5.2 Pipeline 4096-Point R4

SDF and R4

SDF based FFT/IFFT VLSI Architecture

Base on the new proposed radix-4

and radix-4

DIF FFT algorithms, the novel R4

SDF and R4

SDF architectures for supporting the 4096-point FFT/IFFT computations are shown in Fig. 22 and 23, respectively. Two proposed architectures both require six butterfly stages with 4095-word shift registers. The R4

SDF based 4096-point FFT/IFFT pipeline processor requires three constant multipliers and two complex multipliers. The R4

SDF based 4096-point FFT/IFFT pipeline processor requires four constant multipliers and one complex multiplier. Comparing with the R4

SDF design, the R4

SDF design replaces one complex multiplier with one constant multiplier in the 4096-point FFT/IFFT computation. The detailed operations of each element are described as follows.

R4-BF

1024 10241024

R4-BF

256 256256

R4-BF

64 6464

R4-BF

16 1616

R4-BF

4 44

R4-BF

1 11

Stage I Stage II Stage III Stage IV Stage V Stage VI

Fig. 22: Block diagram of the R4²SDF-based 4096-point FFT/IFFT VLSI architecture.

R4-BF

1024 1024 1024

R4-BF

256 256 256

R4-BF

64 64 64

R4-BF

16 16 16

R4-BF

4 4 4

R4-BF

1 1 1

Stage I Stage II Stage III Stage IV Stage V Stage VI

Fig. 23: Block diagram of the R4³SDF-based 4096-point FFT/IFFT VLSI architecture.

5.2.1 Radix-4 Butterfly

The derivation results of the radix-4² and radix-4³ algorithms reveal that both the FFT/IFFT butterfly computation in (78) and (86), can be easily computed with the same radix-4 butterfly architecture. Notably, the radix-4 butterfly structure only requires trivial multiplication, which involves real-imaginary swapping and sign inversion, and which does not require any complex multiplication. Figure 24 illustrates the proposed radix-4 butterfly structure, which only includes four four-input complex adders. Without any complex multiplier, the radix-4 based butterfly structure is more cost-efficient than higher-radix based butterfly structures.

Moreover, the proposed radix-4² algorithm has the same complex multiplication complexity as the radix-16 algorithm, and radix-4³ algorithm further has the few complex multiplication complexity as the radix-64 algorithm. Thus, the proposed two pipeline architectures have the high cost efficiency of lower radix architectures.

Fig. 24: Block diagram of the radix-4 butterfly architecture.

5.2.2 Memory Structure

The memory structure of each butterfly stage is well known to be an important issue for the effective pipeline FFT/IFFT processor. In this work, the delay feedback based memory structure is adopted. In order to compute the radix-4 based butterfly computations, the input data and the intermediate results have to be reordered as four concurrently data streams using memory as shown in Fig. 24. In the radix-4 butterfly structure, four proposed operation modes can finish the data reordering and the butterfly computation as shown in Fig. 25(a). Operation modes 0–2 are adopted in the data reordering, and operation mode 3 is adopted in the FFT/IFFT computation. Each radix-4 butterfly unit applies three parallel Fist-In First-Out (FIFO) shift registers to store the serial data input and butterfly output in the feedback paths as presented in Fig. 25(a). The timing sequence of N-point FFT/IFFT computation can be divided into four stages, each stage contains N/4 clock cycles as presented in Fig. 25(b). The required number of memory cells for the kth stage is 3×N/(4^k). Significantly, the SDF based pipeline FFT/IFFT structure is highly regular, which has the highly effective memory structure with the simpler routing complexity [31, 32, 34, 35, 42, 43].

x(0 : N/4-1)

x(N/4 : N/2-1)

x(N/2 : 3N/4-1)

X[N/4 : N/2-1]

X[N/2 : 3N/4-1]

Mode 0 Mode 1 Mode 2 Mode 3

X[3N/4 : N-1]

(a) The proposed 4 operation modes in the radix-4 based butterfly stages.

0 (N-1)/4 N/

4 2N/4 3N/4

Clock Clock #

Input

Operation Modes #

x(o) x((N-1)

/4)

0 1 2 3 0

(2N-1)/4 (3N-1)/

4 (4N-1)/4 0 (N-1)/4

x(N/4) x((2N-1)

/4) x(2N/4) x((3N-1)

/4) x(3N/4) x((4N-1)

/4) x(o) x((N-1)

/4)

(b) The timing sequences of 4 operation modes in the proposed pipeline architecture.

Fig. 25: The proposed 4 operation modes of the radix-4 butterfly stage in the R4²SDF and R4³SDF based 4096-point FFT/IFFT VLSI architecture.

The dual port memory is well known to be an intuitive implementation for the FIFO shifts register. However, each cell in the dual port memory takes an area 33% larger than the corresponding single port RAM cell. Furthermore, the dual port memory would consume more power than single port memory [31]. In this study, the memory implementation of stage I and II are realized by the single port SRAM. The proposed FIFO shift registers architecture in the butterfly stage I is depicted in Fig. 26(a), where the notations of the input/output ports denote the respective operators in (79) for the proposed operation mode 3. Due to the few memory cell requirements, the stage III, IV, V and VI adopt the synchronize flip-flops to implement the FIFO shift registers for the small chip cost. Accompany with the six words synchronize flip-flops, the proposed FIFO architecture has a wide data width of six-words to provide a six-words reading at a time as shown in Fig. 26(a). Base on the proposed FIFO shifter registers as depicted in Fig. 26(a), the proposed memory architecture can concurrently provide three operators for the radix-4 based butterfly unit in the current and consequent cycles. Therefore, the size of single port SRAMs are 512×6 and 128×6 words in the stage I and II, respectively. Accompany with the control signals of word selection, the proposed single port SRAM adopts the simple word-control circuits to provide the ability of independent-word writing in the same address as shown in Fig. 26(b). That means the proposed single port SRAM, which has the wide data width, can easily achieve the independent-word writing for the data reordering in the operation modes 0–2 as shown in Fig.

25(a). The detail data arrangement in the proposed single port memory is listed as Fig. 26(c).

In Fig. 26(c), the notation A(n) and B(n) denote the combinative data sets of three input data and butterfly results after data reordering and butterfly computations, respectively. In the butterfly stage I, A(n) and B(n) could be expressed as {x(n), x(n+N/4), x(n+N/2)} and {B⁰N/4(n), B¹N/4(n), B²N/4(n)}, respectively. Notably, each radix-4 butterfly unit could store the input data and output results in the same SRAM for the highest memory utilization rate. The read and write operations are interleaved and each of them is active every other clock cycle as shown in Fig. 26(d), which can prevent the read/write conflict. Figure 26(d) shows the detail timing sequence of the proposed memory architecture in the operation mode 3.

512 x 6

(a) The proposed FIFO shift registers architecture on the butterfly stage I.

X ADR

(b) The proposed single port SRAM with independent word control.

B²₁₀₂₄ (1023)

Length:

512

Width: 3 words Width: 3 words

x[2048]

CLK

A(n-1)

& A(n) A(n-3)

& A(n-2) B(n-5)

& B(n-4) B(n-3)

& B(n-2) B(n-1)

& B(n)

Write_EN

(d) The timing sequence of proposed memory architecture in the operation mode 3.

Fig. 26: The proposed memory architecture of the butterfly stage I and II in the R4²SDF and R4³SDF based 4096-point FFT/IFFT VLSI architecture.

5.2.3 Constant Multiplier

Based on the derivation results in Section 5.1, the radix-4² algorithm requires some complex multiplications, namely _W₁₆^k¹ , _W₁₆²^k¹ and _W₁₆³^k¹ in the 4096-point FFT/IFFT computation in (81). According to the SDF based architecture as depicted in Fig. 22, a single data stream passes through the constant multipliers and complex multipliers. There is only one complex multiplication, which is computed in (81) during each cycle. Then, the three full complex multipliers can be simplified as a single constant multiplier. This subsection follows three steps to reduce the complex multipliers to the most economical constant multipliers in the R4²SDF and R4³SDF architecture. The implementation of constant multiplier in the R4²SDFarchitecture is presented as below. First, the multiplication of twiddle factors from Eq.

(81) is realized as the constant multiplier, which only contains shifters and adders as shown in Fig. 27. Second, the complex conjugate symmetry rule is applied to decrease the number of complex multiplications to only two constant multiplications per stage with some shuffle circuits as shown in Fig. 27, thus achieving a constant multiplier cost reduction of 83%.

Finally, the subexpression elimination algorithm [65] is adopted to reduce the number of shift circuits by more than 20%, and the number of complex adders by 50% in one constant multiplier, as depicted in Fig. 27. The strictest constant multipliers are obtained in the

purposed architectures by following these three steps. The cost penalty of the constant multiplier is thus minimized. Similarity, the radix-4³ algorithm has two retrenched constant multipliers as depicted in (78). The constant multiplier of second stage in R4³SDF design is the same as the constant multiplier in R4²SDF design. Following the similar reduction steps, the constant multiplier of the third stages in R4³SDF based design requires eight constant multiplications with the cost reduction of 83%. Considering the chip cost in R4³SDF design, the constant multiplier in third stage increases slightly control complexity than the constant multiplier in second stage.

>>4

>>8

>>7

>>9

>>1

>>3

>>6

>>12

+ +

“0”

>>3

>>12

>>7

>>1

>>4

>>6

>>8 +

+ +

“0”

>>2

S0 Real

input

Imaginary input

Real output Imaginary

output

2's 2's Constant Multiplier

[Real]

Constant 1: 0.923828 = 1-2^-4-2^-7-2^-8-2^-9 Constant 2: 0.707092 = 2^-1+2^-3+2^-4+2^-6+2^-8+2^-12 [Imaginary]

Constant 1: 0.382629 = 2^-2+2^-3+2^-7-2^-13+2^-12 Constant 2: 0.707092 = 2^-1+2^-3+2^-4+2^-6+2^-8+2^-12

S1 S1 S2

S1 S2

Complex Adders: 5 (reduced 50%) Shifts: 17 (reduce 20%)

S0 S0

Fig. 27: Block diagram of the proposed constant multiplier in R4²SDF design.

5.2.4 Eight-Folded Complex Multiplier

The proposed 4096-point R4³SDF design has only one complex multiplier and one coefficient ROM to realize the complex multiplication of twiddle factors W_Nⁿ⁴⁽^k¹⁺⁴^k²⁺¹⁶^k³⁾ in (86). However, the proposed 4096-point R4²SDF design requires two complex multipliers and two coefficient ROMs to realize the ⁿ³⁽^k¹ ⁴^k²⁾

WN ⁺ in (79). To decrease the ROM size, the complex conjugate symmetry rule and subexpression elimination [65] is applied to devise one eight-folded complex multiplier as shown in Fig. 28. The proposed eight-folded complex multiplier could reduce the storage size of 87.5 % for each coefficient ROM. In the proposed R4²SDF design, the first and second coefficient ROMs store 31 and 511 words, respectively.

However, the proposed R4³SDF design only has one complex multiplier, which stores 511 words in the coefficient ROM. Comparing with the R4³SDF design, the R4²SDF design requires a larger chip cost of two complex multipliers and two coefficient ROMs to complete

在文檔中高效能之管線式傅立葉轉換處理器之設計與實現 (頁 75-0)