Constant Multiplier - Pipeline 256-Point FFT/IFFT/8×8 2D-DCT Processor Architecture

5 Long-Length based Effective Pipeline FFT/IFFT Processor

6.2 Pipeline 256-Point FFT/IFFT/8×8 2D-DCT Processor Architecture

6.2.4 Constant Multiplier

Based on the derivation results in Section II, the radix-4² algorithm requires some complex multiplications, namely W₁₆^k¹( ¹

W⁻n ),W₁₆²^k¹(W₁₆⁻²ⁿ¹),W₁₆³^k¹(W₁₆⁻³ⁿ¹) in the 256-point FFT/IFFT mode in (81), (84), and ¹²

Wk , ² ¹²

W k , ³¹²

W k , ²

Wk in the 8×8 2D DCT mode in (98). Due to the finite range of k1 and n1 in Eqs. (81) and (84b), namely 0–3, the three complex multiplications,

161

Wk ( ¹

W⁻n ), W₁₆²^k¹ ( W₁₆⁻²ⁿ¹ ) and W₁₆³^k¹ ( W₁₆⁻³ⁿ¹ ) can be written as {W₁₆⁰(W₁₆⁻⁰),W₁₆¹ (W₁₆⁻¹),W₁₆²(W₁₆⁻²),W₁₆³(W₁₆⁻³)}, {W₁₆⁰(W₁₆⁻⁰),W₁₆²(W₁₆⁻²),W₁₆⁴(W₁₆⁻⁴),W₁₆⁶(W₁₆⁻⁶)} and { ⁰

W16(W₁₆⁻⁰), ³

W16(W₁₆⁻³), ⁶

W16( ⁶

16 W− ), ⁹

W16( ⁹

W− )}. Following the similar procedure, W₈^k¹², W₈²^k¹²,

312

W k and W₈^k² in (98) can be expanded as { ⁰

W8 , ¹

W8 }, { ⁰

W8 , ²

W8 }, { ⁰

W8 , ³

W8 } and {W₈⁰,W₈¹,W₈²,W₈³,W₈⁴,W₈⁵,W₈⁶,W₈⁷}. The system has in total 38 different twiddle factor values, which could be implemented as 38 different constant multipliers by only shifters and adders.

Based on the SDF based architecture, the proposed design only has to calculate one complex multiplication in Eqs. (81), (84) and (97) during each clock cycle. The 38 twiddle factor values can thus be reduced to the extension of two different values of W₁₆¹ and W₁₆² using the complex conjugate symmetry rule. Accordingly, the other 36 twiddle factor values can be expressed as the real-imaginary swapping or sign inversion of these two constant values.

Moreover, the repeated shifters and adders of two constant multipliers could be simplified using the subexpression elimination algorithm [65] as illustrated in Fig. 34. According to our implementation results, the small cost penalty for the multiplexer control (i.e. S0, S1 and S2) could be neglected as shown in Fig. 34.

Following the three steps to reduce the complex multipliers to the most economical constant multipliers are summarized as below. First, the twiddle factors from Eqs. (81), (84) and (98) are realized as the constant multipliers, which only contain shifters and adders as shown in Fig. 31. Second, the complex conjugate symmetry rule is applied to decrease the number of complex multiplications (90) to only two constant multiplications per stage with some shuffle circuits as shown in Fig.5, thus achieving a constant multiplier cost reduction of 94.7%. Finally, the subexpression elimination algorithm [65] is adopted to reduce the number of shift circuits by more than 20%, and the number of complex adders by 50% in the one constant multiplier, as depicted in Fig. 34. The strictest constant multiplier is obtained in the purposed architecture by following these three steps. The cost penalty of the constant multiplier is thus minimized.

>>4

>>8

>>7

>>9

>>1

>>3

>>6

>>12

+ +

“0”

>>3

>>12

>>7

>>1

>>4

>>6

>>8 +

+ +

“0”

>>2

S0 Real

input

Imaginary input

Real output Imaginary

output

2's 2's Constant Multiplier

[Real]

Constant 1: 0.923828 = 1-2^-4-2^-7-2^-8-2^-9 Constant 2: 0.707092 = 2^-1+2^-3+2^-4+2^-6+2^-8+2^-12 [Imaginary]

Constant 1: 0.382629 = 2^-2+2^-3+2^-7-2^-13+2^-12 Constant 2: 0.707092 = 2^-1+2^-3+2^-4+2^-6+2^-8+2^-12

S1 S1 S2

S1 S2

Complex Adders: 5 (reduced 50%) Shifts: 17 (reduce 20%)

S0 S0

Fig. 34: Block diagram of the proposed constant multiplier architecture.

6.2.5 Eight-Folded Complex Multiplier

The proposed architecture has only one complex multiplier and one coefficient ROM to realize the complex multiplication of twiddle factors ⁿ³⁽^k¹ ⁴^k²⁾

WN ⁺ in (78), W_N⁻^k³⁽ⁿ¹⁺⁴ⁿ²⁾ in (83) and ⁴⁽¹ ²⁾

1 8

k k

W ⁺ in (98). Significantly, the implementation of the time-domain shift for 8×8 2D-DCT computation needs one feedback path. To decrease the ROM size, the complex conjugate symmetry rule and subexpression elimination [65] is applied to devise one eight-folded complex multiplier as shown in Fig. 35. The proposed eight-folded complex multiplier only has to store 32 words in the coefficient ROM, reducing the ROM size by 87.5%. The ROM address and data control circuit are also easily realized by the 8-bit counter controller given in Table 12.

96 W256

64 W256

25632 W

2560 W

256224 W 256192

W 256140

W 256128 W

Fig. 35: The block diagram of eight-folded algorithm in the coefficient ROM.

Table 12 The Data Control of The Coefficient ROM.

H = n3(k1+4k2) Address Mode (H[5])

ROM address Data Mode (H[7:5])

ROM data

0~32 0 Two’s complement of H[5:0] 0 a+jb

33~63 1 H[5:0] 1 b+ja

64~95 0 Two’s complement of H[5:0] 2 -b+ja

96~127 1 H[5:0] 3 -a+jb

128~159 0 Two’s complement of H[5:0] 4 -a-jb

160~191 1 H[5:0] 5 -b-ja

192~223 0 Two’s complement of H[5:0] 6 b-ja

224~255 1 H[5:0] 7 a-jb

6.2.6 Post Computation

Clearly, the 256-point FFT/IFFT modes only require 1×3 word shift registers at the fourth butterfly stage of the proposed R4²SDF architecture. However, the 8×8 2D DCT mode has to implement the post-computation at the fourth butterfly stage in (95a) and (95b). As described in Subsection 6.2.1, the proposed architecture follows the specific linear mapping in (97) to minimize the number of shift registers at the fourth stage. Figure 36(a) depicts the analysis of the order of the fourth butterfly results following the specific linear mapping. Notably, the gray solid line in Fig. 36(a) represents the input data order that do not follow the required sequence. For instance, {Y_s[17], Y_s[23]}, {Y_s[18], Y_s[22]} and {Y_s[19], Y_s[21]} should be regarded as three groups for the fourth butterfly computation. However, the sequence of the input data at the fourth butterfly stage is Y_s[17], Y_s[18], Y_s[19], Y_s[21], Y_s[22],

] 23 s[

Y . Then, Y_s[23] and Y_s[21] should be re-ordered. Thus, the proposed overturn shift register (OSR) structure at fourth butterfly stage resolves this simple re-ordering procedure without any performance degradation, as depicted in Fig. 36(b). The desired ordering is obtained with the OSR structure at the fourth butterfly stage, along with the input re-ordering operation at the first butterfly stage as discussed in Subsection 6.2.1. The full-pipeline R4²SDF architecture can then easily follow the two concurrent 8×8 2D DCT outputs.

(a) The data context of the fourth butterfly stage in the 8×8 2D DCT mode.

Q D

Q D Q D

Q D Q D Q D

Q D Q D

Q D

S S

Radix-4 Computation

The input of 4^th stage: X[k]

x[n]

CLK0

CLK1

CLK2

CLK4

CLK5 CLK3

2D DCT mode 2D DCT

mode 2D DCT

mode

(b) The OSR structure of the fourth butterfly stage.

Fig. 36: Block diagram of the proposed fourth butterfly stage in the R4²SDF-based 256-point FFT/IFFT and 8×8 2D-DCT architecture.

6.3 Finite Wordlength Analysis

The next generation mobile-multimedia system can handle high-quality multimedia operations with embedded 256-point FFT/IFFT and 8×8 2D DCT pipeline processor [3]-[5].

The system performance should then satisfy the relative specifications. A higher system performance undoubtedly implies a larger chip cost and greater power consumption, owing to the wider internal wordlength. Since the chip cost and system performance are known to be a trade-off, this study performed a finite wordlength analysis to estimate the appropriate word-length for both 256-point FFT/IFFT and 8×8 2D DCT system requirements.

6.3.1 Pipeline 256-Point FFT/IFFT

In the 256-point FFT/IFFT modes, the output signal to noise ratio (SNR) performance was estimated under different noise environment. First, the input data of the double floating-point precision were generated from the ideal IFFT(FFT) model by passing the additive white Gaussian noise (AWGN) channel model under five noise levels: 20dB, 40dB, 60dB, 80dB and 100dB. The input data with noise were sent into the proposed R4²SDF pipeline FFT/IFFT architecture, which was modeled at different fixed-point levels. The output SNR was obtained by comparing the original input data with the fixed-point model output. The results after 100,000 iterations were averaged as depicted in Fig. 37, where the x-axis and y-axis represent the data word-length and the whole system output SNR, respectively. These analytical results demonstrate that the output SNR saturated as the data word length increased. The output SNR was increased by 20dB for each additional three bits. The 13-bit internal wordlength for each function units produced satisfactory results under 40dB noise environments, satisfying the IEEE 802.16e WiMAX [44] standard.

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0

20 40 60 80 100 120

Internal Wordlength (bits)

Output SNR (dB)

SNR = 100 dB

SNR = 80 dB

SNR = 60 dB

SNR = 40 dB

SNR = 20 dB

Fig. 37: Finite wordlength analysis of the proposed pipeline R4²SDF-based 256 points FFT/IFFT architecture.

6.3.2 Pipeline 8×8 2-D DCT

In the 8×8 2-D DCT mode, the performance of the proposed R4²SDF pipeline architecture was measured in common video compression standards, including the high-quality DV standard [75]. The DV standard defines some tolerances that the 8×8 2-D DCT computation maintains the accuracy and consequently an acceptable reconstructed video quality [75][76].

The DV standard applies four measured error criteria, namely the probability of occurrence of error, mean square errors (MSE), peak mean square error (PMSE) and steady AC coefficients [76]. Following the procedure in the preceding subsection, the double floating-point precision is assumed to be precise in comparing with the fixed-point computation. The zero-mean white input sequences are generated by a random-number generator in the range [−128, 127]. After the repeated 100,000 loops, the probability of occurrence of error, which is greater than 1, is less than 1×10⁻¹⁵. Moreover, the steady AC coefficients of the proposed fixed-point 2D 8×8 DCT model are all zero under the equal-values input. Figures 38(a) and 38(b) depict the MSE and PMSE simulation results, respectively. Notably, the proposed architecture could satisfy the limitation of MSE and PMSE of the DV standard, when the internal wordlength is greater than 12 bits. Thus, the 13-bit internal word length for each function units is the qualified internal wordlength for the DV standard. Figures 38(c) and 38(d) indicate that the overall mean error (OME) is below 0.01, and the peak signal to noise ratio (PSNR) is close to 60dB, which has the required video compression quality under the configuration of the 13-bit internal wordlength [77]. According to the finite wordlength analysis of the proposed R4²SDF 256-point FFT/IFFT pipeline architecture a 13-bit internal wordlength achieves the satisfactory results under the 40dB noise quality, thus satisfying the IEEE 802.16e standard.

The 13-bit internal wordlength was thus chosen for the proposed R4²SDF 256-point FFT/IFFT/2-D DCT RSoC IP to meet the requirements of next-generation handheld applications.

Fig. 38: Finite wordlength analysis of the proposed pipeline R4²SDF-based 8×8 2D DCT architecture. (a). Overall mean square error analysis. (b) Peak Mean Square Error analysis. (c).

Overall Mean Error analysis. (d). Peak Mean Error analysis.

6.4 Comparison and Chip Implementation

6.4.1 Comparison between R4

SDF and R2

SDF

He et al. presented the efficient pipeline FFT processor, several reliable architectures and the detailed comparison of their hardware costs [31]. A comparison of these architectures indicates that R2²SDF has the highest butterfly utilization of 50%, a the highest complex multiplier utilization of 75%, and the lowest hardware resource requirement [31][34]. Additionally, the SDF-based design has the structural merits of high regularity and modularity with simple wiring complexity, making it very appropriate for the VLSI implementation of the pipeline FFT processor design [31, 32, 34]. This section presents the comprehensive comparison results of several famous pipeline FFT/IFFT architectures to demonstrate the high cost-efficiency of the proposed R4²SDF FFT/IFFT architecture. The architectures were compared in two indices, namely cost and utilization, to express the cost efficient of the proposed FFT/IFFT architecture, as listed in Tables 13 and 14. Table 13 lists the required hardware resources, where T denotes the number of complex adders required in the implementation of the constant multiplier.

Significantly, the proposed constant multiplier is minimized using complex conjugate symmetry rule and subexpression elimination algorithm. The area of the complex multiplier is known to be one dominant cost index in the pipeline FFT/IFFT design. The comparison results in Table 14 clearly demonstrate that the proposed R4²SDF based-FFT/IFFT architecture has the fewest complex multipliers requirement among other pipeline architectures. The 256-point FFT/IFFT architecture only needs one complex multiplier, which is 67% and 95% below the requirement of the R2²SDF and R8MDC FFT/IFFT architectures, respectively. Additionally, the proposed architecture applies the feedback type memory structure to maintain the minimum shift registers requirement. Although the proposed R4²SDF based architecture needs slightly more complex adders than the R2²SDF based architecture, this small cost penalty is acceptable.

To estimate the total chip cost in the 256-point FFT/IFFT architectures, which includes the number of complex multipliers, complex adders and memory size, the conventional comparative methodology [26, 32] with the unit of equivalent adders was adopted to estimate the cost of each different architecture. Based on the implementation results in our process, we convert the area of each complex multiplier and complex memory to the 50 and 1.3 complex adder, respectively, when adopting 13-bit precision, and the scheme

with three real multiplications and five real additions, in the implementation. The rightmost column of Table 13 lists the area indexes of the equivalent adder of the 256-point FFT/IFFT architecture. Clearly, the proposed R4²SDF-based 256-point FFT/IFFT architecture has the lowest hardware requirements. The R4²SDF-based 256-point FFT/IFFT architecture has a 16% lower cost than the R2²SDF-based 256-point FFT/IFFT architecture. Significantly, the cost advantage of our proposed architecture becomes more evident when the transform length is larger. Thus, the proposed R4²SDF-based architecture has a lower hardware cost than R2²SDF and other famous pipeline FFT/IFFT architecture in terms of the number of ROMs, complex multipliers, complex adders, constant multipliers and shift registers.

Table 14 shows the comprehensive comparison of the hardware utilization rate in terms of the utilization rate of complex multipliers, complex adders and complex memory.

Clearly, the proposed architecture achieves the highest complex multiplier utilization rate among pipeline architecture (87.5%). Additionally, the proposed architecture maintains the maximum complex memory utilization rate of 100%. Furthermore, the proposed architecture, including the constant multipliers, has the highest complex adder utilization rate of 56.25%. Thus, the purposed architecture achieves a higher hardware utilization rate than R2²SDF and other well-known pipeline FFT/IFFT architecture in terms of the utilization rate of complex multipliers, complex adders, constant multipliers and complex memory. Although the R2MDC, R4MDC and R8MDC architectures have the higher throughput rate (output/cycle) of 2, 4 and 8 than SDF based architecture, these approaches require large hardware requirement, such as complex multipliers, adders and memory size, as shown in Table 13. Therefore, this investigation focuses on the

“hardware-oriented” architecture, in which the arithmetic operations can be tightly scheduled for efficient hardware utilization. This study demonstrates that the proposed R4²SDF based pipeline FFT/IFFT architecture has the lowest hardware cost and highest hardware utilization. Conversely, the proposed R4²SDF based pipeline FFT/IFFT architecture is the most cost-efficient.

6.4.2 8×8 2-D DCT Comparison

Many DCT implementations exist spanning a broad spectrum of architectures, focusing on different applications. Lee et al. [78] presented a highly parallel approach with high arithmetic cost and high power consumption for the high-performance application. The systolic implementation of Lee et al. [78] employs the row-column decomposition to derive the configurable 2D N×N DCT in three steps with each step implemented in systolic form. This work concentrates on high-speed FFT/IFFT/2D DCT architectures with a throughput rate of at least one output sample per cycle, targeted for applications in next-generation handheld devices needing a high data-processing rate.

Moreover, the proposed architecture has high cost efficiency and low cost in a portable consumer device. This subsection lists the hardware requirement comparison between six different implementations in terms of the number of real (complex) multipliers, real (complex) adders, twiddle factors realization, total transistor count, hardware complexity, throughput, internal wordlength, interconnect complexity and support for triple-mode, as shown in Table 15. Clearly, the proposed pipeline R4²SDF-based FFT/IFFT/2D-DCT processor has the fewest complex multipliers and lowest hardware complexity, an acceptable throughput rate and moderate interconnect complexity. Although the number of the complex adders in the proposed processor is greater than the designs in [79] and [80], the total area including complex multiplier is still lower than others. The total number of transistors indicates that the proposed design achieves the smallest chip cost among architectures supporting FFT/IFFT mode.

Table 13 Hardware Cost Comparisons of the Pipelined FFT/IFFT Architecture.

Table 14 Hardware Utilization Rate Comparisons of the Pipelined FFT/IFFT Architecture.

Pipeline architecture

Utilization rate of complex Mult.

Utilization rate of complex adders

(including constant mult.)

Utilization rate of complex memory

Table 15 Hardware Requirement Comparison of 8×8 2D DCT Architecture.

8×8 DCT Lee et al. [78]

(parallel)

Chang &

Wang [81]

(2D systolic)

Hsiao and Shiue [79]

(linear-array)

Ruetz et al. [80]

(linear-array)

Madisetti et al.

[82]

ROM based LUT ROM based LUT Hardwired Multiplier

Internal Wordlength 18 16 16 14 22 13 Interconnect

complexity

Complex Simple Moderate Moderate Simple Moderate

FFT/IFFT/2-D DCT triple modes

No No No No No Yes

1 A gate count was determined and the number of transistors was determined by assuming four transistors per gate.

2 An unknown gate count was indicated by “N/A”

6.4.3 Chip Implementation

Following the functional verification in the Matlab environment, the 256-point FFT/IFFT/2-D DCT architecture in which the internal word length of the entire design is 13-bit was synthesized by the Design Compiler with TSMC 0.13µm CMOS technology.

The floorplan and post-layout were performed by Astro. The post-simulation was issued by NC-Simulator to verify the functionality after back-annotation was performed from the Start-RC extractor. The static timing check can be signed-off by PrimeTime. Finally, the power analysis and DRC were conducted using Astro Rail and Dracula, respectively. The core area of the post layout was 0.6mm². The reported equivalent gate count is 60086 gates, which approaches 60k gates. The gate count usage for each building block is listed in Table 16. It is obviously that 264 words shift register dominates the chip cost of 54.58%. The implementation result without the 2D DCT indicates that the total gate count decreased to 55.2k.The implementation reports in this study reveal that the routing cost penalty incurred by the additional 8×8 2D DCT mode is small. The chip operated at 100MHz, thus satisfying the high throughput requirement After the conversion, the proposed R4²SDF design in 8×8 2D DCT mode could provide high frame rates of 505

kfps and 1042 kfps for frame sizes of 176×144 and 128×96 (pixel²), respectively.

Concerning the speed performance, because the pipelined multiplier operation is easy to design at a clock rate of 100 MHz or even higher, the proposed architecture can achieve a high clock rate by simple pipelining techniques for the involved arithmetic components.

The chip properties shown in Fig.6.9 demonstrate that the average power dissipation of the 256-point FFT/IFFT/2-D DCT design was 22.37mW@100 MHz at 1.2V supply voltage. The layout view as shown in Fig. 39 has 64 I/O pins, of which eight pins are power supply pins. The proposed R4²SDF based 256-point FFT/IFFT/2-D DCT implementation not only satisfies the system performance of DV standards in 8×8 2D DCT mode, but also achieves the satisfactory results with 40dB performance in 256-point FFT/IFFT modes. Additionally, the proposed R4²SDF based 256-point FFT/IFFT/2D DCT implementation has a low power consumption (22.37 mW), and the lowest hardware requirement of all pipeline architectures. These findings indicate that the proposed design is suitable for the highly cost-efficient FFT/IFFT/2-D DCT triple-mode RSoCs IP for next-generation handheld devices.

Mode Selection 256-point FFT/IFFT and 8×8 2D-DCT

Architecture R4²SDF pipeline

Technology 0.13 µm CMOS

Core Size 807(µm) x 754(µm) = 0.6 mm²

Power Consumption / Freq. 22.37 mW / 100 MHz Accuracy / internal wordlength 40dB in DV standard / 13-bits Input/Output/Power Pins # 29 / 27 / 8

Fig. 39: The layout view and design characteristics of proposed pipeline 256-point FFT/IFFT/8×8 2D DCT processor.

Table 16 The Gate Count Usage of Each Building Blocks.

Categories Control Butterfly Cores

Complex Multiplier

Constant Multipliers

Shift Registers

Area 1.3 % 21.74 % 18.9 % 3.48 % 54.58 %

6.5 Summary

This investigation develops a triple-mode reconfigurable pipeline R4²SDF VLSI architecture that supports the 256-point FFT/IFFT and 8×8 2-D DCT computations. The comparison results demonstrate that the proposed R4²SDF pipeline FFT/IFFT architecture has a lower hardware cost and higher utilization than R2²SDF and other pipeline architectures. Following the fixed-point analysis the proposed 256-point FFT/IFFT/8×8 2-D DCT chip design is successfully implemented in 0.13µm CMOS technology with an internal wordlength of 13 bits. This design has a power consumption of 22.37 mW@100 MHz at 1.2V supply voltage. These features ensure that the proposed reconfigurable processor design is certainly amenable to the next-generation mobile communications.

The upcoming fourth-generation wireless system requires the simultaneous application of many computing algorithms including MPEG-4 AVC [83] and Walsh transform [84], in the same handheld device. The reconfigurable hardware core for supporting more transforms is a significant topic for future work.

Chapter 7 Conclusion and Future Work

In this thesis, we focus on the specific ASIC design for the effective pipeline FFT/IFFT processor. Considering the hardware-orientated architecture for most efficiency, the specific FFT/IFFT processor not only minimizes the computation complexity and area cost, but also increase the hardware utilization rate with an appropriate throughput rate for different applications. For the purpose of demonstrating the effective computations in

在文檔中高效能之管線式傅立葉轉換處理器之設計與實現 (頁 114-0)