Summary - Bitwidth-Aware Subexpression Sharing in FIR Filter

Chapter 2 Bitwidth-Aware Subexpression Sharing in FIR Filter

2.5 Summary

In this dissertation, we present an ILP-based bitwidth-aware area minimization algorithm for MCM designs. We first point out that the total adder bit count rather than the total adder count can better estimate the hardware cost in a real implementation. Then, for a given MCM design, those target constants are first represented in a specified number format (MSD in use in this dissertation). Next, a

Table 6 Logic depth comparisons for 12-bit 32-tap FIR filters

Filters

Depth = 3 Depth = 4

Ho et al. [28] Ours Ho et al. [28] Ours

#adders #bits #adders #bits #adders #bits #adders #bits

lp32-1 16 282 16 271 16 296 16 271

lp32-2 21 363 21 323 21 354 21 323

lp32-3 15 259 15 241 15 250 15 241

hp32-1 22 405 22 373 22 400 22 373

hp32-2 17 294 17 284 17 300 17 284

hp32-3 19 339 19 318 19 334 19 318

bp32-1 25 421 26 382 24 388 24 378

bp32-2 22 399 23 393 21 384 22 373

bp32-3 23 395 23 366 23 417 23 366

bs32-1 23 405 23 346 23 393 23 346

bs32-2 24 419 24 368 23 381 24 368

bs32-3 21 360 21 328 21 356 21 328

subexpression graph is created to record all feasible decompositions for every target constant. The graph also keeps track of the required adder bitwidth as well as two subexpressions for every decomposition. At last, the area minimization problem is formulated as a set of ILP constraints derived from the subexpression graph and optimally resolved within an acceptable runtime. The experimental results demonstrate that our proposed algorithm can achieve an average reduction of more than 7% on both of the adder bit count and the real gate count. Therefore, we are confident that the proposed approach can outperform the existing state-of-the-art techniques and should be regarded as a better alternative for area minimization in MCM designs

Chapter 3 Expandable MDC-Based FFT

3.1 Overview of the Pipelined FFT Architecture

Pipelined architectures can be divided into two major categories according to the datapath structure. One is Single-path Delay Feedback (SDF) based architecture and the other is Multi-path Delay Commutator (MDC) based architecture. SDF-based architectures use properly-sized local delay feedback loops to correctly schedule the input data for butterfly units. It typically has the advantages of higher hardware utilization rate and less hardware cost. On the contrary, MDC-based architectures first separate the input sequence into two parallel data streams by properly-controlled switches/FIFOs and then direct them into the correct butterfly units. As a result, MDC-based architectures generally demand a bit more hardware resources and larger memory bandwidth but provide higher throughput in return. Nevertheless, though a fixed MDC-based architecture can generally provide good throughput at reasonable hardware cost, it may still fail to meet the target performance requirement for some throughput-hungry design cases.

In [54] and [55], a foldable structure is proposed to provide various design tradeoffs between area and throughput based on the base (Pease) architecture. Since the Pease architecture possesses high regularity, it is extremely easy to fold butterfly units in its implementation either horizontally or vertically. Figure 10 illustrates the fully-expanded 16-point foldable Pease FFT implementation. It is apparent that 4 butterfly columns are identical and thus can be easily 2- or 4-folded horizontally.

Similarly, identical 8 butterfly rows can be 2/4/8-folded vertically as well. With this folding technique, an area-optimized architecture can be tailored to meet the given throughput constraint. However, this customized architecture still requires more area than conventional pipelined architectures when delivering same throughput (shown later). Furthermore, a matrix factorization of FFT computation is also developed in [44] and [55]. Each element in the factored matrix can be expressed as a specific hardware component so that area/performance evaluation can be easily done at the architecture exploration stage. However, [44] and [55] do not consider MDC-based pipelined architectures.

Figure 10 A foldable Pease architecture for 16-point FFT

Butterfly (BF)

Column 1 Column 2 Column 3 Column 4

Figure 11 The generic template of R2²EMDC architecture

BFI BFII BFI BFII BFI BFII

……

BFI BFII BFI BFII BFI BFII

……

In this dissertation, we propose an area-efficient high-throughput Expandable MDC (EMDC) based FFT architecture. It can be easily applied to conventional

MDC-based FFT architectures (such as R2²MDC, R2³MDC, and R2⁴MDC). Here, we only demonstrate the Radix-2² Expandable MDC (R2²EMDC) architecture.

3.2.1 The proposed R2

EMDC architecture

The generic template of the proposed R2²EMDC architecture is presented in Figure 11 Three key parameters are described as follows:

N: the FFT size, where N = 2^m, and m is a positive integer.

t: the degree of parallelism obtained from expansion, where t = 1, 2, 2², …, 2^m-1. In: the nn interconnection permutation matrix (IPM), where n = 2², 2³, …, 2^m. The proposed architecture is composed of two stages – in addition to butterfly units, the front stage, named data reordering stage, employs FIFOs with specific size and properly-controlled switches to align the data in correct order; while the back stage, named data shuffling stage, deploys a set of precisely-organized IPMs to shuffle the data among different rows to their correct positions (i.e., bit reversing). Note that

Figure 12 Two types of butterfly structures

two types of butterfly structures, BFI and BFII [41], are in use, as shown in Figure 12.

BFI is basically composed of two complex adders/substractors for two complex inputs, a and b, while BFII contains additional multiplexing logic to implement an optional multiplication of –j; that is, the trivial twiddle factor multiplication of –j can be accomplished by a simple real-imaginary swap plus well-multiplexed addition/subtraction computations instead of actually using a costly complex multiplier.

As a result, an IPM is simply a signal wiring network and thus hardly consumes logic resources from a hardware implementation perspective.

3.2.2 Hardware cost and throughput evaluation

For the conventional R2²MDC architecture, the number of complex multipliers and adders is ²



^log4N



² and 2log₂N, respectively. Besides, two output values

Figure 13 Interconnection configuration of I4 and I8

are produced at every cycle, so the throughput is 2/N. Furthermore, since R2²EMDC allows expanding its datapath to trade for better performance, the number of multipliers, adders, and so as the overall throughput is all proportional to the degree of parallelism t. For example, Figure 14 illustrates four instances of the 16-point R2²EMDC architecture with different t settings. Figure 14(a) shows the case with t = 1 (no expansion applied), which is identical to the original R2²MDC architecture.

Meanwhile, Figure 14(b)/(c) gives the case with t = 2/4, where the number of multipliers and adders is doubled/quadrupled, and so is the throughput. Notice that the number of FIFO entries actually decreases as t rises. Figure 14(d) depicts the case with t = 8, namely, the fully-expanded implementation that provides the maximum throughput of 16-point R2²EMDC architecture, which also demands the largest hardware resources.

Similarly, for the foldable Pease architecture illustrated in Figure 10, it is observed that the number of multipliers and adders is both reduced to a half and so is the throughput as the datapath is vertically folded once. However, the number of FIFO entries remains unchanged regardless of the parallelism of datapath.

Figure 14 Different instances of 16-point R2²EMDC architecture

Table 7 gives several theoretical comparisons between the existing foldable Pease architecture and the proposed R2²EMDC one. Since multipliers take a major parameterizable FFT generator in Perl script. By indicating the size (N) of FFT core

Table 7 Comparisons between foldable Pease and R2²EMDC

Architecture #multipliers #adders #FIFOs

Foldable Pease [54, 55] tlog₂ N 2tlog₂N N

R2²EMDC t⁽²^log4N²⁾ 2tlog2N N–2t and the degree of datapath parallelism (t), our generator can output the specified hardware design in synthesizable Verilog HDL just in few seconds. Furthermore, we also use MATLAB in our testbench environment for extensive pattern generation/simulation and SQNR analysis (larger than 80db) to verify the correctness of the generator.

In additional to comparing the foldable Pease architecture and R2²EMDC in an analytic way (Table 7), it is also interesting to compare the generated hardware cores in terms of logic-gate-count (NAND2-equivalent), throughput, and power consumption. As a result, we have also implemented the generator proposed in [43], which always completely folds the core in horizontal direction first and then v-folds the core in vertical direction, where v is parameterizable. The generated hardware cores are then synthesized under UMC 0.18um technology using Synopsys Design Compiler with 100MHz timing constraint.

Figure 15(a) and (b) reports the area-throughput design tradeoff solutions offered by R2²EMDC and the foldable Pease architecture [44] for 256-point and 1024-point FFT, respectively. The throughput ratio indicates the throughput of a given architecture to that of the fully parallel Pease architecture. As expected, the newly proposed one is always more area-efficient than the existing one when delivering the same throughput. To name an example, the area reduction is about 37% as N = 256 and t = 16. Moreover, under a fixed FFT size N, the area gap between two architectures is getting large as the target throughput increases. That is, according to

(a) N = 256

(b) N = 1024

Figure 15 Area vs. throughput in 256/1024-point FFT

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035

Area (gate count)

the theoretical analyses and experimental results, it is conclusive that the proposed R2²EMDC architecture is indeed capable of providing a smaller hardware implementation than the existing foldable Pease architecture under the same throughput constraint.

Table 8 compares the power consumption between four different instance pairs

Table 8 Power consumption in 256-point FFT (mW)

Ratio Foldable Pease R2²EMDC Power Reduction (%)

0.0078 14.03 12.48 11.0

0.0156 27.59 22.09 20.0

0.0313 55.90 42.07 24.7

0.0625 97.67 79.41 18.7

derived from the two architectures with the same target throughput ratio (using UMC 0.18um process and measured by Synopsys PrimePower). R2²EMDC achieves roughly 20% of power reduction as compared to the foldable Pease architecture. Here, the smaller hardware implementation should be the major reason for the results.

3.4 Summary

In this dissertation, we propose an expandable multi-path delay commutator (EMDC) based FFT architecture. We show that the proposed architecture can be easily and flexibly expanded to satisfy throughput-hungry applications. In addition, a parameterizable hardware generator is also developed to automatically produce the specified HDL code so that the design cost and time can be drastically minimized.

Finally, the theoretical analyses and/or experimental results demonstrate that the proposed architecture does consume less area and power than the existing foldable Pease architecture under the same throughput constraint.

Chapter 4 Probability-Based Static Scaling Optimization for Fixed

Wordlength FFT Processors

4.1 Introduction of Scaling Optimization

Recently, high-data-rate wireless communication systems have been becoming a primary focus of both research and development. For example, OFDM technology is one of the favorable choices for future broadband systems and is highly suitable for video transmission and mobile Internet applications. The FFT processor is one of the key components in OFDM-based wireless systems [4-6, 45]. Actually, not just in OFDM, the FFT processor also serves as an essential element in many other modern DSP systems. Consequently, improving FFT processor designs has become the focus of a large number of studies since the past decade.

The architectures of FFT processors can be roughly divided into two major categories: memory-based and pipeline-based. A memory-based architecture usually consists of only one butterfly unit. In general, it provides area-efficient solutions for low-throughput applications. Alternatively, a pipelined architecture consists of multiple concurrent processing butterfly units so that it can provide higher throughput at the cost of more hardware resources. In general, memory-based architectures are suitable for FFT processors where the hardware cost is an issue and the FFT size is

not smaller than 512 [3]. Pipeline-based architectures are typically feasible for applications with smaller FFT sizes. In this dissertation, we focus on the scaling optimization on FFT designs with a fixed wordlength at each stage; that is, the output wordlength of every stage is the same as its input wordlength. Memory-based designs naturally fulfill this requirement, while a considerable part of pipeline-based designs also choose to meet the same requirement since the fixed wordlength is still preferred due to hardware cost and critical-path delay considerations [46].

While crafting a practical FFT hardware design, the output precision in terms of Signal-to-Quantization-Noise Ratio (SQNR) is regarded as a key design requirement.

In practice, FFT algorithms are commonly implemented using fixed-point arithmetic instead of floating-point arithmetic for hardware cost reduction. That is, only a limited number of bits are available to represent a signal or coefficient value. As a result, rounding and truncation operations inevitably introduce noises, which are referred to as quantization noises. Besides, addition and subtraction operations may also cause overflow errors (noises) during computations. Although extending the wordlength can relieve the accuracy loss, the hardware cost and the critical-path delay are increased accordingly.

Therefore, several number scaling methods, either static or dynamic, have been proposed to improve the output SQNR [1, 45-69]. Oppenheim et al. [49] proposed a simple static scaling method that always increases the integer part by one bit for each radix-2 stage to prevent overflows. However, this method suffers significant quantization errors if the wordlength is fixed. In addition, methods based on dynamic scaling have also been proposed for the SQNR improvement. Block Floating Point (BFP)-based methods employ intermediate buffers to store and analyze a block of output values, and then dynamically determine an appropriate number format for that block to achieve a better SQNR [1, 45, 46, 56, 57]. However, all of these dynamic

methods suffer a notable increase in area, power, and latency as well as need a more complicated control unit. Consequently, most FFT designers rely on static instead of dynamic scaling optimization techniques to determine a proper number format for each stage [50].

Previous static scaling techniques can be roughly classified into three major categories: simulation-based approaches [58, 59], analytical approaches [47-50, 60,-68], and a hybrid of previous two [48, 69, 70]. The simulation-based approaches try to find a good number format through lengthy iterations. In contrast, the analytical methods can determine a good number format very efficiently through a static numeric analysis without invoking time-consuming simulation. However, the analysis results are generally too pessimistic and lead to a larger wordlength than required.

Therefore, the hybrid approaches are proposed to determine the number format and shorten the simulation time simultaneously. Meanwhile, the works mentioned above [47-50, 58-70] all assume they can arbitrarily determine the wordlength at each stage.

However, in memory-based FFT designs, the wordlength (the width of the memory block) is always fixed. To the best of our knowledge, the problem of static scaling under a fixed wordlength constraint has not been well addressed yet.

In this dissertation, we propose a scaling optimization technique based on the static probability analysis that can rapidly determine the best number format at each butterfly stage under a fixed wordlength constraint. Given a probability distribution of input signals, a selected FFT algorithm, and a wordlength constraint, the proposed technique can maximize the overall output SQNR through the static number format analysis and optimization stage by stage. Compared to previous works, our method offers the following three contributions: 1) providing a probability model that can abstract the behavior of fixed-point arithmetic logic; 2) preventing the use of time-consuming and pattern-dependent simulation throughout the entire optimization

process; 3) minimizing the required wordlength in a hardware implementation under a given SQNR target without demanding extra hardware components and complicating control logic compared to other existing static approaches [49, 50].

4.2 Number Scaling

4.2.1 Related works

It requires n+1 bits to accurately preserve a result of an n-bit fixed-point addition/subtraction operation. Hence, one solution for avoiding an overflow generated from a butterfly operation is to make the output wordlength one bit larger than the input one [48]. However, increasing the wordlength induces a number of drawbacks in FFT hardware implementation. First, a larger data storage unit (memory block or register file) is required, which increases both chip area and power consumption. Second, a longer wordlength results in a worse critical-path delay in arithmetic logic, which is not eligible for high-throughput FFT designs. Most of all, the wordlength is fixed in a memory-based FFT architecture, meaning that it is not possible to vary the wordlength from stage to stage. Consequently, many number scaling approaches have been proposed to prevent a wordlength increase at the cost of a minor accuracy loss, which can be roughly divided into two categories: the static scaling approaches and the dynamic ones.

Oppenheim et al. [49] proposed a static scaling procedure which is widely adopt in today’s FFT hardware implementation. Since the maximum magnitude of the result increases no more than a factor of 2 for a butterfly stage, incorporating an attenuation of 1/2 at both inputs (that is, increase the integer part by one bit and decrease the fractional part by one bit in a fixed-length word) to a radix-2 butterfly unit can completely eliminate output overflows. However, this approach degrades the output

SQNR due to larger truncation errors caused by the increasingly shorter fractional part stage by stage. Besides, the above scaling method can be further improved a bit with only a slight modification. Instead of performing number scaling at the input, incorporating an attenuation of 1/2 at the output of each stage, as shown in Figure 16, can achieve a better overall SQNR.

In [50], Ramakrishnan et al. concentrated on FFT designs for OFDM receivers.

The authors exploit the fact that input samples of OFDM follow a normal distribution to predict the possible output value range at each stage and then determine the scaling strategy accordingly. They suggest increasing the integer part by one bit for every two stages instead of every stage for FFT designs used in OFDM. However, the input can vary from application to application, and is mostly assumed uniformly distributed in a typical FFT analysis [48]. Furthermore, our experimental results show that the approach presented in [50] works well only if the standard deviation of normal distribution is within a specific range.

Therefore, instead of adopting the methods proposed in [49] and [50] directly, most designers try to find the optimized number format of output for each stage through simulation if a better SQNR is expected. Typically, there are two options for determining the number format of a radix-2 butterfly stage: keeping it unchanged as at the previous stage, or moving one bit from the fractional part to the integer part.

However, when the number of stages (k) is big due to a large FFT size, it is virtually Figure 16 A radix-2 butterfly unit with scaling by 1/2 at the output.

+

-Xm-1[p]

2 1

*

Xm-1[q]

X_m[p]

Xm[q]

impossible to evaluate all feasible configurations (2^k) and then pick the best one through simulation. Consequently, designers usually empirically select a limited set of

"better" candidate configurations, and choose the best one among them still through extensive time-consuming simulation.

On the other hand, a dynamic scaling approach improves the output SQNR by means of the notion of shared-exponent. The BFP algorithm [1], which is one of dynamic scaling methods, employs an intermediate buffer to store a block of output data, detects the maximum value, and then determines the exponent for that block of data. Though this method does achieve a better result than common static scaling approaches, the extra data buffer implies a notable increase in area. As well, buffer access and exponent detection operations require longer processing latency and consume more power. Therefore, static scaling approaches are still much more commonly preferred for typical FFT hardware implementations.

In this dissertation, we propose a fast probability-based static scaling optimization technique that is capable of providing a better output SQNR than existing static ones as well as needs no simulation at all. It is also as area-efficient as other static methods since all of them do not require a dynamic scaling unit; however, our technique can still roughly achieve the same level of output quality when compared with dynamic scaling approaches. For every butterfly stage, the proposed method can precisely estimate the accuracy loss of each candidate number format due to possible saturation and truncation errors via the static probability-based analysis and then picks the best one of them. Furthermore, our method can work with various FFT sizes, FFT algorithms, wordlengths, and input signal distributions.

4.2.2 Motivation and problem definition

As mentioned, the approach proposed in [49] suggests increasing the bitwidth of

the integer part by one at every radix-2 butterfly stage to avoid overflows. In this

在文檔中針對FIR與FFT演算法於超大型積體電路實作上之解析式面積最佳化技術 (頁 41-0)