FFT Architectures - Review of FFT - 經由混合方法進行管線化快速傅利葉轉換處理器的字元長度最佳化之研究

Chapter 2 Review of FFT

2.2 FFT Architectures

The FFT is one of the most widely used digital signal processing algorithms. Recently, attention has been returned to real-time processors in many communication systems. There are many architecture choices for these processors. Among them, the pipelined architectures are particularly suitable for real-time applications since they are easily merged with the sequential nature of sampling. And they are popular for large FFT VLSI realization too, due to their high regularity.

In this section, we will introduce the pipeline-based architecture. The architecture that we want to discuss is used to implement DIF FFT algorithms. Similar structures can be designed for DIT FFT algorithms, too.

Several architectures for pipelined FFT processors have bean proposed. There are Radix-2 Multi-path Delay Commutator (R2MDC) [6], Radix-2 Single-path Delay Feedback (R2SDF) [7], Radix-2² Single-path Delay Feedback (R2²SDF) [8][9], Radix-4 Single-path Delay Feedback (R4SDF) [6], Radix-4 Multi-path Delay Commutator (R4MDC) [6], etc. They will be introduced in this section.

z R2MDC

It is the most straightforward way to reorganize the data for FFT algorithms. At each stage half the data stream is delayed via the memory and processed with the second half data stream. An 16-point R2MDC is shown in Figure 2.6.

Figure 2.6 R2MDC Architecture (N=16)

z R2SDF

Since memory in R2MDC is idle at 50% of time, it can be reused as shown in Figure 2.7 This scheme utilizes the different arrival time of input data and processed data.

The utilization of the memory is 100%.

It is similar with R2MDC, but it utilizes only 25% of time for memory. A 256-point R4MDC is shown in Figure 2.8.

Figure 2.8 R4MDC Architecture (N=256)

z R4SDF

It is a radix-4 version of R2SDF. It is as efficient as R2SDF in terms of memory utilization and the utilization of multipliers increases from 50% to 75% at a cost of only 25% utilization of the BF element. A 64-point R4SDF is shown in Figure 2.9.

Figure 2.9 R4SDF Architecture (N=64)

z R2

SDF

It breaks one radix-4 BF operation into two radix-2 BF operation with trivial multiplications of ±1 and ± . With a feedback mechanism, the memory is fully j utilized as R2SDF and R4SDF. A 64-point R2²SDF is shown in Figure 2.10.

Table 2.1 Summary of N Point Pipelined FFT Architectures

Summary of these architectures are shown in Table 2.1 [5]. The delay feedback approached are always more efficient than corresponding delay commutator approaches in terms of memory requirements. The Radix-4 algorithm based single-path architectures have fewer multipliers than those of radix-2 algorithm. However, radix-2 algorithm based architectures have properties of simple and regular. And radix-2² algorithm is characterized with the trait that it has same multiplicative complexity as radix-4 algorithms but still retains the radix-2 butterfly structure. In this thesis we will focus on R2SDF and R2²SDF architectures.

The detail architecture with control unit of R2SDF and R2²SDF is shown in Figure 2.11(a). The butterfly process element (PE) has two kinds of operation modes. Mode 1 is used to store the data in the shift register, wait several cycles to compute and multiply with twiddle factors, while mode 2 is responsible for butterfly computation, showed in Figure 2.11(b).

(a)

(b)

Figure 2.11 Units of R2SDF and R2²SDF (N=16)

Chapter 3 Error Analysis

Fixed-point arithmetic is popular for FFT hardware implementation for its simplicity.

Because of the finite wordlength in the computation, we have to truncate or round the answers when overflow occurs after addition or multiplication; thus, errors are produced.

The statistical error analysis and simulation-based error analysis are the two most popular methods for FFT error analysis. Many papers about statistical and simulation-based error analysis of fixed-point FFT have been published [10-14]. The previous statistical error analysis is not sufficient for our purpose of choosing the required wordlength stage by stage. We derive a simplified statistical error model to meet the requirement.

In this chapter, we will briefly review the quantization error analysis first. Second, we will introduce the statistical error models in which wordlength can be freely chosen stage by stage. Third, the simulation environment will be briefly reviewed. Then accuracy of our error models will be evaluated by comparing it with that of the simulation-based error analysis.

3.1 Error Analysis of Quantization

finite-length sequence { nx( )} ; n=0,1,2,...,N −1. The expected value of X is shown in equation (3.1). It is zero-mean random sequence at the quantizer input. The variance of X is denoted by σ_x² and is shown in equation (3.2).

A quantizer maps X into the discrete-valued Y. Thus, the quantization error . Denote the boundaries by and the reconstruction levels by , then the output of this quantizer is shown in equation (3.3) and the quantization error variance, denoted by , is then given by equation (3.4).

Finally, the equation of SQNR is shown in equation (3.5)

₂ is 2 + 1 bits sign-fractional discrete-valued data. The input-output mapping is shown in Figure 3.1(a). It is shown that, if the input data are in the interval then the output data of them are all have the same value as 0. If the input data are in the interval

then the output data of them are 0.25, and so on. The related quantization error mapping is shown in Figure 3.1(b).

(a)

(b)

Figure 3.1 Information of 2+1 Bits Quantizer

3.2 Statistical Error Models of FFT

The previous FFT error analysis and model of DIF radix-2 algorithm have been presented by Sundaramurthy et al. [12] . They assume that all the wordlength of all PE stages is the same. This is insufficient for applications that allow the different wordlength

between PE stages.

Due to the finite wordlength in the computation, we have to truncate or round the answers after calculation. And the FFT computation is an iterative process and the value increases in magnitude. The problem of overflowing should be concerned.

In order to prevent overflow and to ensure output accuracy, data need to be scaled.

There are two scaling methods to prevent FFT from overflow. One is overall scaling and the other is stage-by-stage scaling [2]. The input constraint of FFT with overall scaling is

n N

x 1

)

( < , and there is no need to divide the input of each butterfly by two. The input

constraint of FFT with stage-by-stage scaling is x(n) <1, and the input data should be divided by 2 for each butterfly. Due to the noise consideration [14] the stage-by-stage scaling will be used in this thesis.

In this section we aim on delivering statistical FFT error models for DIF radix-2 and radix-4 algorithms with stage-by-stage scaling scheme. These models are useable for case having the different wordlength stage by stage.

3.2.1 Definitions and Constraints

In these analyses, we assume fixed-point arithmetic with bit wordlength and signed fraction, where k is the stage number of PE stage. The input of N-point FFT, denoted by where numbers. Numbers are consisted by 2N real random variable and they are uncorrelated.

And they are distributed uniformly in ) 2 twiddle factor, , is not treated here. The truncation operations are all modeled as mutually uncorrelated.

W p

k o

3.2.2 Expected Noise Sources

Figure 3.2 shows the error model of PE stage with stage-by-stage scaling by 2. There are several noise sources having been considered. They are the quantization error of wordlength difference between PE stages, denote by , the quantization error of scaling, denoted by , the quantization error of complex multiplication of twiddle factor, denoted by , and the insufficient output wordlength error, denoted by .

The is produced when the wordlength of stage k-1, denoted by , is greater

then that of stage k, denoted by . is the variance of truncated bits from to . The scaling error is produced when

b . A complex scaling consists of two real scaling, i.e., the real and imaginary parts of the number are scaled separately. Scaling by a factor

multiplications and each real multiplication is truncated separately. The complex multiplication error variance, denoted by , is equal to the variance of truncated bits of the result of multiplication. It is shown in equation (3.7).

2 quantization error will be produced. The variance is shown in equation (3.8), where the is the wordlength of the last PE stage and the is the FFT output wordlength.

3.2.3 Output Signal to Quantization Noise Ratio (SQNR)

Since all the noise sources are assumed to be uncorrelated, the variance of the noise at output node of the SFG of Figure 2.5 is the sum of contributions from all the individual noise sources that propagate to that output node. Some of noise variance of output nodes

that is contributed by is denoted by , and the contribution of is denoted by .

2 sk

σ σ_S² σ_m²_k

σM

From Figure 3.3, the propagation of in 8-point DIF Radix-2 it can be found. The number of error source propagating to any output node from the first, second, and third stage are 8, 4, and 2, respectively. And the equation of is shown below, equation (3.9), where the total stage number n is equal to log

2 sk

σS

2N, and the factor of )ⁿ⁻^k 4

(1 is the

effect of scaling on the error propagating at stage k.

The σ_S² of DIF Radix-2² algorithm is the same as DIF Radix-2.

Figure 3.3 Propagating Flow of Quantization and Scaling Errors

S² ¹ ²₁ ² ²₂ ₁

)

²_n

4 ( 1 ) 2

4 ( 1

ⁿ _s

N

ⁿ _s

N

_n ⁿ ⁿ _s

N σ σ σ

σ ≈

⁻

⋅ +

⁻

⋅ + L +

₋ ⁻

⋅

(3.9)

It can be assumed that all the complex multiplications are noisy for convenience of σ2

SFG. In general, there are four, half of 8, in each stage, and each from the first (k=1), second (k=2), and third (k=3) stage propagates to 4, 2, and 1 output nodes. Hence it is easy to show in equation (3.10). The corresponding expression of σ_M² of Radix-2² algorithm is shown in equation (3.11)

]

In obtaining equation (3.10) and (3.11), it is assumed that all complex multiplications are noisy. But multiplications associated with twiddle factor or

introduce no errors. Figure 3.5 shows the position of noiseless twiddle factors of 8-point Radix-2 algorithm. The propagation of these noise sources is identical to that in the .

±1

Wp W^p =±j

σM

Thus, denoting the noise variance contribution of these multiplications by , and the expression of is shown in equation (3.12). The corresponding expression of Radix-2

σC 2

σC ²

algorithm is shown in equation (3.13).

Figure 3.5 Propagating Flow of Noiseless Mutiplication

2 ]

The average output signal variance is in equation (3.14) [2].

x N 3

2 = 1

σ (3.14) Finally, the SQNR expression is shown in equation (3.15).

]

3.3 Simulation-Based Error Analysis of FFT

There are many papers about simulation-based error analysis being published.

Johansson et al. published a paper on simulation-based error analysis [17] in 1999. The C model is used to perform the simulation. User can get the proper result under their constraints. The wordlength of each stage, rounding or truncation for each stage, number of stages to do scaling, and the number of bits are parameters which can be chosen by users.

Figure 3.6 shows the simulation environment of SQNR. It compares the outputs of fixed-point FFT and floating-point FFT to calculate the SQNR. The SQNR calculation expression is shown in equation (3.16).

Figure 3.6 Simulation Environment of SQNR

∑

3.4 Verifications

Since the SQNR can be calculated by simulation-based error analysis the simulation setup can be used to verify our new error models too.

The wordlength 8 to 32 bits is the popular selection to implement fixed-point FFT architectures. In this section, we will calculate the SQNR by statistical and simulation-based methods for 8, 16, …, 8192 points DIF Radix-2 FFT and 16, 64,…, 4096 points DIF Radix-2² FFT with the freely chosen wordlength from 8 to 32 bits for each PE

stage. Then, we will compare the results to verify statistical error models.

First, we choose wordlength, 8 to 32 bits, for each PE stage randomly. Second, we will compare all wordlength set in a special range.

3.4.1 Random Verification

For example, we randomly generate 20 wordlength sets of 1024 points DIF Radix-2 FFT. The input worldlength is equal to that of the first PE stage, and the output wordlength is equal to that of the last PE stage. Then, calculate the SQNR by statistical and simulation-based methods, respectively. Then, the SQNR difference between them can be calculated. Table 3.1 shows the results. The first column shows the number of wordlength sets, next column shows the wordlength of each PE stage, column 3 shows the result SQNR of simulation-based error analysis, column 4 shows the SQNR result of statistical error analysis, and the last column is the difference of SQNR.

Table 3.1 Examples of Random Verification (N=1024)

We had compared 10000 wordlength sets for 1024-point FFT of Radix-2 and Radix-2² algorithm. The maximum difference of Radix-2 is almost within ±1dB. The maximum difference of Radix-2² for each FFT is almost within ±1.1dB. Fig. 3.7(a) shows the distribution of difference of 1024-point Radix-2 FFT, Fig. 3.7(b) shows the 1024-point Radix-2² FFT.

(a)

(b)

Figure 3.7 Results of Random Verification of Radix-2 and Radix2²

3.4,2 Partial Exhaustive Verification

To exhaustively compare all wordlength sets of 8 to 32 bits is not practical because the simulation time is not endurable. However we can do exhaustive comparison in some special range, maybe some of the solution space, to verify. We had chosen the wordlength 11 to 18 bits to do partially exhaustive comparison of 64 points DIF Radix-2 and Radix-2² FFT. They spent about 130 hours comparison time, and the results are shown in Figure 3.8.

The difference is within ±1.1dB.

(a)

(b)

Figure 3.8 Results of Partial Exhaustive Verification of Radix-2 and Radix2²

Section 3.4.1 and 3.4.2 clearly show that the result obtained from the statistical error

Chapter 4 Wordlength Optimization

The wordlength is an important design parameter. It will affect both the performance and complexity. Longer wordlength is preferred for good precision. But, increase wordlength will increase the complexity. It will increase the size of memory and computational units and thereby increase power consumption and decrease performance.

Hence, the wordlength requires careful optimization.

In this chapter, we will briefly review the design flow of FFT processor first. Then, we will describe our approach, hybrid wordlength optimization method. Finally, two examples are shown.

4.1 FFT Processor Design Flow

There are many factors have to be considered o design the FFT processor. Figure 4.1 shows the over all design flow of FFT processor. First, system requirements need to be specified. They are points of FFT, SQNR, throughput, area, power, …, etc. Then, the proper FFT algorithm and FFT architecture need to be chosen. Finally, the wordlength of architecture need to be analyzed.

Figure 4.1 Design Flow of FFT Processors

When the FFT is implemented as a fully custom ASIC, the wordlength of each stage can be freely chosen except input and output wordlengths of FFT processor, which are system specified. Internal wordlengths of FFT processor can be chosen to decide the precision and complexity. In general, longer wordlength is preferred for better precision of numbers. On the other hand, increase the wordlength will increase the complexity, it will increase the hardware cost, power consumption, and decrease the speed. Thereby, the optimization is a trade-off between precision and complexity.

To reduce the time of over all system design, the automatic wordlength optimization solution is preferred. A simulation-based method on pipelined FFT had presented by Lin [3]. We will present a faster hybrid method in this thesis. Figure 4.2 outlines the automation flow. There are four steps in sequence, i.e., upper bound wordlength evaluation, lower bound wordlength evaluation, optimized wordlength candidate searching, and optimized wordlength selection. Additionally, there are some tables and libraries built offline to speed up this flow.

Figure 4.2 Over All Flow of Wordlength Optimization

4.2 Wordlength Generation

Items in Fig. 4.2 will be introduced in this section. This flow is to optimize the area under input constraints. Input constraints include points of FFT, SQNR, throughput, FFT input and output wordlength, SQNR simulation confidence interval, and SQNR simulation error. The output data are wordlengths of each PE stage.

4.2.1 Library and Table

Since we optimize hardware cost, the relative hardware library needs to be chosen.

Adder, multiplier, multiplexer, read only memory (ROM), and shift register are five basic elements of FFT. Hardware library decides the area and critical path to wordlength table for these components [3].

PE stages are hardware blocks in the wordlength generation flow, which is built by

the basic components. We need a table that stores the information of area and critical path for each PE stage to speed up the automation flow, PE stage table [3].

In Figure 4.2, the mean of SQNR variance table is used to calculate the simulation times of different confidents of simulation [3].

4.2.2 Upper Bound Wordlength Evaluation

Throughput is one of the input constraints. Satisfy the throughput constraint implies that the critical path must be short enough to meet equation (4.1). In other words, it means that some stages violate the timing of pipeline if there are critical paths greater then

throughput

1 .

throughput path

critical < 1 (4.1) The upper bound wordlength(UBW) is defined as the largest possible wordlength such that the critical path of the corresponding PE stage satisfies equation (4.1). And, the upper bound wordlength set (UBW) is defined as a set which includes all wordlength of PE stages and each wordlength is UBW. Note that we use bold print to denote a set and light print to denote the element in a set. For example, if the UBW of 1024-point FFT (10 PE stages) is {14 15 15 16 17 18 18 18 19 20} then the UBW of stage 1 (UBW₁) is 14, UBW2 is 15, UBW3 is 15, and so on.

Figure 4.3 Flow of Upper Bound Wordlength Evaluation

Fig. 4.3 shows the flow of UBW evaluation. There are three conditions to stop the evaluation. Condition 1, the UBW is founded if SQNR and throughput constraints are both met. Condition 2, the optimization is failed if the SQNR constraint can’t be met. The maximum possible SQNR will be reported before stop. Condition 3, the optimization is failed if throughput constraint can’t be met. The maximum possible throughput will be proposed before stop.

4.2.3 Lower Bound Wordlength Evaluation

The lower bound wordlength (LBW) is defined such that if any wordlength of PE stage is equal to LBW, the SQNR of new set is just small than the SQNR of input constraint. The lower bound wordlength set (LBW) is defined as {LBW_x |x∈N,1≤ x≤n},

x means the xth PE stage. Based on the definition of LBW, it is easy to see that SQNR of

LBW is small then the SQNR of input constraint.

Fig. 4.4 shows the flow of LBW evaluation. The input are N (point of FFT), SQNR, input and output wordlength, and UBW. Then, the output is LBW

Figure 4.4 Flow of Lower Bound Wordlength Evaluation

Fig. 4.5 shows an example of LBW evaluation. Where the iSQNR is the input SQNR constraint. The step of Fig. 4.5 is top to bottom and left to right. The arrow shows the detail steps. And the more than,“>”, and small than, “<”, mean the comparison results between SQNR of statistical error analysis and SQNR of input constraint.

Figure 4.5 Example of Lower Bound Wordlength Evaluation (N=64)

4.2.4 Optimized Wordlength Candidate (OWC) Searching

4.2.4.1 Optimization Format

Since the FFT processor uses large memories especially in the early stages. Figure 4.6 shows the area increment of each PE stage when the wordlength of each stage was added by 1 bit. Therefore, to keep the wordlength short in the early stages is a good choice for area optimization.

The property of output SQNR of pipeline FFT processor is shown in equation (4.2).

) 2 a 2

a 2

a ( log a

10 2

n 2

2 2 1 2 1

10 b₁ b₂ b_n n

SQNR ₋ ₊ ₋ ₊ ₋ ₊

+ +

≈ +

L (4.2)

where is constant of PE stage n, are wordlength of PE stage n. It is easy to see that if there exists one

an b_n

dominated by . So, the wordlength of each stage is efficient when they are close.

)

Due to upon properties the expected optimization wordlength set will be sorted in ascending order from stage 1 to stage n, and the wordlength is closed stage by stage. {11 11 12 13 13 14} and {14 14 14 14 15 16} for examples. We refer these schemes of wordlegth set as optimization format for simplicity in the remaining section.

在文檔中經由混合方法進行管線化快速傅利葉轉換處理器的字元長度最佳化之研究 (頁 20-0)