FFT Processor - 單晶片網路系統平台設計最佳化之研究

The FFT is one of the most widely used algorithms for calculating the Discrete Fourier Transform (DFT) owing to its efficiency in reducing computation time [23]. Recently, the FFT requiring real-time processing has played a significant role in many communication systems based on Orthogonal Frequency Division Multiplexing (OFDM) technology such as HDTV, xDSL modems and wideband mobile terminals.

Pipelined FFT implementations are highly appropriate for real-time applications since pipelined FFT can be easily merged with the sequential nature of sampling. Several FFT architectures were developed, such as Radix-2 Multi-path Delay Commutator (R2MDC) [24], Radix-2 Single-path Delay Feedback (R2SDF) [25], Radix-2² Single-path Delay Feedback (R2²SDF) [26][27], 4 Single-path Delay Feedback (R4SDF) and Radix-4 Multi-path Delay Commutator (RRadix-4MDC) [2Radix-4]. Among these architectures, delay

feed-49

back approaches are always more efficient than the corresponding delay commutator ap-proaches in terms of required memory size [26] [28]. The R4SDF requires fewer multi-pliers than those required by R2SDF; however, the R2SDF architecture is simple and reg-ular. The R2²SDF architecture is a compromise endowed with the R2SDF structure and the multiplicative complexity of the R4SDF. This study focuses on R2SDF and R2²SDF architectures.

Since the pipeline FFT architecture is memory-consuming, reducing its memory re-quirement will save a significant amount of chip area. Several studies have employed regular module implementations and have attempted to reduce the area-consuming ele-ments in the FFT design. The design of [29] reduces the amount of memory used to store the twiddle factors by employing canonic signed digit (CSD) constant multipliers. A new FFT architecture, the radix-2 single deep delay feedback (R2SD²SF) presented in [30], has smaller complex multipliers and adders than other FFT designs. Both the designs of [29] and [30] have fixed wordlength for data and coefficients for each pipeline stage.

The possibility to use varying wordlengths for these stages is frequently ignored when achieving modularized solutions. However, the increasing use of intellectual property (IP) makes the non-module implementation viable, allowing for the further exploitation of pipelined architectures.

In general, an FFT cannot be implemented exactly. Each multiplier and adder in the pipelined FFT architecture can introduce errors due to rounding or truncation of

arith-51

metic results. Errors typically accumulate successively over FFT stages. That is, errors from early stages can affect performance in latter stages. The wordlengths of data and co-efficients chiefly affect precision, quantization errors, and hardware complexity. Increased wordlengths increase the precision and reduce quantization error at the cost of area and power. Conversely, to maintain a lower hardware cost, a shorter wordlength can be chosen at the sacrifice of precision. Therefore, identifying an optimized solution of wordlength is necessary.

Two conventional methods for FFT error analysis of signal to quantization noise ratio (SQNR) and wordlengths are statistical error analysis and simulation-based analysis. Al-though the SQNR can be calculated efficiently by employing statistical models [31] [32]

[33], the accuracy of the calculated result heavily depends on the model used. A more precise model yields more accurate results. The simulation-based method evaluates the FFT by comparing simulation results of the fixed-point computations with those obtained using the floating-point arithmetic [34]. Although simulation increases the accuracy of the evaluation results, it is time-consuming.

According to error analysis, optimizing wordlengths of pipeline stages in FFT pro-cessors for given specifications is feasible. Optimization of an 8192-point FFT processor using the simulation method has shown that progressive wordlengths and scaling in the early stages can achieve a good compromise between SQNR and hardware cost [35].

However, this approach requires a long time to run the simulation.

This work presents a statistical model for error analysis at the stage level with varying wordlengths in the pipeline FFT processor. Furthermore, a hybrid method for reducing the required simulation time is introduced. The optimized wordlength parameters at each stage are generated automatically according to design specifications of FFT processors, such as the length of FFT, SQNR and the real-time processing requirements. Finally, the optimization flow using the proposed error model and the hybrid method is demonstrated.

The rest of this chapter is organized as follows. Section 3.1 gives a brief review of the FFT. Section 3.2 then introduces statistical and simulation-based error analyses and demonstrates the effectiveness of these methods. Section 3.3 describes the proposed method for wordlength optimization step-by-step, while Section 3.4 summarizes the ex-perimental results. Conclusions are finally drawn in Section 3.5.

3.1 Overview of FFT

An FFT based on structuring the DFT computation by forming increasingly smaller sub-sequences of the input sequence x[n] is called a decimation-in-time (DIT) FFT. Alterna-tively, an FFT can also be decomposed using a first-half/second-half approach that divides the output sequence X(r) into increasingly smaller subsequences; this procedure is called a decimation-in-frequency (DIF) FFT [36]. Since both of these schemes are similar in nature, their performance cannot be exactly compared without a given architecture [33].

3.1. OVERVIEW OF FFT 53

N/2 FIFO 1 FIFO

Butterfly Butterfly Butterfly Butterfly

clock Controller

(a) R2SDF architecture.

b_P−1

Butterfly Butterfly Butterfly Butterfly

b₁

(b) R2 SDF architecture.²

W^p

Radix−2 Radix−2 Radix−2 Radix−2

Figure 3.1: Conventional R2SDF and R2²SDF DIF implementations.

In this work, the DIF algorithm is used to illustrate the architectural implementations.

This work examines the architectures of R2SDF and R2²SDF for the fixed-point DIF pipeline FFT processor to demonstrate the effectiveness of the proposed optimization method. Their block diagrams are shown in Fig. 3.1, where N is FFT length, bk is the wordlength of stage k, k ∈ {1, 2, · · · , P }, and P = log₂N. Due to spatial regularity, both controllers in these architectures can be implemented by using simple P -bit counters [25] [27]. Since the valid output range of the+/- operation of the FFT butterfly is double that of the valid input range, a scaling by 1/2 is applied to eliminate the overflow.

With-out scaling by 1/2, the overflow will cause excessive error. Therefore, scaling by 1/2 is employed for each stage in this work.

Although the area and power consumption in these pipeline architectures are dom-inated by memory (FIFOs) and multipliers, to progressively adjust the wordlength of pipeline stages can reduce the area of memory and multipliers and hence the overall power consumption. To adjust wordlengths based on maintaining the SQNR requirement, error analysis is required.

3.2 Error Analysis

3.2.1 Statistical Analysis

This section introduces the statistical error model for varying wordlengths of pipeline stages. The precision of the FFT processor is then discussed according to the SQNR derivation. This derivation has two major steps. The first step involves finding error sources based on the architectures adopted and the fixed-point arithmetic schemes, e.g., truncation, and scaling by 1/2 or not. The next step entails searching the paths of error propagation and combining all of these errors along paths to evaluate the variance of the errors propagating to the output. Moreover, the SQNR of the FFT output is given.

In the following analyses, two architectures, R2SDF and R2²SDF, are illustrated for deriving the statistical error models. Assume a fixed-point arithmetic with (bk + 1) bit

3.2. ERROR ANALYSIS 55

wordlengths and a signed fraction, where k is the stage number of PE stage. The input to an N-point FFT, denoted by x[n], where n = 0, 1, 2, · · · , N − 1, is a sequence of finite valued complex numbers. Numbers consist of 2N real random variables that are uncorrelated and are uniformly distributed in −1/√

2, 1/√

2. One db is added ti the SQNR constraint (iSQNR) to allow for the SQNR error in the statistical model. The effect of inaccuracy in the twiddle factor, W^p, is not addressed here. The truncation operations are modeled as uncorrelated.

Error Sources

Four major error sources must be addressed in an FFT processor. The first error source is the quantization error of the wordlength difference between PE stages whose variance is σ_q,k² . The second error occurs during scaling, and its variance is represented by σ_s,k² . Another error results from the complex multiplication of the twiddle factor, and σ_m,k² is used to denote its variance. The last error, the insufficient output wordlength error σq,o² , is only considered for the last stage of the FFT processor.

Fig. 3.2 shows the error model of a PE stage with stage-by-stage scaling of 1/2. σ²_q,k occurs when the wordlength, b_k−1, of the stage k − 1 is longer than that of the stage k.

σ_q,k² is the variance of the truncated bits from b_k−1 to bk. Scaling error is produced when bk < b_k−1+ 1. Scaling by a factor of 1/2 involves one bit right shift and truncation of the last significant bit (LSB). σ_s,k² is defined as the variance of this truncated bit. Both σ²_q,k

Figure 3.2: Error model of a PE stage.

and σ_s,k² can be combined as an error of directly scaling the output data of stage k − 1 and truncating the scaled result to bkbits. Since complex scaling can be achieved by separately scaling the real and imaginary parts of the data, the combined error σ⁰²_s,k (Fig. 3.2) can be expressed as

If a complex multiplication is implemented by using four real multiplications and the results of real multiplications are truncated individually. σ²_m,kis defined as the variance of truncated bits of the result after finishing a complex multiplication. This variance can be represented by wordlengths, b_k−1and bk, and is obtained as

σ²_m,k =

3.2. ERROR ANALYSIS 57

Figure 3.3: Propagation of quantization and scaling errors.

If the required output wordlength, bo of the FFT processor, is too short, the output of the last PE stage must be truncated. The quantization error is then generated and its variance, σ_q,o² , can be described by bo and wordlength of the last stage, bP; the formula is expressed as

Output Signal-to-Quantization Noise Ratio (SQNR)

Since all error sources are assumed uncorrelated, the variance of the errors at the FFT output can be obtained by summing all contributions from the individual error sources that propagate to the output. For an N-point FFT processor employing either the Radix-2 algorithm or Radix-2²algorithm, the scaling error σ_s,k⁰² propagating to any output node can be given by

where (1/4)^{P −k} is the effect of scaling on the error propagating at stage k. The σs,k⁰²

propagation can be illustrated using an 8-point DIF Radix-2 algorithm signal flow graph (SFG) (Fig. 3.3). The number of scaling errors σ_s,k⁰² propagating to any output node, e.g., X(0), from the first, second, and third stages are 8, 4, and 2 respectively. Thus, the error

variance of X(0) can be obtained from (3.4), and is given by σ_S|X(0)² = (1/2)σ⁰²_s,1+ σ_s,2⁰² + 2σ_s,3⁰².

To derive the variance of the FFT output due to multiplication errors, all multiplica-tions are assumed noisy. Fig. 3.4 shows an 8-point DIF Radix-2 algorithm SFG. There are 4, i.e., half of 8, σ_m,k² s in each stage, and each σ_m,k² of the first (k = 1), second (k = 2), and last (k = 3) stage will propagate to 4, 2, and 1 output nodes, respectively. On the other hand, for a general case with N-point FFT, there are half of the N σ²_m,ks error sources in

3.2. ERROR ANALYSIS 59

Figure 3.4: Propagation of multiplication errors.

each stage and each σ²_m,k from the first stage to the last, P -th, stage propagates to ^N₂, ^N₄,

· · ·, ₂^NP output data, respectively. Hence, one can easily derive the output variance caused by multiplication errors, σ²_M, with the Radix-2 algorithm; the expression of σ²_M is given by

Figure 3.5: Propagation of noiseless multiplications.

For the Radix-2² algorithm, the corresponding expression of σ²_M is modified as

σ_M² ≈ 1

(3.5) and (3.6) are derived by assuming that all the multiplications are noisy. However, the multiplications associated with twiddle factors W^p = ±1 or W^p = ±j introduce no errors. Fig. 3.5 shows the position of noiseless multiplications in an 8-point Radix-2 algorithm SFG. The variances of these noiseless multiplications denoted as σ²_C in the Radix-2 algorithm SFG and the Radix-2² algorithm SFG, are individually re-derived and

3.2. ERROR ANALYSIS 61

According to the assumption of no correlations in these error sources, the total output error variance can be obtained by summing the variance of each error propagating to the output; this summation is expressed as

σ²_T = σ²S+ σM² − σ²_C

+ σ²qo (3.9)

Furthermore, the variance of output data in an N-point FFT processor is given in (3.10) [36].

σ²_X = 1

3N (3.10)

Hence, the output SQNR is obtained by

SQNR₁= 10 log10

σ_X² σ_T²

(3.11)

Floating Point FFT

Calculate SQNR Fixed Point FFT

x[n]

Xq (r) Xq (r)

’ xq [n] SQNR

Figure 3.6: Block diagram of simulation analysis.

This SQNR₁ model is used as the performance index, whereas statistical error analysis is employed in the FFT processor.

3.2.2 Simulation-Based Method

Fig. 3.6 presents a conceptual block diagram of the simulation-based analysis. To perform this simulation, floating-point and fixed-point C models with a given FFT algorithm were developed. According to system constraints, e.g., wordlength of each stage, rounding or truncation of stages, number of scaling stages, input/output wordlength, the C models obtain the proper fixed-point results. Then, SQNR can be evaluated by comparing the fixed-point output with the floating-point output; the formula of the calculation is given by

SQNR₂= 10 log10

PN−1

r=0 Xq(r)2

PN−1

r=0 Xq(r) − Xq⁰(r)2 (3.12) During simulation, random patterns are generated as inputs, and then the resulting SQNR₂’s are averaged as an estimated average of the SQNR distribution. During

sim-3.2. ERROR ANALYSIS 63

ulating analysis, there is a trade-off between the accuracy of the SQNR2 and the re-quired number of simulation times. The rere-quired simulation times are investigated as follows. First, the random variable ¯S is used as an estimate of SQNR₂. Then, accord-ing to the central limit theorem, the samplaccord-ing distribution of ¯Sis approximately normally distributed. Therefore, we can be (1 − α)100% confident that the SQNR error will not exceed a specified amount, e, when the number of SQNR₂’s equals

z_α/2· σ e

(3.13)

where σ is the standard deviation of the distribution of SQNR₂, and z_α/2 satisfies the probability equation, Prob(z_α/2 < Z)= α/2, when Z is a random variable with a standard normal distribution [37]. In this work, e is the constraint of SQNR error (SQNR Error).

3.2.3 Demonstration of Statistical and Simulation-Based Analysis

The proposed error model was verified using the simulation-based error analysis and the result was compared with that obtained by statistical error analysis. In this analysis, the SQNR is calculated using statistical method and is also evaluated using simulation for 8-, 16-, · · · , 8192-point DIF Radix-2 FFT and for 16-, 64-, · · · , 4096-point DIF Radix-2² FFT with the freely chosen wordlength from 8 − 32 bits for each stage.

Table 3.1 shows a summary of the comparison between the two methods with 20 ran-domly generated wordlength sets in a 1024-point DIF Radix-2 FFT, where input wordlength

Table 3.1: Example of Random Verification

Wordlength Wordlength of PE Stage SQNR (dB)

Set no. 1 2 3 4 5 6 7 8 9 10 Simulation Statistical Difference 1 29 32 22 32 30 21 10 10 20 18 21.57151 21.225869 -0.345643 2 17 26 21 32 23 12 31 21 26 23 40.64704 40.568971 -0.078069 3 8 28 14 16 13 32 29 32 10 11 20.73324 20.836938 0.103696 4 9 12 11 11 12 12 16 16 10 22 20.55134 20.906414 0.355071 5 19 30 27 15 22 19 16 21 31 18 58.45886 59.02069 0.561833 6 29 15 23 25 13 32 17 14 11 8 5.981925 5.987078 0.005153 7 17 22 16 26 30 23 15 31 18 30 55.56245 55.547552 -0.014897 8 29 23 23 30 10 22 16 31 18 29 31.52884 31.443187 -0.085655 9 23 16 15 31 28 28 24 8 20 25 11.22512 11.082254 -0.142871 10 29 20 21 23 32 17 14 8 21 20 11.20543 11.081993 -0.123438 11 11 21 22 22 12 15 25 16 13 21 36.91202 37.570486 0.658464 12 28 13 13 24 12 27 12 10 30 14 22.67044 22.880391 0.209947 13 20 17 14 22 15 8 11 15 11 21 16.35159 16.105621 -0.24597 14 20 21 21 16 12 11 29 32 17 32 33.87208 34.053531 0.181455 15 27 20 21 10 19 13 29 25 18 9 12.01716 12.014605 -0.002551 16 32 15 25 24 8 8 9 8 21 11 9.187271 9.280194 0.092923 17 26 23 12 11 22 29 13 30 26 12 29.20325 29.517037 0.313784 18 20 29 16 28 31 13 25 20 22 14 40.35284 40.808381 0.455539 19 22 18 9 17 23 20 30 25 8 16 8.961997 9.005713 0.043716 20 11 25 27 19 24 14 8 29 31 9 9.633909 9.774141 0.140233

3.2. ERROR ANALYSIS 65

Figure 3.7: Histogram of SQNR difference with randomly generated wordlengths.

is set to be equal to that of the first stage, and output wordlength is set to be the same as that of the last stage. The SQNR difference is obtained by subtracting the SQNR of sim-ulation analysis from that of statistical analysis.

Fig. 3.7 shows the histogram of SQNR difference with 10⁴randomly generated wordlength sets for the 1024-point FFT of Radix-2 and Radix-2² algorithm. The difference in com-parison is within ±1.0 dB in Radix-2 FFT and within ±1.1 dB for the Radix-2²FFT.

Exhaustively comparing all wordlength sets of 8 − 32 bits is impractical because the simulation time is unendurable. Therefore, partial exhaustive verification for wordlengths

Figure 3.8: Histogram of SQNR difference with partial exhaustive verification.

3.3. WORDLENGTH OPTIMIZATION 67

of 11 − 18 bits was employed in the comparison of 64-point Radix-2 and Radix-2²FFT.

This comparison required 130 hours. Fig. 3.8 shows the results of the comparison; the SQNR difference is within ±1.1 dB.

Both Fig. 3.7 and Fig. 3.8 present a bias shift in SQNR difference. This shift is pro-duced because the noise model of multipliers in statistical analysis is an approximation of the actual noise distribution of multipliers. However, this is not an important issue as the shift is much smaller than the maximum SQNR difference. On the other hand, a pa-rameter ∆ is introduced to indicate the maximum SQNR difference for the optimization process. The amount of SQNR difference is not analytically expressed, and is obtained by an experiment. Thus, according to experimental results, the value of ∆ is suggested 1 dB for the R2SDF architecture and 1.1 dB for the R2²SDF architecture when statistical analysis is mixed with simulation-based analysis.

3.3 Wordlength Optimization

Fig. 3.9 presents the flow of the proposed automatic wordlength optimization in the pipelined FFT processor. There are four major steps in the process. First, the upper bound wordlength (UBW) for each PE stage is evaluated based on the operating frequency re-quirement and the SQNR constraint (iSQNR) of the processor. Next, the UBWs of stages are fed into the lower bound wordlength (LBW) evaluation as an additional constraint for

Constraints

Figure 3.9: Wordlength optimization flow of a PE stage.

determining the LBWs. Both UBW and LBW evaluations employ the statistical analysis.

Then, use the statistical analysis to determine optimized wordlength candidates (OWCs) based on iSQNR-∆. Finally, the optimized wordlength (OW) evaluation is performed based on the two primary procedures: (a) If the SQNR Error is ≤ 1 dB, a simulation analysis is used to select a solution with the smallest area. As the candidates are arranged in ascending order in area, the algorithm terminates after finding the first solution. (b) If the SQNR Error is > 1 dB, a benefit function is introduced and the best benefit function is selected.

A hardware library, and two tables, a PE stage table and a mean of SQNR variance table, are prepared prior to activating the optimizing process. To optimize the hardware cost, one hardware library such as the TSMC 0.25µm cell library, is chosen to determine the area size and critical timing delay of a PE stage in the FFT processor by synthesizing versus different wordlengths. The obtained data are recorded in the PE stage table to

3.3. WORDLENGTH OPTIMIZATION 69

? Get b_U,k

No Yes

Yes

Find the MAX. Operating Frequency

k=k+1

Last Stage?

iSQNR, N, Operating Frequency, Input Wordlength, Output Wordlength, PE Stage Table

Stage no. k=1

Maximum Operating Frequency Maximum SQNR

Find the MAX. SQNR SQNR Analysis for B_U

Meet iSQNR−∆?

Figure 3.10: Evaluation of the upper bound wordlength.

speed up automation. The mean of the SQNR variance table is used to derive lengths of simulation at different simulated confidences according to (3.13). This table is established by calculating the mean of 100 simulated SQNR variances of a PE stage with wordlengths of 8-32 bits versus distinct FFT lengths (N ).

3.3.1 Evaluation of Upper Bound Wordlength

The UBW of the k-th PE stage, named as bU,k, is defined as the maximum possible wordlength, such that the critical path satisfies the timing constraint, which is the inverse of the operating frequency of the FFT processor. Since the upper bound is obtained based

on the operating frequency and throughput, the lower throughput constraint and faster hardware library exactly increase the UBW. Conversely, the increased wordlength will require increased time for the operation of the PE stage. That is, increasing wordlength reduces the allowed operating frequency, and thus, the operating frequency requirement can be violated. Additionally, short wordlength results in a poor SQNR.

Fig. 3.10 shows the process of UBW evaluation. First, the UBW corresponding to the operating frequency requirement is evaluated stage by stage. When the operating frequency requirement is achieved for each stage, {bU,k}s are obtained. Otherwise, the maximum allowable operating frequency is reported. When all {b_U,k}s are given, they are used to analyze the SQNR of the FFT processor. If the evaluated SQNR meets the SQNR constraint, the UBW set denoted by BU comprising these {bU,k}s is output, or the maximum achievable SQNR is reported.

3.3.2 Evaluation of Lower Bound Wordlength

When the UBW set is obtained, it is used to support the evaluation of the lower bound wordlength (LBW). The LBW set, BL, is derived from BU, such that the optimized solu-tion must be above BL. That is, if the solution is not above BL, the solution will not meet

在文檔中單晶片網路系統平台設計最佳化之研究 (頁 69-109)