Random Verification - Error Analysis - 經由混合方法進行管線化快速傅利葉轉換處理器的字元長度最佳化之研究

Chapter 3 Error Analysis

3.4 Verifications

3.4.1 Random Verification

For example, we randomly generate 20 wordlength sets of 1024 points DIF Radix-2 FFT. The input worldlength is equal to that of the first PE stage, and the output wordlength is equal to that of the last PE stage. Then, calculate the SQNR by statistical and simulation-based methods, respectively. Then, the SQNR difference between them can be calculated. Table 3.1 shows the results. The first column shows the number of wordlength sets, next column shows the wordlength of each PE stage, column 3 shows the result SQNR of simulation-based error analysis, column 4 shows the SQNR result of statistical error analysis, and the last column is the difference of SQNR.

Table 3.1 Examples of Random Verification (N=1024)

We had compared 10000 wordlength sets for 1024-point FFT of Radix-2 and Radix-2² algorithm. The maximum difference of Radix-2 is almost within ±1dB. The maximum difference of Radix-2² for each FFT is almost within ±1.1dB. Fig. 3.7(a) shows the distribution of difference of 1024-point Radix-2 FFT, Fig. 3.7(b) shows the 1024-point Radix-2² FFT.

(a)

(b)

Figure 3.7 Results of Random Verification of Radix-2 and Radix2²

3.4,2 Partial Exhaustive Verification

To exhaustively compare all wordlength sets of 8 to 32 bits is not practical because the simulation time is not endurable. However we can do exhaustive comparison in some special range, maybe some of the solution space, to verify. We had chosen the wordlength 11 to 18 bits to do partially exhaustive comparison of 64 points DIF Radix-2 and Radix-2² FFT. They spent about 130 hours comparison time, and the results are shown in Figure 3.8.

The difference is within ±1.1dB.

(a)

(b)

Figure 3.8 Results of Partial Exhaustive Verification of Radix-2 and Radix2²

Section 3.4.1 and 3.4.2 clearly show that the result obtained from the statistical error

Chapter 4 Wordlength Optimization

The wordlength is an important design parameter. It will affect both the performance and complexity. Longer wordlength is preferred for good precision. But, increase wordlength will increase the complexity. It will increase the size of memory and computational units and thereby increase power consumption and decrease performance.

Hence, the wordlength requires careful optimization.

In this chapter, we will briefly review the design flow of FFT processor first. Then, we will describe our approach, hybrid wordlength optimization method. Finally, two examples are shown.

4.1 FFT Processor Design Flow

There are many factors have to be considered o design the FFT processor. Figure 4.1 shows the over all design flow of FFT processor. First, system requirements need to be specified. They are points of FFT, SQNR, throughput, area, power, …, etc. Then, the proper FFT algorithm and FFT architecture need to be chosen. Finally, the wordlength of architecture need to be analyzed.

Figure 4.1 Design Flow of FFT Processors

When the FFT is implemented as a fully custom ASIC, the wordlength of each stage can be freely chosen except input and output wordlengths of FFT processor, which are system specified. Internal wordlengths of FFT processor can be chosen to decide the precision and complexity. In general, longer wordlength is preferred for better precision of numbers. On the other hand, increase the wordlength will increase the complexity, it will increase the hardware cost, power consumption, and decrease the speed. Thereby, the optimization is a trade-off between precision and complexity.

To reduce the time of over all system design, the automatic wordlength optimization solution is preferred. A simulation-based method on pipelined FFT had presented by Lin [3]. We will present a faster hybrid method in this thesis. Figure 4.2 outlines the automation flow. There are four steps in sequence, i.e., upper bound wordlength evaluation, lower bound wordlength evaluation, optimized wordlength candidate searching, and optimized wordlength selection. Additionally, there are some tables and libraries built offline to speed up this flow.

Figure 4.2 Over All Flow of Wordlength Optimization

4.2 Wordlength Generation

Items in Fig. 4.2 will be introduced in this section. This flow is to optimize the area under input constraints. Input constraints include points of FFT, SQNR, throughput, FFT input and output wordlength, SQNR simulation confidence interval, and SQNR simulation error. The output data are wordlengths of each PE stage.

4.2.1 Library and Table

Since we optimize hardware cost, the relative hardware library needs to be chosen.

Adder, multiplier, multiplexer, read only memory (ROM), and shift register are five basic elements of FFT. Hardware library decides the area and critical path to wordlength table for these components [3].

PE stages are hardware blocks in the wordlength generation flow, which is built by

the basic components. We need a table that stores the information of area and critical path for each PE stage to speed up the automation flow, PE stage table [3].

In Figure 4.2, the mean of SQNR variance table is used to calculate the simulation times of different confidents of simulation [3].

4.2.2 Upper Bound Wordlength Evaluation

Throughput is one of the input constraints. Satisfy the throughput constraint implies that the critical path must be short enough to meet equation (4.1). In other words, it means that some stages violate the timing of pipeline if there are critical paths greater then

throughput

1 .

throughput path

critical < 1 (4.1) The upper bound wordlength(UBW) is defined as the largest possible wordlength such that the critical path of the corresponding PE stage satisfies equation (4.1). And, the upper bound wordlength set (UBW) is defined as a set which includes all wordlength of PE stages and each wordlength is UBW. Note that we use bold print to denote a set and light print to denote the element in a set. For example, if the UBW of 1024-point FFT (10 PE stages) is {14 15 15 16 17 18 18 18 19 20} then the UBW of stage 1 (UBW₁) is 14, UBW2 is 15, UBW3 is 15, and so on.

Figure 4.3 Flow of Upper Bound Wordlength Evaluation

Fig. 4.3 shows the flow of UBW evaluation. There are three conditions to stop the evaluation. Condition 1, the UBW is founded if SQNR and throughput constraints are both met. Condition 2, the optimization is failed if the SQNR constraint can’t be met. The maximum possible SQNR will be reported before stop. Condition 3, the optimization is failed if throughput constraint can’t be met. The maximum possible throughput will be proposed before stop.

4.2.3 Lower Bound Wordlength Evaluation

The lower bound wordlength (LBW) is defined such that if any wordlength of PE stage is equal to LBW, the SQNR of new set is just small than the SQNR of input constraint. The lower bound wordlength set (LBW) is defined as {LBW_x |x∈N,1≤ x≤n},

x means the xth PE stage. Based on the definition of LBW, it is easy to see that SQNR of

LBW is small then the SQNR of input constraint.

Fig. 4.4 shows the flow of LBW evaluation. The input are N (point of FFT), SQNR, input and output wordlength, and UBW. Then, the output is LBW

Figure 4.4 Flow of Lower Bound Wordlength Evaluation

Fig. 4.5 shows an example of LBW evaluation. Where the iSQNR is the input SQNR constraint. The step of Fig. 4.5 is top to bottom and left to right. The arrow shows the detail steps. And the more than,“>”, and small than, “<”, mean the comparison results between SQNR of statistical error analysis and SQNR of input constraint.

Figure 4.5 Example of Lower Bound Wordlength Evaluation (N=64)

4.2.4 Optimized Wordlength Candidate (OWC) Searching

4.2.4.1 Optimization Format

Since the FFT processor uses large memories especially in the early stages. Figure 4.6 shows the area increment of each PE stage when the wordlength of each stage was added by 1 bit. Therefore, to keep the wordlength short in the early stages is a good choice for area optimization.

The property of output SQNR of pipeline FFT processor is shown in equation (4.2).

) 2 a 2

a 2

a ( log a

10 2

n 2

2 2 1 2 1

10 b₁ b₂ b_n n

SQNR ₋ ₊ ₋ ₊ ₋ ₊

+ +

≈ +

L (4.2)

where is constant of PE stage n, are wordlength of PE stage n. It is easy to see that if there exists one

an b_n

dominated by . So, the wordlength of each stage is efficient when they are close.

)

Due to upon properties the expected optimization wordlength set will be sorted in ascending order from stage 1 to stage n, and the wordlength is closed stage by stage. {11 11 12 13 13 14} and {14 14 14 14 15 16} for examples. We refer these schemes of wordlegth set as optimization format for simplicity in the remaining section.

4.2.4.2 OWC Searching Flow

The optimized wordlength set candidates (OWC) have three properties. (1) It is between LPW and UBW. (2) It is in optimization format. (3) The SQNR of FFT processor

OWC.

To search the OWC, we scan the wordlength set from LBW to UBW and compare SQNR of each set with the input SQNR constraint. Figure 4.7 shows the flow of OWC searching. The output of this flow is the OWC Array. It contains all the information of OWC and is sorted by area size.

Figure 4.7 Flow of Optimized Wordlength Candidate Searching

4.2.5 Optimized Wordlength (OW) Selection

The OW is an OWC which has the smallest area size and good SQNR. There are two methods to get the optimized wordlength in OWC Array. Method 1, the optimized wordlength set will be found by simulation-based method if user’s SQNR error constraint is under 1 dB. Method 2, the optimized wordlength set will be found by statistical method if users SQNR error constraint is more than

± 1 dB.

Figure 4.8 shows the flow of OW selection. In Method 1, we simulate all OWC of OWC Array one by one from the one with the smallest area size until the SQNR of simulation meets the SQNR of the input constraint. In Method 2, we judge all the OWC in OWC Array by a benefit function to get the OWC with the best benefit. The benefit function is shown in equation (4.3).

increament size

area

increament

Benefit= SQNR (4.3)

where the increment is the difference between the SQNR or area size of LBW and those of OWC.

Figure 4.8 Flow of Optimized Wordlength Selection

4.3 Examples of Wordlength Optimization

4.3.1 Hybrid Method

Input constraints of this example are {N=1024(n=10), SQNR=45 dB, input_wordlength=output_wordlength=18, throughput=50MHz, and SQNR_error=0.1 dB}.

Since the SQNR_error constraint is smaller than ± dB, the hybrid method will be used. 1 Figure 4.9 shows the steps of this example. The “sim_SQNR” means the result of simulation and the “iSQNR” means the SQNR of input constraint.

Figure 4.9 Example of Hybrid Wordlength Optimization Method

4.3.2 Pure Statistical Method

Input constraints of this example are {N=1024 (n=10), SQNR=45 dB, input_wordlength=output_wordlength=18, throughput=50MHz , and SQNR_error=1.1 dB}. Since the SQNR_error constraint is more than ± dB the pure statistical method 1 will be used. Figure 4.10 shows the steps of this example.

Figure 4.10 Example of Pure Statistical Method

Chapter 5 Experimental Results

5.1 Introduction

We implement two FFT architectures, including DIF R2SDF and DIF R2 SDF. The range of N can be adjusted from 8 to 8192 points, and wordlength from 8 to 32 bits in each stage. We pipe each PE stage of FFT architectures and apply stage-by-stage scaling.

In order to compare the performance with previous work [3], the same hardware libraries are used here.

Logic gate model includes adder, multiplier, and multiplexer. We conduct synthesis without any constraints by Synopsys Design Analyzer [19] and the TSMC 0.25um cell library and Synopsys DesignWare [18] are used. The fast carry look-ahead synthesis model for adder, Booth-encoded Wallace tree synthesis model for multiplier, and universal multiplexer synthesis model for multiplexer are adopted and area and timing reports of Synopsys Design Analyzer are used for these models. Memory model includes shift register and ROM also use TSMC 0.25um cell library.

The SQNR range between 40 to 60 dB had been used in most system. It is for our experimentations too. Two common FFT design specifications that are typically used in OFDM systems [22] had been summarized in Table 5.1.

Complex, word-sequential

Table 5.1 Specification of Common FFT for OFDM

To implement the proposed flow, the C++ language with SystemC library is used. The SystemC library is used for fixed-point type to model the behavior of fixed-point hardware.

The quantization mode is always truncation (SC_TRN) and the overflow mode is saturation (SC_SAT) in our experimentations.

Finally, the platform is built in a PC with Intel 2.4GHz CPU and 768M Memory. The operation system is Microsoft Windows 2000. The Visual C++ 6.0 is used for compiler.

5.2 Results

The experimental results of R2SDF and R2²SDF wordlength optimization will be showed in this section.

5.2.1 Optimization of Different Constraint

Results of experiments with different constraints will be introduced in this sub-section.

5.2.1.1 FFT Point Constraint

Experimental result of area optimization for point from 8 points to 8192 points is presented in Table 5.2. Table 5.2(a) is for DIF R2SDF and Table 5.2(b) is for DIF R2²SDF.

Constraints include: SQNR is 45(dB), SQNR error is 0.1(dB), SQNR simulation confidence interval is at the level of 95%, the throughput is 50MHz, and the input and output wordlengths are 18 (bits). Since the constraint of maximum allowable SQNR error is small then 1 dB, the hybrid method will be used. In these tables, the first column “Point”

presents the point of FFT processor. The column of “Pre-Post” represents that parameters in the row with “Pre” belong to traditional design, without optimization, or parameters in the row with “Post” are optimized.

(a)

(b)

Table 5.2 Area Optimization of Different FFT Point (IO Wordlength=18)

The column of “Area Reduction” presents the reduction rate of area, calculated by

% _ 100

_ − ×

area pre

area post area

pre . The last column “Time” shows the computer time of

optimization. It can be see that the greater N with the greater area reduction rate, generally.

The maximum and minimum area reduction rates for DIF R2SDF are 24% and 9% and those are 23% and 6% for DIF R2²SDF.

5.2.1.2 Input Wordlength and Output Wordlength

Table 5.3 introduces the experimental results with different input and output wordlength constraints to those of Table 5.2. The input wordlength is 14 bits and the output wordlength is 14 bits. The area reduction rate is still the same when point range in 8 to 1024. There is no solution when the point number is greater than 1024.

(a)

(b)

Table 5.3 Area Optimization of Different FFT Point (IO Wordlength=14)

Figure 5.1 shows the difference of area reduction rate between these two input and output wordlengths.

(a)

(b)

Figure 5.1 Area Reduction Rate of IO Wordlength=18 and 14 Bits

5.2.1.3 SQNR

Figure 5.3 presents the area reduction rate for different SQNR constraint of DIF R2SDF and DIF R2²SDF. Constraint of SQNR error is 0.1(dB), SQNR simulation confidence interval is at the level of 95%, the throughput is 50MHz, and the input and output wordlengths are 18.The SQNR of traditional design increases 6 dB if all wordlength increases 1 bit. It can be found that 6 dB is a cycle of area reduction rate for different SQNR constraint, too. The range of area reduction rate is from 12% to 20%.

Figure 5.2 Area Reduction Rate vs. SQNR Constraint

5.2.1.4 SQNR Error

Table 5.4 shows the experimental results with the same constraints except SQNR error is 1.1 dB as that in Table 5.2(a). Since the allowable SQNR error is great than 1 dB, the pure statistical error analysis method will be used. The SQNR of these optimized wordlength sets had been verified by simulation based-method for accuracy, introduced in column “Post-SQNR”. The maximum insufficient error of SQNR is 0.18 dB. In other words, it is -0.4% of SQNR constraint.

Table 5.4 Area Optimization of Different FFT Point (SQNR Error = 1.1dB)

5.2.2 Special Cases of Optimization

5.2.2.1 Absolute Constraint Over

There is only one advice for conditions that are scaling down to meet the constraint of hardware library. There are two conditions about these cases. First, the throughput constraint is great then the maximum throughput of hardware library. The maximum throughput of hardware library is the throughput for the wordlength set with the minimum wordlength of hardware library for all stages. If 2 is the minimum wordlength of hardware library, then the {2 2 2 2 2 2 …} is the wordlength set of maximum throughout. Second, the SQNR constraint is great than the maximum SQNR of hardware library. The maximum SQNR of hardware library is the SQNR for the wordlength set with maximum wordlength of hardware library for all stages. If 32 is the maximum wordlength of hardware library then the {32 32 32 32 32 32 …} is the wordlength set of maximum SQNR.

Figure 5.3 shows the output messages. Figure 5.3 (a) is the output message when the related user constraints are N=1024, SQNR=45dB, the input wordlength and output wordlength are 18, and the throughput constraint is 200MHz. The throughput constraint, 200MHz, is over the maximum throughput, 171MHz, of hardware library. Figure 5.3(b) is the output message when the related user constraints are N=1024, SQNR=80dB, the input wordlength and output wordlength are 18, and the throughput constraint is 50MHz. The SQNR constraint, 80dB, is over the maximum SQNR, 69dB, of hardware library.

(a)

(b)

Figure 5.3 Output Message of Generator when There is No Solution

5.2.2.2 Partial Constraint Over

This case happens when some constraints are over and all constraints are within hardware library constraints. The proper ranges will be presented for tread-off. Figure 5.4 is the output message when the related user constraints are N=1024, SQNR=68dB, the input wordlength and output wordlength=18, and the throughput constraint is 77MHz. The SQNR constraint, 68dB, with the throughput constraint, 77MHz, can’t be met. The output message is to advise user how to trade off.

Figure 5.4 Output Message of Generator when There is No Solution

5.2.3 Methods Comparison

The area reduction and the computation time of optimization will be compared in this sub-section. First, the comparison between previous work [3] and our hybrid method will be shown. Then, the comparison between our hybrid method and the pure statistical method will be introduced.

5.2.3.1 Previous Work vs. Our Work

The previous work [3] is to optimize wordlength by the pure simulation-based method.

And our hybrid method is combined with simulation-based and statistical method. Figure 5.5 presents the post area and computing time of these methods. It shows that results of optimized area of these methods are equally. But the computing time of our method is much faster especially when the FFT length is longer.

Figure 5.5 Comparison Result between Pure Simulation-Based and Hybrid Method

5.2.3.2 Our Hybrid Method vs. Our Pure Statistical Method

There are two kinds of optimization methods in our work. The hybrid method is the first one, used whenever the allowable maximum SQNR error constraint is less than 1 dB.

Second, the pure statistical method is used whenever the allowable maximum SQNR error constraint is greater than 1 dB. The comparison result of these methods is presented in Figure 5.6. It is the figure of the area reduction rate and computing time. It can be found that the area reduction rates of these two method are equally but the computing time of pure statistical method is much faster.

It is interesting to note that the area reduction rate is better when there are insufficient SQNR error occurred in optimizations of 128, 512 and 2048 point FFT, in Table 5.4, of pure statistical method.

Figure 5.6 Comparison Result between Hybrid and Pure Statistical Method

Chapter 6

Conclusions and Future Works

In this thesis, a statistical error analysis method between SQNR and wordlength of each PE stage of pipelined FFT processors is presented. New hybrid wordlength optimization method on area reduction for pipelined FFT processors based on statistical and simulation-based error analysis is introduced, which is fast then the pure simulation-based method. We also presented a pure statistical wordlength optimization method. It generates the optimized wordlength of FFT processors just in several seconds even the point number of FFT is 8192. With our generator, the advice will still be given even there are no solution under user constraints.

Increase wordlength of FFT processors will increase the power consumption.

Therefore, wordlength optimization for power consumption is another attractive topic.

Actually, the accuracy of our optimization method depends on the accuracy of the given hardware library. And to build a precise hardware library for area or power is a difficult

在文檔中經由混合方法進行管線化快速傅利葉轉換處理器的字元長度最佳化之研究 (頁 34-0)