Problem Formulation - 應用多重區域條件式成組縮放法於快速傅利葉轉換處理器之面積最小化技術

Chapter 3 Motivation

3.4 Problem Formulation

Given FFT size and required SQNR, our goal is to minimize the area of memory-based radix-2 FFT under the given SQNR constraint by applying our MRCBS method.

Chapter 4 The Proposed Approach

In this chapter, we present the proposed MRCBS method for memory-based FFT which utilizes the profits of conditional scaling and the convergent block scaling to improve SQNR performance. The first section describes the scheduling of the butterfly computation in order to predict overflow precisely and save the additional storage. The second section illustrates the MRCBS and its architecture. Finally, in the third section, we will discuss the MRCBS with different number of blocks and the relationship between the number of blocks and the performance of area and SQNR. MRCBS generates many solutions for improving SQNR, and the purpose of this thesis is to find out the architecture of scaling method for FFT which meets the SQNR requirement and has the smallest area.

4.1 Scheduling of Butterfly Computation

In order to precisely predict the overflow and prevent the unnecessary scaling, we should detect the magnitude of the two data which are the inputs of the same butterfly in next stage.

As Fig. 20 shows where BU is abbreviated from butterfly unit, BU₁ and BU₂ are in current

we can predict overflow and determine the scaling flag for BU3. Fortunately, X2 and X3 are both available as well. We can predict overflow for BU₃ and BU₄simultaneously. That is, while two butterflies are finished in current stage, we can predict two butterflies in next stage smoothly. As a result, only four registers are required to store the results of BU₁and BU2 for overflow predictions. When the predictions of BU3 and BU4 are finished and the scaling flags are determined, those four registers can be reset for storing the results of other butterflies.

Furthermore, compared to the original order, just small extra control circuits are required to schedule the computation of butterflies as we wish.

Fig. 20 Detection of the two data in the same butterfly of next stage

4.2 Multi-Region Conditional Block Scaling

Since the thought of conditional scaling is to predict the overflow and predetermine the scaling flag for next stage, it does not need intermediate buffers to store the output data to determine the scaling flag of current stage. Therefore, we develop the architecture for our scaling method which is shown in Fig. 21. The memory block is the original part of the traditional memory-based FFT architecture shown in Fig. 6 and there is one butterfly unit in the PE block in our work. The detector is to detect the region information and the predictor is to predict possible overflow. The shared exponents and the scaling flags of each block are stored in the exponent array.

Fig. 21 The architecture of the proposed MRCBS

When evaluating the FFT, two data are read from the memory for each cycle and computed in BU. The scaling flag predetermined in previous stage will be read from exponent array to scale the results of butterfly in current stage. After computation of the butterfly is finished, the results will be straightly written back to the memory. In the meanwhile, the results are passed to the detector to define their region information by detecting their magnitudes. Then the predictor receives the region information of the results from the detector to judge whether overflow will occur in next stage or not. After the prediction is finished, the predetermined scaling flags and the shared exponents of next stage will be stored into the exponent array. Moreover, the detector and predictor are worked in parallel with the computation of butterfly unit since the results of butterfly can be written back to the memory without waiting for the results of them. Thus, such kind of architecture will not produce large amount of processing latency. The details of detector, predictor, and exponent array will be described in the following subsections.

4.2.1 Region Detector

Because we divide the complex plane into many regions, the overflow detector consists of comparators in order to determine the region information of the data by comparing the outputs of butterfly with several thresholds. The detector dividing the complex plane into many circular regions with different radii is called circular-type detector. On the other hand, dividing the complex plane into many square regions with different side lengths is called square-type detector. Because the square region is the maximal cyclic quadrilateral of each circular region, the area is smaller and the prediction is severer. As a result, the square-type detector improves less SQNR than the circular-type one but increases less area.

The purpose of the multi-region detection is to handle the situation shown in Fig. 18 where X_m(p) is outside the internal region but X_m(q) is deeply inside and they are actually overflow-free in next stage. Therefore, we should define an additional pair of regions that one region is larger and the other is smaller. As a result, the case with larger X_m(p) and smaller Xm(q) or vice versa will possibly be judged to be overflow-free. And that is why we divide the complex plane into even number of regions. In our work, we divide the complex plane into two regions, four regions, and six regions and implement circular-type and square-type detectors respectively. That is, there are six different detectors in total with different area overhead and different SQNR performance.

After the detection of the detector is finished, the region information which indicates the region where the data is located will be output.

4.2.1.1 Circular-Type Detector

First we discuss the region detector which divides the complex plane into two regions as Fig. 22(a) shows. This type of detector is named “C2”. The internal region R0 is defined as the circle of radius 0.5 and the threshold t₀ representing the radius of R₀ is equal to 0.5.

Next we divide the complex plane into four regions. This type of detector is named “C4”.

Additional regions R_-1 and R₊₁ are defined as shown in Fig. 22(b). The R_-1 is the circle of radius t-1 and R+1 is the annulus with inner radius t0 and outer radius t+1, and we have to ensure that t_-1 plus t₊₁ is less than one. Because the threshold t_-1 is absolutely larger than the magnitude of Xm(q) and t+1 is larger than that of Xm(p), the addition and subtraction operations of those two complex data will not be larger than one to cause overflow. Therefore, once X_m(p) is outside R0 but is inside R+1 while Xm(q) is inside R-1, it will be judged that the butterfly computing X_m(p) and X_m(q) in next stage is overflow-free.

(a) (b)

(c)

Fig. 22 The regions of the circular-type detectors (a) C2 (b) C4 (c) C6

Finally we further divide the complex plane into six regions as shown in Fig. 22(c) and this type of detector is named “C6”. With the existed four regions on the complex plane in the C4 detector, the regions R-2 and R+2 are defined additionally. In C6 detector, R-2 is the circle of radius t_-2, R₊₂ is the annulus with inner radius t₊₁ and outer radius t₊₂, and R_-1 becomes the annulus with inner radius t-2 and outer radius t-1. For the same idea in C4, the summation of t-2

and t₊₂ should also be less than one.

As mentioned above, we have known that t-k plus tk where k = 1 or 2 should be less than one to avoid overflow. And if t_-k is larger, t_k will become smaller. In the meanwhile, the area of R-k becomes larger as the area of Rk becomes smaller. Since our purpose is to avoid the unnecessary scaling as accurate as possible, the area of the two regions should be larger and the possibility of data in R-k should be equal to the possibility of data in Rk. As a result, we have two conditions as (4.1) and (4.2) to determine the thresholds t_k in detectors. And the results of thresholds are shown in Table 1.

k k 1

t_  t (4.1) Area R ( __k ) = Area R_k ( (4.2) )

Table 1 The value of the thresholds in circular-type detectors

Here we sweep t_-1 from 0 to 0.5 to simulate the SQNR performance and the result is shown in Fig. 23. As we can see, SQNR is almost the highest when t-1 is equal to 0.375 and t1

is equal to 0.625 as we expect.

Fig. 23 The simulation result of SQNR with different t-1

Because the regions are all circles in the complex plane, we are required to calculate the magnitude of the complex data by computing its summation of the square of the real part and the imaginary part. As a result, multipliers, adders, and comparators are introduced which are required to compare the thresholds as shown in Fig. 24. It is intuitive that C6 has the best performance and the largest area of comparators since there are six thresholds to be compared while C2 has the smallest area of comparators and the performance is relatively worst.

Fig. 24 The block diagram of the circular-type detector

However, the bit width BW of multipliers, adders and comparators influences the area and the accuracy as well. That is to say, the arithmetic unit with longer bit width will produces better accuracy and cost more area. In our work, we implement 10-bit comparators, BW-bit multipliers, and 2*BW-bit adders where BW is an integer and can be chosen from 5 to 10.

4.2.1.2 Square-Type Detector

Although the circular-type region detectors make precise predictions, they cost a lot of area for introducing the multipliers and adders. For hardware concern, there are alternative ways which are the square-type region detectors [14]. That is, we can simplify those circular regions to their maximal cyclic quadrilaterals. The square regions are described in Fig. 25. As the circular-type detectors, “S2” is the square-type detector which divides the complex plane into two square regions and “S4” is the detector dividing the complex plane into four square regions. The detector dividing the complex plane into six squares is therefore named “S6”.

Because each square region shown in Fig. 25 is the maximal cyclic quadrilateral of the circular region shown in Fig. 22, the thresholds in square-type detectors will be defined as (4.3) where k = -2, -1, 0, 1, and 2. And the thresholds h_k of the square-type detectors are listed in Table 2.

h_k  t_k 2 2 (4.3)

Table 2 The value of the thresholds in square-type detectors

(a) (b)

(c)

Fig. 25 The regions of the square-type detectors (a) S2 (b) S4 (c) S6

Also we sweep h-1 from 0 to 0.354 to simulate the SQNR performance of S4-type detector.

And the result is shown in Fig. 26. The SQNR is almost the highest when h_-1is equal to 0.265 and h1 is equal to 0.442 as we expect.

Fig. 26 The simulation result of SQNR with different h_-1

To detect the region information for the data in square-type detectors, we only need to compare the absolute value of the real part and imaginary part with the half of the side lengths of those squares. The block diagram of square-type detector is shown in Fig. 27. The only difference between circular type and square type is that square type does not need the multipliers and adders to calculate the magnitude. As a result, the circuits of the square-type detectors are much simpler than the circuits of the circular-type detectors. However, the bit width of the comparators influences the accuracy as we have mentioned. Thus, in our work we implement BW-bit comparators where BW is an integer and can be chosen from 5 to 10.

Fig. 27 The block diagram of the square-type detector

4.2.2 Overflow Predictor

With the region information of the data come from the region detector, we will predict overflow of the butterflies of next stage. As shown in Fig. 28, X0 and X2 are computed by BU1

as X₁ and X₃ are computed by BU₂. After the computations of BU₁ and BU₂ are finished, we will get the four results from X0 to X3. Then we will predict whether X4 to X7 may cause overflow or not. Here we define two variables P and Q to represent the region information.

For the prediction of BU3, PBU3 is the region information of X0 and QBU3 is the region

information of X1. And for the prediction of BU4, PBU4 is the region information of X2 and Q_BU4 is the region information of X₃. The values of P and Q are decided according to the data locations of X0 to X3. The region information is equal to k as the data is inside the region Rk

where k = -2, -1, 0, 1, and 2 as Table 3 shows.

Fig. 28 Overflow Prediction based on the region information of the inputs

Table 3 The value of region information P and Q according to the data locations

Taking the prediction of BU₃ with the C6-type detector as an example, P_BU3 is set to -2 while X0 is inside the region R-2 and QBU3 is set to 2 while X1 is inside the region R+2. As we know, X₄ and X₅ will cause overflow if the summation of the magnitude of X₀ and X₁ is larger than one. As a result, we will sum up the variable PBU3 and QBU3 and compare to a constant zero. If the result of P_BU3 plus Q_BU3 is larger than 0, it implies that the magnitude of X₀ plus the magnitude of X1 is larger than one and the outputs of the BU3 should be scaled to avoid overflow. Table 4 shows the decisions of scaling which are based on the result of the summation of P and Q.

Table 4 Scaling decision according to the summation of P and Q

The block diagram of the predictor is shown in Fig. 29. There are four registers to temporarily store the region information. After the computation of BU₁ in Fig. 28 is finished, we store PBU3 and PBU4 and wait for the results of BU2. After BU2 is finished, we will get QBU3

and Q_BU4 and store them into the registers. While the four variables are getting ready, we will calculate PBU3 plus QBU3 and PBU4 plus QBU4 and then compare the results to zero.

Fig. 29 The block diagram of the overflow predictor

Besides, there are two special flags in the predictor which memorize the scaling flags in next stage. It is because that the convergent block scaling method will separate the data block in current stage to two smaller blocks in next stage. Once the result of P plus Q in the new smaller blocks is larger than zero, the special flag will be set and held. After the computations of the data in a certain block are all finished, the two flags will determine the scaling flags of the new two blocks and will be stored in the exponent array.

4.2.3 Exponent Unit

The block scaling method needs exponent units to store the shared exponents and the scaling flags of the blocks. As shown in Fig. 30, the exponent units are stored in the exponent array. Each exponent unit consists of a k-bit shared exponent and a one-bit scaling flag where k is depending on the FFT size N and is equal to log log N₂ ₂ .

Fig. 30 The exponent array with the exponent unit

The shared exponent is shared for all data of a certain block, and the scaling flag is to decide whether to scale the results of the butterfly when the data in this block are being computed. After all computations in one block are finished, the two new scaling flags and shared exponents will be stored in the corresponding exponent units as shown in Fig. 31. If the flag is set, the shared exponent will be increased by one. Otherwise, if the flag is unset, the shared exponent will keep its original value.

Fig. 31 The block diagram of the exponent unit

As we know, each block has its own exponent unit. Here we define Bn as the tag of the block and E_n as the tag of the corresponding exponent unit where n is an integer. If there are m blocks, n is from 0 to m-1. Besides, during the computations of the convergent block scaling, the block in current stage will be divided into two blocks in next stage. Therefore, after the computations of the block Bx in stage s are finished, the new two shared exponents and scaling flags will be stored in the exponent units E_x and E_y where y = x + m / 2^s. The usage of the exponent array for each stage is shown in Fig. 32.

Fig. 32 The usage of the exponent array for convergent block scaling with 8 blocks

4.3 Restricted Number of Blocks

As the convergent block scaling method we have mentioned, the dynamic scaling method only scales when it is necessary to avoid the loss of accuracy. And the concept of grouping data into several blocks improves the SQNR since there are lots of exponents to represent the data with different dynamic range. Therefore, it is easy to expect that the larger number of blocks will acquire higher precision. However, the convergent block scaling method will divide one block into two blocks from the first stage to the last stage. That is, the number of blocks and the area of the exponent storage will be doubled through one stage. For an N-point FFT, there will be N/2 blocks in the last stage and N/2 exponent units are required.

As a result, it will cost a lot amount of storage. Therefore, we define Bmax = 2^s-1 which is the total number of blocks in convergent block scaling and the number of blocks is doubled until the stage s. Fig. 33 shows the convergent block scaling with different Bmax.

(a) (b)

(c)

Fig. 33 The convergent block scaling with (a) Bmax = 1 (b) Bmax = 2 (c) Bmax = 4

Taking the 8192-point 16-bit wordlength FFT with MRCBS as an example which uses the S2-type detector with 10-bit comparators, the performance of SQNR and area are shown in Fig. 34. It can be observed that the area of the storage is getting increased yet SQNR is getting saturated while the B_max is getting larger. It implies that in deeper stages, we are failed to get the SQNR we expect even if we double the area of the exponent storage. As we can see, if we divide the blocks until stage 11 which requires only 1024 exponent units, the area overhead of exponent storage is only 1/4 of that we divide until stage 13. However, the SQNR is just 0.13 dB lower than before. Thus, through doubling the number of blocks until a certain stage rather than doubling the number of blocks incessantly until the last stage, we can economize the usage of exponent storage to acquire the SQNR improvement we want.

Although the SQNR performance is not the ultimately highest if we restrict the number of blocks, we can still get the acceptable SQNR and reduce area cost consequently.

Fig. 34 The SQNR and area cost with different B_max

Chapter 5 Experimental Results

The proposed MRCBS method is to generate many hardware solutions for SQNR improvement and find out the one which meets the SQNR constraint with minimum area cost.

Here we define the performance pair (PP): (SQNR, AREA) which indicates the SQNR performance with the corresponding area cost. Thus, each solution obtained by MRCBS has its own PP defined as PP^T: (SQNR^T, AREA^T) where the SQNR^T represents the total SQNR performance and the AREA^T represents the minimized total area cost.

The PP^T is determined by the quintuple (N, WL, Type, BW, Bmax) where N is the given FFT size and WL is the wordlength of storage from 14 bits to 18 bits. The Type indicates different

在文檔中應用多重區域條件式成組縮放法於快速傅利葉轉換處理器之面積最小化技術 (頁 32-0)