Chapter 3 Motivation
3.4 Problem Formulation
Given FFT size and required SQNR, our goal is to minimize the area of memory-based radix-2 FFT under the given SQNR constraint by applying our MRCBS method.
24
Chapter 4
The Proposed Approach
In this chapter, we present the proposed MRCBS method for memory-based FFT which utilizes the profits of conditional scaling and the convergent block scaling to improve SQNR performance. The first section describes the scheduling of the butterfly computation in order to predict overflow precisely and save the additional storage. The second section illustrates the MRCBS and its architecture. Finally, in the third section, we will discuss the MRCBS with different number of blocks and the relationship between the number of blocks and the performance of area and SQNR. MRCBS generates many solutions for improving SQNR, and the purpose of this thesis is to find out the architecture of scaling method for FFT which meets the SQNR requirement and has the smallest area.
4.1 Scheduling of Butterfly Computation
In order to precisely predict the overflow and prevent the unnecessary scaling, we should detect the magnitude of the two data which are the inputs of the same butterfly in next stage.
As Fig. 20 shows where BU is abbreviated from butterfly unit, BU1 and BU2 are in current
25
we can predict overflow and determine the scaling flag for BU3. Fortunately, X2 and X3 are both available as well. We can predict overflow for BU3 and BU4 simultaneously. That is, while two butterflies are finished in current stage, we can predict two butterflies in next stage smoothly. As a result, only four registers are required to store the results of BU1 and BU2 for overflow predictions. When the predictions of BU3 and BU4 are finished and the scaling flags are determined, those four registers can be reset for storing the results of other butterflies.
Furthermore, compared to the original order, just small extra control circuits are required to schedule the computation of butterflies as we wish.
Fig. 20 Detection of the two data in the same butterfly of next stage
4.2 Multi-Region Conditional Block Scaling
Since the thought of conditional scaling is to predict the overflow and predetermine the scaling flag for next stage, it does not need intermediate buffers to store the output data to determine the scaling flag of current stage. Therefore, we develop the architecture for our scaling method which is shown in Fig. 21. The memory block is the original part of the traditional memory-based FFT architecture shown in Fig. 6 and there is one butterfly unit in the PE block in our work. The detector is to detect the region information and the predictor is to predict possible overflow. The shared exponents and the scaling flags of each block are stored in the exponent array.
26
Fig. 21 The architecture of the proposed MRCBS
When evaluating the FFT, two data are read from the memory for each cycle and computed in BU. The scaling flag predetermined in previous stage will be read from exponent array to scale the results of butterfly in current stage. After computation of the butterfly is finished, the results will be straightly written back to the memory. In the meanwhile, the results are passed to the detector to define their region information by detecting their magnitudes. Then the predictor receives the region information of the results from the detector to judge whether overflow will occur in next stage or not. After the prediction is finished, the predetermined scaling flags and the shared exponents of next stage will be stored into the exponent array. Moreover, the detector and predictor are worked in parallel with the computation of butterfly unit since the results of butterfly can be written back to the memory without waiting for the results of them. Thus, such kind of architecture will not produce large amount of processing latency. The details of detector, predictor, and exponent array will be described in the following subsections.
27
4.2.1 Region Detector
Because we divide the complex plane into many regions, the overflow detector consists of comparators in order to determine the region information of the data by comparing the outputs of butterfly with several thresholds. The detector dividing the complex plane into many circular regions with different radii is called circular-type detector. On the other hand, dividing the complex plane into many square regions with different side lengths is called square-type detector. Because the square region is the maximal cyclic quadrilateral of each circular region, the area is smaller and the prediction is severer. As a result, the square-type detector improves less SQNR than the circular-type one but increases less area.
The purpose of the multi-region detection is to handle the situation shown in Fig. 18 where Xm(p) is outside the internal region but Xm(q) is deeply inside and they are actually overflow-free in next stage. Therefore, we should define an additional pair of regions that one region is larger and the other is smaller. As a result, the case with larger Xm(p) and smaller Xm(q) or vice versa will possibly be judged to be overflow-free. And that is why we divide the complex plane into even number of regions. In our work, we divide the complex plane into two regions, four regions, and six regions and implement circular-type and square-type detectors respectively. That is, there are six different detectors in total with different area overhead and different SQNR performance.
After the detection of the detector is finished, the region information which indicates the region where the data is located will be output.
4.2.1.1 Circular-Type Detector
First we discuss the region detector which divides the complex plane into two regions as Fig. 22(a) shows. This type of detector is named “C2”. The internal region R0 is defined as the circle of radius 0.5 and the threshold t0 representing the radius of R0 is equal to 0.5.
28
Next we divide the complex plane into four regions. This type of detector is named “C4”.
Additional regions R-1 and R+1 are defined as shown in Fig. 22(b). The R-1 is the circle of radius t-1 and R+1 is the annulus with inner radius t0 and outer radius t+1, and we have to ensure that t-1 plus t+1 is less than one. Because the threshold t-1 is absolutely larger than the magnitude of Xm(q) and t+1 is larger than that of Xm(p), the addition and subtraction operations of those two complex data will not be larger than one to cause overflow. Therefore, once Xm(p) is outside R0 but is inside R+1 while Xm(q) is inside R-1, it will be judged that the butterfly computing Xm(p) and Xm(q) in next stage is overflow-free.
(a) (b)
(c)
Fig. 22 The regions of the circular-type detectors (a) C2 (b) C4 (c) C6
29
Finally we further divide the complex plane into six regions as shown in Fig. 22(c) and this type of detector is named “C6”. With the existed four regions on the complex plane in the C4 detector, the regions R-2 and R+2 are defined additionally. In C6 detector, R-2 is the circle of radius t-2, R+2 is the annulus with inner radius t+1 and outer radius t+2, and R-1 becomes the annulus with inner radius t-2 and outer radius t-1. For the same idea in C4, the summation of t-2
and t+2 should also be less than one.
As mentioned above, we have known that t-k plus tk where k = 1 or 2 should be less than one to avoid overflow. And if t-k is larger, tk will become smaller. In the meanwhile, the area of R-k becomes larger as the area of Rk becomes smaller. Since our purpose is to avoid the unnecessary scaling as accurate as possible, the area of the two regions should be larger and the possibility of data in R-k should be equal to the possibility of data in Rk. As a result, we have two conditions as (4.1) and (4.2) to determine the thresholds tk in detectors. And the results of thresholds are shown in Table 1.
k k 1
t t (4.1) Area R ( k ) = Area Rk ( (4.2) )
Table 1 The value of the thresholds in circular-type detectors
Here we sweep t-1 from 0 to 0.5 to simulate the SQNR performance and the result is shown in Fig. 23. As we can see, SQNR is almost the highest when t-1 is equal to 0.375 and t1
is equal to 0.625 as we expect.
30
Fig. 23 The simulation result of SQNR with different t-1
Because the regions are all circles in the complex plane, we are required to calculate the magnitude of the complex data by computing its summation of the square of the real part and the imaginary part. As a result, multipliers, adders, and comparators are introduced which are required to compare the thresholds as shown in Fig. 24. It is intuitive that C6 has the best performance and the largest area of comparators since there are six thresholds to be compared while C2 has the smallest area of comparators and the performance is relatively worst.
Fig. 24 The block diagram of the circular-type detector
31
However, the bit width BW of multipliers, adders and comparators influences the area and the accuracy as well. That is to say, the arithmetic unit with longer bit width will produces better accuracy and cost more area. In our work, we implement 10-bit comparators, BW-bit multipliers, and 2*BW-bit adders where BW is an integer and can be chosen from 5 to 10.
4.2.1.2 Square-Type Detector
Although the circular-type region detectors make precise predictions, they cost a lot of area for introducing the multipliers and adders. For hardware concern, there are alternative ways which are the square-type region detectors [14]. That is, we can simplify those circular regions to their maximal cyclic quadrilaterals. The square regions are described in Fig. 25. As the circular-type detectors, “S2” is the square-type detector which divides the complex plane into two square regions and “S4” is the detector dividing the complex plane into four square regions. The detector dividing the complex plane into six squares is therefore named “S6”.
Because each square region shown in Fig. 25 is the maximal cyclic quadrilateral of the circular region shown in Fig. 22, the thresholds in square-type detectors will be defined as (4.3) where k = -2, -1, 0, 1, and 2. And the thresholds hk of the square-type detectors are listed in Table 2.
hk tk 2 2 (4.3)
Table 2 The value of the thresholds in square-type detectors
32
(a) (b)
(c)
Fig. 25 The regions of the square-type detectors (a) S2 (b) S4 (c) S6
Also we sweep h-1 from 0 to 0.354 to simulate the SQNR performance of S4-type detector.
And the result is shown in Fig. 26. The SQNR is almost the highest when h-1 is equal to 0.265 and h1 is equal to 0.442 as we expect.
Fig. 26 The simulation result of SQNR with different h-1
33
To detect the region information for the data in square-type detectors, we only need to compare the absolute value of the real part and imaginary part with the half of the side lengths of those squares. The block diagram of square-type detector is shown in Fig. 27. The only difference between circular type and square type is that square type does not need the multipliers and adders to calculate the magnitude. As a result, the circuits of the square-type detectors are much simpler than the circuits of the circular-type detectors. However, the bit width of the comparators influences the accuracy as we have mentioned. Thus, in our work we implement BW-bit comparators where BW is an integer and can be chosen from 5 to 10.
Fig. 27 The block diagram of the square-type detector
4.2.2 Overflow Predictor
With the region information of the data come from the region detector, we will predict overflow of the butterflies of next stage. As shown in Fig. 28, X0 and X2 are computed by BU1
as X1 and X3 are computed by BU2. After the computations of BU1 and BU2 are finished, we will get the four results from X0 to X3. Then we will predict whether X4 to X7 may cause overflow or not. Here we define two variables P and Q to represent the region information.
For the prediction of BU3, PBU3 is the region information of X0 and QBU3 is the region
34
information of X1. And for the prediction of BU4, PBU4 is the region information of X2 and QBU4 is the region information of X3. The values of P and Q are decided according to the data locations of X0 to X3. The region information is equal to k as the data is inside the region Rk
where k = -2, -1, 0, 1, and 2 as Table 3 shows.
Fig. 28 Overflow Prediction based on the region information of the inputs
Table 3 The value of region information P and Q according to the data locations
Taking the prediction of BU3 with the C6-type detector as an example, PBU3 is set to -2 while X0 is inside the region R-2 and QBU3 is set to 2 while X1 is inside the region R+2. As we know, X4 and X5 will cause overflow if the summation of the magnitude of X0 and X1 is larger than one. As a result, we will sum up the variable PBU3 and QBU3 and compare to a constant zero. If the result of PBU3 plus QBU3 is larger than 0, it implies that the magnitude of X0 plus the magnitude of X1 is larger than one and the outputs of the BU3 should be scaled to avoid overflow. Table 4 shows the decisions of scaling which are based on the result of the summation of P and Q.
35
Table 4 Scaling decision according to the summation of P and Q
The block diagram of the predictor is shown in Fig. 29. There are four registers to temporarily store the region information. After the computation of BU1 in Fig. 28 is finished, we store PBU3 and PBU4 and wait for the results of BU2. After BU2 is finished, we will get QBU3
and QBU4 and store them into the registers. While the four variables are getting ready, we will calculate PBU3 plus QBU3 and PBU4 plus QBU4 and then compare the results to zero.
Fig. 29 The block diagram of the overflow predictor
36
Besides, there are two special flags in the predictor which memorize the scaling flags in next stage. It is because that the convergent block scaling method will separate the data block in current stage to two smaller blocks in next stage. Once the result of P plus Q in the new smaller blocks is larger than zero, the special flag will be set and held. After the computations of the data in a certain block are all finished, the two flags will determine the scaling flags of the new two blocks and will be stored in the exponent array.
4.2.3 Exponent Unit
The block scaling method needs exponent units to store the shared exponents and the scaling flags of the blocks. As shown in Fig. 30, the exponent units are stored in the exponent array. Each exponent unit consists of a k-bit shared exponent and a one-bit scaling flag where k is depending on the FFT size N and is equal to log log N2 2 .
Fig. 30 The exponent array with the exponent unit
The shared exponent is shared for all data of a certain block, and the scaling flag is to decide whether to scale the results of the butterfly when the data in this block are being computed. After all computations in one block are finished, the two new scaling flags and shared exponents will be stored in the corresponding exponent units as shown in Fig. 31. If the flag is set, the shared exponent will be increased by one. Otherwise, if the flag is unset, the shared exponent will keep its original value.
37
Fig. 31 The block diagram of the exponent unit
As we know, each block has its own exponent unit. Here we define Bn as the tag of the block and En as the tag of the corresponding exponent unit where n is an integer. If there are m blocks, n is from 0 to m-1. Besides, during the computations of the convergent block scaling, the block in current stage will be divided into two blocks in next stage. Therefore, after the computations of the block Bx in stage s are finished, the new two shared exponents and scaling flags will be stored in the exponent units Ex and Ey where y = x + m / 2s. The usage of the exponent array for each stage is shown in Fig. 32.
Fig. 32 The usage of the exponent array for convergent block scaling with 8 blocks
38
4.3 Restricted Number of Blocks
As the convergent block scaling method we have mentioned, the dynamic scaling method only scales when it is necessary to avoid the loss of accuracy. And the concept of grouping data into several blocks improves the SQNR since there are lots of exponents to represent the data with different dynamic range. Therefore, it is easy to expect that the larger number of blocks will acquire higher precision. However, the convergent block scaling method will divide one block into two blocks from the first stage to the last stage. That is, the number of blocks and the area of the exponent storage will be doubled through one stage. For an N-point FFT, there will be N/2 blocks in the last stage and N/2 exponent units are required.
As a result, it will cost a lot amount of storage. Therefore, we define Bmax = 2s-1 which is the total number of blocks in convergent block scaling and the number of blocks is doubled until the stage s. Fig. 33 shows the convergent block scaling with different Bmax.
(a) (b)
(c)
Fig. 33 The convergent block scaling with (a) Bmax = 1 (b) Bmax = 2 (c) Bmax = 4
39
Taking the 8192-point 16-bit wordlength FFT with MRCBS as an example which uses the S2-type detector with 10-bit comparators, the performance of SQNR and area are shown in Fig. 34. It can be observed that the area of the storage is getting increased yet SQNR is getting saturated while the Bmax is getting larger. It implies that in deeper stages, we are failed to get the SQNR we expect even if we double the area of the exponent storage. As we can see, if we divide the blocks until stage 11 which requires only 1024 exponent units, the area overhead of exponent storage is only 1/4 of that we divide until stage 13. However, the SQNR is just 0.13 dB lower than before. Thus, through doubling the number of blocks until a certain stage rather than doubling the number of blocks incessantly until the last stage, we can economize the usage of exponent storage to acquire the SQNR improvement we want.
Although the SQNR performance is not the ultimately highest if we restrict the number of blocks, we can still get the acceptable SQNR and reduce area cost consequently.
Fig. 34 The SQNR and area cost with different Bmax
40
Chapter 5
Experimental Results
The proposed MRCBS method is to generate many hardware solutions for SQNR improvement and find out the one which meets the SQNR constraint with minimum area cost.
Here we define the performance pair (PP): (SQNR, AREA) which indicates the SQNR performance with the corresponding area cost. Thus, each solution obtained by MRCBS has its own PP defined as PPT: (SQNRT, AREAT ) where the SQNRT represents the total SQNR performance and the AREAT represents the minimized total area cost.
The PPT is determined by the quintuple (N, WL, Type, BW, Bmax) where N is the given FFT size and WL is the wordlength of storage from 14 bits to 18 bits. The Type indicates different
The PPT is determined by the quintuple (N, WL, Type, BW, Bmax) where N is the given FFT size and WL is the wordlength of storage from 14 bits to 18 bits. The Type indicates different