Conditional Scaling - Scaling Method - 應用多重區域條件式成組縮放法於快速傅利葉轉換處理器之面積最小化技術

Chapter 2 Preliminaries

2.4 Scaling Method

2.4.3 Conditional Scaling

Instead of storing results in the buffers and detecting the largest value to decide the scaling flag, the conditional scaling method using the concept of prediction is another way to avoid overflow but saves the area overhead of buffers. Fig. 16 shows the architecture of conditional scaling method. In details, conditional scaling predetermines the scaling flag of current stage by the detections in previous stage. Then the detections in current stage will predetermine the scaling flag of next stage. Therefore, conditional scaling does not need buffers to store intermediate results. Moreover, there is only one shared-exponent in conditional scaling because it uses the fixed point representation. As a result, the alignment unit is unnecessary.

We can directly operate additions and subtractions on the two inputs of butterfly.

Fig. 16 The architecture of the conditional scaling method

The criterion for deciding the scaling flag of next stage is based on the observation of whether any value in current stage is outside a particular region on the complex plane [13].

Besides, (2.10) tells us that X_m_₁( ) , p X_m_₁( )q  X_m( )p  X_m( )q . If X_m( )p and

m( )

X q are both smaller than 0.5, X_m_₁( )p and X_m_₁( )q will be smaller than one.

Therefore, the region with which to compare the data in current stage is the circle of radius 0.5. As shown in Fig. 17, the circle of radius 0.5 is the idealized threshold for deciding the scaling flag of next stage.

Fig. 17 The particular region of the complex plane in conditional scaling method

The magnitudes of the results are checked during the butterfly computations. If all output data in current stage are inside the region with radius 0.5, it guarantees that there will not be any data with magnitude larger than one to cause overflow in next stage. Thus, the scaling flag will not be set and the exponent will be kept. On the contrary, if there is at least one data outside the circular region with radius 0.5, it may cause overflow through butterfly computation in next stage. As a result, when computing the butterflies of next stage, the results should be scaled and the exponent should be increased by one.

Because conditional scaling predicts the necessity of scaling, the SQNR performance of conditional scaling is much higher than that of forced scaling but a little lower than that of BFP scaling. However, calculating the magnitude of a complex number needs to compute the square of real part and imaginary part. The required multipliers and adders will cost area.

Alternatively, for hardware concern, the maximal cyclic quadrilateral which is the square with dotted line in Fig. 17 is chosen to define the particular region [14]. As a result, only comparators are needed to detect the region information of data.

In this thesis, we will combine the concepts of BFP scaling method and the conditional scaling method and utilize the profits of them.

Chapter 3 Motivation

To satisfy the required SQNR performance, we will choose a scaling approach which produces SQNR higher enough with less area. However, once the constraint is tighter and the original design does not satisfy the requirement, using longer wordlength is the only way to further increase the accuracy. Based on the experience of simulations, increasing wordlength by one will acquire about 6 dB improvement for SQNR but about 6% area penalty in addition.

However, sometimes we do not have to increase SQNR so much to meet the constraint. Thus, by the improvements of conditional scaling and modifications of block floating point scaling, we will acquire SQNR improvement in demand with the corresponding area overhead.

3.1 Multi-Region Detection

With the approaches of [13, 14], the complex plane has been divided into two regions to detect the region information of the outputs of the butterfly. Traditional conditional scaling method avoids overflow in current stage by ensuring the data in previous stage are all in the internal region with radius 0.5. However, overflow comes from the addition and subtraction operations in butterfly which result in the growth of data magnitude. And the computation only has relations with the two input data. That is to say, restricting all data in the same region to avoid overflow is excessively severe. In order to avoid overflow, we only need to ensure that X_m( )p  X_m( )q is smaller than one as (2.10) says.

We assume that we are now computing butterflies in stage m-1, and the complex plane is divided into two regions as Fig. 18(a) shows. The X_m(q) is inside R₀ and X_m(p) is outside R₀ and they are both the inputs of the same butterfly in stage m. In previous conditional scaling

method, it is judged that overflow will be produced in stage m and the scaling flag of stage m will be set. However, overflow can also be avoided as long as X_m( )q is small enough.

Therefore, we try to divide the complex plane into more regions. Fig. 18(b) shows the idea of which four regions are divided. In such case, Xm(p) is inside R+1 and Xm(q) is inside R-1 where the radius of R₊₁ is 0.7 and the radius of R-₁is 0.2. By ensuring the summation of the radii of R+1 and R-1 is less than unity, we can judge that the butterfly computation in stage m which operates on these two data will not cause overflow. That is to say, dividing the complex plane into more regions will further prevent the unnecessary scaling operations and produce better SQNR performance. And we can expect that the more regions the complex plane is divided, the higher precision can be obtained.

(a) (b)

Fig. 18 The complex plane with (a) two regions (b) four regions are divided

3.2 Convergent Block Scaling

The hardware of floating point arithmetic is more complicated than that of fixed point arithmetic. As mentioned in 2.4.2, one part of the area overhead of BFP scaling method is the alignment unit because the floating point representation is used. Since the inputs of butterfly

may come from different blocks and their exponent may be different, we cannot operate these two mantissas directly without alignment. However, figuring out the larger exponent and shifting the smaller mantissa introduces area and processing latency. If we want to save the hardware of alignment unit, we must ensure that the two inputs of butterfly are come from the same block which means their exponent is always the same one.

It can be observed in Fig. 5 that during the decomposition of FFT algorithms, a k-point DFT in stage m will be separated to two k/2-point DFT in stage m+1. And the computation of the first k/2 data in stage m+1 only depends on the first k/2 data in stage m. Thus we group the data into blocks in a convergent way mentioned in [25] and the idea is shown in Fig. 19. In first stage, all data are grouped into one block. That is, the number of blocks and shared exponents in first stage are both equal to one. Afterwards, the number of blocks and shared exponents are doubled as the size of the block is one half from stage to stage. In this way, inputs of butterfly for each stage are surely come from the same block with the same exponent.

Fig. 19 An example of 8-point FFT with convergent block scaling

Through the convergent block scheme, the data will be represented in different dynamic ranges with different exponents so the SQNR is higher than forced scaling scheme where the data are all in the same dynamic range. Besides, the number of blocks is a key factor for SQNR improvement. Larger number of blocks results in better SQNR performance. However, such kinds of block scaling methods require the additional area of storage to store the shared exponents. By the way, since the conditional scaling is assumed that the data are all in the same dynamic range with the same exponent, the convergent block scheme is naturally suitable for implementation with conditional scaling in fixed point representation.

3.3 Our Strategy

We are informed that the SQNR performance can be improved by two ways. One is the multi-region conditional scaling and the other is the convergent block scaling. Therefore, we propose the multi-region conditional block scaling (MRCBS) method which combines these two methods mentioned above to obtain many solutions of hardware architecture for SQNR improvement. As a result, by searching those solutions, we can figure out the solution which has the minimum area cost with the required SQNR performance.

3.4 Problem Formulation

Given FFT size and required SQNR, our goal is to minimize the area of memory-based radix-2 FFT under the given SQNR constraint by applying our MRCBS method.

Chapter 4 The Proposed Approach

In this chapter, we present the proposed MRCBS method for memory-based FFT which utilizes the profits of conditional scaling and the convergent block scaling to improve SQNR performance. The first section describes the scheduling of the butterfly computation in order to predict overflow precisely and save the additional storage. The second section illustrates the MRCBS and its architecture. Finally, in the third section, we will discuss the MRCBS with different number of blocks and the relationship between the number of blocks and the performance of area and SQNR. MRCBS generates many solutions for improving SQNR, and the purpose of this thesis is to find out the architecture of scaling method for FFT which meets the SQNR requirement and has the smallest area.

4.1 Scheduling of Butterfly Computation

In order to precisely predict the overflow and prevent the unnecessary scaling, we should detect the magnitude of the two data which are the inputs of the same butterfly in next stage.

As Fig. 20 shows where BU is abbreviated from butterfly unit, BU₁ and BU₂ are in current

we can predict overflow and determine the scaling flag for BU3. Fortunately, X2 and X3 are both available as well. We can predict overflow for BU₃ and BU₄simultaneously. That is, while two butterflies are finished in current stage, we can predict two butterflies in next stage smoothly. As a result, only four registers are required to store the results of BU₁and BU2 for overflow predictions. When the predictions of BU3 and BU4 are finished and the scaling flags are determined, those four registers can be reset for storing the results of other butterflies.

Furthermore, compared to the original order, just small extra control circuits are required to schedule the computation of butterflies as we wish.

Fig. 20 Detection of the two data in the same butterfly of next stage

4.2 Multi-Region Conditional Block Scaling

Since the thought of conditional scaling is to predict the overflow and predetermine the scaling flag for next stage, it does not need intermediate buffers to store the output data to determine the scaling flag of current stage. Therefore, we develop the architecture for our scaling method which is shown in Fig. 21. The memory block is the original part of the traditional memory-based FFT architecture shown in Fig. 6 and there is one butterfly unit in the PE block in our work. The detector is to detect the region information and the predictor is to predict possible overflow. The shared exponents and the scaling flags of each block are stored in the exponent array.

Fig. 21 The architecture of the proposed MRCBS

When evaluating the FFT, two data are read from the memory for each cycle and computed in BU. The scaling flag predetermined in previous stage will be read from exponent array to scale the results of butterfly in current stage. After computation of the butterfly is finished, the results will be straightly written back to the memory. In the meanwhile, the results are passed to the detector to define their region information by detecting their magnitudes. Then the predictor receives the region information of the results from the detector to judge whether overflow will occur in next stage or not. After the prediction is finished, the predetermined scaling flags and the shared exponents of next stage will be stored into the exponent array. Moreover, the detector and predictor are worked in parallel with the computation of butterfly unit since the results of butterfly can be written back to the memory without waiting for the results of them. Thus, such kind of architecture will not produce large amount of processing latency. The details of detector, predictor, and exponent array will be described in the following subsections.

4.2.1 Region Detector

Because we divide the complex plane into many regions, the overflow detector consists of comparators in order to determine the region information of the data by comparing the outputs of butterfly with several thresholds. The detector dividing the complex plane into many circular regions with different radii is called circular-type detector. On the other hand, dividing the complex plane into many square regions with different side lengths is called square-type detector. Because the square region is the maximal cyclic quadrilateral of each circular region, the area is smaller and the prediction is severer. As a result, the square-type detector improves less SQNR than the circular-type one but increases less area.

The purpose of the multi-region detection is to handle the situation shown in Fig. 18 where X_m(p) is outside the internal region but X_m(q) is deeply inside and they are actually overflow-free in next stage. Therefore, we should define an additional pair of regions that one region is larger and the other is smaller. As a result, the case with larger X_m(p) and smaller Xm(q) or vice versa will possibly be judged to be overflow-free. And that is why we divide the complex plane into even number of regions. In our work, we divide the complex plane into two regions, four regions, and six regions and implement circular-type and square-type detectors respectively. That is, there are six different detectors in total with different area overhead and different SQNR performance.

After the detection of the detector is finished, the region information which indicates the region where the data is located will be output.

4.2.1.1 Circular-Type Detector

First we discuss the region detector which divides the complex plane into two regions as Fig. 22(a) shows. This type of detector is named “C2”. The internal region R0 is defined as the circle of radius 0.5 and the threshold t₀ representing the radius of R₀ is equal to 0.5.

Next we divide the complex plane into four regions. This type of detector is named “C4”.

Additional regions R_-1 and R₊₁ are defined as shown in Fig. 22(b). The R_-1 is the circle of radius t-1 and R+1 is the annulus with inner radius t0 and outer radius t+1, and we have to ensure that t_-1 plus t₊₁ is less than one. Because the threshold t_-1 is absolutely larger than the magnitude of Xm(q) and t+1 is larger than that of Xm(p), the addition and subtraction operations of those two complex data will not be larger than one to cause overflow. Therefore, once X_m(p) is outside R0 but is inside R+1 while Xm(q) is inside R-1, it will be judged that the butterfly computing X_m(p) and X_m(q) in next stage is overflow-free.

(a) (b)

(c)

Fig. 22 The regions of the circular-type detectors (a) C2 (b) C4 (c) C6

Finally we further divide the complex plane into six regions as shown in Fig. 22(c) and this type of detector is named “C6”. With the existed four regions on the complex plane in the C4 detector, the regions R-2 and R+2 are defined additionally. In C6 detector, R-2 is the circle of radius t_-2, R₊₂ is the annulus with inner radius t₊₁ and outer radius t₊₂, and R_-1 becomes the annulus with inner radius t-2 and outer radius t-1. For the same idea in C4, the summation of t-2

and t₊₂ should also be less than one.

As mentioned above, we have known that t-k plus tk where k = 1 or 2 should be less than one to avoid overflow. And if t_-k is larger, t_k will become smaller. In the meanwhile, the area of R-k becomes larger as the area of Rk becomes smaller. Since our purpose is to avoid the unnecessary scaling as accurate as possible, the area of the two regions should be larger and the possibility of data in R-k should be equal to the possibility of data in Rk. As a result, we have two conditions as (4.1) and (4.2) to determine the thresholds t_k in detectors. And the results of thresholds are shown in Table 1.

k k 1

t_  t (4.1) Area R ( __k ) = Area R_k ( (4.2) )

Table 1 The value of the thresholds in circular-type detectors

Here we sweep t_-1 from 0 to 0.5 to simulate the SQNR performance and the result is shown in Fig. 23. As we can see, SQNR is almost the highest when t-1 is equal to 0.375 and t1

is equal to 0.625 as we expect.

Fig. 23 The simulation result of SQNR with different t-1

Because the regions are all circles in the complex plane, we are required to calculate the magnitude of the complex data by computing its summation of the square of the real part and the imaginary part. As a result, multipliers, adders, and comparators are introduced which are required to compare the thresholds as shown in Fig. 24. It is intuitive that C6 has the best performance and the largest area of comparators since there are six thresholds to be compared while C2 has the smallest area of comparators and the performance is relatively worst.

Fig. 24 The block diagram of the circular-type detector

However, the bit width BW of multipliers, adders and comparators influences the area and the accuracy as well. That is to say, the arithmetic unit with longer bit width will produces better accuracy and cost more area. In our work, we implement 10-bit comparators, BW-bit multipliers, and 2*BW-bit adders where BW is an integer and can be chosen from 5 to 10.

4.2.1.2 Square-Type Detector

Although the circular-type region detectors make precise predictions, they cost a lot of area for introducing the multipliers and adders. For hardware concern, there are alternative ways which are the square-type region detectors [14]. That is, we can simplify those circular regions to their maximal cyclic quadrilaterals. The square regions are described in Fig. 25. As the circular-type detectors, “S2” is the square-type detector which divides the complex plane into two square regions and “S4” is the detector dividing the complex plane into four square regions. The detector dividing the complex plane into six squares is therefore named “S6”.

Because each square region shown in Fig. 25 is the maximal cyclic quadrilateral of the circular region shown in Fig. 22, the thresholds in square-type detectors will be defined as (4.3) where k = -2, -1, 0, 1, and 2. And the thresholds h_k of the square-type detectors are listed in Table 2.

h_k  t_k 2 2 (4.3)

Table 2 The value of the thresholds in square-type detectors

(a) (b)

(c)

Fig. 25 The regions of the square-type detectors (a) S2 (b) S4 (c) S6

Also we sweep h-1 from 0 to 0.354 to simulate the SQNR performance of S4-type detector.

And the result is shown in Fig. 26. The SQNR is almost the highest when h_-1is equal to 0.265 and h1 is equal to 0.442 as we expect.

Fig. 26 The simulation result of SQNR with different h_-1

To detect the region information for the data in square-type detectors, we only need to compare the absolute value of the real part and imaginary part with the half of the side lengths of those squares. The block diagram of square-type detector is shown in Fig. 27. The only difference between circular type and square type is that square type does not need the

在文檔中應用多重區域條件式成組縮放法於快速傅利葉轉換處理器之面積最小化技術 (頁 26-0)