Function Evaluation for Non-uniform Segmentation

after the second ACC(2) is needed to ensure one valid noise sample is fed to the multiplexor every clock cycle, rather than two valid samples every two clock cycles.

Two further remarks about this architecture can be made. First, it is pos-sible to speed up the output rate further by having multiple noise generators running in parallel, provided that the LFSRs are initialized with different ran-dom seeds. Second, the periodicity can be increased by using larger LFSRs and higher σ values can be obtained using more bits for u₁, both with little increase in complexity.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

1 2 3 4 5

f(u)

Figure 6.2: The f function. The asterisks indicate the boundaries of the linear approximations.

look at the entire function over the given domain, and therefore we do not need to have two stages. As shown in Figure 6.2, the greatest non-linearities of the f function occur in the regions close to zero and one. If uniform segments are used, a large number of small segments would be required to get accurate approxima-tions in the non-linear regions. However, in the middle part of the curve where it is relatively linear, accurate approximation can be obtained using relatively few segments. It would be efficient to use small segments for the non-linear regions, and large segments for linear regions. Arbitrary-sized segments would enable us to have the least error for a given number of segments; however, the hardware to calculate the segment address for a given input can be complex. Our objective is to provide near arbitrary-sized segments with a simple circuit to find the segment address for a given input.

We have developed a novel method which can construct piecewise linear ap-proximation. The main features of our proposed method include: (a) the segment

lengths used in a given region depends on the local linearity, with more segments deployed for regions of higher non-linearity; and (b) the boundaries between seg-ments are chosen such that the task of identifying which segment to use for a given input can be rapidly performed. The method is based on early ideas to the hierarchical segmentation method (HSM) described in Chapter 5. It is not as sophisticated as HSM, but is sufficient to generate high quality Gaussian noise samples.

As an example to illustrate our approach, consider approximating f with an 8-bit input. Using the traditional approach, the most-significant bits of u are used to index the uniform segments. For instance if the most-significant four bits are used, 16 uniform segments are used to approximate the function. Using our approach, it is possible to adopt small segments for non-linear regions (regions near 0 and 1), and large segments for linear regions (regions around 0.5). The idea is to use segments that grow by a factor of two from 0 to 0.5, and segments that shrink by a factor of two from 0.5 to 1 in the horizontal axis of Figure 6.2.

We use segment boundaries at locations 2ⁿ⁻⁸and 1−2⁻ⁿwhere 0 ≤ n < 8. Up to 14 segments can be formed this way. A circuit based on prefix computation can be used for calculating segment addresses (Figure 6.3, same as the circuit used for HSM in Chapter 5) for a given input x. It checks the number of leading zeros and ones to work out the segment address. A cascade of OR gates is used for segments that grow by factors of two, and a cascade of AND gates is used for segments that shrink by factors of two; these circuits can be pipelined and a circuit with shorter critical path but requiring more area can be used [80]. Note that the choice of segments does not have to be factors of two, it could be more. The appropriate taps are taken from the cascades depending on the choice of the segments and are added to work out the segment address. In Figure 6.3, the maximum available taps are taken, giving 14 segment addresses. Some taps would not be taken if the

address

+

segment

x₇ x₆ x₅ x₄ x₃ x₂ x₁

Figure 6.3: Circuit to calculate the segment address for a given input x. The adder counts the number of ones in the output of the two prefix circuits. Note that the least-significant bit x_o is not required.

segments grow or shrink by more than a factor of two. It can be seen that the critical path of this circuit is the path from x₆ or x₇ to the output of the adder.

By introducing pipeline registers between the gates, higher throughput can easily be achieved.

When approximating f with 32-bit inputs based on polynomials of the form

p(u) = c₁× u + c₀ (6.6)

the gradient of the steepest part of the curve is in the order of 10⁸, thus large multipliers would be required. To overcome this problem, we use scaling factors of multiples of two to reduce the magnitude of the gradient, essentially trading precision for range. This is appropriate since the larger the gradient, the less important precision becomes. The use of scaling factors provides the user the ability to control the precision for both c1 and c0, resulting in variation of the size of the multiplier and adder. Hence for each segment, four coefficients are stored: c1 and its scaling factor, c0 and its scaling factor. Note that the precision

of the approximation p(x) depends on the maximum error desired between p(x) and the actual function.

It is also possible to divide the input interval into uniform or non-uniform intervals, and have uniform or non-uniform segments inside each interval. In this case, the most-significant bits are used to address the intervals, and the least-significant bits are used to address the segments inside each interval. It can be seen that one can have any number of nested combinations of uniform and non-uniform segments. This hybrid combination of nested non-uniform and non-non-uniform segments provides a flexible way to choose the segment boundaries.

The architecture of our function evaluator, shown in Figure 6.4, is based on first order polynomials. The most-significant bits are used to select the interval, and the least-significant bits are passed through the segment address calculator which calculates the segment address within the interval. The ROM outputs the four coefficients for the chosen interval and segment. c₁ is multiplied by the input x and c s1 is used to scale the output. The scaling circuit involves shifters, which increase or decrease the value by powers of two. This scaled multiplication value is added to the scaled c₀ coefficient to produce the final result.

在文檔中 Hardware Designs for Function Evaluation and LDPC Coding (頁 178-182)