• 沒有找到結果。

The Hierarchical Segmentation Method

Table 5.1: The ranges for P2S addresses for Λ1 = P2S, n = 8, v0 = 5 and v1 = 3.

The five P2S address bits δ0 are highlighted in bold.

P2S address range

0 0 0 0 0 0 | 0 0 0 ∼ 0 0 0 0 0 | 1 1 1 1 0 0 0 0 1 | 0 0 0 ∼ 0 0 0 0 1 | 1 1 1 2 0 0 0 1 | 0 0 0 0 ∼ 0 0 0 1 | 1 1 1 1 3 0 0 1 | 0 0 0 0 0 ∼ 0 0 1 | 1 1 1 1 1 4 0 1 | 0 0 0 0 0 0 ∼ 0 1 | 1 1 1 1 1 1 5 1 0 | 0 0 0 0 0 0 ∼ 1 0 | 1 1 1 1 1 1 6 1 1 0 | 0 0 0 0 0 ∼ 1 1 0 | 1 1 1 1 1 7 1 1 1 0 | 0 0 0 0 ∼ 1 1 1 0 | 1 1 1 1 8 1 1 1 1 0 | 0 0 0 ∼ 1 1 1 1 0 | 1 1 1 9 1 1 1 1 1 | 0 0 0 ∼ 1 1 1 1 1 | 1 1 1

Figure 5.6. The appropriate taps are taken from the cascades depending on the choice of the segments and are added to work out the P2S address. For P2S that increase and decrease by powers of two, the full circuit is used, and for P2S that decrease only to the left side (P2SL), just the AND gates are used. Similarly for P2S that decrease to the right side (P2SR), the cascade OR gates are used. These circuits can be pipelined and a circuit with shorter critical path but requiring more area can be used [80]. Note that in the last partition, δλ is not used as an address. If Λi = US, then δi+1 uses the next set of bits vi+1. However if Λi = P2S, then the location of δi+1 depends on the value of δi. Let j denote the P2S address, where j = 0..si− 1. From the vertical lines in Table 5.1, we observe that δi+1 should be placed after a0 for j = 0 and j = si− 1, after aj−1 for j = 1

av-2 av-3

av-1 av-4 a0

P2S Address

Figure 5.6: Circuit to calculate the P2S address for a given input δi, where δi = av−1av−2..a0. The adder counts the number of ones in the output of the two prefix circuits.

to j = (si/2) − 1, and after asi−2−j for j = si/2 to j = si− 2.

In principle it is possible to have any number of levels of nested Λ, as long as Pλ

i=0vi ≤ n. The more levels are used, the closer the total number of segments m will be to the optimum. However as λ (the number of levels) increases the partitioning problem becomes more complex, and the cascade of look-up tables gets longer, increasing the delay to find the final segment. Therefore there is a tradeoff between the partitioning complexity, delay and m. Our tests with the functions we consider in this chapter show that the rate of reduction of m decreases rapidly as λ increases. λ = 2 gives a very close m to the optimum with acceptable partitioning complexity and delay. Moreover, λ > 2 gives diminishing returns in terms of small improvement in m with high partitioning complexity and long delays. Therefore in this work, we limit ourselves to λ = 2, which consists of one outer segment Λ0 and one inner segment Λ1. P2S is used as the outer segment if the function varies exponentially in the beginning and the end of the interval. P2SL and P2SR are used as the outer segment when the function

varies exponentially at the beginning or at the end respectively. US is used if the function is non-linear in arbitrary regions. Although we limit ourselves with λ = 2, higher levels of hierarchies could be useful for certain functions.

In Section 6.4 (Chapter 6), we approximate the functions p

− log(x) with P2S and cos(πx/2) with US(P2S) which are needed by the Box-Muller algo-rithm. These two schemes are found to be sufficient to generate high quality noise samples. However, these schemes are perhaps inappropriate for applica-tions that require high accuracies, since when P2S is used as the most inner segmentation, the segments in the middle regions are large causing large er-rors. Moreover, the address calculation circuit is needed for P2S, therefore P2S should be avoided if the difference is small compared to using US. US(P2S(US)) could be useful for cases when there are highly non-linear regions in the mid-dle parts of the function. The hierarchy schemes we have chosen are H = {P2S(US), P2SL(US), P2SR(US), US(US)}. These four schemes cover most of the non-linear functions of interest.

We have implemented the hierarchical segmentation method (HSM) in MAT-LAB, which deals with the four schemes. The program called HFS (hierarchical function segmenter) takes the following inputs: the function f to be approx-imated, input range, operand size n, hierarchy scheme H, number of bits for the outer segment v0, the requested output error emax, and the precision of the polynomial coefficients and the data paths. HFS divides the input interval into outer segments whose boundaries are determined by H and v0. HFS finds the minimum number of bits v1 for the inner segments for each outer segment, which meets the requested output error constraint. For each outer segment, HFS starts with v1 = 0 and computes the error e of the approximation. If e > emax then v1 is incremented and the error e for each inner segment is computed, i.e. the number

of inner segments is doubled in every iteration. If it detects that e > emax it incre-ments v1 again. This process is repeated until e ≤ emax for all inner segments of the current outer segment. This is the point at which HFS obtains the minimum number of bits for the current outer segment. HFS performs this process for all outer segments. The main MATLAB code for finding the hierarchical boundaries and their polynomial coefficients is shown in Figure 5.7. Note that minimax2 takes the precisions of the polynomial coefficients and data paths into account.

The outer boundaries are determined by H and v0.

Experiments are carried out to find the minimum number of bits for v0. Fig-ure 5.8 shows how the total number of segments varies with v0 for 16-bit second order approximation to f3. We can observe the figure of U shape, and there is a point at which v0 is at a minimum, which is five bits in this particular case.

When v0 is too small, there are not enough outer segments to cater to local non-linearities. When v0 is too large, there are too many unnecessary outer segments.

Note that when v0 = 0, it is equivalent to using standard uniform segmenta-tion. Figure 5.9 shows the segmented functions obtained from HFS for 16-bit second order approximations to the four functions. It can be seen that the seg-ments produced by HFS closely resemble the optimum segseg-ments in Figure 5.2.

Table 5.2 shows a comparison in terms of numbers of segments for various second order approximations for uniform, HSM, and the optimum number of segments.

Double precision is used for the data paths and the output for this comparison.

We can see that HSM is significantly more efficient for the first three functions than using uniform segments, and the difference between the optimum ones are around a factor of two. However, for f4, the improvements over uniform segments are small due to the function being very linear. Looking at the results for 24-bit approximation to f1, we can see that HSM performs worse than average. This is due to the fact that insufficient bits are left for δ (19 bits are already used for

δ0).

Figure 5.10 shows our design flow for approximating functions. First the user supplies the following to the HFS: f , input range, H, n, v0, emax, and the precision of the polynomial coefficients and the data paths. HFS computes the segment boundaries and the polynomial coefficients and stores the data into a file. It also provides the user with a report, which contains the total number of segments m, maximum error, percentage of exactly rounded results, and the sizes of the multipliers, adders and look-up tables. There is a parameterizable reference design template library for the four hierarchy schemes defined by H for first and second order approximations. A design generator instantiates the relevant reference design templates with information from the data file and generates the hardware design in VHDL.

An interesting aspect of our approach is that it could be used to accelerate applications that have involve pure floating-point calculations such as software applications. This is because our method computes compound functions at once using polynomial approximations, instead of decomposing the compound func-tions into sub-funcfunc-tions and computing the sub-funcfunc-tions one by one. Versions of FastMath [44] used P2S to approximate the non-linear functions in logarithmic number systems (LNS) to speed up software applications without the use of a coprocessor.

% Inputs: d, f, e_max, ulp, v0, H, n, precisions

% Output: hier_boundaries_table, poly_coeffs_table

for i=1:(length(outer_boundaries)-1) x1 = outer_boundaries(i)

x2 = outer_boundaries(i+1);

hier_boundaries = x1;

[e, poly_coeffs] = minimax2(f,d,x1,x2,ulp);

if (e > e_max)

outer_seg_size = x2-x1;

v1 = 1;

while (e > e_max)

inner_seg_size = outer_seg_size/(2^v1);

hier_boundaries = [];

poly_coeffs = [];

for j=1:2^v1

x1 = outer_boundaries(i) + (inner_seg_size*j) - inner_seg_size;

x2 = x1 + inner_seg_size;

[e, _poly_coeffs]

= minimax2(f,d,x1,x2,ulp);

hier_boundaries(j,:) = x1;

poly_coeffs(j,:) = _poly_coeffs;

end

v1 = v1 + 1;

end end

hier_boundaries_table

= [hier_boundaries_table; hier_boundaries];

poly_coeffs_table

= [poly_coeffs_table; poly_coeffs];

end

Figure 5.7: Main MATLAB code for finding the hierarchical boundaries and their polynomial coefficients.

0 2 4 6 8 10 100

150 200 250 300 350 400 450 500 550

v0

Number of Segments

Figure 5.8: Variation of total number of segments against v0 for a 16-bit second order approximation to f3.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.5 1 1.5 2 2.5 3

x f 1(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

x f 2(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.4 0.6 0.8 1 1.2

x f 3(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2 0.4 0.6 0.8 1

x f4(x)

Figure 5.9: The segmented functions generated by HFS for 16-bit second order approximations. f1, f2, f3 and f4 employ P2S(US), P2SL(US), US(US) and US(US) respectively. The black and grey vertical lines are the boundaries for the outer and inner segments respectively.

Table 5.2: Number of segments for second order approximations to the four functions. Results for uniform, HSM and optimum are shown.

function order operand uniform HSM optimum HSM

width segments segments segments /optimum

f1 1 8 64 13 7 1.86

12 4,096 78 35 2.23

16 65,536 395 161 2.45

20 1,048,576 1,876 723 2.59

24 33,554,432 8,608 2,302 3.74

2 8 8 5 4 1.25

12 1,024 23 15 1.53

16 32,768 72 44 1.64

20 524,288 218 126 1.73

24 16,777,216 742 287 2.59

f2 1 8 32 19 11 1.73

12 512 93 45 2.07

16 8,192 381 181 2.10

20 131,072 1533 724 2.12

24 2,097,152 6,141 2,896 2.12

2 8 8 5 4 1.25

12 128 15 10 1.50

16 2,048 44 26 1.69

20 32,768 124 66 1.88

24 524,228 315 167 1.89

f3 1 8 256 36 20 1.80

12 1,024 172 81 2.12

16 4096 683 303 2.25

20 16,384 2,723 1,296 2.10

24 65,536 10,609 5,182 2.05

2 8 64 20 10 2.00

12 256 41 24 1.71

16 512 107 59 1.81

20 1,024 234 151 1.55

24 2,048 573 379 1.51

f4 1 8 8 7 5 1.40

12 32 27 20 1.35

16 128 110 77 1.43

20 512 435 307 1.42

24 2048 1,739 1,228 1.42

2 8 4 3 2 2.00

12 8 7 4 1.71

16 16 15 10 1.81

20 64 45 23 1.55

24 128 111 58 1.51

Hierarchical Function Segmenter

Data

File Synthesis

Place and Route Report

Hardware User Input

Design Generator

Reference Design

Library

Figure 5.10: Design flow of our approach.

相關文件