Results - Hardware Designs for Function Evaluation and LDPC Coding

After applying the method in Section 4.4.2, 1000 distinct designs are place and routed on a Xilinx Virtex-II XC2V6000-6 device. These result in over 150 graphs/figures. We summarize all the results in two matrices which show the Pareto-optimal solutions in Figure 4.10 for area and Figure 4.11 for latency. In essence, these matrices tell us for each combination of range and precision of the input which method to use for the three functions. Note that we use the term range reduction to also include range reconstruction.

The remaining result figures show a sample of the graphs that we used to arrive at the decisions presented in the matrices above.

Figures 4.12, 4.13, 4.14 and 4.15 show the area cost of range reduction for

sin(x) and log(x) implemented using po and tp3 methods. The lower part of the bars shows LUTs used for function evaluation, and the small upper part shows the LUTs used for range reduction. These figures show that the percentage area used by range reduction increases with precision and range. Comparing sin(x) with log(x), the cost of range reduction with increasing range is large for sin(x), due to the use of the modulus operation which incorporates a divider. In contrast, log(x) uses a barrel shifter to do the range reduction.

To decide when to use range reduction, we consider Figures 4.16, 4.17, 4.18 and 4.19, which show the area and latency results for sin(x) and log(x) evaluated using range reduction (WRR) and without range reduction (WOR). In the case of evaluating with WOR, we approximate the function over the entire user defined range with the given methods (tp2 ∼tp4 ).

Considering the area for sin(x), WOR has a lower LUT usage than WRR when the range is less then six bits. In the case of log(x), we observe that even for ranges as low as two bits, the LUT usage for WOR is significantly higher than WRR and this gap increases with range. This is due to the non-linear region of log(x) near zero which requires more segments to approximate with WOR. Considering the latency results for sin(x) and log(x), WOR is always faster than the corresponding WRR method. This is due to the absence of the range reduction step.

Figures 4.20 and 4.21 highlight the area and latency tradeoffs where the area increase with precision is smaller for area optimized designs, and the latency increase is smaller for latency optimized designs. Figures 4.22 and 4.23 show a similar tradeoff when we consider the range while keeping the precision fixed.

By looking at these figures along with other figures, we are able to create the resulting matrices in Figures 4.10 and 4.11. From the two matrices, we observe

that mostly tp2 is the most attractive solution. This result is not too surprising, since second order polynomials are known to give good trade offs between table size and circuit complexity for the bitwidths we aim in this chapter. But when the precision requirement is high (such as 16 bits in Figure 4.10), we see that tp3 gives the smallest area. This is because at low precision requirements, table sizes are manageable with low order polynomials. However, table sizes increase rapidly with increasing precision, at which higher order polynomials result in significantly smaller tables.

sin: tp2 log: tp2 sqrt: tp2

sin: tp2 log: tp2 sqrt: po

sin: tp2 log: tp2 sqrt: tp2

sin: po log: tp2 sqrt: tp2

sin: tp2 log: tp2 sqrt: tp3

sin: tp2 log: tp2 sqrt: tp2

sin: tp2 log: tp2 sqrt: tp3

sin: tp2 log: tp2 sqrt: tp2

sin: po log: po sqrt: tp2 16 12

8 4

4 8 12 16

Range [bits]

Precision [bits]

Figure 4.10: Area matrix which tells us for each input range/precision combina-tion which design to use for minimum area.

sin: tp2 log: tp2 sqrt: tp2

sin: po log: po sqrt: po

sin: tp2 log: tp2 sqrt: tp2

sin: tp2 log: po sqrt: tp2

sin: tp2 log: tp2 sqrt: tp2

sin: po log: tp2 sqrt: tp2

sin: tp2 log: tp2 sqrt: tp2

sin: tp2 log: po sqrt: tp2 16 12

8 4

4 8 12 16

Range [bits]

Precision [bits]

Figure 4.11: Latency matrix which tells us for each input range/precision com-bination which design to use for minimum latency.

4 8 12 16 0

2000 4000 6000 8000 10000 12000

sin(x) − po

Range [bits]

Area [4−input LUTs]

Precision 4 Precision 8 Precision 12 Precision 16

Figure 4.12: Area cost of range reduction (upper part) for sin(x) implemented using po with the designs optimized for area.

4 8 12 16

0 1000 2000 3000 4000 5000 6000 7000

sin(x) − tp3

Range [bits]

Area [4−input LUTs]

Precision 4 Precision 8 Precision 12 Precision 16

Figure 4.13: Area cost of range reduction (upper part) for sin(x) implemented using tp3 with the designs optimized for area.

4 8 12 16 0

2000 4000 6000 8000 10000

log(x) − po

Range [bits]

Area [4−input LUTs]

Precision 4 Precision 8 Precision 12 Precision 16

Figure 4.14: Area cost of range reduction (upper part) for log(x) implemented using po with the designs optimized for area.

4 8 12 16

0 500 1000 1500 2000 2500 3000 3500 4000 4500

log(x) − tp3

Range [bits]

Area [4−input LUTs]

Precision 4 Precision 8 Precision 12 Precision 16

Figure 4.15: Area cost of range reduction (upper part) for log(x) implemented using tp3 with the designs optimized for area.

4 6 8 0

1000 2000 3000 4000 5000 6000 7000 8000

Range [bits]

Area [4−input LUTs]

sin(x) tp2 WOR

tp2 WRR tp3 WOR tp3 WRR tp4 WOR tp4 WRR

Figure 4.16: Area for sin(x) with precision of eight bits for different methods with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for area.

4 6 8

40 60 80 100 120 140 160 180 200 220 240

Range [bits]

Latency [ns]

sin(x) tp2 WOR

tp2 WRR tp3 WOR tp3 WRR tp4 WOR tp4 WRR

Figure 4.17: Latency for sin(x) with precision of eight bits for different methods with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for latency.

2 3 4 0

2000 4000 6000 8000 10000 12000

Range [bits]

Area [4−input LUTs]

log(x) tp2 WOR

tp2 WRR tp3 WOR tp3 WRR tp4 WOR tp4 WRR

Figure 4.18: Area for log(x) with precision of eight bits for different methods with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for area.

2 3 4

50 60 70 80 90 100 110 120 130 140

Range [bits]

Latency [ns]

log(x) tp2 WOR

tp2 WRR tp3 WOR tp3 WRR tp4 WOR tp4 WRR

Figure 4.19: Latency for sin(x) with precision of eight bits for different methods with (WRR, solid line) and without (WOR, dashed line) range reduction, with the designs optimized for latency.

4 8 12 16 0

1000 2000 3000 4000 5000 6000 7000

sin(x) − tp3

Precision [bits]

Area [4 Input LUTs]

Range 4, area opt Range 8, area opt Range 12, area opt Range 16, area opt Range 4, latency opt Range 8, latency opt Range 12, latency opt Range 16, latency opt

Figure 4.20: Area versus precision for sin(x) using tp3 for different ranges and optimization.

4 8 12 16

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

sin(x) − tp3

Precision [bits]

Latency [ns]

Range 4, area opt Range 8, area opt Range 12, area opt Range 16, area opt Range 4, latency opt Range 8, latency opt Range 12, latency opt Range 16, latency opt

Figure 4.21: Latency versus precision for sin(x) using tp3 for different ranges and optimization.

4 8 12 16 400

600 800 1000 1200 1400 1600 1800 2000

Area Optimization − Precision 8 bits

Range [bits]

Area [4 Input LUTs]

sin, tp2 sin, tp3 sin, tp4 sin, po sqrt, tp2 sqrt, tp3 sqrt, tp4 sqrt, po log, tp2 log, tp3 log, tp4 log, po

Figure 4.22: Area versus range for all three functions using different methods with the precision fixed at eight bits optimized for area.

4 8 12 16

0 100 200 300 400 500 600 700

Latency Optimization − Precision 8 bits

Range [bits]

Latency [ns]

sin, tp2 sin, tp3 sin, tp4 sin, po sqrt, tp2 sqrt, tp3 sqrt, tp4 sqrt, po log, tp2 log, tp3 log, tp4 log, po

Figure 4.23: Latency versus range for all three functions using different methods with the precision fixed at eight bits optimized for latency.

4 8 12 16 0

500 1000 1500 2000 2500 3000 3500 4000

Area Optimization − po

Range [bits]

Area [4 Input LUTs]

sin, prec. 4 sin, prec. 8 sin, prec. 12 sin, prec. 16 sqrt, prec. 4 sqrt, prec. 8 sqrt, prec. 12 sqrt, prec. 16 log, prec. 4 log, prec. 8 log, prec. 12 log, prec. 16

Figure 4.24: Area versus range for all three functions using po for different pre-cisions optimized for area.

4 8 12 16

0 100 200 300 400 500 600 700 800 900

Latency Optimization − po

Range [bits]

Latency [ns]

sin, prec. 4 sin, prec. 8 sin, prec. 12 sin, prec. 16 sqrt, prec. 4 sqrt, prec. 8 sqrt, prec. 12 sqrt, prec. 16 log, prec. 4 log, prec. 8 log, prec. 12 log, prec. 16

Figure 4.25: Latency versus range for all three functions using po for different precisions optimized for latency.

4 8 12 16 200

400 600 800 1000 1200 1400 1600 1800 2000

Area Optimization − tp3

Range [bits]

Area [4 Input LUTs]

sin, prec. 4 sin, prec. 8 sin, prec. 12 sin, prec. 16 sqrt, prec. 4 sqrt, prec. 8 sqrt, prec. 12 sqrt, prec. 16 log, prec. 4 log, prec. 8 log, prec. 12 log, prec. 16

Figure 4.26: Area versus range for all three functions using po for different pre-cisions optimized for area.

4 8 12 16

0 100 200 300 400 500 600 700 800

Latency Optimization − tp3

Range [bits]

Latency [ns]

sin, prec. 4 sin, prec. 8 sin, prec. 12 sin, prec. 16 sqrt, prec. 4 sqrt, prec. 8 sqrt, prec. 12 sqrt, prec. 16 log, prec. 4 log, prec. 8 log, prec. 12 log, prec. 16

Figure 4.27: Latency versus range for all three functions using po for different precisions optimized for latency.

在文檔中 Hardware Designs for Function Evaluation and LDPC Coding (頁 114-126)