• 沒有找到結果。

We demonstrate our approach with three elementary functions: sin(x), log(x + 1) and 2x. Five bit sizes 8, 12, 16, 20 and 24 bits are considered for the bitwidth. In this chapter, we implement designs with n-bit inputs and n-bit outputs. However, the position of the decimal (or binary) point in the input and output formats can be different in order to maximize the precision that can be described. All results are post place-and-route, and are implemented on a Xilinx Virtex-II XC2V6000-6 device [187].

In algorithmic space explored by MATLAB, there are three methods, three functions and five bitwidths, resulting in 45 designs. These designs are generated by the user with hand-optimized coefficient and operation bitwidths. ASC takes the 45 algorithmic designs and generates a large number of implementations in the hardware space with different optimization metrics. With the aid of the automatic design exploration features of ASC (Section 3.4), we are able to generate all the implementation results in one go with a single ‘make’ file. It takes around twelve hours on a dual Athlon MP 2.13GHz PC with 2GB DDR-SDRAM.

The following graphs are a subset of the full design space exploration which we show for demonstration purposes. Figures 3.4 to 3.15 show a set of FPGA

implementations resulting from a 2D cut of the multidimensional design space.

In Figures 3.4 to 3.6, we fix the function and approximation method to sin(x) and TABLE+POLY, and obtain area, latency and throughput results for various bitwidths and optimization methods. Degree two polynomials are used for all TABLE+POLY experiments in this chapter.

Figure 3.4 shows how the area (in terms of the number of 4-input LUTs) varies with bitwidth. The lower part shows LUTs used for logic while the small top part of the bars shows LUTs used for routing. We observe that designs optimized for area are significantly smaller than other designs. In addition, as one would expect, the area increases with the bit width. Designs optimized for throughput have the largest area; this is due to the registers used for pipelining. Figure 3.5 shows that designs optimized for latency have significantly less delay, and the increase in delay with the bitwidth is lower than others. Designs optimized for area have the longest delay, which is due to hardware being shared in a time-multiplexed manner. Figure 3.6 shows that designs optimized for throughput perform significantly better than others. Designs optimized for area perform worst, which is again due to the hardware sharing. We note that the throughput is rather unpredictable with increasing bitwidth. This is because the throughput is solely determined by the critical path, which does not necessarily increase with bitwidth (circuit area).

Figures 3.7 to 3.9 show various metric-against-metric scatter plots of 12-bit approximations to sin(x) with different methods and optimizations. For TABLE, only results with area optimization are shown, because the results for other opti-mizations applied are identical (since for TABLE optiopti-mizations are not possible).

With the aid of such plots, one can decide rapidly what methods to use for meeting specific requirements in area, latency or throughput.

0 500 1000 1500 2000 2500 3000 3500 4000 4500

24 20

16 12

8

Area [4-input LUTs]

Bitwidth TABLE+POLY, sin(x) OPT-AREA

OPT-LATENCY OPT-THROUGHPUT

Figure 3.4: Area versus bitwidth for sin(x) with TABLE+POLY. OPT indicates for what metric the design is optimized for. Lower part: LUTs for logic; small top part: LUTs for routing.

0 200 400 600 800 1000 1200 1400 1600 1800

8 10 12 14 16 18 20 22 24

Latency [ns]

Bitwidth TABLE+POLY, sin(x)

OPT-AREA OPT-LATENCY OPT-THROUGHPUT

Figure 3.5: Latency versus bitwidth for sin(x) with TABLE+POLY. Shows the impact of latency optimization.

0 100 200 300 400 500 600

8 10 12 14 16 18 20 22 24

Throughput [Mbps]

Bitwidth TABLE+POLY, sin(x)

OPT-AREA OPT-LATENCY OPT-THROUGHPUT

Figure 3.6: Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows the impact of throughput optimization.

0 200 400 600 800 1000

0 500 1000 1500 2000 2500 3000

Latency [ns]

Area [4-input LUTs]

sin(x), 12 bits

OPT-AREA-TABLE+POLY OPT-LATENCY-TABLE+POLY OPT-THROUGHPUT-TABLE+POLY OPT-AREA-TABLE OPT-AREA-POLY OPT-LATENCY-POLY OPT-THROUGHPUT-POLY

Figure 3.7: Latency versus area for 12-bit approximations to sin(x). The Pare-to-optimal points [124] in the latency-area space are shown.

0 100 200 300 400 500 600 700 800 900

0 100 200 300 400 500 600 700

Latency [ns]

Throughput [Mbps]

sin(x), 12 bits

OPT-AREA-TABLE+POLY OPT-LATENCY-TABLE+POLY OPT-THROUGHPUT-TABLE+POLY OPT-AREA-TABLE OPT-AREA-POLY OPT-LATENCY-POLY OPT-THROUGHPUT-POLY

Figure 3.8: Latency versus throughput for 12-bit approximations to sin(x). The Pareto-optimal points in the latency-throughput space are shown.

0 200 400 600 800 1000 1200 1400

0 500 1000 1500 2000 2500 3000

Throughput [Mbps]

Area [4-input LUTs]

sin(x), 12 bits

OPT-AREA-TABLE+POLY OPT-LATENCY-TABLE+POLY OPT-THROUGHPUT-TABLE+POLY OPT-AREA-TABLE OPT-AREA-POLY OPT-LATENCY-POLY OPT-THROUGHPUT-POLY

Figure 3.9: Area versus throughput for 12-bit approximations to sin(x). The Pareto-optimal points in the throughput-area space are shown.

In Figures 3.10 to 3.12, we fix the approximation method to TABLE+POLY, and obtain area, latency and throughput results for all three functions at various bitwidths. Optimization methods are used for all three experiments (e.g. area is optimized to get the area results).

From Figure 3.10, we observe that sin(x) requires the most and 2xrequires the least area. The difference gets more apparent as the bitwidth increases. This is because 2x is the most linear of the three functions, hence requires fewer number of segments for the approximations. This leads to a reduction in the number of entries in the coefficient table and hence less area on the device.

Figure 3.11 shows the variations of the latency with the bitwidth. We observe that all three functions have similar behavior. In Figure 3.12, we observe that again the three functions have similar behavior, with 2xperforming slightly better than others for bitwidths higher than 16 bits. We suspect that this is because of the lower area requirement of 2x, which leads to less routing delay.

Figures 3.13 to 3.15 show the main emphasis and contribution of this chap-ter, illustrating which approximation method to use for the best area, latency or throughput performance. We fix the function to sin(x) and obtain results for all three methods at various bitwidths. Again, area/latency/throughput optimiza-tions are performed for a given experiment. For experiments involving TABLE, we have managed to obtain results up to 12 bits only, due to memory limitations of our PCs.

From Figure 3.13, we observe that TABLE has the least area at 8 bits, but the area increases rapidly making it less desirable at higher bitwidths. The reason for this is the exponential increase in table to the input size for full look-up tables.

The TABLE+POLY approach yields the least area for precisions higher than eight bits. This is due to the efficiency of using multiple segments with minimax

coefficients. We have observed that for POLY, roughly one more polynomial term (i.e. one more multiply-and-add module) is needed every four bits. Hence, we see a linear behavior with the POLY curve. We are unable to generate TABLE results beyond 12 bits, due to the device size restrictions.

Figure 3.14 shows that TABLE has significantly smaller latency than others.

We expect that this will be the case for bitwidths higher than 12 bits as well.

POLY has the worst delay, which is due to computations involving high-degree polynomials, and the terms of the polynomials increase with the bitwidth. The latency for TABLE+POLY is relatively low across all bitwidths, because the number of memory accesses and polynomial degree are fixed.

In Figure 3.15, we observe how the throughput varies with bitwidth. For low bitwidths, TABLE designs result in the best throughput, which is due to the short delay for a single memory access. However, the performance quickly degrades and we predict that at bitwidths higher than 12 bits, it will perform worse than the other two methods due to rapid increase in routing congestion.

The performance of TABLE+POLY is better than POLY before 15 bits and gets worse after. This is due to the increase in the size of the table with precision, which leads to longer delays for memory accesses.

0 200 400 600 800 1000 1200 1400 1600

24 20

16 12

8

Area [4-input LUTs]

Bitwidth Optimize Area sin(x)

log(1+x) 2x

Figure 3.10: Area versus bitwidth for the three functions with TABLE+POLY.

Lower part: LUTs for logic; small top part: LUTs for routing.

0 10 20 30 40 50 60 70 80 90 100

8 10 12 14 16 18 20 22 24

Latency [ns]

Bitwidth Optimize Latency sin(x)

log(1+x) 2x

Figure 3.11: Latency versus bitwidth for the three functions with TABLE+POLY.

0 100 200 300 400 500 600 700 800

8 10 12 14 16 18 20 22 24

Throughput [Mbps]

Bitwidth Optimize Throughput

sin(x) log(1+x) 2x

Figure 3.12: Throughput versus bitwidth for the three functions with TA-BLE+POLY. Throughput is similar across functions, as expected.

0 500 1000 1500 2000 2500 3000

8 10 12 14 16 18 20 22 24

Area [4-input LUTs]

Bitwidth Optimize Area, sin(x)

TABLE POLY TABLE+POLY

Figure 3.13: Area versus bitwidth for sin(x) with the three methods. Note that the TABLE method gets too large already for 14 bits.

0 50 100 150 200 250 300 350

8 10 12 14 16 18 20 22 24

Latency [ns]

Bitwidth

Optimize Latency, sin(x) TABLE

POLY TABLE+POLY

Figure 3.14: Latency versus bitwidth for sin(x) with the three methods.

0 100 200 300 400 500 600 700 800 900 1000

8 10 12 14 16 18 20 22 24

Throughput [Mbps]

Bitwidth

Optimize Throughput, sin(x) TABLE

POLY TABLE+POLY

Figure 3.15: Throughput versus bitwidth for sin(x) with the three methods.

相關文件