• 沒有找到結果。

This chapter is organized as follows. First, we compare the performance of SA8051 with the Intel 8051 under various clock rates. Then, we compare the power consumption of SA8051 with the synchronous 8051. Finally, we compare the area cost with the synchronous version.

5-1 Performance

The performance of the SA8051 is compared with the Intel 8051, called I8051 developed by University of California [18]. The I8051 models the actual Intel implementation rather closely, e.g., it is 100% instruction compatible. It is written in synthesizable VHDL (at least by Synopsys and Xilinx). We modify it a little in order to compare the SA8051 with it fairly.

Hence, we remove the MUL, DIV and MOVX operations from it.

The FPGA device Xilinx Spartan IIE 300 ft256 is chosen to estimate the performance.

We do timing simulation by ModelSim. We run 6 test programs under different clock rates.

There is a clock in the interface between the SA8051 and the memory. Figure 32 shows the speedup of SA8051 versus I8051. The SpeedUp is defined as

SpeedUp = Execution Time of I8051 / Execution Time of SA8051

The maximum rate of I8051 is 12 MHz. The performance depends on the clock rate and the tested programs. In sort.c program the SA8051 runs faster than I8051 when the clock rate is less than 8 MHz. In other five programs the SA8051 runs faster than I8051 when the clock rate is less than 6 MHz.

When the clock rate is above 8 MHz, the SpeedUp is below 1 for these 6 tested programs.

The bottleneck is the interface between the asynchronous processor and the synchronous memory. The worst case for fetching data from memory is 2 clock cycle delays. The same

situation occurs in writing data.

Figure 32: SpeedUp for SA8051 versus I8051

Sort.c Program

5-2 Power Consumption

Power consumption is estimated by Xilinx XPower. It can analyze total device power, power per net, routed, partially routed or unrouted designs, all driven from a comprehensive graphic interface or command-line driven batch-mode. It reads VCD simulation data from the ModelSim family of HDL simulators to set estimate stimulus.

There are two main components to power consumption: static and dynamic. Static or quiescent power is mainly dominated by transistor leakage current. Dynamic or active power has components from both the switching power of the core of the FPGA and the I/O being switched. The dynamic power consumption is determined by the node capacitance, supply voltage, and switching frequency.

The 6 test programs are run for estimating the power consumption the same as in section 5-1. Figure 33 depicts the total power consumption of the asynchronous and synchronous 8051. The total power consumption consists of the energy dissipation of the processor core, the memory and the interface. We can compare them in the same performance. When the clock rate is 8 MHz, the SpeedUp for the sort.c program is 1. The asynchronous 8051 shows a total power advantage of a factor 2 compared to the synchronous implementation. The SpeedUp of the other 5 test program is 1 when the clock rate is 6 MHz. The asynchronous 8051 shows a total power advantage of a factor 1.5 compared to the synchronous implementation.

The static power consumption of the FPGA is a significant portion for the total power consumption. For example, the static power consumption is 28.2 mW for the FPGA device Spartan IIE 300 ft256. Figure 34 shows the dynamic power consumption of the asynchronous and synchronous 8051. The asynchronous 8051 shows a dynamic power advantage of a factor 3 compared to the synchronous implementation for the same performance. There are several

reasons for the power saving. First, the asynchronous implementation does not have clock power and can automatically turn off the unused portion of the circuit. Second, the handshake interface also plays an important role because the memory is active only when the processor wants to access it.

We can compare the core of asynchronous with synchronous 8051. Figure 35 shows the results. The asynchronous 8051 shows a dynamic power advantage of a factor 2 compared to the synchronous implementation for the same performance. The detailed energy dissipation is depicted in figure 36 and 37. The asynchronous implementation needs less dynamic power than the synchronous implementation because of no clock energy dissipation. Although the asynchronous implementation does not need clock power, it needs extra signal power results from the handshake implementation.

Sort.c Program

0.78 0.88 0.92 0.57

0.63 0.72 0.79 0.84

0

0.70 0.79 0.87 0.93

0

1.16 1.23 1.29

0.32 0.51

0.65

0.76 0.81 0.87

0

0.94 1.02 1.14 1.17 1.25

0.30

0.50 0.63 0.72 0.79 0.85

0

1.06 1.08 1.17

0.29 0.50

0.65

0.76 0.84 0.92

0

Figure 33: Total Power Consumption for test programs

Figure 34: Dynamic Power Consumption for test programs

Sort.c Program

46.58

57.72 55.72 59.84

54.18 56.05

13.77 15.35 14.96 15.79 16.22 15.29

0

51.31 50.20 49.49 48.59

54.48

15.33 16.96 16.61 17.45 17.94 16.94

0

56.66 55.77 54.23 52.49

57.04

16.90 18.79 18.43 19.35 19.88 21.23

0

49.83 51.06 53.90

16.46 18.07 18.83 19.42 19.21 20.45

0

51.42 50.21 48.33 47.71 51.04

15.98 18.06 19.03 19.07 19.03 19.61

0

45.59 45.81 45.24 44.37 46.99

17.26 19.93 19.40 20.58 20.70 20.06

0

Figure 35: The dynamic power consumption of the asynchronous processor core versus synchronous

Sort.c Program

104.5 105.3 103.9 105.4 105.5 105.7

51.4 51.4 51.2 51.3 51.3 51.3

0

100.6 100.3 100.8 102.8 102.8 104.1

57.3 57.0 57.1 57.0 57.1 57.1

0

103.7 103.8 104.4 110.4 107.2 104.5

51.1 51.1 51.1

60.7

97.4 97.6 96.6 97.5 96.8 96.6

57.5 57.3 57.4

49.6

93.1 92.3 93.8 93.6 92.5

53.4 53.7 53.6 54.5 53.6 53.6

0

84.3 85.6 84.2 86.5 84.4

48.2 48.2 48.2 48.2 48.2 48.2

0

Figure 36: The detailed dynamic power consumption (a) The left side is asynchronous processor (b) The right side is synchronous processor

0

Figure 37: The detailed dynamic power consumption (a) The left side is asynchronous processor (b) The right side is synchronous processor

0

5-3 Area Cost

We remove the multiplier and the divider from the synchronous 8051 in order to compare the cost fairly. The area cost is show in table 8. The results show the asynchronous implementation is about 2 larger than the synchronous implementation. The area overhead mainly comes from the handshake circuit in each handshake component. The hazard free circuit is employed in order to assure the circuit validity. The circuits of the completion detection on the control path which need large C element also result in the area overhead. The extra buffers are added in order to assure the timing validity.

Another reason is due to the CAD tool. There are no commercial CAD tools for the asynchronous circuits. The synchronous CAD tools can do some optimization techniques for the speed and area such as logic minimization and retiming. But, the asynchronous tool Balsa just does transparent compilation and does not do optimization on the asynchronous circuits.

Slices Gate Count (NAND)

Synchronous Implementation

990 13251

Asynchronous Implementation

2245 23590 (no added buffers) 25780 (with added buffers)

Table 8: The area cost for the synchronous and asynchronous 8051

5-4 Concluding Remarks

In this chapter we compared the asynchronous 8051 with synchronous 8051 in performance, power consumption and area cost. The simulation results show the asynchronous 8051 outperformed the synchronous 8051 by a factor 3 in dynamic power consumption under the same performance. The performance depends on the executed instructions which have different machine cycles. In the low clock rate the asynchronous

implementation outperforms the synchronous because the SA8051 avoids the unnecessary operations in the original machine cycles. The area cost of the asynchronous processor is about 2 larger than the synchronous.

相關文件