Dynamic Approaches - 利用與資料相依之延遲改善運算單元之能量效率

This category is said that the supply voltage and switch activity/capacitance can be adjusted dynamically for different applications and throughput requirement. It is more flexible than the static approaches. I’ll introduce the techniques of energy reduction by reducing supply voltage and switch activity/capacitance respectively.

2.3.1 Supply Voltage

z Dynamic voltage and frequency scaling (DVFS)

The gap between high performance and low power can be bridged through the use of dynamic voltage scaling, where periods of low processor utilization are exploited by lowering the clock frequency to the minimum required level, allowing corresponding reduction in the supply voltage [22][23].

Figure 2-8 shows the overall architecture of a DVFS system. The performance manager uses a software interface to predict performance requirements. Once performance requirement for the next task is determined, the performance manager sets the voltage and frequency just necessary to accomplish the task. The target frequency is sent to the phase-locked loop (PLL) to accomplish frequency scaling.

Based on the target voltage, the voltage regulator scales supply voltage to meet performance target.

Figure 2-8 Architecture of the DVFS system

A robust system should be able to meet the deadlines at any voltage, process and temperature condition. The conventional approach performs voltage scaling that it uses a target operating voltage for each required operating frequency. To guarantee a robust operation, the frequency-voltage relationship is determined via chip characterization at worst case conditions. This technique is utilized in open-loop dynamic voltage and frequency scaling system where the frequency-voltage relationship is stored in a look-up table. Since such LUT (look up table) is pre-loaded with voltage-frequency points, DVFS systems are not able to adapt to process variations or environmental conditions.

2.3.2 Switching Activity and Capacitance

z Clock gating & operand isolation

Clock gating is a common method for reducing the unnecessary signal transitions. In [24], it proposes a technique to automatically synthesize gated clocks for finite-state machines to reduce power dissipation. The following graph (Figure 2-9) is a gated-clock D flip-flop.

Figure 2-9 Clock-gated D flip-flop

There will be an additional signal named “Enable”. For a D flip-flop without gated-clock, the input will be passed to output at the rising edge of clock. The input of gated D flip-flop will only be passed to output at the rising edge of clock if the enable signal is “1”.

We can control the enable signal dynamically according to the different requirements. It reduces the signal transitions of register and combinational circuit.

If the inputs of a circuit are gated, the inputs are the same with the ones in the previous cycle. And all the nodes in circuit remain unchanged. If the circuit is without gated-clock input registers, there might be some glitches in this cycle which consumes power also.

Hence, we also can insert latches (flip-flops) at the inputs of the functional units.

If the output of the functional units is not necessary, the input data can be isolated using latches (flip-flops).

z Pre-computation logic

It relies on the idea of duplicating part of the logic with the purpose of pre-computing the circuit output values one clock cycle before they are required, and it uses these values to reduce the total amount of switching in the circuit during the next clock cycle.

In [25][26], they present an algorithm to synthesize pre-computation logic for

the complete input-disabling architecture. The pre-computation logic is a function of all of the input variables. It is shown in Figure 2-10, the complete input-disabling architecture can reduce power dissipation for a larger class of sequential circuits.

Figure 2-10 Pre-computation logic

z Computation kernel

It also duplicates a part of the original circuit. The sub-set logic is smaller and faster such that it dissipates less power. At the most time, the sub-set logic can accomplish the circuit operation, and the original circuit is turned off.

Figure 2-11 (a) shows an example with the standard topology. The paradigm for improving its quality with respect to a given cost function is based on the architecture shown in Figure 2-11 (b). The architecture consists of the combinational portion of the original circuit (block CL), the computational kernel (block K), the selector function (block S), the double state flip-flops (DSFF), and the output multiplexers (MUX).

Figure 2-11 Computational kernel [27]

In [27] that presents a power optimization technique by exploiting the concept of computational kernel of a sequential circuit, which is a highly simplified logic block that imitates the steady-state behavior of the original specification. This block is smaller, faster, and less power consuming than the circuit from which it is extracted and can replace the original network for a large fraction of the operation time.

In [28] that presents a low power adder for SIMD data path. By exploiting the difference length in the critical path for the types of operations (e.g., 4x8/2x16/1x32), energy-efficient SIMD adders can be developed. Indeed, 8-bit adders have smaller gates and energy consumption. Hence, 4x8-bit operations on an 8-bit ripple adder consume 1.8 times less compared 1x32-bit operation on a 32-bit adder. To alleviate the power dissipation, it combines four 8-bit energy optimized adders and one 32-bit adder to support SIMD.

在文檔中利用與資料相依之延遲改善運算單元之能量效率 (頁 36-40)