In the synchronous circuit design, traditional strategies for circuit optimization are based on worst case (critical path). For the given clock period, the critical paths of the circuit must be optimized to meet it, but that usually spends much energy effort to accomplish. The energy effort consists of gate size, structure and voltage.
29
In [7][8][9], we observe that the circuit delay is strongly data dependent, and only exhibits its critical path delay for very specific data sequences. Because, CMOS circuit delay is equal to the elapsed time of charging and discharging the circuit capacitances [31]. The computation time of each input pattern is based on the original status of the circuit capacitance. The same input patterns with different status of circuit capacitances will activate different paths such that the computation times are different.
Hence, estimating the circuit delay or path delay requires a two-pattern sequence — the first pattern initializes the circuit while the second pattern causes and propagates the desired transition [32][33].
To observe the delay of CMOS circuits, we synthesized a 8-bit unsigned carry-save-array multiplier using the UMC 90nm CMOS cell library. After the gate level synthesis, we used the 10,000 random pattern sequences for gate-level simulation. Figure 3-1 shows the path delay distribution of the 8-bit carry-save-array multiplier. The x-axis represents the delay time of the data computation (path delay), and the y-axis represents the number of patterns.
The green line represents the path delay distribution of the multiplier. The clock period is 1.6ns, so the critical path of the multiplier can not be larger than the clock period. The path delay distribution is similar to the normal distribution, and the probability of sensitizing the critical paths is very low.
30
Figure 3-1 Path delay distribution (8-bit multiplier)
For the 1.6ns clock period, the conventional design method is directly synthesizing the circuit with 1.6ns timing constraint. From the path delay distribution we found that the delay time of most patterns are smaller than 1.4ns even. From this scenario, the circuit energy can be optimized for common case, rather than the few critical cases. In other words, we can relax the synthesis timing constraint for energy reduction and tolerate the few critical cases.
Then we observe the relationship between the circuit energy and synthesis timing constraint. We use the UMC 90nm CMOS cell library and 10,000 random patterns to estimate the energy consumption (average energy consumption per operation). Figure 3-2 shows the energy curve of 8-bit carry-save-array multiplier with different synthesis timing constraints. The x-axis represents the synthesis timing constraint (circuit delay), and the y-axis represents the energy per multiplication.
31
Figure 3-2 Energy curve of 8-bit multiplier
Tightening the timing constraint of the multiplier will induce the increasing of the energy per multiplication. Especially when the timing constraint approaches the peak value (1.6ns), the energy consumption increases drastically. Even if we synthesize the circuit with power optimization constraint, the circuit energy decreases also as the synthesis timing constraint relaxes.
Optimizing the circuit delay needs to spend large energy effort. From the path delay distribution, we found that the energy effort is spent on the few circuit critical paths. The energy effort consists of optimizing the circuit structure and upping the gate sizes, and it makes the circuit delay (critical path delay) to be reduced.
Optimizing the circuit structure or upping the gate sizes usually causes the circuit capacitance to be increased, and therefore the circuit energy is increased.
Multimedia systems are desired not only for low-energy consumption but also for high speed (high performance). Although relaxing timing constraint is an effective method for energy reduction. In Figure 3-2, the timing constraint is relaxed from 1.6ns to 1.9ns will lead the energy consumption to be reduced about 45%, but
32
it indicates that the clock frequency (performance) is degraded directly.
Hence, exploiting the data-dependent delay of circuit can not only avoid clock frequency degradation but also gain the energy reduction. For instance, from the above multiplier, if the operating clock period is 1.6ns, the synthesis timing constraint can be relaxed to 1.9ns, and therefore the energy can be reduced about 45%. Then we observe the path delay distribution of the multiplier with 1.9ns critical path, it is shown in Figure 3-3.
Figure 3-3 Path delay distribution (8-bit multiplier)
The delay time of most patterns (98.88% of pattern) is less than 1.6ns (clock period), and only 1.12% of pattern that delay time is greater than 1.6ns. The delay time (computation time) of few input patterns will exceed 1.6ns, and these patterns may incur computing errors. The possible computing errors can be detected and it can be corrected by two-cycle operation. This implies that a one-cycle latency penalty. The detection and correction will be discussed in detail later.
From Figure 3-2 and Figure 3-3, only 1.12% of pattern that the delay time is greater than 1.6ns, the probability of spending one-cycle latency penalty is 1.12%, so the performance penalty is very light and negligible. If few errors (one-cycle latency penalty) can be tolerated by the multiplier design, the energy per
33
multiplication can be reduced about 45%.
In order to detect the computing errors and accommodate the one-cycle latency penalty, we proposed a variable latency design that it can be simply integrated into other systems.