Statistical Timing-Yield Optimization via Latch Substitution ∗

(1)

Statistical Timing-Yield Optimization via Latch Substitution ^∗

Szu-Jui Chou

^†

, Chin-Hsiung Hsu

^†

, Jie-Hong Roland Jiang

^†‡

, and Yao-Wen Chang

^†‡

Graduate Institute of Electronics Engineering, National Taiwan University, Taipei

^†

Department of Electrical Engineering, National Taiwan University, Taipei

^‡

{rerechou, arious}@eda.ee.ntu.edu.tw; [email protected]; [email protected]

Abstract

The continuous miniaturization of semiconductor devices imposes serious threats to design robustness against process variations and environmental fluctuations. Modern circuit designs may suffer from uncertain delays, not predictable in the design phase or even after manufacturing. This paper presents an optimization technique to make sequential circuits robust against delay variations and thus maximize timing yield. By trading larger flip-flops for smaller latches, the proposed approach can be used as a post-synthesis or post-layout optimization tool, allowing accurate timing information to be available. Experimental results show an average of 26% timing yield improvement, and suggest that our method is promising for high speed designs tolerating clock variations.

1 Introduction

As the semiconductor fabrication technology advances to the sub- 100nm feature size regime, sensitivities of IC designs to process variations and environmental fluctuations are ever-increasing. To maintain design robustness against these uncertainties, it becomes more and more apparent that traditional design methodologies need to be modified and take variations into account at the early stage of a design flow since not all process variations can be diminished with technology advances after all.

In recent years, statistical approaches to circuit analysis and optimization have been revolutionizing the EDA community. They are mostly centered around delay and power issues, the two main con- cerns affected by design uncertainties. In this paper, we focus on the former issue. Based on statistical timing analysis, most exist- ing statistical optimization approaches focused on gate sizing, e.g., [2, 9, 12, 5], and clock skew scheduling, e.g., [13, 6]. Rather, we propose a new statistical optimization methodology, which is or- thogonal and complementary to gate sizing and can possibly be combined with clock scheduling for further improvement. We take advantage of the transparency property of level-sensitive latches for tolerating delay uncertainties. In fact, there was other work substituting latches for flip-flops in other optimization context. For in- stance, flip-flops may be replaced with latches to optimize storage [14], or to obtain better performance while considering crosstalk [8].

However, to the best of our knowledge, there was no work done in the context of optimizing timing yield in the statistical domain.

Given a design with edge-triggered D-type flip-flop (D-FF) implementation of state-holding elements (i.e. registers), we substitute level-sensitive latches for D-FFs such that timing yield is maximally improved. Based on dynamic programming, we devise an optimal algorithm for pipelined circuits, and generalize it for arbitrary sequential circuits. Because latches are of small sizes compared with D-FFs, the substitution is possible without affecting nearby circuit structures and thus can be performed even after physical design.

Thereby, accurate timing information may be used.

We organize our explanation as follows. Section 2 gives some preliminaries of our models and the underlying timing analysis. Sec- tion 3 analyzes the effect of substituting latches for D-FFs, and for-

∗This work was supported in part by NSC of Taiwan under Grant No’s. NSC 94-2218-E-002-083, NSC 94-2215-E-002-030 and NSC 94-2752-E-002-008-PAE.

malizes our optimization objectives. Section 4 presents our algorithms, which are evaluated with experimental results in Section 5.

Finally, concluding remarks are given in Section 6.

2 Preliminaries

2.1 Statistical Timing Models and Analysis

To simplify our exposition, in the sequel we shall assume that gates are the main delay source, like most statistical static timing analysis algorithms do. In addition, we assume a gate delay is a random variable of Gaussian distribution independent of other gate delays.

(As our method is not restricted to these assumptions, other more accurate timing models can be easily integrated into our framework.) Based on this timing model, in our development we use two statistical static timing analysis approaches, Monte Carlo simulation and block-based distribution propagation [3], to justify our optimization algorithm. With these analysis tools, we are able to ap- proximate or simulate the delay distribution of the output signals of a combinational circuit from the distributions of individual gate delays. For a sequential circuit, we may conduct similar timing analysis on its combinational blocks, each of which is a maximal connected combinational sub-circuit with primary inputs or register outputs as its inputs and with primary outputs or register inputs as its outputs. After performing timing analysis on a combinational block, we can obtain the static delay distributions of all paths. In particular, in block-based timing analysis we may derive the longest combinational-path delay distribution ∆(ri, rj) (respec- tively the shortest combinational-path delay distribution δ(ri, rj)) from register rito register rj by Gaussian-approximating max [1]

(respectively min) and sum operations over Gaussian random vari- ables. While δ(ri, rj) is immaterial in combinational timing analysis, it is crucial in analyzing sequential circuits involving latches.

Note that ∆(r_i, rj) (similarly δ(ri, rj)) is not a distribution for some single fixed path, rather it may probabilistically correspond to different paths.

2.2 Timing Yield of Sequential Circuits

Based on timing analysis techniques for combinational circuits, we may calculate timing yield for sequential circuits. For simplicity, we shall assume all registers are triggered by a single global clock, and are with zero setup time and hold time. We further assume D-FFs are positive-edge triggered. In the sequel, we assume a register is in one of the three types {D, H, L}, and denote the type of a register r by type(r). (Our discussion below can be straightforwardly extended to cases without these simplifications based on [11].) Let T = TH+ TLbe the clock period with high interval THand low interval TL.

Given a design with some target operation speed, its timing yield is the probability that no violation occurs with respect to timing constraints, see e.g. [3, 4]. In the simplest case, when a circuit is implemented with D-FFs for all of its registers, its timing yield is the probability

Pr[ ^

(ri,rj)

(∆(ri, rj) ≤ T )], (1)

for any register pair (ri, rj) with a combinational path from rito rj.

1

(2)

Active interval of r₁

Combinational

block Register

r₂ r₁

r₀

C₂ C₁

Delay( r₀, r₁) Delay( r₁, r₂)

(1) (2) T

T_H T_L

(1’) (2’)

(a)

(b)

(c)

Figure 1: A single-path pipelined circuit and timing diagrams. (a) type(r0) = type(r1) = type(r2) = D; (b) type(r1) = H and type(r0) = type(r2) = D; (c) type(r1) = L and type(r0) = type(r2) = D.

For example in Figure 1 (a), where registers r0, r1, r2are of type D, then the yield is

Pr[(∆(r0, r1) ≤ T ) ∧ (∆(r1, r2) ≤ T )]. (2)

3 Yield & Register Configuration

3.1 Yield Changed by Latch Replacement

We study the effects of substituting latches for D-FFs. To begin with, consider a single-path pipelined circuit of Figure 1. Intuitively, an active-high latch can tolerate longer delay of its fan-in combina- tional block than a D-FF. If the type of r₁is changed to H as shown in Figure 1 (b), the longest delay of combinational block C₁can ex- ceed T . Essentially, there are two legal cases need to be analyzed depending on ∆(r₀, r1):

case 1 TH ≤ ∆(r0, r1) < T : The signal of C1 arrives r1 before r1 is turned on so it must wait until r₁ is active again at T . C2must satisfy ∆(r₁, r2) ≤ T . In addition, C1must satisfy δ(r0, r1) > TH, otherwise the earliest and latest signals of C₁ arrive C₂in different clock cycles.

case 2 T ≤ ∆(r₀, r1) < T + TH: The delay of C₁is in the active interval of r₁and can directly pass to C₂so T < ∆(r₀, r1) +

∆(r1, r2) ≤ 2T must hold. Also, δ(r0, r1) > TH must hold for the same reason as case 1.

For other cases, 0 ≤ ∆(r₀, r1) < THand T +T_H≤ ∆(r0, r1) <

2T , since they might be problematic for the latter stages, we exclude it from our yield calculation. Thus, the yield equals

Pr[(TH≤ ∆(r0, r1) < T + TH) ∧ (δ(r0, r1) > TH) ∧ (max{∆(r0, r1), T } + ∆(r1, r2) ≤ 2T )]. (3) In contrast, if the type of r₁ is changed to L as shown in Fig- ure 1 (c), there are two legal cases need to be analyzed depending on ∆(r₀, r1):

case 1⁰ 0 ≤ ∆(r0, r1) < TH: The signal of C1arrives r1before r1is turned on so it must wait until r1is active again at TH. C2must satisfy T < TH+ ∆(r1, r2) ≤ 2T and TH+ δ(r1, r2) > T , that is, T_L< ∆(r1, r2) ≤ T + TLand δ(r₁, r2) > TL. case 2⁰ TH ≤ ∆(r0, r1) < T : The signal of C1arrives r1within the

active interval and can directly pass to C2so T < ∆(r0, r1) +

∆(r1, r2) ≤ 2T and δ(r1, r2)+max{δ(r0, r1), TH} > T must hold.

Although other cases, T ≤ ∆(r₀, r1) < T + TH and T + T_H ≤

∆(r0, r1) < 2T , incur no timing violation in this example, it is problematic if r₁fans out to a primary output since it may not latch the right value. We exclude it from our yield calculation. Thus, the yield equals

Pr[(∆(r0, r1) < T ) ∧ (max{δ(r0, r1), TH} + δ(r1, r2) > T ) ∧ (T < max{∆(r0, r1), TH} + ∆(r1, r2) ≤ 2T )]. (4) The above analysis forms the basis of our yield calculation. It can be extended to the analysis of general pipelined circuits. (Note that it may not be directly applicable to the analysis of cyclic sequential circuits because cyclic delay dependencies make the block- based timing analysis fail to have a legal starting point. However, the cyclic dependency problem do not occur in our analysis due to the fact that we need to impose some latch replacement constraints to maintain the number of pipeline stages.)

3.2 Problem Formulation

Definition 1 Let R be a nonempty set of registers of a sequential circuit. A register configuration of R is a total function ρ : R → {D, H, L}.

D-FFs are the most common implementation of state-holding elements of sequential circuits due to their simple edge-triggered timing constraints. We assume that a given design is in D-FF implementation initially. It is possible to change this initial register configuration while maintaining the circuit behavior. Essentially, we require pipeline stages should not be changed before and after mod- ifying register configurations. Thus, no two latches of the same type can be connected by a combinational path. Furthermore, even two latches of different types cannot be connected by a combinational path because the number of pipeline stages will decrease if the total number of registers cannot increase. In essence, we require

• The fan-in and fan-out registers of a latch need to be of type D.

(Note that this criterion allows us to perform block-based timing analysis for pipelined circuits.) The optimization problem is Yield optimization problem: Given a sequential circuit with ρ(r) = D, for any register r, and the distributions of its gate delays, find the register configuration such that timing yield is maximally improved subject to the above replacement criterion.

4 Statistical Latch Replacement

4.1 Optimization Flow Overview

The flow of our algorithm is shown in Figure 2. Firstly, the input circuit is abstracted and converted to a register dependency graph with statistical timing models and analysis to abstract essential timing information. Secondly, all cycles of the register dependency graph are made acyclic with respect to a chosen minimal feedback vertex set.

Thirdly, the resultant acyclic graph is levelized in topological order from inputs to outputs. Fourthly, our statistical dynamic programming algorithm is conducted forwardly over the levelized acyclic graph. The optimal configuration can then be derived by tracing backward from outputs to inputs. Finally, Monte Carlo simulation can optionally be applied to justify the yield improvement.

4.2 Preprocessing Steps

Conversion from Circuit to Register Dependency Graph. We abstract a given input circuit C with a register dependency graph G = (V, E), where a vertex vi∈ V represents a register riin C and there is an directed edge (vi, vj) ∈ E if and only if there is a combi- national path from rito rjin C. (To simply our discussion, we view any primary input as the output of a D-FF, and any primary output as

(3)

Start Circuit

Graph conversion

Acyclic graph?

Statistical dynamic programming

Cycle breaking

Register configuration

Monte Carlo justification

End No

Yes

Estimated yield improvement Library

Figure 2: The flowchart of statistical latch replacement.

the input of a D-FF. Also, register-to-register distributions ∆(r_i, rj) and δ(r_i, rj) are computed according to the delay distributions of C, and is associated to its corresponding edge (v_i, vj) ∈ E.

Cycle Breaking. If a circuit has some feedbacks, there will be cycles in converted graph. In order to levelize the register dependency graph, we break all cycles by finding a minimal feedback vertex set (FVS). (Finding an exact minimum feedback vertex set is NP-complete. Also, even if the minimum FVS is found, they are unnecessarily the best breaking points for our problem. Thus we seek for a heuristic approach [7].)

Levelization. After making a register dependency graph acyclic, we levelize it in a topological order such that each vertex is labelled with the longest distance from an input vertex.

4.3 Statistical Dynamic Programming

4.3.1 Algorithm Overview

Given an acyclic register dependency graph, we derive a register configuration with maximal timing yield by the statistical dynamic programming algorithm outlined in Figure 3.

Algorithm: StatisticalDynamicProgramming input: levelized register dependency graph

G = (V, E) and delay distributions on E output: optimal register configuration for yield begin

01 set level-1 registers to D-FFs with local yield 1 02 ` := level-count(G)

03 for i = 2, . . . , `

04 let Ribe the set of registers at level-i 05 for every register configuration α of Ri

06 compute the highest local yield Yαof α subject to the configurations of Ri−1

and their local yields

07 record the config. of R_i−1responsible for Y_α 08 set R_`to the config. β_`of all D-FFs

09 for i := ` − 1, ` − 2, . . . , 2

10 set R_ito the config. β_iresponsible for β_i+1 11 return β’s

end

Figure 3: The Statistical Dynamic Programming Algorithm.

Recall that we add artificial D-FFs at the primary inputs and outputs when converting a circuit to a register dependency graph.

Hence we set level-1 and level-` registers to be of type D, where ` is the number of levels in the levelized register dependency graph.

In addition, we define the local yield of a register to be the accu- mulated yield computed forward from level-1 registers, each having local yield 1. The statistical dynamic programming algorithm com- putes and stores the optimal configurations and the corresponding local yields in a forward direction based on the timing analysis intro-

D H D L

H L r₀

r₁

r₂ D H L

D H D L

H L r₀

r₁

r₂ D H L

D H D L

H L r₀

r₁

r₂ D H L Conflict Conflict

(a) (b) (c)

Figure 4: Some examples of conflicts. (a) is a legal case, (b) and (c) are conflict cases.

duced in Section 3. Note that other timing analysis algorithms may be easily incorporated into our optimization framework for latch replacement.

4.3.2 Analysis

Optimality. The dynamic programming algorithm is optimal with respect to the levelized register dependency graph and its corresponding edge delay distributions. However, we may lose optimality due to the preprocessing steps, more precisely, in deriving the delay distributions of edges of the register graph and in cycle breaking.

Complexity. The computational complexity of the overall opti- mization flow is dominated by the statistical dynamic programming.

Suppose, in a levelized register dependency graph, the pipeline widths are upper bounded by w. The statistical dynamic program- ming invokes O(` · 3^w) function calls to the timing analysis engine.

Due to this potential exponential overhead, we resort to some heuris- tics in our implementation.

4.4 Implementation Issues

4.4.1 Large Pipeline Width

For a register dependency graph with large pipeline widths, statistical dynamic programming becomes inefficient since it considers all possible configurations for registers at each level. We alleviate this problem by greedily optimizing one register at a time without considering the configurations of other registers at the same level. Thus, we may need to handle the consistency problem for conflicting register type assignments.

When dealing with a combinational block with multiple fan-in registers, we maintain the correlations among all fan-in registers and propagate distributions to the combinational block. On the other hand, when dealing with a combinational block with multiple fan- out registers, each fan-out is computed independently but keep correlations among them. Note that because we only consider one register at a time, the result may differ from the global optimum. It is a tradeoff between optimality and efficiency.

4.4.2 Consistency

The consistency problem occurs when greedy approaches are used in the statistical dynamic programming algorithm. When the algorithm enters the second phase to trace out an optimal register configuration, conflicting optimal configurations for different registers show up. Whether a conflict happens on a register depends on the configurations of its multiple fan-out registers. For example, as shown in Figure 4 (a), no conflict occurs since D is not the best configuration for both registers on the fan-out side. However, it is not the case as shown in Figure 4 (b) and 4 (c). Because this situa- tion is inevitable, we propose a method to control the consistency in dynamic programming.

Suppose register r has multiple fan-out registers. When the best configurations of the fan-out registers require r to have different types, a conflict arises in backward tracing. One straightforward so- lution is to calculate local yields for all possible configurations, but it would take exponential time and space in the number of fan-out registers.

(4)

ISCAS85 # of pipeline Clock # of total # of replaced Original Final Yield CPU Circuit stages period registers registers yield (%) yield (%) improvement (%) time (s)

c432 5 8.58 214 28 63.2 97.2 34.0 0.20

c499 5 9.34 186 8 59.3 100.0 40.7 0.11

c880 5 7.74 242 16 62.7 98.7 36.0 0.13

c1355 5 10.18 218 10 62.3 99.8 37.5 0.16

c1908 5 14.26 240 19 64.0 98.1 34.1 0.19

c2540 5 11.96 278 62 62.3 93.9 31.6 0.40

c7552 5 12.12 879 69 63.5 99.9 36.4 0.68

s1196 - 53.54 18 4 59.7 62.4 2.7 0.05

s5378 - 52.98 179 10 61.1 65.2 4.1 0.45

s9234 - 118.86 211 8 57.8 59.3 1.5 0.89

Average 25.9 0.33

Table 1: ISCAS85 benchmark circuits with 20% delay deviation.

Instead of keeping all configurations, we propose a data struc- ture which only takes linear time and space in the fan-out number of a register to keep consistent optimal solutions such that no conflict occurs during backward tracing. When one fan-out register is considered at a time, there exist redundant results. Based on this ob- servation, we give a key to each distinct combination of r and one of its fan-out register. These keys are distinct prime numbers. If r has n fan-out registers, we just need 3n keys, which is linear to the fan-out number. We represent different configurations by multiply the corresponding keys, because all keys are prime numbers, we can easily distinguish the configuration by factoring the resulting number.

5 Experimental Results

We have implemented our algorithm in the C++ language. The ex- periments were conducted on a Linux machine with Pentium IV 3.2GHz CPU and 3GB memory. Two sets of circuits are used:

pipeline circuits and general sequential circuits all from ISCAS benchmark suites. The pipeline circuits are generated from combinational circuits by adding 4- to 5-stage pipelines and then retimed by SIS [10]. By SIS technology mapping with a library, the delay information can be obtained from the lookup table. In addition, the circuits are synthesized to balance long and short combinational paths. All delay variations are in normal distribution with 20% deviation.

Tables 1 shows the results for 20% delay deviations. A clock period, shown in the third column, is determined by imposing the timing yield of an original circuit to fall between 60-70%. The yield improvements, shown in the eighth column, are justified with Monte Carlo simulation. The CPU times shown in the ninth columns are without counting Monte Carlo simulation. From Table 1, we note that our method achieves substantial yield improvement mostly when longest and shortest delays are of small differences. For cyclic sequential circuits, such ass1196,s5378, and s9234, our approach only has small improvements because their register dependency graphs are close to complete graphs making latch replacement almost impossible.

6 Conclusions and Future Work

Based on statistical timing analysis, we proposed an algorithm to optimize the timing yield of a sequential circuit. Experimental results show that, by substituting latches for D-FFs, timing yield can be improved about 26% on average. In addition, the results suggest that latch replacement tends to tolerate clock variations. Comple- mentary to other design-for-yield methodologies like gate sizing and clock skew scheduling, our technique may be combined with these techniques for further improvement. Since most circuits use D-FFs for register implementation, our approach can be widely applicable

to circuit designs. Since replacing D-FFs with latches incurs no area penalty, the proposed algorithm can be used for not only pre-layout but also post-layout optimization, where accurate timing information is available.

We made a certain assumptions to simplify our development. As future work, we would like to relax our assumptions to handle multiple phased clocking scheme, which may lead to further yield improvement. Also, we neglected the setup time and hold time constraints, and the correlation between longest and shortest path delay distributions. Even our current method is accurate enough for optimization, we may obtain higher accuracy by adding these consider- ations to our framework.

References

[1] C. E. Clark. The greatest of a finite set of random variables. Operations Research, vol. 9, no.

2, pp. 145-162, 1961.

[2] S.-H. Choi, B. Paul, and K. Roy. Novel sizing algorithm for yield improvement under process variation in nanometer technology. In Proc. Design Automation Conf., 2004.

[3] C.-T. Chao, L.-C. Wang, K.-T. Cheng, and S. Kundu. Static statistical timing analysis for latch-based pipeline designs. In Proc. Int’l Conf. on Computer-Aided Design, 2004.

[4] R. Chen and H. Zhou. Clock schedule verification under process variations. In Proc. Int’l Conf. on Computer-Aided Design, 2004.

[5] M. Guthaus, N. Venkateswaran, C. Visweswariah, and V. Zolotov. Gate sizing using in- cremental parameterized statistical timing analysis. In Proc. Int’l Conf. on Computer-Aided Design, 2005

[6] A. Hurst and R. Brayton. Computing clock skew schedules under normal process variation.

In Proc. Int’l Workshop on Logic and Synthesis, 2005.

[7] H.-M. Lin and J.-Y. Jou. On computing the minimum feedback vertex set of a directed graph by contraction operations. IEEE Transactions on Computer-Aided Design, vol. 19, no. 3, 2000.

[8] C. Lin and H. Zhou. Trade-off between latch and flop for min-period sequential circuit de- signs with crosstalk. In Proc. Int’l Conf. on Computer-Aided Design, 2005.

[9] S. Raj, S. Vrudhula, and J. Wang. A methodology to improve timing yield in the presence of process variations. In Proc. Design Automation Conference, pp. 448-453, 2004.

[10] E.M. Sentovish et al. SIS: a system for sequential circuit synthesis. Technical Report UCB/ERL M92/41, Univ. of California, Berkeley, 1992.

[11] K. Sakallah, T. Mudge, and O. Olukotun. checkT_cand minT_c: Timing verification and optimal clocking of synchronous digital circuits. In Proc. Int’l Conf. on Computer-Aided Design, pp. 552-555, 1990.

[12] D. Sinha, N. Shenoy, and H. Zhou. Statistical gate sizing for timing yield optimization. In Proc. Int’l Conf. on Computer-Aided Design, 2005.

[13] J.-L. Tsai, D. Baik, C.-P. Chen. and K. Saluja. A yield improvement methodology using pre- and post-silicon statistical clock scheduling. In Proc. Int’l Conf. on Computer-Aided Design, pp.611-618, 2004.

[14] T.-Y. Wu and Y.-L. Lin. Storage optimization by replacing some flip-flops with latches. In Proc. Design Automation Conference, 1996.