• 沒有找到結果。

Statistical Timing-Yield Optimization via Latch Substitution ∗

N/A
N/A
Protected

Academic year: 2022

Share "Statistical Timing-Yield Optimization via Latch Substitution ∗ "

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

Statistical Timing-Yield Optimization via Latch Substitution

Szu-Jui Chou

, Chin-Hsiung Hsu

, Jie-Hong Roland Jiang

†‡

, and Yao-Wen Chang

†‡

Graduate Institute of Electronics Engineering, National Taiwan University, Taipei

Department of Electrical Engineering, National Taiwan University, Taipei

{rerechou, arious}@eda.ee.ntu.edu.tw; [email protected]; [email protected]

Abstract

The continuous miniaturization of semiconductor devices imposes serious threats to design robustness against process variations and environmental fluctuations. Modern circuit designs may suffer from uncertain delays, not predictable in the design phase or even after manufacturing. This paper presents an optimization technique to make sequential circuits robust against delay variations and thus maximize timing yield. By trading larger flip-flops for smaller latches, the proposed approach can be used as a post-synthesis or post-layout optimization tool, allowing accurate timing information to be available. Experimental results show an average of 26% tim- ing yield improvement, and suggest that our method is promising for high speed designs tolerating clock variations.

1 Introduction

As the semiconductor fabrication technology advances to the sub- 100nm feature size regime, sensitivities of IC designs to process variations and environmental fluctuations are ever-increasing. To maintain design robustness against these uncertainties, it becomes more and more apparent that traditional design methodologies need to be modified and take variations into account at the early stage of a design flow since not all process variations can be diminished with technology advances after all.

In recent years, statistical approaches to circuit analysis and op- timization have been revolutionizing the EDA community. They are mostly centered around delay and power issues, the two main con- cerns affected by design uncertainties. In this paper, we focus on the former issue. Based on statistical timing analysis, most exist- ing statistical optimization approaches focused on gate sizing, e.g., [2, 9, 12, 5], and clock skew scheduling, e.g., [13, 6]. Rather, we propose a new statistical optimization methodology, which is or- thogonal and complementary to gate sizing and can possibly be combined with clock scheduling for further improvement. We take advantage of the transparency property of level-sensitive latches for tolerating delay uncertainties. In fact, there was other work substi- tuting latches for flip-flops in other optimization context. For in- stance, flip-flops may be replaced with latches to optimize storage [14], or to obtain better performance while considering crosstalk [8].

However, to the best of our knowledge, there was no work done in the context of optimizing timing yield in the statistical domain.

Given a design with edge-triggered D-type flip-flop (D-FF) im- plementation of state-holding elements (i.e. registers), we substitute level-sensitive latches for D-FFs such that timing yield is maximally improved. Based on dynamic programming, we devise an optimal algorithm for pipelined circuits, and generalize it for arbitrary se- quential circuits. Because latches are of small sizes compared with D-FFs, the substitution is possible without affecting nearby circuit structures and thus can be performed even after physical design.

Thereby, accurate timing information may be used.

We organize our explanation as follows. Section 2 gives some preliminaries of our models and the underlying timing analysis. Sec- tion 3 analyzes the effect of substituting latches for D-FFs, and for-

This work was supported in part by NSC of Taiwan under Grant No’s. NSC 94-2218-E-002-083, NSC 94-2215-E-002-030 and NSC 94-2752-E-002-008-PAE.

malizes our optimization objectives. Section 4 presents our algo- rithms, which are evaluated with experimental results in Section 5.

Finally, concluding remarks are given in Section 6.

2 Preliminaries

2.1 Statistical Timing Models and Analysis

To simplify our exposition, in the sequel we shall assume that gates are the main delay source, like most statistical static timing analysis algorithms do. In addition, we assume a gate delay is a random variable of Gaussian distribution independent of other gate delays.

(As our method is not restricted to these assumptions, other more accurate timing models can be easily integrated into our framework.) Based on this timing model, in our development we use two sta- tistical static timing analysis approaches, Monte Carlo simulation and block-based distribution propagation [3], to justify our opti- mization algorithm. With these analysis tools, we are able to ap- proximate or simulate the delay distribution of the output signals of a combinational circuit from the distributions of individual gate delays. For a sequential circuit, we may conduct similar timing analysis on its combinational blocks, each of which is a maximal connected combinational sub-circuit with primary inputs or regis- ter outputs as its inputs and with primary outputs or register in- puts as its outputs. After performing timing analysis on a combi- national block, we can obtain the static delay distributions of all paths. In particular, in block-based timing analysis we may derive the longest combinational-path delay distribution ∆(ri, rj) (respec- tively the shortest combinational-path delay distribution δ(ri, rj)) from register rito register rj by Gaussian-approximating max [1]

(respectively min) and sum operations over Gaussian random vari- ables. While δ(ri, rj) is immaterial in combinational timing analy- sis, it is crucial in analyzing sequential circuits involving latches.

Note that ∆(ri, rj) (similarly δ(ri, rj)) is not a distribution for some single fixed path, rather it may probabilistically correspond to dif- ferent paths.

2.2 Timing Yield of Sequential Circuits

Based on timing analysis techniques for combinational circuits, we may calculate timing yield for sequential circuits. For simplicity, we shall assume all registers are triggered by a single global clock, and are with zero setup time and hold time. We further assume D-FFs are positive-edge triggered. In the sequel, we assume a register is in one of the three types {D, H, L}, and denote the type of a register r by type(r). (Our discussion below can be straightforwardly extended to cases without these simplifications based on [11].) Let T = TH+ TLbe the clock period with high interval THand low interval TL.

Given a design with some target operation speed, its timing yield is the probability that no violation occurs with respect to timing con- straints, see e.g. [3, 4]. In the simplest case, when a circuit is im- plemented with D-FFs for all of its registers, its timing yield is the probability

Pr[ ^

(ri,rj)

(∆(ri, rj) ≤ T )], (1)

for any register pair (ri, rj) with a combinational path from rito rj.

1

(2)

Active interval of r1

Combinational

block Register

r2 r1

r0

C2 C1

Delay( r0, r1) Delay( r1, r2)

(1) (2) T

TH TL

(1’) (2’)

(a)

(b)

(c)

Figure 1: A single-path pipelined circuit and timing diagrams. (a) type(r0) = type(r1) = type(r2) = D; (b) type(r1) = H and type(r0) = type(r2) = D; (c) type(r1) = L and type(r0) = type(r2) = D.

For example in Figure 1 (a), where registers r0, r1, r2are of type D, then the yield is

Pr[(∆(r0, r1) ≤ T ) ∧ (∆(r1, r2) ≤ T )]. (2)

3 Yield & Register Configuration

3.1 Yield Changed by Latch Replacement

We study the effects of substituting latches for D-FFs. To begin with, consider a single-path pipelined circuit of Figure 1. Intuitively, an active-high latch can tolerate longer delay of its fan-in combina- tional block than a D-FF. If the type of r1is changed to H as shown in Figure 1 (b), the longest delay of combinational block C1can ex- ceed T . Essentially, there are two legal cases need to be analyzed depending on ∆(r0, r1):

case 1 TH ≤ ∆(r0, r1) < T : The signal of C1 arrives r1 before r1 is turned on so it must wait until r1 is active again at T . C2must satisfy ∆(r1, r2) ≤ T . In addition, C1must satisfy δ(r0, r1) > TH, otherwise the earliest and latest signals of C1 arrive C2in different clock cycles.

case 2 T ≤ ∆(r0, r1) < T + TH: The delay of C1is in the active interval of r1and can directly pass to C2so T < ∆(r0, r1) +

∆(r1, r2) ≤ 2T must hold. Also, δ(r0, r1) > TH must hold for the same reason as case 1.

For other cases, 0 ≤ ∆(r0, r1) < THand T +TH≤ ∆(r0, r1) <

2T , since they might be problematic for the latter stages, we exclude it from our yield calculation. Thus, the yield equals

Pr[(TH≤ ∆(r0, r1) < T + TH) ∧ (δ(r0, r1) > TH) ∧ (max{∆(r0, r1), T } + ∆(r1, r2) ≤ 2T )]. (3) In contrast, if the type of r1 is changed to L as shown in Fig- ure 1 (c), there are two legal cases need to be analyzed depending on ∆(r0, r1):

case 10 0 ≤ ∆(r0, r1) < TH: The signal of C1arrives r1before r1is turned on so it must wait until r1is active again at TH. C2must satisfy T < TH+ ∆(r1, r2) ≤ 2T and TH+ δ(r1, r2) > T , that is, TL< ∆(r1, r2) ≤ T + TLand δ(r1, r2) > TL. case 20 TH ≤ ∆(r0, r1) < T : The signal of C1arrives r1within the

active interval and can directly pass to C2so T < ∆(r0, r1) +

∆(r1, r2) ≤ 2T and δ(r1, r2)+max{δ(r0, r1), TH} > T must hold.

Although other cases, T ≤ ∆(r0, r1) < T + TH and T + TH

∆(r0, r1) < 2T , incur no timing violation in this example, it is problematic if r1fans out to a primary output since it may not latch the right value. We exclude it from our yield calculation. Thus, the yield equals

Pr[(∆(r0, r1) < T ) ∧ (max{δ(r0, r1), TH} + δ(r1, r2) > T ) ∧ (T < max{∆(r0, r1), TH} + ∆(r1, r2) ≤ 2T )]. (4) The above analysis forms the basis of our yield calculation. It can be extended to the analysis of general pipelined circuits. (Note that it may not be directly applicable to the analysis of cyclic se- quential circuits because cyclic delay dependencies make the block- based timing analysis fail to have a legal starting point. However, the cyclic dependency problem do not occur in our analysis due to the fact that we need to impose some latch replacement constraints to maintain the number of pipeline stages.)

3.2 Problem Formulation

Definition 1 Let R be a nonempty set of registers of a sequential circuit. A register configuration of R is a total function ρ : R → {D, H, L}.

D-FFs are the most common implementation of state-holding el- ements of sequential circuits due to their simple edge-triggered tim- ing constraints. We assume that a given design is in D-FF imple- mentation initially. It is possible to change this initial register con- figuration while maintaining the circuit behavior. Essentially, we require pipeline stages should not be changed before and after mod- ifying register configurations. Thus, no two latches of the same type can be connected by a combinational path. Furthermore, even two latches of different types cannot be connected by a combinational path because the number of pipeline stages will decrease if the total number of registers cannot increase. In essence, we require

• The fan-in and fan-out registers of a latch need to be of type D.

(Note that this criterion allows us to perform block-based timing analysis for pipelined circuits.) The optimization problem is Yield optimization problem: Given a sequential circuit with ρ(r) = D, for any register r, and the distributions of its gate delays, find the register configuration such that timing yield is maximally improved subject to the above replacement criterion.

4 Statistical Latch Replacement

4.1 Optimization Flow Overview

The flow of our algorithm is shown in Figure 2. Firstly, the input cir- cuit is abstracted and converted to a register dependency graph with statistical timing models and analysis to abstract essential timing in- formation. Secondly, all cycles of the register dependency graph are made acyclic with respect to a chosen minimal feedback vertex set.

Thirdly, the resultant acyclic graph is levelized in topological order from inputs to outputs. Fourthly, our statistical dynamic program- ming algorithm is conducted forwardly over the levelized acyclic graph. The optimal configuration can then be derived by tracing backward from outputs to inputs. Finally, Monte Carlo simulation can optionally be applied to justify the yield improvement.

4.2 Preprocessing Steps

Conversion from Circuit to Register Dependency Graph. We abstract a given input circuit C with a register dependency graph G = (V, E), where a vertex vi∈ V represents a register riin C and there is an directed edge (vi, vj) ∈ E if and only if there is a combi- national path from rito rjin C. (To simply our discussion, we view any primary input as the output of a D-FF, and any primary output as

(3)

Start Circuit

Graph conversion

Acyclic graph?

Statistical dynamic programming

Cycle breaking

Register configuration

Monte Carlo justification

End No

Yes

Estimated yield improvement Library

Figure 2: The flowchart of statistical latch replacement.

the input of a D-FF. Also, register-to-register distributions ∆(ri, rj) and δ(ri, rj) are computed according to the delay distributions of C, and is associated to its corresponding edge (vi, vj) ∈ E.

Cycle Breaking. If a circuit has some feedbacks, there will be cycles in converted graph. In order to levelize the register depen- dency graph, we break all cycles by finding a minimal feedback vertex set (FVS). (Finding an exact minimum feedback vertex set is NP-complete. Also, even if the minimum FVS is found, they are unnecessarily the best breaking points for our problem. Thus we seek for a heuristic approach [7].)

Levelization. After making a register dependency graph acyclic, we levelize it in a topological order such that each vertex is labelled with the longest distance from an input vertex.

4.3 Statistical Dynamic Programming

4.3.1 Algorithm Overview

Given an acyclic register dependency graph, we derive a register configuration with maximal timing yield by the statistical dynamic programming algorithm outlined in Figure 3.

Algorithm: StatisticalDynamicProgramming input: levelized register dependency graph

G = (V, E) and delay distributions on E output: optimal register configuration for yield begin

01 set level-1 registers to D-FFs with local yield 1 02 ` := level-count(G)

03 for i = 2, . . . , `

04 let Ribe the set of registers at level-i 05 for every register configuration α of Ri

06 compute the highest local yield Yαof α subject to the configurations of Ri−1

and their local yields

07 record the config. of Ri−1responsible for Yα 08 set R`to the config. β`of all D-FFs

09 for i := ` − 1, ` − 2, . . . , 2

10 set Rito the config. βiresponsible for βi+1 11 return β’s

end

Figure 3: The Statistical Dynamic Programming Algorithm.

Recall that we add artificial D-FFs at the primary inputs and outputs when converting a circuit to a register dependency graph.

Hence we set level-1 and level-` registers to be of type D, where ` is the number of levels in the levelized register dependency graph.

In addition, we define the local yield of a register to be the accu- mulated yield computed forward from level-1 registers, each having local yield 1. The statistical dynamic programming algorithm com- putes and stores the optimal configurations and the corresponding local yields in a forward direction based on the timing analysis intro-

D H D L

H L r0

r1

r2 D H L

D H D L

H L r0

r1

r2 D H L

D H D L

H L r0

r1

r2 D H L Conflict Conflict

(a) (b) (c)

Figure 4: Some examples of conflicts. (a) is a legal case, (b) and (c) are conflict cases.

duced in Section 3. Note that other timing analysis algorithms may be easily incorporated into our optimization framework for latch re- placement.

4.3.2 Analysis

Optimality. The dynamic programming algorithm is optimal with respect to the levelized register dependency graph and its corre- sponding edge delay distributions. However, we may lose optimality due to the preprocessing steps, more precisely, in deriving the delay distributions of edges of the register graph and in cycle breaking.

Complexity. The computational complexity of the overall opti- mization flow is dominated by the statistical dynamic programming.

Suppose, in a levelized register dependency graph, the pipeline widths are upper bounded by w. The statistical dynamic program- ming invokes O(` · 3w) function calls to the timing analysis engine.

Due to this potential exponential overhead, we resort to some heuris- tics in our implementation.

4.4 Implementation Issues

4.4.1 Large Pipeline Width

For a register dependency graph with large pipeline widths, statisti- cal dynamic programming becomes inefficient since it considers all possible configurations for registers at each level. We alleviate this problem by greedily optimizing one register at a time without con- sidering the configurations of other registers at the same level. Thus, we may need to handle the consistency problem for conflicting reg- ister type assignments.

When dealing with a combinational block with multiple fan-in registers, we maintain the correlations among all fan-in registers and propagate distributions to the combinational block. On the other hand, when dealing with a combinational block with multiple fan- out registers, each fan-out is computed independently but keep cor- relations among them. Note that because we only consider one reg- ister at a time, the result may differ from the global optimum. It is a tradeoff between optimality and efficiency.

4.4.2 Consistency

The consistency problem occurs when greedy approaches are used in the statistical dynamic programming algorithm. When the al- gorithm enters the second phase to trace out an optimal register configuration, conflicting optimal configurations for different reg- isters show up. Whether a conflict happens on a register depends on the configurations of its multiple fan-out registers. For example, as shown in Figure 4 (a), no conflict occurs since D is not the best configuration for both registers on the fan-out side. However, it is not the case as shown in Figure 4 (b) and 4 (c). Because this situa- tion is inevitable, we propose a method to control the consistency in dynamic programming.

Suppose register r has multiple fan-out registers. When the best configurations of the fan-out registers require r to have different types, a conflict arises in backward tracing. One straightforward so- lution is to calculate local yields for all possible configurations, but it would take exponential time and space in the number of fan-out registers.

(4)

ISCAS85 # of pipeline Clock # of total # of replaced Original Final Yield CPU Circuit stages period registers registers yield (%) yield (%) improvement (%) time (s)

c432 5 8.58 214 28 63.2 97.2 34.0 0.20

c499 5 9.34 186 8 59.3 100.0 40.7 0.11

c880 5 7.74 242 16 62.7 98.7 36.0 0.13

c1355 5 10.18 218 10 62.3 99.8 37.5 0.16

c1908 5 14.26 240 19 64.0 98.1 34.1 0.19

c2540 5 11.96 278 62 62.3 93.9 31.6 0.40

c7552 5 12.12 879 69 63.5 99.9 36.4 0.68

s1196 - 53.54 18 4 59.7 62.4 2.7 0.05

s5378 - 52.98 179 10 61.1 65.2 4.1 0.45

s9234 - 118.86 211 8 57.8 59.3 1.5 0.89

Average 25.9 0.33

Table 1: ISCAS85 benchmark circuits with 20% delay deviation.

Instead of keeping all configurations, we propose a data struc- ture which only takes linear time and space in the fan-out number of a register to keep consistent optimal solutions such that no con- flict occurs during backward tracing. When one fan-out register is considered at a time, there exist redundant results. Based on this ob- servation, we give a key to each distinct combination of r and one of its fan-out register. These keys are distinct prime numbers. If r has n fan-out registers, we just need 3n keys, which is linear to the fan-out number. We represent different configurations by multiply the corresponding keys, because all keys are prime numbers, we can easily distinguish the configuration by factoring the resulting num- ber.

5 Experimental Results

We have implemented our algorithm in the C++ language. The ex- periments were conducted on a Linux machine with Pentium IV 3.2GHz CPU and 3GB memory. Two sets of circuits are used:

pipeline circuits and general sequential circuits all from ISCAS benchmark suites. The pipeline circuits are generated from combi- national circuits by adding 4- to 5-stage pipelines and then retimed by SIS [10]. By SIS technology mapping with a library, the de- lay information can be obtained from the lookup table. In addition, the circuits are synthesized to balance long and short combinational paths. All delay variations are in normal distribution with 20% de- viation.

Tables 1 shows the results for 20% delay deviations. A clock period, shown in the third column, is determined by imposing the timing yield of an original circuit to fall between 60-70%. The yield improvements, shown in the eighth column, are justified with Monte Carlo simulation. The CPU times shown in the ninth columns are without counting Monte Carlo simulation. From Table 1, we note that our method achieves substantial yield improvement mostly when longest and shortest delays are of small differences. For cyclic sequential circuits, such ass1196,s5378, and s9234, our ap- proach only has small improvements because their register depen- dency graphs are close to complete graphs making latch replacement almost impossible.

6 Conclusions and Future Work

Based on statistical timing analysis, we proposed an algorithm to optimize the timing yield of a sequential circuit. Experimental re- sults show that, by substituting latches for D-FFs, timing yield can be improved about 26% on average. In addition, the results suggest that latch replacement tends to tolerate clock variations. Comple- mentary to other design-for-yield methodologies like gate sizing and clock skew scheduling, our technique may be combined with these techniques for further improvement. Since most circuits use D-FFs for register implementation, our approach can be widely applicable

to circuit designs. Since replacing D-FFs with latches incurs no area penalty, the proposed algorithm can be used for not only pre-layout but also post-layout optimization, where accurate timing informa- tion is available.

We made a certain assumptions to simplify our development. As future work, we would like to relax our assumptions to handle mul- tiple phased clocking scheme, which may lead to further yield im- provement. Also, we neglected the setup time and hold time con- straints, and the correlation between longest and shortest path delay distributions. Even our current method is accurate enough for opti- mization, we may obtain higher accuracy by adding these consider- ations to our framework.

References

[1] C. E. Clark. The greatest of a finite set of random variables. Operations Research, vol. 9, no.

2, pp. 145-162, 1961.

[2] S.-H. Choi, B. Paul, and K. Roy. Novel sizing algorithm for yield improvement under process variation in nanometer technology. In Proc. Design Automation Conf., 2004.

[3] C.-T. Chao, L.-C. Wang, K.-T. Cheng, and S. Kundu. Static statistical timing analysis for latch-based pipeline designs. In Proc. Int’l Conf. on Computer-Aided Design, 2004.

[4] R. Chen and H. Zhou. Clock schedule verification under process variations. In Proc. Int’l Conf. on Computer-Aided Design, 2004.

[5] M. Guthaus, N. Venkateswaran, C. Visweswariah, and V. Zolotov. Gate sizing using in- cremental parameterized statistical timing analysis. In Proc. Int’l Conf. on Computer-Aided Design, 2005

[6] A. Hurst and R. Brayton. Computing clock skew schedules under normal process variation.

In Proc. Int’l Workshop on Logic and Synthesis, 2005.

[7] H.-M. Lin and J.-Y. Jou. On computing the minimum feedback vertex set of a directed graph by contraction operations. IEEE Transactions on Computer-Aided Design, vol. 19, no. 3, 2000.

[8] C. Lin and H. Zhou. Trade-off between latch and flop for min-period sequential circuit de- signs with crosstalk. In Proc. Int’l Conf. on Computer-Aided Design, 2005.

[9] S. Raj, S. Vrudhula, and J. Wang. A methodology to improve timing yield in the presence of process variations. In Proc. Design Automation Conference, pp. 448-453, 2004.

[10] E.M. Sentovish et al. SIS: a system for sequential circuit synthesis. Technical Report UCB/ERL M92/41, Univ. of California, Berkeley, 1992.

[11] K. Sakallah, T. Mudge, and O. Olukotun. checkTcand minTc: Timing verification and optimal clocking of synchronous digital circuits. In Proc. Int’l Conf. on Computer-Aided Design, pp. 552-555, 1990.

[12] D. Sinha, N. Shenoy, and H. Zhou. Statistical gate sizing for timing yield optimization. In Proc. Int’l Conf. on Computer-Aided Design, 2005.

[13] J.-L. Tsai, D. Baik, C.-P. Chen. and K. Saluja. A yield improvement methodology using pre- and post-silicon statistical clock scheduling. In Proc. Int’l Conf. on Computer-Aided Design, pp.611-618, 2004.

[14] T.-Y. Wu and Y.-L. Lin. Storage optimization by replacing some flip-flops with latches. In Proc. Design Automation Conference, 1996.

參考文獻

相關文件

Then, we recast the signal recovery problem as a smoothing penalized least squares optimization problem, and apply the nonlinear conjugate gradient method to solve the smoothing

Accordingly, we reformulate the image deblur- ring problem as a smoothing convex optimization problem, and then apply semi-proximal alternating direction method of multipliers

Chen, The semismooth-related properties of a merit function and a descent method for the nonlinear complementarity problem, Journal of Global Optimization, vol.. Soares, A new

For different types of optimization problems, there arise various complementarity problems, for example, linear complemen- tarity problem, nonlinear complementarity problem

For different types of optimization problems, there arise various complementarity problems, for example, linear complementarity problem, nonlinear complementarity problem,

Chen, The semismooth-related properties of a merit function and a descent method for the nonlinear complementarity problem, Journal of Global Optimization 36 (2006) 565–580..

The second algorithm is based on the Fischer-Burmeister merit function for the second-order cone complementarity problem and transforms the KKT system of the second-order

It is well-known that, to deal with symmetric cone optimization problems, such as second-order cone optimization problems and positive semi-definite optimization prob- lems, this