Clock Tree Synthesis Considering Slew Effect on Supply Voltage Variation

(1)

3

CHUN-KAI WANG, YEH-CHI CHANG, HUNG-MING CHEN, and CHING-YU CHIN,

National Chiao Tung University

This work tackles a problem of clock power minimization within a skew constraint under supply voltage variation. This problem is defined in the ISPD 2010 benchmark. Unlike mesh and cross link that reduce clock skew uncertainty by multiple driving paths, our focus is on controlling skew uncertainty in the structure of the tree. We observe that slow slew amplifies supply voltage variation, which induces larger path delay variation and skew uncertainty. To obtain the optimality, we formulate a symmetric clock tree synthesis as a mathematical programming problem in which the slew effect is considered by an NLDM-like cell delay variation model. A symmetry-to-asymmetry tree transformation is proposed to further reduce wire loading. Experimental results show that the proposed four methods save up to 20% of clock tree capacitance loading. Beyond controlling slew to suppress supply-voltage-variation-induced skew, we also discuss the strategies of clock tree synthesis under variant variation scenarios and the limitations of the ISPD 2010 benchmark. Categories and Subject Descriptors: B.7.2 [Integrated Circuits]: Design Aids

General Terms: Algorithms, Design

Additional Key Words and Phrases: Clock tree optimization, slew, voltage variation, robust design ACM Reference Format:

Chun-Kai Wang, Yeh-Chi Chang, Hung-Ming Chen, and Ching-Yu Chin. 2014. Clock tree synthesis con-sidering slew effect on supply voltage variation. ACM Trans. Des. Autom. Electron. Syst. 20, 1, Article 3 (November 2014), 23 pages.

DOI: http://dx.doi.org/10.1145/2651401

1. INTRODUCTION

Clock network costs the most power for a synchronous design and directly affects circuit speed. With shrinking technology, variations result from manufacturing, the operating environment, and even analysis [Blaauw et al. 2008]. These variations require more guard band in the design process. In addition, to achieve lower power dissipation, a voltage scaling technique is broadly adopted in modern design, which introduces severe voltage variation.

One way to reduce clock skew uncertainty is improving the delay correlation between clock paths by using a multiple-driving-paths clock network such as mesh [Restle et al. 2001; Xiao et al. 2010], cross link [Rajaram et al. 2006; Mittal and Koh 2011], and multilevel tree [Lee and Markov 2011]. However, this work adopts the other way, that is, to reduce skew uncertainty by reducing path delay variability. Therefore we focus on the network structure of the tree.

To reduce voltage-variation-induced path delay variability, previous works [Shih et al. 2010; Bujimalla and Koh 2011] minimize the number of buffer stages by full-filling

A preliminary version of this article was presented at the ISPD 2012 Conference [Chang et al. 2012]. Authors’ addresses: C.-K. Wang (corresponding author), Y.-C. Chang, H.-M. Chen, and C.-Y. Chin, Depart-ment of Electronics Engineering, National Chiao Tung University, Hsinchu 30010, Taiwan; email: oldkai. ee90@gmail.com.

ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purpose only.

2014 Copyright is held by the author/owner. Publication rights licensed to ACM. 1084-4309/2014/11-ART3 $15.00

(2)

3:2 C.-K. Wang et al.

buffers. However, the mapping from voltage variation to delay variation depends on the buffer’s input signal transition. Slow transition amplifies supply-voltage-variation-induced path delay variation and skew uncertainty. In this work, the slew effect is considered so that a clock buffer insertion and wire sizing method can efficiently reduce clock latency variation and skew uncertainty.

The contributions of this work are as follows.

—We formulate buffer insertion and wire sizing on a symmetric clock tree as a mathe-matical programming problem, which ensures the optimality.

—The slew effect on voltage variation to delay variation is considered by a NonLinear Delay Model (NLDM)-like cell delay variation model.

—A technique projecting a symmetric tree to an asymmetric tree is proposed to further reduce clock wire loading.

—We discuss the strategies of Clock Tree Synthesis (CTS) for different variation sce-narios and the limitations of the ISPD 2010 benchmark using Sze [2010].

The organization of this article is as follows. Section 2 introduces the problem formu-lation, our observation/motivation on the slew effect, and the overall flow in this work. Section 3 introduces a global optimization for the problem involving buffer insertion and wire sizing on a symmetric tree. Section 4 introduces a local optimization that projects a symmetric tree solution to an asymmetric tree. Section 5 presents experi-mental results, the discussion of CTS strategies under different variation scenarios, and the limitations of the ISPD 2010 benchmark.

2. PRELIMINARIES

In this section, we first review the problems of the ISPD 2010 benchmark and then deliver our observation about the slew effect and the overall flow optimizing a clock tree.

2.1. Review of ISPD 2010 Problem

To achieve a low-power and robust clock network, ISPD held a High-Performance Clock Network Contest in 2009 and 2010 [Sze et al. 2009; Sze 2010]. To reflect the variation effect realistically, instead of the clock latency range used in 2009, which is the maximum difference of clock arrival time of arbitrary sink pairs by two different supply voltages, ISPD 2010 measured the skew using a Monte Carlo method with the variation source of wire dimension and supply voltage. The details of the ISPD 2010 problem are as follows.

A synchronous circuit layout with edge-triggered flip-flops as sinks of the clock network is the starting input.

Given.

(1) a set of sinks S= {s1, s2, . . . , sn} with physical position and capacitance loading, (2) a clock source s0and its position,

(3) a buffer library, (4) a wire library,

(5) two variation sources: the width of a wire segment varies uniformly in±5%, and the voltage source of a buffer varies uniformly in±7.5%,

(6) a set of placement blockages B= {b1, b2, . . . , bm}, (7) a W× H layout region, and

(8) a Local Clock Skew (LCS) distance: a skew is negligible if the distance of its sink pair is larger than the LCS distance.

(3)

Objective.

Minimize the capacitance loading of a clock network. Constraints.

(1) no 95% LCS violation: only max 5% LCS (in 500 times Monte Carlo simulations) can exceed a skew limit,

(2) no slew-rate violation, and (3) no buffer overlaps a blockage.

The blockages are placement blockage, and clock routing can go through them. How-ever, the clock buffers cannot be placed on the blockages. LCS is the worst local clock skew for a sample of NGSPICE simulation under wire dimension variation and supply voltage variation. To evaluate the robustness of a clock network, 95% LCS is measured. A clock network with lower value of 95% LCS is more robust, and 95% LCS is a value that cuts off 95% samples’ LCS with the rest 5%. In the ISPD 2010 benchmark, the evaluator performs 500-sample Monte Carlo to derive 95% LCS of a clock network. In general, max loading and transition speed are defined in the cell library for sig-nal integrity, and the ISPD 2010 benchmark defines a slew constraint that the slew transition in any position of a clock network must be faster than 100ps.

2.2. Our Observation on Slew Effect and Overall Flow

To reduce voltage-variation-induced skew, previous works [Bujimalla and Koh 2011; Shih and Chang 2010; Shih et al. 2010] minimized the number of buffer stages by full-filling buffers, or minimizing nominal clock latency by assuming that latency variation is proportional to nominal latency [Lee et al. 2010]. In Bujimalla and Koh [2011], the authors assumed that, for a buffer stage, voltage variation to delay variation is constant. However, the influence of voltage variation to delay variation depends on the slew rate; slow input slew would amplify voltage variation to delay variation of a buffer stage. Figure 1 shows the slew effect for the case of a single-buffer stage by two input slews, 30ps and 50ps. Regardless of a rising or falling transition, the 30ps input slew has a more compact histogram of arrival time. The reason is that a faster input slew reduces the fuzziness of a gate’s switching time, as shown in Figure 2.

The slew effect is also experimented as shown in Figure 3 in cases of multiple-buffer stages. A slew of 0.4mm multiple-buffer distance is about 30ps and a slew of 0.9mm buffer distance is about 50ps. Results show that minimizing buffer levels does not sufficiently minimize voltage-variation-induced skew, and a smaller clock latency does not guarantee less clock latency variation.

Although the example demonstrates that controlling slew is effective to reduce voltage-variation-induced skew, inserting buffers with a slew consideration is still a problem. Conventional Ginneken’s buffer insertion [van Ginneken 1990] minimizes nominal delay by dynamic programming (DP) from sinks toward the source; however, it is not suitable to minimize delay variation because, when minimizing delay vari-ation, slew has to be considered. Nevertheless, slew is propagated from predecessor buffer stages that are unknown during DP bottom up. Hu et al. [2007] proposed a slew-constrained buffer insertion based on Ginneken’s method. But applying this method induced another problem: which value should be used as the constraint? In addition, it gives up more solution space than only applying a single value as the slew constraint. We propose a two-stage overall flow shown in Figure 4. In the first stage, we solve buffer insertion and wire sizing for a symmetric tree. There are two buffer inser-tion methods in first stage: (1) length-based buffer inserinser-tion and (2) mathematical-programming-based buffer insertion and wire sizing. Either one of the two buffer insertion methods will be selected. The length-based buffer insertion runs fast, but

(4)

Fig. 1. Slew effect in single-buffer stage. (a) Two ramp input signals with different slew rate 30ps and 50ps are experimented to drive a buffer stage; (b) Monte Carlo simulations are performed by the voltage variation setting of the ISPD 2010 benchmark. The histogram shows that the slower input slew amplifies the supply-voltage-variation-induced delay variation.

Fig. 2. Consider a gate switch when the input signal is VDD/2. A gate of slow input slew has more uncertainty than that of sharp input slew.

the mathematical-programming-based buffer insertion and wire sizing explores better solutions. The results of the first stage could be viewed as a final solution, or could be the input of the second stage. The second stage further saves wire loading by projecting a symmetric tree to an asymmetric tree. The two-stage overall flow is a heuristic that the first stage sacrifices solution space to facilitate global optimization, and the sec-ond stage searches unexplored solution space of the asymmetric structure to improve quality.

The two-stage method can avoid some difficulties when a process directly attempts to design buffer stages for an asymmetric tree. First, the fine-tuning for an asymmetry tree is a complex process. Assuming a skew estimation reports that variation-induced skew is too large for an asymmetry tree, a process may adjust the buffer position, buffer size, and wire size to reduce clock latency variation. However, at the same time,

(5)

Fig. 3. Slew effect in multiple-buffer stages. Two buffer insertions drive the same loading. Their costs of buffer loading are the same. (a) One is 9 buffer stages with 12x inv-1, and the other is 4 buffer stages with 27x inv-1; (b) variation of arrival time of 9 stages is less than that of 4 stages. The experiment shows: (1) minimizing the number of buffer stages results in slow slew and may enlarge timing uncertainty; (2) a smaller latency does not guarantee less latency variation.

these adjustments on an asymmetry clock tree always generate new nominal skew. The process of rebalancing nominal skew and that of reducing clock latency variation are coupled to each other, which is complex and time consuming. Second, it is difficult to analyze variation-induced skew for an asymmetric tree because the paths from the clock source to sinks are all different. In contrast, the identical path of a symmetric tree can facilitate the analysis of path delay variation. Third, it is difficult to design a slew rate, to design buffer stages, and to ensure a valid variation-induced sKew at the same time for an asymmetry tree. Assuming that the slew of a node has been planned, to achieve this plan, it needs to know the driving buffer of this node and the input slew of the driving buffer; therefore, it must know every stage of input slew and buffer size. Nevertheless, it will not be known whether the plan of buffer insertion and wire sizing is able to completely drive an asymmetric tree until a Deferred Merging Embedding (DME) bottom-up phase is finished. So the process becomes that of first guessing a buffer insertion and wire sizing solution and then performing DME to verify whether the solution is feasible or not, and a method guessing a new solution is needed when the current solution is unfeasible.

3. GLOBAL OPTIMIZATION ON A SYMMETRIC TREE

To facilitate global optimization of CTS with a slew effect consideration, in the first stage we sacrifice solution space, only considering the symmetric structure. We explain the first stage in Section 3.1 by introducing a symmetric tree generation. Section 3.2 then introduces a skew estimation that will be utilized in buffer insertion and wire

(6)

Fig. 4. Overview of optimizing a clock tree. The first stage is a global optimization by the symmetric structure, and the second stage is local optimization that transforms a symmetric tree to an asymmetric tree. Two buffer insertion methods are proposed, and either one of two methods would be selected in the first stage. The slew effect is considered by an NLDM-like cell delay variation model.

sizing later. Two types of buffer insertion and wire sizing are introduced. Section 3.3 introduces a length-based buffer insertion and wire sizing that generates quality solu-tions in fast runtime. And Section 3.4 introduces a mathematical-programming-based buffer insertion and wire sizing that applies a mathematical programming solver to boost solution performance.

3.1. The Symmetric Tree

We adopt a symmetric tree generation method proposed by Shih et al. [2010]. A weakness in nature of the symmetric structure is that it costs longer wire length to maintain symmetry. If we adopt a traditional H-tree, the longer wire length would need more buffer stages to drive the tree. Consequently, it costs more wire and buffer capacitance loading, and variation-induced skew increases. Second, when the gap be-tween a symmetric and an asymmetric tree is large, the projection from symmetry to asymmetry would be less accurate. Shih et al. [2010] proposed a method to minimize the wire length of a symmetric tree, which is adopted in this work as the starting point. The difference between Shih et al. [2010] and this work is that Shih et al. [2010] adopt a symmetric tree to drive a mesh, but this work adopts a symmetric tree to drive a set of subtrees. These subtrees are all in asymmetric structure. Before applying the symmetric tree generation by Shih et al. [2010], we have to generate a set of subtrees. The set of subtrees must have an identical driving buffer so that the top-level sym-metric tree can see an identical loading. The other parameter to be decided is the number of subtrees; since this work applies a symmetric tree in binary structure, the number of subtrees must be a power of two.

(7)

ALGORITHM 1: subtreeGeneration(allsinks, B) Input: allsinks and buffer library B

Output: driving buffer b and the number of sub-trees n

1 sort buffer in B in increase order of size;

2 min cost← inf;

3 for b∈ B do

4 V ← generating a set of sub-trees by b, using DME;

5 n← 2logV ;

6 cost← power dissipation of band n;

7 if min cost> cost then

8 min cost← cost;

9 b← b;

10 n← n;

11 end

12 end

13 return b and n

The subtree generation steps are listed in Algorithm 1. To decide the number of subtrees and the type of their driving buffer, we sweep buffer types in the library to generate sets of subtrees by DME. If the number generated by the procedure is not a power of two, the one closest to and larger than the generated one will be selected. And each set has a cost for its power dissipation, the lowest cost one of which would be selected for the number of subtrees and the driving buffer.

After Algorithm 1 returns the target number of subtrees and the driving buffer, these two values are used as the arguments for the DME function. Then the DME function uses the driving buffer to bottom-up subtrees and terminates when the number of subtrees is equal to the target.

3.2. Skew Estimation

Skew estimation can prevent keeping too much guard band. An asymptotic approxima-tion for the mean and variance of clock skew is proposed by Kugelmass and Steighlitz [1990] as E(skew) = σ 4 ln N− ln ln N − ln 4π + 2C (2 ln N)1/2 + O 1 log N , (1) Var(skew) = σ 2 ln N π2 6 + O 1 log2N , (2)

where σ is the standard deviation of clock latency, N the number of sinks, and C(= 0.05772 . . .) is Euler’s constant. Bujimalla and Koh [2011] applied (1) and (2) with an assumption that skew is a normal distribution to estimate 95%skew:

95%skew =E(skew) + 2 × V ar(skew). (3)

To utilize (1), (2), and (3) in this work, computingσ is needed:

σ2₌ n i=0 σi2+ n−1 i=0 (ρReRi+ ρFeFi)σiσi+1, (4) (eRi, eFi)= (0, 1), if falling input (1, 0), if rising input , (5)

(8)

Fig. 5. Length-based buffer insertion. (a) Flow of length-based buffer insertion; (b) a symmetric topology is adopted; (c) inserting buffers on all branches; (d) inserting buffers by a distance d; (e) enlarging the buffer size.

where σi is the standard deviation of a buffer-stage delay, ρR/ρF the correlation co-efficient for a rising/falling transition before a falling/rising transition, and eR/eF de-notes that the buffer-stage input transition is rising/falling. σi is looked up by an NLDM-like cell delay variation model. In this work, the NLDM-like cell delay vari-ation model is extracted by SPICE simulvari-ations and Response Surface Model (RSM) fitting [NIST 2012]. Parameters used to look upσi are about the input signal transition and output RC network. The details of the parameters to look upσi will be given in Section 3.4.3.

3.3. Length-Based Buffer Insertion

Our first buffer insertion method is a length-based one [Alpert and Devgan 1997]. The length-based method inserts buffers by a constant distance d. To minimize supply-voltage-variation-induced skew, d is decided by experiments of different distances on a long wire. According to Monte Carlo simulations, the distance of minimum delay variation is defined as d.

Figure 5 illustrates the overall flow of the length-based buffer insertion in this work. We first insert buffers on branches and then insert buffers by d from sinks to root. This ensures that the loading of one buffer stage is not particularly larger than any of the other buffer stages. If skew estimation reports a skew violation, we enlarge the buffer. If all types of buffer cannot satisfy the skew constraint, the buffer of smallest variation-induced skew would be used.

3.4. Mathematical Programming Buffer Insertion and Wire Sizing

The other buffer insertion and wire sizing is based on mathematical programming, in which the objective is to minimize power dissipation and the two constraints are skew and slew. Sections 3.4.1 to 3.4.3 introduce the programming variables, the objective formulation, and the constraint formulations. Section 3.4.4 enhances performance by

(9)

Table I. Notations Used in Mathematical-Programming-Based Buffer Insertion Wire Sizing

Notation Description

vi number of buffer stages in ithlevel

bi jk binary variable, buffer size of jthbuffer stage in ith

level is k

wi jk binary variable, wire size of jthbuffer stage in ithlevel

is k

ηi lower bound of buffer stages in ithlevel ξi upper bound of buffer stages in ithlevel dmax maximum distance between two buffer stages dmin minimum distance between two buffer stages

Li wire length of ithlevel

li j wire length of jthbuffer stage in ithlevel Pi power cost of ithlevel

Bi j power cost of buffer of jthbuffer stage in ithlevel Wi j power cost of wire of jthbuffer stage in ithlevel

βi power cost of buffer size i ωi power cost of wire size i

σ standard deviation of clock latency

σi j standard deviation of jthbuffer stage delay in ithlevel slew_{i j} terminal slew of jthbuffer stage in ithlevel

sle_wi j input slew of jthbuffer stage in ithlevel

ei j binary variable, input signal is rising/falling for jth

buffer stage in ithlevel

hi j binary variable, jthbuffer stage in ithlevel is on branch bi j buffer size of jthbuffer stage in ithlevel

b_{i j} next stage buffer size of jthbuffer stage in ithlevel wi j wire size of jthbuffer stage in ithlevel

yi j binary variable, jthbuffer stage in ithlevel is realized zi j binary variable, jthbuffer stage in ithlevel is the last

buffer stage

reducing the complexity of slew propagation. All notations used in the mathematical formulations are listed in Table I.

3.4.1. Programming Variables.Figure 6 illustrates the way that programming variables indicate a solution. In the beginning, we insert buffers at every branch, which levels the mathematical programming model and simplifies the RC topology for each buffer stage. For level i, three programming variables describe a solution:vi denotes the number of buffer stages, bi jkthat the buffer size of the jth stage is equal to type k, andwi jkthat the wire size of the jthstage is equal to type k.

vi∈ N (6)

bi jk∈ {0, 1} (7)

wi jk∈ {0, 1} (8)

Constraints on bi jk and wi jk ensure the solution’s uniqueness, and vi is bounded to reduce the solution space.

(10)

Fig. 6. In mathematical-programming-based buffer insertion and wire sizing, buffers are inserted on branches to level the problem. Three programming variables include:vi, denoting the number of buffer

stages in the ithlevel; bi jk, denoting that the size of the jthbuffer is equal to k, andwi jk, denoting that the

size of the jthwire is equal to k. li jis the wire length of the jthstage in the ithlevel and is defined by level

wire length Liandvi. In this example,viis bounded by [1,3]. And in the example ofvi= 3, 3 buffer stages in

the ithlevel, bi31= 1 represents the 3rd-stage buffer size is equal to type-1 (a smaller buffer), and wi12= 1

represents the 1st-stage wire size is equal to type-2 (a wider wire width). |b| k=1 bi jk= 1 (10) |w| k=1 wi jk= 1 (11)

To balance loading for buffer stages, buffers in a level are inserted uniformly. li j denotes the wire length of the jthbuffer stage. It is defined by level wire length Liand vi: li j = _L_i (2vi−1), if j= 1 Li (2vi−1) × 2, otherwise. (12) Since the stage of j = 1 is on a branch, compared to other stages, it has double capacitance loading per unit stage wire length. We make its wire length half of other stages to balance loading.

3.4.2. Power Formulation.Total power is the summation of all levels’ buffer loading and wire loading: power = n i=1 2i−1Pi, (13) Pi = Bi,1+ 2 vi j=2 Bi j+ 2 vi j=1 Wi j, (14)

(11)

Fig. 7. Boolean variables yi j and zi j. In the example of four levels, each level has maximum of 3 buffer

stages and minimum of 1 buffer stage. A true yi j denotes that the corresponding buffer stage is realized,

and a true zi jdenotes that the corresponding buffer stage is the last buffer stage in the corresponding level.

They are utilized to formulate skew and slew constraints.

Bi j= |b| k=1 βk× bi jk, (15) Wi j= |w| k=1 li j× ωk× wi jk, (16)

where Bi jdenotes the buffer capacitance on the jthstage in the ithlevel,βkis the buffer capacitance of buffer size k, and where Wi j denotes the wire capacitance, andωkis the wire capacitance of wire size k.

3.4.3. Skew and Slew Constraint.We rewrite (4) as

σ2₌ i j yi jσi j2+ i j (ρReRi j+ ρFeFi j)σi jσi j. (17) Substitutingσ in (1), (2), and (3) formulates the skew constraint. In (17), a boolean variable yi jis introduced:

yi j=

1, if vi ≥ j

0, otherwise, (18)

yi jdenotes that the jthbuffer stage in the ithlevel is realized. For example, in Figure 7, the upper bound of the number of buffer stages in the second level of the tree is three, butv2is set to 2. Therefore, the third-stage buffer is not realized (y23= 0).

To look upσi jby the NLDM-like cell delay variation model requires

σi j= g(ei j, slewi j, hi j, wi j, li j, bi j , bi j), (19) slew

i j= f (ei j, slewi j, hi j, wi j, li j, bi j, bi j). (20) The parameters are:

—rise/fall transition ei j; —input slew slewi j;

—output branch/unbranch hi j; —output wire sizewi j;

—output wire length li j;

—next stage buffer input capacitance b_{i j}; and —current stage buffer bi j.

(12)

Fig. 8. Slew of a node is dominated by its near predecessors. After propagating through three buffer stages, the difference in slew decreases from 40ps to 1.2ps.

slew

i j is the terminal slew value of the jthstage in the ithlevel and is looked up in the same manner.

3.4.4. Enhancement on Efficiency.It could be seen in (19) and (20) that to look upσi j, input slew slewi j is required, and a propagation from source to sinks for slew must be performed first. To propagate slew across levels, a boolean variable zi jis utilized which denotes that the jthbuffer stage is the last buffer stage in the ithlevel:

zi j=

1, if vi= j

0, otherwise. (21)

Then the input slew of the first buffer stage for all levels can be derived by slewi+1,1=

ξi

k=ηi

zikslewik. (22)

However, the slew propagation across levels results in a problem, that is, slewi+1,1 grows exponentially by (ξi− ηi+ 1) times per level.

By a property of slew that it is dominated by near predecessors, neglecting far predecessors only affects accuracy little. An example in Figure 8 shows that, after three-stage slew propagation, the difference decreases from 40ps to 1.2ps only. This slew deviation corresponds to 3% deviation of a stage delay variation. This work adopts three-stage slew propagation as shown in Figure 9, and the formulation is

slew_{i j} = slew(3)_{i j} , (23)

slew_{i j}(k)= fslewi j(k−1), . . .

, (24)

slew(0)_{i j} = slew(0), (25)

where slew_{i j}(k)means that the leaf terminal slew of the jthbuffer stage in the ithlevel is calculated by k-stage slew propagation, slew(0)is a user-defined parameter, and f (.) in (24) is equivalent to (20).

4. ASYMMETRIC CLOCK TREE

After global optimization, the second stage relaxes the constraint of symmetric struc-ture. To reduce wire loading, a symmetric tree from the first stage is projected onto an asymmetric tree. In the projection, the skew performance needs to be preserved.

(13)

Fig. 9. For mathematical programming, slew propagation from the clock source results in exponential growth with the number of tree levels as shown in (a). Since slew is dominated by near predecessors, its efficiency can be enhanced by only local propagation as shown in (b).

Fig. 10. Parameters affect variation of a buffer-stage delay.

The idea to preserve skew performance is to control the delay variation of each buffer stage. As each buffer stage in an asymmetric tree is well controlled, the total clock la-tency variation and the variation-induced skew would be equal to the symmetric tree. Such a buffer-stage delay variation control is due to three parameters, namely slew, buffer type, and wire size, which are introduced in Section 4.1. A solution refinement between the symmetric and the asymmetric tree synthesis is introduced in Section 4.2, which addresses the weakness of the symmetric tree solution to reduce the number of inserted buffers in the asymmetric tree. Based on DME, the asymmetric tree synthesis controlling the variation for each buffer stage is introduced in Section 4.3.

4.1. Performance Preservation of Skew

By observing the NLDM-like cell delay variation model, we know the manner in which a factor affects delay variation. Figure 10 is the part of the raw data of the NLDM-like cell delay variation model that delivers our observation on the ISPD 2010 benchmark.

(14)

Fig. 11. When transferring a symmetric to an asymmetric tree, the extreme condition for a buffer stage changing is from symmetric fanout to no fanout; however, the wire length deviation is bounded because of the correlation between slew and wire length.

Input slew, buffer size, wire size, and wire length strongly influence the delay variation of a buffer stage; however, the asymmetric tree synthesis of this work neglects wire length. The reason is that wire length and slew are highly correlated, for example, in Figure 11(a), when the slews of two buffer stages in series are constrained, the wire length deviation is limited. The effect of wire length neglect will be discussed in the next paragraph. Consequently, in the asymmetric tree synthesis of this work, delay variation of a buffer stage is controlled by slew, buffer size, and wire size. An example projecting a buffer stage from a symmetric tree to an asymmetric tree is illustrated in Figure 11(b), in which the input slew slewi and slewi+1 in the asymmetric tree are the same as they are in the symmetric tree, and so are the buffer size and the wire size. (In Section 4.3, the buffer size and the wire size of a buffer stage are possible to change, however, their new values are restricted so that delay variation of the buffer stage will not be increased. Note that the asymmetric tree synthesis does not consider the location of the buffer and wire in the symmetric tree, and its detailed process will be introduced in Section 4.3.)

The neglect of wire length results in the deviation of performance preservation, but the amount of this deviation is tolerable. In the extreme condition, the deviation in-creases 10% delay variation for a buffer stage. The performance preservation considers

(15)

Fig. 12. Solution refinement is a process between the symmetric and the asymmetric clock tree synthesis. The refinement shifts stages of small input slew toward the root. It reduces capacitance loading near leaf level. (a) An asymmetric tree without solution refinement; (b) an asymmetric tree with solution refinement.

slew, buffer size, and wire size for a buffer stage; in other words, it neglects the buffer stage’s RC tree topology. The RC tree topology change from a symmetric to an asym-metric tree is such that the tapping point of an asymasym-metric tree is skewed because of a delay difference between the two merged substrees. The long side will dominate delay variation, and the extreme condition is that the short side degenerates to zero length (no fanout) as shown in Figure 11(b); however, the wire length deviation is bounded because of the correlation between slew and wire length. According to our experiments, for a buffer stage, the extreme condition that a symmetric fanout stage is transferred to a no-fanout stage results in 10% larger delay variation; for a whole tree, the skew performance results in average 9.6% larger skew in an asymmetric tree than in the original symmetric tree.

4.2. Solution Refinement between a Symmetric and an Asymmetric Clock Tree

A solution refinement, which moves the buffer stage of sharp input slew toward the tree root, is adopted to reduce the number of buffers inserted in asymmetric tree synthesis. A tight stage slew constraint limits a stage’s wire length; as a result, before being merged with other subtrees, a subtree may need a preceding driving buffer stage to maintain sharp slew. When such a condition occurs near leaf level, a large number of preceding buffers would be inserted, which costs buffer loading. It is sometimes inevitable that a short-level wire length is generated near the leaf level of a symmetric tree as shown in Figure 12(a); however, when synthesizing an asymmetric tree, we can move these sharp slew buffer stages shown as (s3, w3, b3) in Figure 12(b) toward the root so that leaf-level subtrees could be merged as deeply as possible, thus reducing buffer loading from 14 to 9. Therefore, after receiving the buffer insertion and wire sizing solution of a symmetric tree, we shifted the sharp slew buffer stage toward the tree root in order to apply this buffer insertion plan in the asymmetric tree synthesis. Note that the refinement only modifies the plan of buffer insertion for the asymmetric tree synthesis; in other words, no modification was actually done on the original symmetric tree.

(16)

3:16 C.-K. Wang et al. 4.3. DME-Based Projection

The asymmetric CTS is based on DME and, during the DME bottom-up phase, the parameters, namely input slew of a stage s, leaf slew of a stage s, buffer size b, and wire sizew, are maintained for each stage.

Asymmetric CTS steps are listed in Algorithm 2. For each stage, (s, s, b, w) from a refined solution of a symmetric tree are read. According to these parameters, genStageSubtree generates a set of stage subtrees N and a table T , which records pairs of a stage subtree and the corresponding stage buffer. Then stage buffers are inserted as roots of stage subtrees and become merging candidates for the next stage. The while-loop (lines 2–6) continues until only one tree is in the merging candidate pool, that is, our final asymmetric clock tree.

We can elaborate more on genStageSubtree. In Section 4.2, the solution refinement shifts the small slew stage toward the tree root to reduce the number of buffers. genStageSubtree further saves the number of buffers by using stronger driving-strength buffers. A stage buffer may be swapped by a stronger driving-driving-strength one if the stronger buffer drives more subtrees, which in turn saves power. Note that variabil-ity of the stronger buffer must be less than that of the original stage buffer. Therefore genStageSubtree records the original stage buffer for all subtrees (lines 11–13) and invokes genStageSubtree BySingle Buf several times (lines 14–16) to test whether a stronger buffer saves power.

genStageSubtree BySingle Buf generates a set of stage subtrees L and records stage buffers in T . In genStageSubtree BySingle Buf , a subtree of smallest delay n1has high-est priority to be merged. n2is the merging partner of n1and they must satisfy the stage slew constraint by a driving buffer b. After n1 and n2 are selected (lines 23–37), they are merged into a new subtree nnew. Merging candidates are updated and bis recorded in T as the stage buffer of nnew (lines 38–42). If n1 finds no merging partner, n1 is removed from merging candidate container N and saved in stage subtree container L (lines 43–46). The merging process continues until no more merging is possible and all stage subtrees are stored in L.

4.3.1. Bottleneck and Complexity.The runtime bottleneck of Algorithm 2 happens at line 5, that is, to insert the buffer and adjust the wire length to match the slew target. It costs the most runtime, because it embeds NGSPICE simulation to derive an accurate slew rate and delay.

The complexity of Algorithm 2 is analyzed as follows.

(1) The complexity of genStageSubtree BySingle Buf is equal to DME.

—If the driving buffer is very strong so that one buffer can drive the whole tree, the first genStageSubtree BySingle Buf in line 15 will return |N| = 1, and the later iterations of genStageSubtree BySingle Buf will cost nothing because no subtrees are merged. As a result, an asymmetric tree is completed by one genStageSubtree, and the complexity is equal to DME. The slew check in line 28 of genStageSubtree BySingle Buf is an additional constant cost for each merging of DME, which does not increase complexity.

—When a driving buffer cannot drive a whole tree, genStageSubtree returns a set of subtrees in N, which is L in genStageSubtree BySingle Buf . The cost of generating a subtree in L is equal to a successful merging, because generating a subtree in L and a successful merging both decrease one subtree in merging candidate N in genStageSubtree BySingle Buf . The only difference is that n1find no merging partner with a valid slew in lines 26–37.

(17)

ALGORITHM 2: AsymmetryT ree(allsinks, B)

Input: all sinks, buffer library B, wire library W , a refined solution of a symmetric clock tree

Output: an asymmetric tree

1 N← all sinks; 2 while|N| > 1 do

3 (s, s, w, b) ← read solution of symmetric tree for current stage, or there is no feasible solution and exit;

4 (N, T ) ← genStageSubtree(s, s, w, b, N);

5 for all sub-trees∈ N, insert buffers as their roots by a table of stage buffer T and adjust wire length to match s//concurrent;

6 end

7 There is only one sub-tree in N, connect its root to clock source and top down node embedding;

10 genStageSubtree(s, s, w, b, N) 11 for all sub-trees n∈ N do

12 T (n)← b; 13 end

14 for all b∈ B && driving strength satisfies that strength(b) ≤ strength(b)≤ α × strength(b), by order of strength do

15 (N, T ) ← genStageSubtreeBySingleBuf (b, s, s, w, N) ;

16 end

17 return (N, T );

20 genStageSubtree BySingle Buf (b, s, s, w, N) 21 L← φ;

22 while|N| > 1 do

23 n1← smallest delay sub-tree ∈ N;

24 isMergeble← f alse; 25 minCost← inf;

26 for ntest← all other sub-trees ∈ N do 27 nnew← merge(n1, ntest) by wire sizew;

28 stest← calcSlew(b, nnew) //buffer bdrives sub-tree nnew; 29 if stest≤ sthen

30 cost← calcMergeCost(nne_w); 31 if cost< minCost then

32 minCost← cost; 33 n2← ntest; 34 end 35 isMergeble← true; 36 end 37 end 38 if isMergeble then 39 nne_w← merge(n1, n2); 40 N← N\ {n1, n2}; 41 N← N ∪ {nnew}; 42 T (nnew)← b; 43 end 44 else 45 N← N\n1; 46 L← L ∪ n1; 47 end 48 end

49 L← L ∪ n0, n0is the last sub-tree∈ N;

(18)

3:18 C.-K. Wang et al. Table II. ISPD 2010 Benchmark Information

#Sinks LCS (ps) LCS Dist.(_μm) W(_μm) H(_μm) #Blocks

cns01 1107 7.5 600 8000 8000 4 cns02 2249 7.5 600 13000 7000 1 cns03 1200 4.9 370 3072 493 2 cns04 1845 7.5 600 2130 2690 2 cns05 1016 7.5 600 2319 2545 1 cns06 981 7.5 600 1950 891 0 cns07 1915 7.5 600 2537 1448 0 cns08 1134 7.5 600 1837 1628 0

Table III. Physical Properties of Buffers

Inverted Input Cap (fF) Output Cap (fF) Output Res (₎

inv-0 True 35 80 61.2

inv-1 True 4.2 6.1 440

Table IV. Physical Properties of Wires Unit Res (/nm) Unit Cap (fF/nm)

wire-0 0.0001 0.0002

wire-1 0.0003 0.00016

(2) The complexity of Algorithm 2 is as follows. Assuming each

genStageSubtree BySingle Buf scales down the number of subtrees by α, we

have two cases. —Worst case (α = 1).

Assuming a genStageSubtree sweeps β buffer sizes and that AsymmetryT ree

calls genStageSubtree at most γ times, there are totally β × γ times of

genStageSubtree BySingle Buf . The complexity is O(β × γ × n2_).

—General case (0< α < 1).

m is an integer satisfying two rules: n× αm _{≤ 1 and 1 < n × α}m−1_{. This means}

that a tree is completed after m number of genStageSubtree BySingle Buf are performed. The complexity is O( m_i₌₀−1(n× αi₎2₎_{= O(} m−1

i=0 α2in2). 5. EXPERIMENTAL RESULTS AND LIMITATIONS OF ISPD 2010 BENCHMARK

This section demonstrates experimental results and compares our methods in Sec-tion 5.1. SecSec-tion 5.2 discusses limitaSec-tions of the ISPD 2010 benchmark and strategies of CTS for different variation scenarios.

5.1. Experimental Results

The proposed approach is implemented in C++, and the mathematical programming solver is IBM ILOG CPLEX v12.2 [CPLEX 2010]. Experimental results are evaluated by the ISPD 2010 benchmark (Table II), which is based on IBM and Intel real-case microprocessor design. The buffer library (Table III) and wire library (Table IV) are based on PTM [2011] 45nm technology. The variation setting is the same one as that of the contest, that is±7.5% vdd variation and ±5% wire width variation. Monte Carlo simulations by NGSPICE are performed to evaluate performance.

The experiments are carried out on a 2.4 GHz Intel Xeon CPU Linux workstation with 16GB memory. The runtime limit of mathematical programming is set to 600 seconds and 14 threads are utilized for concurrent SPICE simulations of asymmetric CTS. The correlation coefficients of stage delay variation areρR of 0.4, and ρF of 0. These values are extracted by a least-square error fitting on an inverter chain, as

(19)

Fig. 13. Extraction ofρRandρFby an inverter chain. Variance of two-stage delay is collected by Monte

Carlo simulations, and a least-square-fitting (26)–(29)-deriveρRandρF. shown in Figure 13. min i (σRFi − ˆσRFi) 2_{+ (σ} F Ri− ˆσF Ri) 2 ₍₂₆₎

such that σtotal2= i σRi 2_{+ σ} Fi 2_{+ 2} i (σRiσFiρR+ σFiσRiρF)+ O(n 3₎ ₍₂₇₎ ˆ σF Ri 2_{= σ} Fi 2_{+ σ} Ri 2_{+ 2σ} FiσRiρF (28) ˆ σRFi 2_{= σ} Ri 2_{+ σ} Fi+1 2_{+ 2σ} RiσFi+1ρR (29)

AllσFi andσRi are looked up by the NLDM-like cell delay variation model andσtotal,

σRFi, andσF Ri are derived by Monte Carlo simulations. The higher-order terms O(n

3₎ are neglected.

Table V shows a comparison of the statistics on LCS, capacitance loading, CPU time, and wall-clock time. Acronyms are used to denote the following methods.

—SMeshMB is a symmetric tree driving a bottom mesh, which is done in Shih et al. [2010].

—Contango 2.0 is an asymmetric tree with clock latency minimization, which is done in Lee et al. [2010].

—AMB is an asymmetric tree with buffer-stage minimization, which is done in Bujimalla and Koh [2011].

—AMB CL inserts a cross link based on AMB, which is done in Mittal and Koh [2011]. And our four methods are as follows.

—length-symm is a symmetric clock tree of length-based buffer insertion.

—mp-symm is a symmetric clock tree of mathematical-programming-based buffer insertion.

—length-asym is an asymmetric clock tree transformed from length-symm. —mp-asym is an asymmetric clock tree transformed from mp-symm.

The clock network produced by mp-asym has the smallest capacitance loading. It is smaller than SMeshMB [Shih et al. 2010], Contango 2.0 [Lee et al. 2010], AMB [Bujimalla and Koh 2011], and AMB CL [Mittal and Koh 2011] up to 1.2×. For our four methods’ comparison, mathematical-programming-based buffer insertion wire sizing improved 4% capacitance, and transformation from symmetric to asymmetric versions improved up to 5% of capacitance. The runtime overhead results from: (1) the mathe-matical programming solver and (2) SPICE simulations of asymmetric CTS, especially in large cases, are cns01 and cns02. The present study shows that concurrent SPICE simulations of asymmetric CTS can effectively reduce CPU time to wall clock. It can

(20)

3:20 C.-K. Wang et al. Table V. Experimental Results of ISPD 2010 Benchmark

Skew, Capacitance, and RunTime

Contango length- mp- length-

mp-BM SMeshMB 2.0 AMB AMB CL symm1 _symm _asym _asym

95%LCS(ps) 7.16 7.01 5.79 7.32 7.32 7.35 7.77 6.41

cns01 cap(pF) 445.3 198.3 177.5 142.6 146.0 124.4 124.5 143.3

cpu time (sec) 0.4 12015 2790 1092 114 795 1477 2426

wall clock (sec) 97 688 335 860

95%LCS(ps) 7.33 7.34 6.69 7.42 7.38 7.49 8.93 6.73

cns02 cap(pF) 933.6 375.9 329.9 265.2 268.3 255.3 250.4 275.2

cpu time (sec) 2.42 25006 7787 4314 295 935 3319 6223

wall clock (sec) 120 763 800 1659

95%LCS(ps) 4.88 4.18 3.46 4.49 4.76 4.64 6.41 4.83

cns03 cap(pF) 183.7 55.86 50.81 36.61 34.17 34.33 34.24 33.21

cpu time (sec) 1.57 3840 2094 383 71 274 313 441

wall clock (sec) 37 241 120 228

95%LCS(ps) 4.01 4.46 3.79 6.70 7.14 6.70 7.64 6.96

cns04 cap(pF) 196.3 71.84 57.44 51.07 42.77 41.78 40.00 38.03

cpu time (sec) 0.27 6075 2763 934 73 244 335 970

wall clock (sec) 27 199 127 720

95%LCS(ps) 3.81 4.41 3.68 4.78 5.88 6.22 5.72 5.80

cns05 cap(pF) 89.09 37.69 28.93 25.13 22.13 20.98 19.50 18.33

cpu time (sec) 0.10 2406 1110 278 36 207 150 716

wall clock (sec) 13 185 65 609

95%LCS(ps) 7.40 6.05 4.01 6.41 5.61 5.82 5.75 7.04

cns06 cap(pF) 160.4 47.81 36.12 32.68 28.55 28.01 26.03 23.78

cpu time (sec) 0.28 2660 1142 285 70 75 184 232

95%LCS(ps) 6.24 4.58 5.65 5.86 6.62 6.80 7.08 6.75

cns07 cap(pF) 228.2 72.66 57.93 48.32 43.91 43.39 39.79 39.30

cpu time (sec) 0.30 2351 2968 818 75 122 283 511

95%LCS(ps) 7.64 5.15 4.24 5.07 6.50 6.89 6.58 6.95

cns08 cap(pF) 228.2 52.49 40.43 32.70 28.41 28.08 27.25 25.69

cpu time (sec) 0.28 1987 1497 327 76 82 206 241

geo mean of cap 5.17 1.76 1.45 1.20 1.09 1.05 1.04 1.00

further reduce runtime by replacing SPICE simulation with a static timing analysis tool, for example, of a composite current source model.

Figure 14 describes experiments on benchmark cns01 with different skew con-straints. The Figure shows that mp-based ones are more flexible than length-based ones for different skew specifications. Table VI shows the comparison between Monte Carlo results and the asymptotic skew approximation.

5.2. Limitations of ISPD 2010 Benchmark

To evaluate a clock network by means of the ISPD 2010 benchmark, one should beware of the variation setting. We list the notables of variation setting as follows.

1_{length-symm is slightly different from Chang et al. [2012]. Because the symmetric tree in Chang et al.}

(21)

Fig. 14. Trade-off between skew and capacitance on cns01. mp-asym has the smallest capacitance. Mathematical-programming-based buffer insertion wire sizing is more flexible for different skew constraints.

Table VI. Skew Estimation vs. Monte Carlo Result

Monte Carlo Estimated

BM 95%LCS 95%LCS cns01 6.41 8.58 cns02 6.73 10.21 cns03 4.83 5.00 cns04 6.96 7.49 cns05 5.80 7.29 cns06 7.04 7.18 cns07 6.75 7.42 cns08 6.95 7.10

—The problem addressed by Bujimalla and Koh [2011] and Lee and Markov [2011] is that the setting of the ISPD 2010 benchmark allows to reduce a buffer-stage variation by stacking buffers such that each buffer has its own voltage source. Increasing the number of stacking buffers can smooth the effects of voltage vari-ation. The works done in Bujimalla and Koh [2011] and Lee and Markov [2011] adopt a setting of single-location single-voltage to eliminate smoothing effects by the stacking buffer, as shown in Figure 15. It is worth mention that the compar-ison in Table V is fair because the number of stacking buffers used in our pro-posed method is not greater than others. The max number of stacking buffers used in length-asym, mp-asym, Contango 2.0 [Lee et al. 2010], AMB [Bujimalla and Koh 2011], and AMB CL [Mittal and Koh 2011] are all 30x inv-1, while those used in length-symm, mp-symm, and SMeshMB [Shih et al. 2010] are all 20x inv-1.

—The slew effect addressed in this work affects performance evaluation of a multiple-driving-paths network. Networks such as mesh and cross link reduce the variation-induced skew by improving the delay correlation between paths. However, when the slew effect is not controlled well, the baseline of path delay variation is different, possibly misleading the real performance of a network.

—The ISPD 2010 benchmark considers variation of supply voltage and wire width only. When more variation sources are considered such as threshold voltage and gate

(22)

Fig. 15. By the setup of the ISPD 2010 benchmark, buffers stacked at a same location have differing voltage source. This smooths voltage variation for a buffer stage. By the setup of Location Single-Voltage (SLSV), all buffers have only one voltage source.

length, the primitive delay variation of a buffer stage increases. In this scenario, the proposed CTS method is concerned, and a strategy of CTS to minimize skew uncer-tainty should seek to minimize the number of buffer stages. To minimize delay vari-ation of a buffered long wire, the slew effect considered in this work delivers an idea that there is an optimal number of buffer stages. When the number of buffer stages is less than the optimal value, inserting an additional buffer stage sharpens slew and reduces delay variation. When the number is more than the optimal value, the slew effect remains but is weaker than the primitive delay variation of an additional buffer stage. Once the primitive delay variation of a buffer stage become severe, the optimal number of buffer stages will be less. In the worst case, to minimize delay variation is just to minimize the number of buffer stages. We call this scenario primitive variation dominance.

6. CONCLUSIONS

This work proposed a method to tackle supply voltage variation and to synthesize a lower-power and robust clock tree. The proposed method includes two stages. The first stage facilitates global optimization by adopting a symmetric structure. Buffer insertion and wire sizing are formulated in mathematical programming, and the slew effect is considered by an NLDM-like cell delay variation model. The second stage per-forms local optimization in which a transformation from a symmetric to an asymmetric tree further saves wire and buffer loading. Experimental results demonstrate that the proposed method saves capacitance loading up to 20%.

Beyond the proposed method, limitations of the ISPD 2010 benchmark are addressed. To evaluate the performance of a clock network synthesizer by the ISPD 2010 bench-mark, one should be aware of the variation setting. For example, location single-voltage can prevent the single-voltage variation smoothing by stacking buffers; when evalu-ating performance of a multiple-driving-paths clock network, ignoring the slew effect may result in a different baseline of path delay variation and mislead the performance; when variation becomes more severe, primitive variation dominance may occur, and the strategy of CTS in this scenario should seek to minimize the number of buffer stages.

ACKNOWLEDGMENTS

The authors would like to thank for anonymous reviewers for their comments that guided us to improve the quality of the article.

(23)

REFERENCES

C. J. Alpert and A. Devgan. 1997. Wire segmenting for improved buffer insertion. In Proceedings of the

Design Automation Conference (DAC’97). 588–593.

D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer. 2008. Statistical timing analysis: From basic principles to state of the art. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 27, 4, 589–607.

S. Bujimalla and C.-K. Koh. 2011. Synthesis of low power clock trees for handling power-supply variations. In Proceedings of the International Symposium on Physical Design (ISPD’11). 37–44.

Y.-C. Chang, C. K. Wang, and H.-M. Chen. 2012. On constructing low power and robust clock tree via slew budgeting. In Proceedings of the International Symposium on Physical Design (ISPD’12). 129–136. CPLEX. 2010. IBM ilog cplex optimizer v12.2. http://www01.ibm.com/software/integration/optimization/

cplex-optimizer/.

S. Hu, C. J. Alpert, J. Hu, S. K. Karandikar, Z. Li, W. Shi, and C. N. Sze. 2007. Fast algorithms for slew-constrained minimum cost buffering. IEEE Trans. Comput.-Aided Des. 26, 11, 2009–2022.

S. D. Kugelmass and K. Steighlitz. 1990. An upper bound on expected clock skew in synchronous systems.

IEEE Trans. Comput. 39, 12, 1475–1477.

D.-J. Lee, M.-C. Kim, and I. L. Markov. 2010. Low-power clock trees for cpus. In Proceedings of the

Interna-tional Conference on Computer-Aided Design (ICCAD’10). 444–451.

D.-J. Lee and I. L. Markov. 2011. Multilevel tree fusion for robust clock networks. In Proceedings of the

International Conference on Computer Aided Design (ICCAD’11). 632–639.

T. Mittal and C.-K. Koh. 2011. Cross link insertion for improving tolerance to variations in clock network synthesis. In Proceedings of the International Symposium on Physical Design (ISPD’11). 29–36. NIST. 2012. NIST/SEMATECH e-handbook of statistical methods. http://www.itl.nist.gov/div898/handbook/. PTM. 2011. Predictive technology model. http://ptm.asu.edu/.

A. Rajaram, J. Hu, and R. Mahapatra. 2006. Reducing clock skew variability via crosslinks. IEEE Trans.

Comput.-Aided Des. Integr. Circ. Syst. 25, 6, 1176–1182.

P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovick, B. L. Krauter, and B. D. McCredie. 2001. A clock distribution network for microprocessors. IEEE J. Solid-State Circ. 36, 5, 792–799.

X-.W. Shih and Y.-W. Chang. 2010. Fast timing-model independent buffered clock-tree synthesis. In

Proceed-ings of the Design Automation Conference (DAC’10). 80–85.

X.-W. Shih, H.-C. Lee, K.-H. Ho, and Y.-W. Chang. 2010. High variation-tolerant obstacle-avoiding clock mesh synthesis with symmetrical driving trees. In Proceedings of the International Conference on

Computer-Aided Design (ICCAD’10). 452–457.

C. N. Sze. 2010. ISPD 2010 high performance clock network synthesis contest: Benchmark suite and results. In Proceedings of the International Symposium on Physical Design (ISPD’10). 143–143.

C. N. Sze, P. Restle, G.-J. Nam, and C. Alpert. 2009. ISPD 2009 clock network synthesis contest. In Proceedings

of the International Symposium on Physical Design (ISPD’09). 149–150.

L. P. P. P. van Ginneken. 1990. Buffer placement in distributed rc-tree networks for minimal elmore delay. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’90). 865–868. L. Xiao, Z. Xiao, Z. Qian, Y. Jiang, T. Huang, H. Tian, and E. F. Y. Young. 2010. Local clock skew minimization

using blockage-aware mixed tree-mesh clock network. In Proceedings of the International Conference on

Computer-Aided Design (ICCAD’10). 458–462.