LPGC:一個基於最佳閘控制時脈拓樸結構之新穎的低功耗電路擺置演算法

全文

(1)國立交通大學電信工程學系碩士論文 LPGC:一個基於最佳閘控制時脈拓樸結構之新穎的低功耗電路擺置演算法 LPGC: A Novel Low Power Driven Placement Algorithm Based on Optimal Gated Clock Topology. 研究生：陳逸宏指導教授：李育民. 教授. 中華民國九十四年七月.

(2) LPGC:一個基於最佳閘控制時脈拓樸結構之新穎的低功耗電路擺置演算法 LPGC: A Novel Low Power Driven Placement Algorithm Based on Optimal Gated Clock Topology. 研究生：陳逸宏. Student：Simon Yi-Hung Chen. 指導教授：李育民. Advisor：Yu-Min Lee. 國立交通大學電信工程系碩士論文. A Thesis Submitted to Department of Communication Engineering College of Electrical Engineering and Computer Science National Chiao Tung University in partial Fulfillment of the Requirements for the Degree of Master in. Communication Engineering July 2005 Hsinchu, Taiwan, Republic of China. 中華民國九十四年七月.

(3) LPGC:一個基於最佳閘控制時脈拓樸結構之新穎的低功耗電路擺置演算法研究生：陳逸宏. 指導教授：李育民博士. 國立交通大學電信工程學系碩士班. 摘. 要. 本論文研製之電路擺置器，乃基於將用於低功耗設計下之閘控制時脈網路之功率消耗最佳化。首先，基於考慮時脈網路之實體設計連線關係及將其切換機率最小化，先建製一個最佳化閘控制時脈拓樸結構。再透過新穎的評估函數，本論文研製之擺置器可提供讓最佳化閘控制時脈拓樸結構成為可繞結構之適當電路擺置。本擺置器可提供專為低功耗時脈網路繞線演算法之專用電路擺置，這是因為本擺置器提供該演算法將積體電路連線功耗最小化之全域最佳化解。積體電路連線之功耗是一項在奈米製程技術下的嚴重問題。實驗結果證明本擺置器可將閘控制時脈網路功耗減少至少 25%。. i.

(4) LPGC: A Novel Low Power Driven Placement Algorithm Based on Optimal Gated Clock Topology Student：Simon Yi-Hung Chen. Advisor：Dr.Yu-Min Lee. Department of Communication Engineering National Chiao Tung University. ABSTRACT. In this thesis, a placer based on optimizing the power consumption of clock gating network is presented. First, we construct an optimal gated clock topology based on considering the physical connectivity, and minimizing the total switching activity of the clock network. Then, with a novel measure function, our placer can build a suitable placement solution which makes the optimal gated clock topology being routable. Our placer can give a dedicated placement seed to any low power gated clock routing algorithm since we provide them a global optimized solution for reducing the interconnect power which is a severe factor in the coming nano-scale era. Experimental results show that our placer indeed works and can help gated clock routing algorithms to additionally save power consumption by 25%.. ii.

(5) 誌. 謝. 本篇論文得以順利完成，首先要感謝指導教授李育民博士，當研究出現困難之時，李老師總是給予相當多的建議，在關鍵的時刻，也總是由李老師扮演推手，將研究中不足的地方加強。其次，我由衷的感謝實驗室與我ㄧ起奮鬥的朋友們，飛鴻、震軒、阿琅與小宇，有各位的陪伴，才能讓我在交大的學習生活更有效率，更多彩多姿。最後，我將最誠摯的謝意獻給一直支持我的家人，這包含了我的父母、我即將就讀交大研究所的弟弟、我深愛的女友芝婷，因為有你們無悔的付出與關懷，我才能從低潮中振作起來。因為你們的陪伴，我的喜悅與快樂才得以分享。. iii.

(6) Contents 1 Introduction 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preliminaries and Literatures Survey 2.1 Basic Concepts of General Placement . . . . . . . . . . . . . . . 2.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . 2.1.2 Wire-length Estimation . . . . . . . . . . . . . . . . . . . 2.2 Previous Work Related to Low Power Placement . . . . . . . . . 2.3 Basic Concept of General Clock Routing . . . . . . . . . . . . . . 2.3.1 Elmore Delay Model . . . . . . . . . . . . . . . . . . . . 2.3.2 Clock Timing Considerations . . . . . . . . . . . . . . . 2.3.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . 2.4 Previous Work Related to General and Low Power Clock Routing 2.4.1 Zero-skew Clock Routing . . . . . . . . . . . . . . . . . 2.4.2 Low Power Zero-skew Clock Routing . . . . . . . . . . . 3 Activity analysis and optimal gated clock topology 3.1 Activity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Extraction from High Level Description of the System 3.1.2 Instruction Level Simulation . . . . . . . . . . . . . . 3.2 Optimal Gated Clock Topology . . . . . . . . . . . . . . . . .. . . . .. . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . .. 1 1 3 4. . . . . . . . . . . .. 5 5 5 7 8 11 12 12 16 17 17 22. . . . .. 26 26 27 27 29. 4 LPGC Placer 4.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Data Structrue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 LPGC Novel Gated Clock Network Construction Flow . . . . . . . . . . 4.4 The Novel Low Power Placement Algorithm . . . . . . . . . . . . . . . . 4.4.1 The Optimal Gated Clock Topology Construction . . . . . . . . . 4.4.2 Approaching to Optimal Gated Clock Topology and Pseudo Clock Router Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . .. 33 33 34 34 36 37. 5 Experimental Result. 45. 6 Conclusions and Future Work. 50. iv. 38.

(7) List of Figures 2.1. Comparison between two placements. (a) The placement with shorter total wire-length. (b) The placement with longer total wire-length. . . . . 2.2 Models of wire-length estimation. (a) Steiner tree. (b) Spanning tree. (c) Semi-perimeter. (d) Complete graph. . . . . . . . . . . . . . . . . . . . . 2.3 (a) Clock sinks’ locations for general placement. (b) Gated clock routing under general placement. (c) After performing the existed low power driven placement algorithm, the skew problem becomes much more severe(under optimal gated clock topology). (d) Dedicated low power placement and gated clock routing. . . . . . . . . . . . . . . . . . . . . . . . . 2.4 (a) Elmore delay model for a segment of wire. (b) Elmore delay model for a driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 A typical synchronous system. . . . . . . . . . . . . . . . . . . . . . . . 2.6 A simple diagram which demonstrates clock skew effect. . . . . . . . . . 2.7 Method of mean and median clock routing algorithm. . . . . . . . . . . . 2.8 Geometric matching based clock routing algorithm. . . . . . . . . . . . . 2.9 A simple example of tiled rectangle region(TRR). . . . . . . . . . . . . . 2.10 A tree of merging segments. Solid lines are merging segments and dotted lines indicate edges between merging segments. The squares represent the clock sinks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. 3.2 3.3. 4.1 4.2. 4.3 4.4. Methodology for transformation of an ASIC function from hardware description language to a control-data flow graph. (a) Transformation from a simple ASIC function to hardware description language. (b) A simple example of CDFG after performing HDL scheduling and allocation. . . . A typical topology of gated clocking circuitry. . . . . . . . . . . . . . . . An example for switching activities comparison under two different clock topology. (a) Topology with smaller average switching activity. (b) Topology with larger average switching activity. . . . . . . . . . . . . . . . . . B*-tree data structure. (a) An eleven blocks B*-tree representation. (b) Modules’ actual positions and placement of (a). . . . . . . . . . . . . . . (a) Traditional design flow of the gated clock network construction. (b)The LPGC novel flow for gated clock network construction. The dotted rectangle circles the operation flow of LPGC Placer. . . . . . . . . . . . . . Optimal gated clock topology construction procedure. . . . . . . . . . . . Gated clock placement cost computation procedure for pseudo clock router heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v. 6 8. 9 13 14 16 19 21 22. 23. 28 30. 31 35. 36 38 40.

(8) 4.5. 4.6 4.7. 5.1. 5.2 5.3. (a) A typical placement result under general constraints, the optimal topology seems to be unroutable. (b) A dedicated placement which makes the optimal topology routable. . . . . . . . . . . . . . . . . . . . . . . . . . Evaluate a placement through pseudo clock router heuristic. (a) Longer pseudo clock wire-length. (b) Shorter pseudo cLock wire-length. . . . . . (a) A dedicated placement with considering the optimal gated clock topology. (b) A typical placement without considering the optimal gated clock topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40 42. 44. (a) Gated clock network comparison between a general placer and LPGC Placer with additionally consideration on gated clock routing. (b)Additional gated clock network Wire-length Saving. (c) Additional gated clock control network wire-length saving. (d) Reduction rate comparison. . . . . . 47 Snapshot of the placement result on circuit plbench576. . . . . . . . . . . 48 Snapshot of the placement result on circuit plbench3136. . . . . . . . . . 49. vi.

(9) List of Tables 3.1 3.2. A simple RTL description of instructions for a microprocessor. . . . . . . Instruction level simulation for 15 Clock Cycles. . . . . . . . . . . . . .. 29 29. 5.1. Comparison between typical placer with LPGC Placer on benchmark circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 48. vii.

(10) Chapter 1 Introduction 1.1 Introduction The need for low power driven VLSI designs has emerged rapidly in past ten years because of the growing demands for the high-performance portable computing devices. Specifically, the saving of power dissipation for the portable electronics is extremely important to extend the battery life and improve device usability. Modern digital systems are designed with a target clock period (or clock frequency) which determines the rate of data processing. The clock network distributes the clock signal from the clock generator/source to the clock inputs/sinks of the synchronous modules. As we know, the clock distribution network consumes a large portion (more than 40%) [19] of the total circuit power. Therefore, the design of a low power synchronous system tends to minimize the total power consumed by the clock tree subject to performance constraints, such as the operating frequency and the maximum clock skew. Clock gating [19]–[26] is a well-known technique to reduce the dynamic clock network power consumption of a digital circuit. It disables portion of the clock distribution network to save the dynamic power. The existing gated clock routing algorithms first formulate a specifically objective function based on the switching activities, capacitances of clock sinks, gated clock control circuits, the wire length of clock network and gated clock control network, then optimize the power consumption through different gated clock routing algorithms.. 1.

(11) According to [21], most of the publications [20, 22, 23] have suffered from the same shortcoming that there is no deeply physical level consideration while performing the gated clock routing algorithm. As the result, the clock tree constraints might not be met or the clock power dissipation might go up after the gated clock routing. Furthermore, the trend of continuously increasing interconnect power is another more severe problem which has been pointed out by [19]. Therefore, we need to pay more effort in reducing the interconnect power to save more power. These facts motivate us to develop a dedicated placement algorithm for reducing the power dissipation of clock distribution network. Placement/Floorplanning is the first stage of the VLSI physical layout design. Several techniques [5]–[10] have been developed to reduce the power consumption at this level, but none of them cooperated with the technique of clock gating. They only optimized dynamic switching power through different approaches with consideration on the combinational networks. This thesis revisits the low power driven placement problem with a focus on gated clock and provides a novel placer to reduce the interconnect power of the gated clock network and gated clock control network. In other word, we can provide a good seed for those gated clock routing algorithms [20]–[23]. The power consumed by complementary metal-oxide-semiconductor (CMOS) circuits consists of two components: dynamic and static power. The static power is mainly determined by the technology. In this thesis, we only consider the dynamic power. The 2 CL , where α is the circuit dynamic power consumed by a synchronous module is αf Vdd. activity, f is the clock frequency, Vdd is the supply voltage, and CL is the total load capacitance of the circuit. Therefore, a given synchronous CMOS circuits with fixed clock frequency, supply voltage, and total load capacitance, the power consumption can be reduced by decreasing its total activity. The effects of interconnect such as signal integrity, transmission line effect, and I-R drop · · · etc, are revealed in the dep-submicron circuit design, and those effects become more serious because of the continuously shrinking of feature size. Furthermore, the power dissipation of interconnects gets closer to the loading power as the line is long 2.

(12) enough. Hence, the clock network power consumption is no longer dominated by the loading registers. Therefore, the design method need to be modified to deal with those effects. This thesis develops a novel way to optimize the power consumption of the gated clock network.. 1.2 Our Contributions In this thesis, a low power driven placer based on the optimal gated clock topology is developed to minimize the power consumption of clock network with taking the total wire-length and area occupation of each module into account. We call this placer as “LPGC Placer”. Firstly, the activity pattern is acquired through the analysis of functional blocks, and the optimal gated clock topology is constructed by minimizing the total activity. After that, by using a simulated annealing based optimization algorithm with concurrently evaluating the routability of optimal gated clock topology, the blocks are placed to their suitable positions. Finally, we modify the low power clock routing algorithm [20, 23], and compare our results with the general placement without considering the optimal gated clock topology. The major contribution of this thesis is to provide a good seed for constructing the low power gated clock routing network. Therefore, the low power gated clock routing algorithms are more likely to find the global optimal solutions instead of the local optimal solutions. We also develop an efficient technique to estimate the total wire length of clock network at the placement level, and utilize this technique to test the routability of the optimal gated clock topology. The experimental results are excellent. With feeding the placement of the proposed placer into a gated clock routing algorithm [20]–[23], this gated clock algorithm can save more than 25% of the clock power consumption compared with its routing result by using the placement of a general placement algorithm [48], at the cost of the total wire-length increase of 1.3% on average.. 3.

(13) 1.3 Organization of this Thesis The remainder of this thesis is organized as follows. First, in Chapter 2, the basic ideas of circuit placement and clock network distribution will be introduced. The related literatures of low power design will also be surveyed thoroughly in Chapter 2. The activity analysis method which we utilize to obtain the precisely distribution of modules’ activity and the optimal gated clock topology will be described in Chapter 3. In Chapter 4, the techniques constructing the LPGC Placer will be presented. In order to demonstrate our LPGC Placer, both the placement results from LPCG Placer and a general placement method without considering the gated clock topology will be fed into a real low power gated clock router [20, 23] to find out the extra power consumption saving percentage of LPGC Placer in Chapter 5. Finally, the conclusions and future work will also be given in the remaining of Chapter 5.. 4.

(14) Chapter 2 Preliminaries and Literatures Survey In this chapter, the basic concepts of placement and clock routing problems will be presented, and the previous related researches will be reviewed.. 2.1 Basic Concepts of General Placement Placement is a crucial step in physical design cycle. A poor placement results in larger area occupation, performance degradation, higher power consumption, and even the unevenly thermal distribution. A poor placement also leads to a difficult or impossible routing task. The input of the placement phase is a set of blocks, the information of terminals of each block, and the netlist which describes the connectivity of all modules. The placement has a deeply impact on the performance of overall layout because an ill-placed layout cannot be improved by any high quality routing algorithm. In other word, the overall quality of the layout such as area, power, total wire-length or thermal distribution is mainly determined in the placement phase. For example, Fig. 2.1 shows two different placements of a circuit, you can see that Fig. 2.1(a) is better than Fig. 2.1(b) since the total wire-length of Fig. 2.1(a) is seem to be shorter than Fig. 2.1(b)’s.. 2.1.1 Problem Formulation Given a set of circuit modules, a placement problem is to arrange modules on the layout surface to satisfy some prescribed constraints such as non-overlapping criterion or circuit boundary constraint, and to optimize certain objective functions such as total area, 5.

(15) H. D. A. H. J. J L. L. A. K C. D. I. I. G. G. K. F. F. B. C B. E. E. (b). (a). Fig. 2.1: Comparison between two placements. (a) The placement with shorter total wirelength. (b) The placement with longer total wire-length. total estimated wire-length, total power consumption or/and thermal balancing. At the placement level, a circuit is considered as a group of rectangular modules with fixed geometric ratios. The interconnect relationships among each modules are described in the netlist of interconnect. The different operations such as rotate, flip, move, and swap can be performed on the modules to achieve an optimal placement according to different cost functions. The formal definition of placement problem is as follows. Given a set of modules M = {m1 , m2 , · · · , mn }, a set of signals S = {s1 , s2 , · · · , sk }, and a set of slots/locations L = {L1 , L2 , · · · , Lp } where p ≥ n, Smi is a set of signals with Smi ⊆ S for each module mi , and Msi is a set of modules with Msi = {mj | si ∈ Smj ; ∀mj ∈ M } for each signal si . Msi is called a signal net. The placement problem is to assign each mj ∈ M to an unique location Lj such that some objective function, f (M, S, L), is 6.

(16) optimized. Sometimes a subset of modules in M is pre-assigned to several specific locations. The placement problem is extended to be the floorplanning problem if the geometries or aspect ratios of given modules are more flexible. For example, the modules can be rotated, their shapes can be modified, or the locations of pins can be changed. It is obvious that floorplanning is a generalized version of placement problem, and has higher flexibility to place the modules than the placement’s. Of course, it is more complicated.. 2.1.2 Wire-length Estimation The first objective of the high performance integrated circuits is to maximize the operating frequency or to minimize the total delay of the whole system. This objective can be complete by minimizing the length of the critical nets. Usually, it could be approximated by minimization of the length of the longest signal wire-length. After this approximated approach, the objective for maximizing the operating frequency becomes a simpler problem. Although the real signal wire-lengths are not available at placement level, a placer needs to model the actual topology of the interconnection effectively and accurately. By using the interconnection graph structure, the requirements could be satisfied. The interconnection graph structure for modeling the connection between two terminals is simply an edge between the two vertices corresponding to the terminals. For the purpose of modeling a net with more terminals, rectilinear steiner trees are used as shown in Fig. 2.2(a) to estimate the optimal signal paths for a net. This model is seldom used by placers, because of the NP-completeness of steiner tree problem. As a result, minimum spanning tree(shown in Fig. 2.2(b)) and semi-perimeter representations(shown in Fig. 2.2(c)) are the most two commonly used models to represent the connections of a net in the placement level. In th Minimum spanning tree model, connections are allowed to branch only at the pins. The other one is semi-perimeter representation, it can estimate the wire-length effectively by bounding pins with a rectangle 7.

(17) (a). (b). (c). (d). Fig. 2.2: Models of wire-length estimation. (a) Steiner tree. (b) Spanning tree. (c) Semiperimeter. (d) Complete graph. then calculates its semi-perimeter. Complete graph interconnection is shown in Fig. 2.2(d). This method causes many redundant wires, and result in longer wire length. The purpose of such model is to give one the upper bound of a net’s wire-length.. 2.2 Previous Work Related to Low Power Placement As we mentioned before, placement/floorplanning is the first stage of the VLSI physical layout design methodology. Several techniques [5]–[10] have been developed to reduce the power consumption at this level, but none of them cooperated with the technique of clock gating. They only optimized dynamic switching power through different approaches with consideration on the combinational networks. Generally speaking, the total power consumption, Pm , in a typical CMOS circuit can. 8.

(18) d. c. d. c e. h C. a. e. h C. a. g. g b. b. f. f. (b). (a). d. a g. e. b a. h. f. C. c. e. C f. d. g. c. h. b. (d). (c). Fig. 2.3: (a) Clock sinks’ locations for general placement. (b) Gated clock routing under general placement. (c) After performing the existed low power driven placement algorithm, the skew problem becomes much more severe(under optimal gated clock topology). (d) Dedicated low power placement and gated clock routing. be formulated as follow [5] Ptotal =. X. Pm. (2.1). ∀module m 2 Pwire = αwire Cload,wire f Vdd. Pm =. X gate∈m. (. ). 2 X Vdd 2 Pwire , + αgate Cload,gate f Vdd + Rd,gate wire∈m. where Ptotal is the total power consumption, Pm is the power consumption of module m, Vdd is the supply voltage, Rd,gate is the equivalent depletion resistance of a gate, Cload,gate is the total loading capacitance that a gate drives, Pwire is the power consumption of a wire, Cload,wire is the wire capacitance, f is the clock frequency, αwire is the activity of a wire, and αgate is the activity of a gate. The existing low power driven placement methods [5]–[10] optimized the following cost function through different approaches since they believed that this cost function was 9.

(19) the only one related to the power consumption at the placement level. Cost =. X. αi Ciwire ,. (2.2). ∀net i. where αi is the activity of net i, and Ciwire is the wire capacitance of net i. Optimizing the above cost function would tend to shrink the interconnect wire length of high switching activity modules. However, the form of Equation (2.2) can not be directly applied to the low power driven placement with considering the clock routing since the clock routing problem and the general routing problem are completely different [27]. The reason is that the clock distribution inherently consists skew problem while the general interconnect network does not. The skew problem becomes worsen for gated clock routing. Suppose we have a gated clock topology which minimizes the average activity. As we will see in the Section 3.2, we define this topology as optimal gated clock topology. Suppose we could make this topology routable then we will save the additional power. For an example, given a distribution of clock sinks under general placement as shown in Fig. 2.3(a), pin C is the clock source, pins a, c, d, e, f and h are very high activity clock sinks, and pins b and g are very low activity clock sinks. Fig. 2.3(b) is a gated clock routing after performing the general placement. We say this solution is feasible since the skew problem is less. But it depends on luck since it places modules in a uniformly distribution. However, a dedicated low power placement for gated clock routing could save more as we could see from Fig. 2.3(c) under minimizing skew problem. If we adopt the existing low power placement methods would probably be placed as Fig. 2.3(d), which make skew problem much worse. If we want to solve the problem in Fig. 2.3(b) and Fig. 2.3(d), one might re-synthesis the clock topology hence in order to minimize the skew. Even we can handle the problem of skew, the overhead may increase the total clock wire-length. Therefore, the total clock network power consumption will actually be increased. Specifically, this solution is not good since a dedicated placement could replace it as it says in Fig. 2.3(c). 10.

(20) To the best knowledge of the authors, all existing gated clock routing algorithms are restricted by the placement. In other word, they couldn’t achieve minimum power consumption unless they have a “dedicated seed at placement level for the the gated clock routing.. 2.3 Basic Concept of General Clock Routing In a typical synchronous system, its performance is directly proportional to the clock operational frequency since it handles more instructions or commands within a same duration. Clock signal is global in nature and the length of a clock line is very long. Hence, the clock network is accompanied by great capacitances so that it limits the performance of the whole system. Therefore, clock nets should be routed with great precision, since the actual length of the path of a net from its entry point to its terminals determines the maximum operation frequency for a chip. By the way, a clock router needs to take several factors into account, including the resistance and capacitance of the routed wire, the noise and cross-talk between wires, the type of load to be driven, the arrival time from clock source to sinks and the total clock network’s power consumption. Among all the clock routing requirements, the most important factor is “clock skew”. The clock signals must arrive simultaneously at all functional units with slightly or no waveform distortion. Because of the importance of clock network, optimization of the clock signal can have a significant impact on the chip’s cycle time, especially in high-performance VLSI designs. Non-optimal clock behavior is caused by either of two phenomena: the routing to the chip’s synchronizing elements(flip-flops or registers), or in the non-symmetric behavior of the clock distribution. Therefore, specialized routers are required for clock nets due to strict specifications for routing such network. Before the introduction of these algorithms, we discuss some fundamentals of clock network such as delay model, timing and clock skew.. 11.

(21) 2.3.1 Elmore Delay Model The well-known and powerful model for wire delay estimation is Elmore delay model [47]. Let’s see Fig. 2.4(a), a Π(pi) model is used to model a wire segment. A via is also modeled as a Π model, with R and C are twice of a wire segment. The source and all sinks of a net are treated as drivers(shown in Fig. 2.4(a) ) which can be modeled by a capacitance connected to the upstream and a resistance connected to the downstream. Calculation of Elmore delays is the sum of resistance×downstream capacitances. Let T be an RC tree, with vertices V = {v1 , v2 , · · · , vn }, ck is the capacitance of a vertex vk , rk is the resistance of the edge between vk and its immediate predecessor. The subtree capacitance at node k is given as Ck = ck +. X. Ci. (2.3). i∈Sk. where Sk is the set of all the immediate successor of vk . Let δm,n be the path between vm and vn , excluding vm and including vn . Then the delay between two nodes m and n is tmn =. X. rn Cn. (2.4). n∈δm,n. 2.3.2 Clock Timing Considerations The majority of digital chips are synchronous in nature. Synchronous designs are often modeled as a Moore finite state machine for the purpose of ease for analysis. The topological requirement imposed on such a finite state machine is that all closed signal paths must contain at lease one synchronous element. Satisfaction of this constraint has several benefits, two of which are: the assurance of deterministic behavior if the physical aspects of the design are correct, and the elimination of the requirement that the combinational logic be free of transient as long as next state sampling is performed after after the longest path has settled to its final value.. 12.

(22) Wire. Driver. R. c 2. r. c 2. CL. (a). (b). Fig. 2.4: (a) Elmore delay model for a segment of wire. (b) Elmore delay model for a driver. Clock Period Constraint For simplicity, and without loss of generality, let the synchronizing elements be edgetriggered. Furthermore, let CP denotes the clock period, dL the longest path delay through the combinational logic(longest combinational logic path delay), tSKEW the clock skew, tSU the set-up time of the edge-triggered synchronizing elements, and tCQ the delay from the synchronizing element’s clock pins to its out pins. In order to guarantee that no long-path timing violations occur in the design, the following equation must be satisfied CP ≥ dL + tSKEW + tSU + tCQ. (2.5). This expression demonstrates the important relationship between the clock period, the longest path delay, and the clock skew.. 13.

(23) RAM, Memory and I/O circuitry CPUs. Clock Generator. Fig. 2.5: A typical synchronous system. Clock Skew Fig 2.5 demonstrate a typical synchronous system, dotted lines represent clock path. The clock signal is often generated external to the chip and supplied to the chips through the clock source pins. All chips would have their own clock network distribution by some dedicated clock router in order to transmit clock signals from clock source pin to every functional unit which needs clock signal. Hence, clock controls the flow of data transmission within the synchronous system. In this figure we omitted the internal clock networks for a brief introduction. 14.

(24) Designers hope that the arrive times from clock source pin to all the functional units are completely and precisely the same so that all data transmissions can start at the same time without data stall and the all functional units can be designed faster to achieve maximum performance. Actually, the clock signals do not arrive at all functional units simultaneously. We give the definition for the maximum difference from the clock source to each of the clock pins clock skew, just as Fig.2.6 demonstrate. With the clock skew limitation, we can not design our system ideally. In other word, one should first calculate the budget or the upper bound for the maximum clock skew can affect. As skew increases with the clock period held fixed, the efficiency of the digital system is reduced because valuable computation time is “stolen” from the total cycle time. Frequently in high performance design environments, skew is constrained to be less than five percent of the clock period. Thus, in a 1 Ghz design, skew would be less than 5000ps. However, there is still some clock skew that we cannot solve which comes from manufacturing process. It forces us to be conservative and use little bigger clock period in order to minimize the skew effect so that we sacrifice the maximum frequency from malfunction comes from clock skew. We adopts the period which allows for the maximum completion time of all the functional units’ in addition to the extra time for handling the maximum clock skew as the system clock period. As we know, we should minimize the clock period as hard we can so that we should evaluate for the second term in Equation (2.5) which represent the bound of the clock skew. However, this term is hard to be evaluated since it deviated from logic style, clock routing network, register distribution and technology. More precisely, clock skew is caused by several phenomena: asymmetric routes to the clocked elements, differing interconnect line parameters, different delays through the clock distribution elements, and different device threshold voltages for the clock distribution logic. Therefore, one should keep any possibilities in mind so as to handle the worst skew problem. By the way, once we minimize the skew effect in every level of the design, we will have higher probability for improving the performance. 15.

(25) Register B. Register A. Clock Source. SkewAC = |tSA – tSC|. Register C. Fig. 2.6: A simple diagram which demonstrates clock skew effect. In the following subsections, we will introduce some classic algorithm which minimize the skew effect even minimize the clock power simultaneously.. 2.3.3 Problem Formulation The clock routing problem defines as follows: Given the routing plane and a set of sinks C = {c1 , c2 , · · · , cn } lying within the plane and cs represent the clock source pin. The target is to construct a network to connect all of the clock pins ci with clock source cs hence to minimize the source to sink delay, clock skew and total clock wire-length. The cost function can be formulated as follows: Clock Routing Cost = M inimize( max , max | τ(cs ,cm ) −τ(cs ,cn ) |, wirelengthclock ) τ(cs ,cj ) (m,n)∈C. 16.

(26) 2.4 Previous Work Related to General and Low Power Clock Routing Prior to delving into the clock routing algorithm, it is necessary to define its role in the context of the overall design flow. In the contemporary physical design flow, it consist three classical steps: floorplanning/placement, global routing and detailed routing. Clock routing is a specific problem interposed between placement and global routing. Specifically, during the clock routing step, the global and detail routes of the clock net are determined and passed to the global router as blockages. The determined routes are constructed so that the clock signal behavior is optimized.. 2.4.1 Zero-skew Clock Routing The skew can be minimized by distributing the clock sinks in a way such that for each path from the clock source to sink has same delay time under some wire delay model. The higher precision of a delay model is, the fewer impact of the clock skew has. Many approaches are devoted to minimize the skew problem in this field such as [15, 14, 17, 16, 11, 13, 12, 18]. They design their routing algorithms to solve the zero-skew clock routing problem distinctly , and their various skew minimization techniques also motivate many works on low power zero-skew clock routing. Additionally, as we will introduce in the coming chapters, the LPGC Placer built an “pseudo clock router” for evaluation on gated clock topology. This pseudo clock router is also motivated by some of the zero-skew clock router. We introduce the classic zero-skew clock algorithms in the coming sections. The Method of Means and Medians The first clock routing algorithm is the H-tree clock router which is a well-known algorithm. However, it could not solve the problem with unevenly distribution of clock sinks. This motivated authors in [12] to invent a top-down clock routing algorithm called method of means and medians(MMM). The MMM algorithm follows a technique very similar to the H-tree algorithm by 17.

(27) recursively partitioning a circuit into two equal parts, and then connects the center of the mass of the whole circuit to the centers of the mass of the two sub-circuits.The algorithm is simple and yields good results. The definition of some terminologies is that: Let si be the clock sink i, and Ps = {s1 , s2 , · · · , sn } is the set of the clock sinks. Lx is the list of clock sinks which sorted according to the their x-coordinate. Pxm is the median of the Lx . Given Pxm , assign points in list Lx to the left of Pxm to the set PLef t . Assign the remaining points to the set PRight . Similarly, PBottom and PT op represent the division of P into two sets about Pym ,the median of y axis. Because of the geometric nature of the problem, one may consider the partition of the set of clock sinks as the partition of a region. Thus PBottom and PT op , PLef t and PRight partition the original region by y-median or x-median into two sub-regions with an approximately equal number of sinks in each sub-regions. The algorithm operates as follows: First, splits Ps into two sets(in the x or y directions). Suppose that a split Ps into PBottom and PT op is chose. Then, the algorithm connects the center of the mass of Ps to each of the center of mass of PT op and PBottom respectively. Then split the regions PT op and PBottom in the x direction respectively. Thus splits between x and y directions are introduced on the set of points recursively until there is only one point in each sub-region. A simple example of this algorithm is shown in Fig. 2.7. The Geometric Matching based Algorithm Another binary tree based routing algorithm [14, 15] is presented by Kahng, Cong and Robins. In their approach, clock routing is completed by constructing a binary tree through recursively geometric matching. This algorithm is so called Geometric Matching Algorithm(GMA). It works in a bottom-up fashion while MMM algorithm is a top down algorithm. Given a set Ps of n sinks, a geometric matching on Ps is a set of n/2 line segments 18.

(28) (a). (c). (b). (d). centers of mass. Fig. 2.7: Method of mean and median clock routing algorithm.. 19. (e). (f).

(29) whose endpoints are in Ps , with no two line segments sharing the same endpoint. Every line segment in the matching is defined as an edge. The geometric matching cost is the sum of the lengths of its edges. The target is to construct a clock tree by recursive matching. It starts by considering a forest of n isolated sinks, The minimum-cost matching on these sinks yields n/2 line segments, each of which defines a subtree with two leaf nodes. The center point of each line segment is called the tapping point and if the clock signal is transmitted through the tapping point, then the clock signal will arrive at the two endpoints of the line segment with zero skew. The set of tapping points serves as the set of points for the next iteration of the matching algorithm. This algorithm recursively reduce the number of points by. 1 2. once a iteration. By the way, the algorithm optimizes the wire-length in each iteration of matching. In each iteration, the algorithm selects the root of the newly merged tree to be the tapping point which minimizes the path length skew to the leaves of the two subtrees. Because this algorithm is based on geometric matching, its time complexity is O(n3 ). Fig 2.8 shows GMA algorithm running on eight-point set. Fig. 2.8(a) and Fig. 2.8(b) represent the first iteration of matching, the points is reduced to be 4. Fig. 2.8(c) demonstrates the next matching iteration. However, the figure shows that the connection path between the tapping points of T12 and T34 is impossible. Therefore, this algorithm has a refinement step called “H-flipping”(shown in Fig. 2.8(d)) in order to fix the problem of unbalance path length when merging two points. The Deferred Merge Embedding Algorithm The purpose of deferred merge embedding algorithm [16, 17, 13] was to improve the two techniques which we have discussed. This DME algorithm could minimize the total clock wirelength with zero-skew under any given clock topology. In other word, this algorithm should be performed after the clock topology is determined. The MMM and GMA are the algorithms could solve clock topology construction and clock routing at the same time while DME just solve the second one. However, this prop-. 20.

(30) (a). (c). (b). H-flipping. (d). Fig. 2.8: Geometric matching based clock routing algorithm. erty make DME a multi-purpose clock router since different clock topology could brings different advantages and disadvantages. We will introduce the optimal clock topology which optimized the total clock power in Section 3.2. A generic DME algorithm is a two phase: bottom up merge and top down placement. The name placement here just referred to determine the precise locations for every steiner point (merge point) in order to eliminate the skew. First, we define merging segment and Tilted Rectangle Region(TRR) for the clarity in this section. A manhattan arc is defined to be a line segment, possibly of zero length, with slope -1 or +1(45◦ tilted). The collection of the different manhattan arcs with same slope and fixed distance is called tilted rectangular region. We can examine Fig. 2.9 for an example. The central manhattan arc in a TRR is called its core and the radius is the distance between the core to boundary. The definition of merging segment is simpler, it represents the possible internal nodes’ locations with the minimum wire cost when we merge two nodes. The bottom-up phase is recursively merging each node’s TRR in order to find out all 21.

(31) core. radius. Fig. 2.9: A simple example of tiled rectangle region(TRR). the merging segments which we call it tree of merging segments, Fig. 2.10 is an example. Given the tree of merging segments corresponding to a clock topology, the top-down phase resolves the exact positions of the internal nodes with minimum wire cost while preserving zero skew. DME algorithm requires the pre-determined clock topology so that this algorithm could be applied to many clock routing problem under different objective function as long as providing the related clock topology.. 2.4.2 Low Power Zero-skew Clock Routing For a long time, the general theory behind clock design was that the signal should be kept as clean as possible, and that a circuit designer should not interrupt or disable a clock signal. Traditionally, all the clock routing algorithms aim to solve the clock skew and minimum clock wirelength. However, as the technology keep improving and keep shrinking the feature size, the interconnect power has the same magnitude as compared to gate power [2, 3, 4, 19]. Clock network is a network with long wires and highly switching. 22.

(32) 3. 1. 4. 2. 7 5 root merging segment. 8. 6. Fig. 2.10: A tree of merging segments. Solid lines are merging segments and dotted lines indicate edges between merging segments. The squares represent the clock sinks. activity, hence, the higher power. It has been researched that the clock power is at least 40% of the total power for a typical microprocessor [19]. Therefore, today’s research focus on distributing a clock network with minimizing clock skew, wirelength, timing and finally clock power. It is a new field with many chances and improvements in it. Activity Driven Clock Design Clock power can be saved by disabling clock signals from inactive registers(flip-flops) in idle circuit portions. We can carefully insert the controlling logic gates to control the clock signals in front of registers. By shutting down idle registers, the downstream combinational circuitry and data transmission path would not be active so that saving a substantial amount of power. This technique is so-called gated clock routing or clock gating. By performing sophisticated algorithms, one could achieved the low power clock net23.

(33) work with inserting the control gates. Since registers that should be gated or merged together may be placed far apart, it is possible that gating control wirelength will finally be increased and the total so that the clock power finally goes up. The authors in [22, 25] suggested that by putting the pseudo net between commonlygated registers together in the netlist and then the placer would tends to put them closely. However, here are the drawbacks in this idea: 1. The amount of pseudo nets would be enormously so that the weight of net which made by manually dictated would probably be failed when handling larger circuits. 2. There is no way to handle the global optimization. 3. The pseudo net would destroy the timing of critical nets. Gated clock routing for microprocessor design. In this publication, the authors have proposed an approach for clock tree routing, that ensures zero-skew and reduces the clock power dissipation by minimizing the effective switched capacitance at the internal nodes of the tree. This algorithm refined the DME algorithm which we have known as a multi-purpose clock router. The authors [23, 20] replace the traditional clock routing objective from minimizing total wirelength to minimizing total switched capacitance. The algorithm first formulates the clock network power and clock control network power as follows: Consider a clock tree without gates, the power dissipation on a particular clock edge ei is defined as 2 P owerei = |ei |c0 f Vdd. (2.6). where |ei |c0 represent the switched capacitance of this clock edge(c0 is the unit wire capacitance and ei is the wirelength of the clock edge ei ). The power dissipation for gated clock network, the Equation (2.6) should be modified as: 2 P owerei ,gated = |ei |c0 f Vdd P (ei ). 24. (2.7).

(34) with additional consideration on the switching activity of the clock edge ei . Since the operation frequency and supply voltage are fixed we define effective switched capacitance at any edge as: SCei = (|ei |c0 + Ci )P (ei ). (2.8). where the Ci represents the load capacitance of the clock edge ei . Thus, the switched capacitance of the entire gated clock network can be defined as: SCGCN =. X. SCei. (2.9). ∀ei. Finally, we take the gated clock control network into account. We can follow the definition as same as Equation (2.9): SCGCCN =. 1X (c0 |gei | + Cg )Ptr (gei ) 2 ∀ge. (2.10). i. where the |gei | and Ptr (gei )represent the wirelength and transition probability of the gated clock control wire gei . We assume that the source of the gated clock control network is be laid in the center of the chip, Cg means the control gate’s capacitance. The objective of this routing algorithm is to minimize the term SC = SCGCN + SCGCCN through DME algorithm. First, given all the clock sinks’ locations, they could built a binary tree which represent the gated clock topology which could minimize SC. Then, by passing this topology to DME router hence force it to route this clock network with zero-skew. As we can realize from the idea of this algorithm, there are many aspects which we could improve. Because they have no process for global optimization, they may achieve either a nonoptimal solution or a poor solution. Our work improves these drawbacks then provides a process or a good seed for global optimization.. 25.

(35) Chapter 3 Activity analysis and optimal gated clock topology In this chapter, the activity analysis and optimal gated clock topology will be introduced. As we mentioned in the previously chapters, the dynamic power of a CMOS circuitry is proportional to its switching activity. The VLSI designers can utilize different techniques on their designs to reduce the power consumption as long as the correct distribution of activity among every portion of their circuits is available. However, obtaining the precise activity distribution of each module is not an easy task. After the correct activity distribution is obtained, a procedure of constructing the optimal low power gated clock topology can be executed. LPGC Placer develops a procedure to build the optimal gated clock topology. Before introducing the em LPGC placer, the essential ideas of activity analysis and the optimal gated clock topology are presented in this chapter.. 3.1 Activity Analysis Since the problem of placement which we deal with is at the physical design level, the information of activity from the higher level of the design must be acquired. Many different methods have been investigated in this field [20, 22, 50, 51, 52, 53, 54, 55, 56, 57], two of them which are related to our work will be introduced.. 26.

(36) 3.1.1 Extraction from High Level Description of the System In [22], the authors started from a high-level description of the system, and transformed a hardware description language (HDL) to the representation of control-data flow. Then, all of the modules were scheduled and allocated into the control steps. In other word, they analyzed the data flow in every clock cycle, then calculated the distribution of activity. This method might be failed if the circuit if complicated. The reason is that one may not have the time to simulate the whole system in this manner because it’s too slow. One simple example from [22] is shown in Fig. 3.1. This simple case can be perfectly transformed since its function is very simple. However, as we face the more complicated designs such as SOC or CPU design, this method is inefficient. As we may see from the example that the authors provided which we put in Fig. 3.1, this case can be perfectly transformed since the function is simple. However, when we face the more complicated designs such as SOC or CPU design, this method is inefficient.. 3.1.2 Instruction Level Simulation The other group provided another solution of activity extraction [20, 23]. Their work was based on detailed statistics about the instruction frequency because different instructions had different operation probabilities. Hence, the activities of the modules related to the instructions can be obtained. They simulated in the instruction level with numbers of CPU benchmarks then analyzed how often each instruction was executed. They developed an algorithm to calculated each module’s activity effectively. We also demonstrate this method through an example. For simplicity, we assume that the processor which we shall analyze has only three instruction and five modules. The RTL description of each instruction lists the modules used when executing as what Table 3.1 shows. We can simulate the processor with benchmarks so as to obtain the instruction stream that is occurred most frequently. For example, we get the instruction stream through simulations for 15 clock cycles as Table 3.2 shows. Then, we can scan the table in order to get the executing probability for each module. For instance, P (M2 ) = 27.

(37) y’’+3zy’+3y. u. dz. 3. z. 3. y. =0. dz. u. a. z. dz. C1 HDL. M1. M2. M3. C2. +. C3 while (z < a ) do z1 := z + dz; u1 := u – (3*z*u*dz) – (3*y*dz); z := z1; u := u1; y := y1 ; end do. M4. +. M1. <. C4 M2. C5. -. C6. u1. (a). y1. z1. ctr. (b). Fig. 3.1: Methodology for transformation of an ASIC function from hardware description language to a control-data flow graph. (a) Transformation from a simple ASIC function to hardware description language. (b) A simple example of CDFG after performing HDL scheduling and allocation.. 28.

(38) Instruction IA IB IC. M1 √ √. M2 √. M3 √. M5 √. M4 √ √. √. Table 3.1: A simple RTL description of instructions for a microprocessor. Clock Period Instruction. C1 IA. C2 IB. C3 IC. C4 IA. C5 IA. C6 IB. C7 IC. C8 IA. C9 IC. C10 IA. C11 IB. C12 IA. C13 IC. C14 IA. C15 IC. Table 3.2: Instruction level simulation for 15 Clock Cycles. P (IC ) =. C3 +C7P +C9 +C13 +C15 15 C i=1 i. = 0.33.. According to these tables, if we build a clock tree which merges module 1 and 5, we can calculate the probability of their control gate. We should take consideration on all instructions which contain module 1 and 5. Therefore, P (gateM1 ,M5 = P (M1 ∪ M5 )) = P (IA ) + P (IB ) = 0.67. For the simplicity, we adopt this method to obtain the activity distribution. Researching on how to do it with more efficiently is one of our future work.. 3.2 Optimal Gated Clock Topology Fig. 3.2 demonstrates the circuitry with gated clock network. The power management unit is essentially a finite state machine which determines the active/inactive state of the gates on the clock network. The power management unit send the command to turn on or turn off a gate through gated clock control network. In Fig. 3.2, the rectangle “CLK” denotes the clock source. For the simplicity, we use “GCN” and “GCCN” to denote the gated clock network and gated clock control network. The total power consumption of this system under such topology is Psys = PGCN + PGCCN + PCOM. (3.1). where PGCN and PGCCN refer to the power consumption of the gated clock network and gated clock control network respectively, PCOM is the power consumption of the combinational circuit that can be calculated by using Equation (2.1). 29.

(39) Clock Wire Gate Control Network Combinational Logic Wire Clock Sink Combinational Module. CLK. Power Management Unit. a. b. c. d. e. f. g. h. Fig. 3.2: A typical topology of gated clocking circuitry.  2  PGCN = f Vdd.  X. αk Ck +. ∀register k. 2 PGCCN = f Vdd. X. αtopology,ei c0 |ei |. (3.2). ∀ei. X. (αwj cwj + CGate,j ). (3.3). ∀wj. In Equation (3.3), wj represents a gated clock control wire with capacitance cwj , and αwj denotes the switching activity of wj . Each wj is connected to a control gate with input capacitance CGate,j . In Equation (3.2), the power consumption of gated clock network can be divided into two parts, say registers’ power and clock wire segments’ power. As we can see in Equation (3.2), Ck denotes the input capacitance of a register, c0 is the unit clock wire capacitance, and αtopology,ei is the switching activity at each edge of the gated clock topology. Each value of αtopology,ei is highly dependent on the gated clock topology. Let’s consider Fig. 3.3 for an example to explain the relationship between switching activity and gated clock topology. There are two different gated clock topologies in 30.

(40) CLK. CLK. 40%. 10%. 40%. 30%. 30%. 10%. 10%. 20%. 30%. a. b. c. d. e. f. g. 40%. 10%. 20%. 10%. 30%. a. c. b. d. h. (a). e. f. g. h. (b). Fig. 3.3: An example for switching activities comparison under two different clock topology. (a) Topology with smaller average switching activity. (b) Topology with larger average switching activity.. 31.

(41) Fig. 3.3. For simplicity, we assume that the right hand side of each topology is the same. The activity of each clock module activity is: a=10%, b=10%, c=20%, and d=30%. We also assume that the union activities for (a,b) pair and (c,d) pair are 10% and 30% respectively. By the way, the union activities for (a,c) pair and (b,d) pair are 30% and 40%. Therefore, we can easily compare the average activity for this two topology. The average switching = activity for the gates in the left hand side of Fig. 3.3(a) is 10%+10%+20%+30%+10%+30%+40% 7 21.43% while. 10%+20%+10%+30%+30%+40%+40% 7. = 25.71% for Fig. 3.3(b). If each mod-. ule in each topology in the figure has been placed in the same location, the topology in Fig. 3.3(a) could save additional power from wasted. Therefore, we define the topology which has the smallest activity at each tree node as the “optimal gated clock topology”. As we mentioned in the last section, we could develop an optimal gated clock topology which minimizing the Psys . This would be the first goal of our placer. The second goal of our placer is to place each module at suitable position in order to make the optimal gated clock topology become routable, and then minimize Psys , while the gated clock routing algorithms would probably view that topology as an unroutable structure since the placements provided by the typical placement algorithms do not consider the factor of gated clock network. The major techniques of our placer will be introduced in the next chapter.. 32.

(42) Chapter 4 LPGC Placer In this chapter, the details of LPGC Placer will be described. First, we formulate the problem for LPGC Placer. Second, the data structure adopted by LPGC Placer will be introduced. Then, this chapter will present the comparison between the traditional gated clock network construction flow and the novel flow proposed by LPGC Placer. We will also introduce the notions of “optimal gated clock topology construction” and “pseudo clock router heuristic” in this chapter.. 4.1 Problem Formulation The low power driven placement problem based on optimal gated clock topology can be formulated as follows. Given a set of modules M = {m1 , m2 , · · · , mn }, the interconnect netlist I, and a clock activity relationship matrix, Lr×r , where r is the number of clock pins, and each entry lij in Lr×r represents the joint activity of clock pins i and j, the goal of the low power driven placement based on optimal gated clock topology is to force the placement to approach the optimal gated clock topology through novel pseudo clock router heuristics. At the same time, the traditional placement constraints such as area and the total wire-length are also taken into account.. 33.

(43) 4.2 Data Structrue Many different representations and data structures, such as B*-tree [32, 48], O-tree [33], corner block lists [34], sequence-pair [35], TCG [37], and the other outstanding works [42, 41, 40, 39, 38, 36], have been proposed to handle the floorplanning/placement problem. In order to make a comparison between a dedicated placement and typical placement for the gated clock power optimization, we adopt the data structure of B*-tree [32, 48]. B*-tree is an efficient, flexible, and effective representation of geometric relationship for floorplanning/placement designs. It is essentially a binary tree as shown in Fig. 4.1(a), but with several efficient operations, rotate, flip, move, and swap, for the placement problem. After a series of the above operations onto the B*-tree data structure, a “packing” function can be utilized to map the B*-tree representation to a real placement as shown in Fig. 4.1(b). The above mechanisms of B*-tree could provide us a simple way to “disturb” the placement. By using the optimization algorithms such as simulated annealing,force directed and simulated evolution we can take the advantage of this merit since these algorithms need disturb the target thoroughly. By the way, B*-tree is a good data structure for the area constraint since it places modules compactly from the bottom left corner, and this can minimize the dead space. We adopt this data structure for the LPGC Placer, and compare the experimental results of LPGC Placer with an outstanding B*-tree placer [48].. 4.3 LPGC Novel Gated Clock Network Construction Flow The novel flow for gated clock network construction will be introduced in this section. Fig. 4.2 is the comparison between the traditional gated clock network construction flow and the LPGC novel flow. Traditionally, a placer doesn’t consider the distribution of the clock sinks which has a dramatically impact on the clock power. As shown in Fig. 4.2, the traditional gated clock network construction flow is a simply general placement followed by the gated clock routing. Therefore, the gated clock routing algorithms usually obtain 34.

(44) n0 b5. b6. n1. n7. b3. b1. n5. n2. n8. b10. b4 b2. n11. n9. n3. n6 b9 b0. n10. b8. n4. b11 b7. (b). (a). Fig. 4.1: B*-tree data structure. (a) An eleven blocks B*-tree representation. (b) Modules’ actual positions and placement of (a). a “uniform distribution” of clock sinks so that the chance to get the optimal solution is limited. LPGC Placer provides a novel flow. First, activity analysis is performed to obtain the activity distribution among all the clock sinks. Second, according to the activity distribution, an optimal gated clock topology is constructed. Then a “dedicated seed” for the gated clock routing is constructed by using the placement core and pseudo clock router heuristic as shown in Fig. 4.2(b). After clock gating, an arbitrary gated clock router [21, 22, 23, 20] can be applied to the placement to construct the low power gated clock network. The key role of the LPGC Novel flow is the “pseudo clock router heuristic” for evaluating whether a placement is prone to make the optimal topology routable or not. The pseudo clock router uses a quick way to estimate the clock wire-length instead of performing a real routing algorithm. With the criterion of minimizing the total estimated clock wire-length, LPGC Placer can place the modules close to their optimal locations. 35.

(45) Netlist. Netlist LPGC Placer. Activity Analysis. General Placement. Optimal Gated Clock Topology Construction Placement Core. Pseudo Clock Router Heuristic. Gated Clock Routing. Gated Clock Routing. (a). (b). Fig. 4.2: (a) Traditional design flow of the gated clock network construction. (b)The LPGC novel flow for gated clock network construction. The dotted rectangle circles the operation flow of LPGC Placer. for the routing result to approach the optimal gated clock topology. The placement core of LPGC Placer can be any kind of placement. Here, the simulated annealing method is used to set up the placement core. All the details of LPGC Placer will be described in the next section.. 4.4 The Novel Low Power Placement Algorithm In this section, two main techniques will be described to set up the LPGC Placer, a low power driven placer based on the optimal gated clock topology. LPGC Placer is a simulated annealing based placer with consideration on minimizing area, total wirelength, gated clock network power, and gated clock control network power. The cost function of LPGC Placer is defined as CostLP GC = α · Area + β · T otalW ireLength + γ · GatedClockCost, (4.1) 36.

(46) where α, β, and γ are the weighting factors corresponding to Area, TotalWireLength, and GatedClockCost, and α + β + γ = 1. Here, the GatedClockCost is a function proportional to the power consumption of clock and gated clock control networks. First, a circuit’s activity pattern will be acquired by simulation with the same technique shown in [20, 23] which has been introduced in Section 3.1.2. After that, the LPGC Placer will construct the optimal gated clock topology. Then, the LPGC Placer will also minimize the CostLP GC through the optimizing algorithm, say simulated annealing. By the way, the LPGC Placer minimizes the GatedClockCost by utilizing the pseudo clock router heuristic based on the optimal gated clock topology.. 4.4.1 The Optimal Gated Clock Topology Construction The first step of LPGC Placer is to determine the topology structure of optimal gated clock network distribution. In other word, we seek for the routing tree which could optimize the total switching activity at every node in the gated clock network under the constraint of routability. We modify the recursive construction method used in [20]–[23] with an additional control parameter, the physical connectivity of different blocks, to build the optimal gated clock routing tree. The physical connectivity is indeed an important parameter since it directly determines whether the activity-optimized clock topology can be achieved or not. The other difference is that our procedure determines the optimal gated clock topology since each block’s location on the plane hasn’t been determined. In other word, we try to find a topology which has the minimum average activity. This construction procedure is described in Fig. 4.3. The Huffman coding algorithm [49] is recursively used to construct the clock tree since this algorithm is a effective procedure for constructing a binary tree with the lowest average activity. Each new node is created at each time, and its activity is the joint activity of its two children. With the bottom-up manner of the Huffman coding algorithm, there is a sequence of merging order. We utilize this sequence of merging order to speed up the calculation of GatedClockCost. 37.

(47) Optimal Gated Clock Topology Construction Procedure Input: Activity Patterns of the Modules Output: Optimal Gated Clock Topology T 1 set S ← total modules 2 T ← null 3 While(S.size 6= 1) 4 find the two smallest joint activity modules in S → (x, y) 5 if tie, choose stronger connectivity with T 6 z ← newnode(x, y) //parent of x,y 7 S.insert(z) 8 T.insert(x) 9 S.delete(x) 10 T.insert(y) 11 S.delete(y) 12 T.insert(the one left in S) //Construct the root 13 return T Fig. 4.3: Optimal gated clock topology construction procedure. The binary tree is very suitable for clock distribution [27]. As we introduced in Section 2.3, minimizing the skew effect is necessary in the clock routing. Therefore, using a binary tree for building the optimal gated clock topology is very reasonable. The connectivity is a metric of connection between two modules and can be estimated from the netlist of non-placed modules. If block a and b are highly connected, they should be placed closer than other blocks. With this criterion, the total wire-length between each block is potentially reduced.. 4.4.2 Approaching to Optimal Gated Clock Topology and Pseudo Clock Router Heuristic The optimal gated clock topology T can be constructed after performing the above procedure. Equation (3.2) can be rewritten as  2  PGCN = f Vdd.  X. αk Ck +. X. αOptimal topology,ei c0 |ei |. (4.2). ∀ei. ∀register k. where αOptimal topology,ei is the switching activity at each node of the optimal gated clock topology. 38.

(48) Since the control gates are not inserted at the placement level, the second term of the Equation (3.1) can not be directly calculated. However, all clock gating algorithms tend to insert gates at the positions which can minimize both the control network’s wire-length and the gated clock network power. In other word, we try to disturb the placement to make the optimal gated clock topology becoming a routable structure. We observe that if we could place those modules which are in the same families of the optimal gated clock topology closer, the gated clock router would tend to actually route the topology. Hence, the power consumption of the gated clock control network can be reduced. For example, given an optimal clock topology T as shown in Fig. 3.2, its typical placement without considering the clock gating is illustrated in Fig. 4.5(a). LPGC Placer can move the cells to approach a suitable solution of placement which makes the optimal gated clock topology being routable, and the wire-length of gated clock control network is minimized by the clock gating router. As a result, Fig. 4.5(b) does have smaller clock and control network wire-length. The reason is that the clock gating routers [20]–[23] always merge the modules with minimum switched capacitances to achieve the optimal gated clock topology, and LPGC Placer places the modules based on their optimal topology to provide a better placement for the gated clock router. In other word, since the routers cannot move the modules, they can only construct a local optimal clock gating network with an arbitrary placement while we can provide a global optimal placement for the gated clock routing methods. Given an optimal gated clock topology T and a placement P , LPGC Placer utilizes the following mechanism to drive the given placement to a better solution. LPGC Placer first maps each leaf node of T to the corresponding module of P . Then, it estimates the total clock wire length as if the gated clock network has been really routed as topology T , and chooses this estimate as the cost function. To estimate the total wire length, LPGC Placer utilizes the idea behind the zero-skew clock routing algorithms proposed by [12, 14], and uses a special data structure to store T in the sequence of bottom-up merging in order to speed up the procedure of updating the total estimated clock wire length. We call 39.

(49) Gated Clock Placement Cost Computation Procedure Input: Optimal Gated Clock Topology T; Given Placement P Output: Clock Cost of Some Placement 1 left ← Head(T) 2 ClockCost ← 0 3 While(Next(left) 6= null) 4 right ← Next(left) 5 UpdateLocation(left, right, P, T) /*compute their parent’s approximately zero skew location by LPGC pseudo clock router*/ 6 ClockCost ← ClockCost + WireLength(left, right) 7 left ← Next(right) 8 ClockCost ← ClockCost + WireLength(left, clock source) 9 return ClockCost Fig. 4.4: Gated clock placement cost computation procedure for pseudo clock router heuristic.. Clock Source. 。c 。b. 。c. 。a. 。d. 。g. 。a 。f. 。h 。e. 。g. 。e 。b. 。d. 。f. 。h. (b). (a). Fig. 4.5: (a) A typical placement result under general constraints, the optimal topology seems to be unroutable. (b) A dedicated placement which makes the optimal topology routable.. 40.

(50) this idea as “pseudo clock router heuristic”. The idea behind the real zero-skew clock router [12, 14] is that minimizing the skew problem by routing through the central of the mass. The procedure of constructing the optimal gated clock topology combines a pair of modules once a time. Therefore, memorizing the sequence of the merging order of the Huffman coding algorithm can speed up the calculation process of the estimated central of the mass. ♠ Pseudo Clock Router Heuristic ♠ As we introduced in Section 2.4.1, the MMM algorithm is a top-down algorithm and the GMA algorithm is a bottom-up algorithm. They route the clock net to achieve zero-skew and minimize the clock wire-length. Our optimal gated clock topology can not be a routable topology if the placement is not “suitable”. We evaluate whether a placement is suitable or not by using a pseudo clock router which performs pseudo clock routing algorithm according to the optimal gated clock topology. This pseudo clock router executes the following steps. Step 1 Dictate the siblings and father for each clock module top-down according to the optimal gated clock topology. Step 2 Create steiner points for storing the merging points of each level, and those merging points are called “fathers”. Step 3 Compute the location of each merging point bottom-up by using the idea of means of medians. Step 4 Sum the total pseudo clock wire-length to estimate the total clock wire-length. Fig. 4.6 shows two different placements according to the optimal gated clock topology of Fig. 3.2. You can see that the placement shown in Fig. 4.6(a) has longer pseudo clock wire-length than the placement’s in Fig. 4.6(b). Therefore, the placement of Fig. 4.6(b) is more suitable than Fig. 4.6(a) since it seems to be more possible for a real router to route 41.

(51) H. D. A G. H D clk. clk. C. E. G C. A. F B. F B. E. (b). (a). Fig. 4.6: Evaluate a placement through pseudo clock router heuristic. (a) Longer pseudo clock wire-length. (b) Shorter pseudo cLock wire-length. it. Therefore, by using this pseudo clock router heuristic, we could evaluate whether a placement is more suitable to let optimal gated clock topology become routable or not. One goal of Equation (4.1) is to minimize the GatedClockCost which can be approximated by the pseudo clock wire-length. Therefore, the cost function of gated clock routing can be formulated as GatedClockCost = P seudo Clock W irelength. (4.3). The procedure of calculating the GatedClockCost by using the pseudo clock router heuristic is shown in Fig. 4.4. LPGC Placer utilizes this procedure to update the value of GatedClockCost for each different placement, and tries to minimize it to make the optimal gated clock topology routable. Actually, this procedure is a revised version cost function. We revisited the essential core of the pseudo clock router and then redesign our algorithm since we should calculate it as fast as we can. We just “memorize” the sequence of the merging order which was occured in the construction of optimal gated clock topology. By 42.

(52) design a new data structure for storing the pseudo clock router, our procedure could be performed faster. The computational complexity of the pseudo clock router estimation heuristic is O(n). The simple proof of the complexity is described as follows: Given an ordered sequence P = p1 , p2 , · · · , p2n−1 which is referred to the positions of the clock sinks or internal nodes. The number n denotes the total number of the clock sinks. We store the sequence P as the merging order in the procedure of the optimal gated clock topology construction. The pseudo clock wirelength can be calculated as: Ãn−1 X. !. |P2i−1 − P2i | + |P2n−1 − Ps |. (4.4). i=1. where Ps denotes the position of the clock source. In Equation (4.4), |P2i−1 − P2i | represent the pseudo clock wire-length calculation with complexity O(1). Finally we compute the pseudo clock wire-length with n times so that the complexity of pseudo clock router heuristic is O(n). Minimizing Equation (4.3) can reduce the power consumption gated clock network under the optimal topology. Let’s compare two different placements with the same optimal gated clock topology as shown in Fig. 4.7. Fig. 4.7(a) represents a dedicated placement with considering the gated clock topology. In Fig. 4.7(a), the wire length of each clock net can be reduced since the clock sinks are close to the clock source. Therefore, each Cclock. wire,i. is reduced, and so is Psys . However, a typical placement without con-. sidering the gated clock topology might prohibit the gated clock routing algorithm constructing the optimal gated clock network, as illustrated in Fig. 4.7(b). The clock sinks (a and b, or c and d) which should be merged are too far to be merged. This is the reason why a dedicated placer is important for clock gating routers.. 43.

(53) Power Management Unit. a a. b. c. b. d. d c. (a). Clock Source (b). Fig. 4.7: (a) A dedicated placement with considering the optimal gated clock topology. (b) A typical placement without considering the optimal gated clock topology.. 44.