Approaching to Optimal Gated Clock Topology and Pseudo Clock

4.4 The Novel Low Power Placement Algorithm

4.4.2 Approaching to Optimal Gated Clock Topology and Pseudo Clock

The optimal gated clock topology T can be constructed after performing the above pro-cedure. Equation (3.2) can be rewritten as

PGCN = f V_dd²

where αOptimal topology,ei is the switching activity at each node of the optimal gated clock topology.

Since the control gates are not inserted at the placement level, the second term of the Equation (3.1) can not be directly calculated. However, all clock gating algorithms tend to insert gates at the positions which can minimize both the control network’s wire-length and the gated clock network power. In other word, we try to disturb the placement to make the optimal gated clock topology becoming a routable structure. We observe that if we could place those modules which are in the same families of the optimal gated clock topology closer, the gated clock router would tend to actually route the topology. Hence, the power consumption of the gated clock control network can be reduced. For example, given an optimal clock topology T as shown in Fig. 3.2, its typical placement without considering the clock gating is illustrated in Fig. 4.5(a). LPGC Placer can move the cells to approach a suitable solution of placement which makes the optimal gated clock topology being routable, and the wire-length of gated clock control network is minimized by the clock gating router. As a result, Fig. 4.5(b) does have smaller clock and control network wire-length. The reason is that the clock gating routers [20]–[23] always merge the modules with minimum switched capacitances to achieve the optimal gated clock topology, and LPGC Placer places the modules based on their optimal topology to provide a better placement for the gated clock router. In other word, since the routers cannot move the modules, they can only construct a local optimal clock gating network with an arbitrary placement while we can provide a global optimal placement for the gated clock routing methods.

Given an optimal gated clock topology T and a placement P , LPGC Placer utilizes the following mechanism to drive the given placement to a better solution. LPGC Placer first maps each leaf node of T to the corresponding module of P . Then, it estimates the total clock wire length as if the gated clock network has been really routed as topology T , and chooses this estimate as the cost function. To estimate the total wire length, LPGC Placer utilizes the idea behind the zero-skew clock routing algorithms proposed by [12, 14], and uses a special data structure to store T in the sequence of bottom-up merging in order to speed up the procedure of updating the total estimated clock wire length. We call

Gated Clock Placement Cost Computation Procedure Input:Optimal Gated Clock Topology T;Given Placement P Output:Clock Cost of Some Placement

1 left ← Head(T) 2 ClockCost ← 0

3 While(Next(left) 6= null) 4 right ← Next(left)

5 UpdateLocation(left, right, P, T)

/*compute their parent’s approximately zero skew location by LPGC pseudo clock router*/

6 ClockCost ← ClockCost + WireLength(left, right) 7 left ← Next(right)

8 ClockCost ← ClockCost + WireLength(left, clock source) 9 return ClockCost

Fig. 4.4: Gated clock placement cost computation procedure for pseudo clock router heuristic.

Fig. 4.5: (a) A typical placement result under general constraints, the optimal topology seems to be unroutable. (b) A dedicated placement which makes the optimal topology routable.

this idea as “pseudo clock router heuristic”. The idea behind the real zero-skew clock router [12, 14] is that minimizing the skew problem by routing through the central of the mass. The procedure of constructing the optimal gated clock topology combines a pair of modules once a time. Therefore, memorizing the sequence of the merging order of the Huffman coding algorithm can speed up the calculation process of the estimated central of the mass.

♠ Pseudo Clock Router Heuristic ♠

As we introduced in Section 2.4.1, the MMM algorithm is a top-down algorithm and the GMA algorithm is a bottom-up algorithm. They route the clock net to achieve zero-skew and minimize the clock wire-length.

Our optimal gated clock topology can not be a routable topology if the placement is not “suitable”. We evaluate whether a placement is suitable or not by using a pseudo clock router which performs pseudo clock routing algorithm according to the optimal gated clock topology.

This pseudo clock router executes the following steps.

Step 1 Dictate the siblings and father for each clock module top-down according to the optimal gated clock topology.

Step 2 Create steiner points for storing the merging points of each level, and those merg-ing points are called “fathers”.

Step 3 Compute the location of each merging point bottom-up by using the idea of means of medians.

Step 4 Sum the total pseudo clock wire-length to estimate the total clock wire-length.

Fig. 4.6 shows two different placements according to the optimal gated clock topology of Fig. 3.2. You can see that the placement shown in Fig. 4.6(a) has longer pseudo clock wire-length than the placement’s in Fig. 4.6(b). Therefore, the placement of Fig. 4.6(b) is more suitable than Fig. 4.6(a) since it seems to be more possible for a real router to route

(b)

Fig. 4.6: Evaluate a placement through pseudo clock router heuristic. (a) Longer pseudo clock wire-length. (b) Shorter pseudo cLock wire-length.

it. Therefore, by using this pseudo clock router heuristic, we could evaluate whether a placement is more suitable to let optimal gated clock topology become routable or not.

One goal of Equation (4.1) is to minimize the GatedClockCost which can be approxi-mated by the pseudo clock wire-length. Therefore, the cost function of gated clock routing can be formulated as

GatedClockCost = P seudo Clock W irelength (4.3)

The procedure of calculating the GatedClockCost by using the pseudo clock router heuristic is shown in Fig. 4.4. LPGC Placer utilizes this procedure to update the value of GatedClockCost for each different placement, and tries to minimize it to make the optimal gated clock topology routable. Actually, this procedure is a revised version cost function.

We revisited the essential core of the pseudo clock router and then redesign our algorithm since we should calculate it as fast as we can. We just “memorize” the sequence of the merging order which was occured in the construction of optimal gated clock topology. By

design a new data structure for storing the pseudo clock router, our procedure could be performed faster.

The computational complexity of the pseudo clock router estimation heuristic is O(n).

The simple proof of the complexity is described as follows: Given an ordered sequence P = p₁, p₂, · · · , p_2n−1which is referred to the positions of the clock sinks or internal nodes. The number n denotes the total number of the clock sinks. We store the sequence P as the merging order in the procedure of the optimal gated clock topology construction.

The pseudo clock wirelength can be calculated as:

Ã_n−1 X

i=1

|P_2i−1− P_2i|

+ |P_2n−1− P_s| (4.4)

where P_sdenotes the position of the clock source. In Equation (4.4), |P_2i−1− P_2i| repre-sent the pseudo clock wire-length calculation with complexity O(1). Finally we compute the pseudo clock wire-length with n times so that the complexity of pseudo clock router heuristic is O(n).

Minimizing Equation (4.3) can reduce the power consumption gated clock network under the optimal topology. Let’s compare two different placements with the same opti-mal gated clock topology as shown in Fig. 4.7. Fig. 4.7(a) represents a dedicated place-ment with considering the gated clock topology. In Fig. 4.7(a), the wire length of each clock net can be reduced since the clock sinks are close to the clock source. Therefore, each Cclock wire,i is reduced, and so is Psys. However, a typical placement without sidering the gated clock topology might prohibit the gated clock routing algorithm con-structing the optimal gated clock network, as illustrated in Fig. 4.7(b). The clock sinks (a and b, or c and d) which should be merged are too far to be merged. This is the reason why a dedicated placer is important for clock gating routers.

a

c

d b

Power Management Unit

Clock Source (a) (b)

a

b d

c

Fig. 4.7: (a) A dedicated placement with considering the optimal gated clock topology.

(b) A typical placement without considering the optimal gated clock topology.

Chapter 5 Experimental Result

We have implemented our LPGC placer in C++ programming language on a 1GHz SUN Blade 1500 workstation with 2.5GB memory, and acquired the source code of B*-tree placer [48] and a gated clock router [20] from the authors. We apply both of our LPGC Placer and the outstanding B*-tree placer on the placement benchmark circuits. The benchmarks are designed to make them reasonable and suitable for this experiment.

The placement benchmark “plbench100” denotes a benchmark with 100 blocks. The design process of a benchmark with n blocks is explained as follows. First, we construct n blocks with same width and same height. Each of them is a 1000x1000 unit rectangle.

Second, we randomly shrink the width and height of each block. The bounds of the shrinking ratio are 0.2 < ^height_width < 5, and 0.6 < height×width

100×100 < 1. Finally, we randomly assign the interconnect relationship between each block.

In order to demonstrate the importance of a dedicated placement for the clock gating, we compare two different gated clock network construction flows as shown in Fig. 4.2. In this experiment, we adopt B*-tree placer [48] and the gated clock router [20] to represent the traditional flow. Then, we replace the B*-tree placer with LPGC Placer for the novel flow in Fig. 4.2(b). We first compare the general metrics of a placement, which are the total length and total area. Then, we compare their gated clock networks’ wire-length, gated clock control networks’ wire-length and the total switched capacitance. We use total switched capacitance to denote the total clock power of them.

The results are reported in Fig. 5.1, and Table 5.1. Fig. 5.1 shows the trends under

dif-ferent average activity of modules for benchmark plbench576. The “General Placement”

denotes the result of traditional gated clock network construction flow while the “LPGC Placer” denotes the result of novel flow. Fig. 5.1(a), (b) and (c) show that LPGC Placer can reduce about 20% in the total switched capacitance, 6% in the gated clock netowrk’s wire-length and 28% in the gated clock control network’s wire-length among different average activity of modules in plbench576.

Fig. 5.1(b) shows that LPGC Placer does really save much of gated clock network power by shrinking the clock wire length. The authors of [20] pointed out that the power consumption of the gated clock control network could be ignored because it was very small. However, Fig. 5.1(c) shows that it is not always true, and LPGC Placer can save the gated clock control network power significantly. Hence, in Fig. 5.1(a), LPGC Placer can still reduce the power consumption as the average activity of modules is high. Therefore, these factors result in the power reduction gap whether adopting a dedicated placement or not.

Table 5.1 shows the comparison between the traditional flow and the novel flow among different benchmarks. The “Est. Wire” is the estimation of total wire length between mod-ules without including the gated clock and gated clock control networks. In Table 5.1, we use CSC, the total switched capacitance of gated clock and gated clock control networks, to represent the power consumption of the clock routing. The GCW is the actual wire length of gated clock network, and GCCW is the actual wire length of gated clock control network.

The results are extremely excellent since on the average the LPGC Placer can pro-vide a dedicated placement for clock gating methods to additionally 19.2% and 33.2%

of the gated clock network’s and gated clock control network’s wire-length. The power consumption of the total gated clock network is reduced by 27.4%. The total area is also reduced by 0.5%. However, the total wire length is increased by 0.4%. We should note that the clock power what the LPGC Placer saves is the additionally amount which any typical placer could not provide. Fig. 5.2 and Fig. 5.3 present the snapshots of the

place-100

14.6% 20.5% 32.6% 45.3% 57.0% 64.5% 75.5% 86.2%

Average Mudule Activity

14.6% 20.5% 32.6% 45.3% 57.0% 64.5% 75.5% 86.2%

Average Module Activity

14.6% 20.5% 32.6% 45.3% 57.0% 64.5% 75.5% 86.2%

Average Mudule Activity

14.6% 20.5% 32.6% 45.3% 57.0% 64.5% 75.5% 86.2%

Average Mudule Activity

Fig. 5.1: (a) Gated clock network comparison between a general placer and LPGC Placer with additionally consideration on gated clock routing. (b)Additional gated clock network Wire-length Saving. (c) Additional gated clock control network wire-length saving. (d) Reduction rate comparison.

Fig. 5.2: Snapshot of the placement result on circuit plbench576.

B* -tree Placer [48]

Benchmark Est.Wire Area SCC GCW GCCW

(mm) (mm²) (pF ) (mm) (mm)

plbench250 66.5 16.1 33.7 60.2 500.2

plbench576 1940.5 568.5 242.2 547.5 6608.9

plbench1000 614.1 1012.0 569.2 983.5 15903.3

plbench3136 5135.3 3646.1 3007.1 4333.8 94829.4

plbench10000 20104.0 10264.1 16222.9 10451.0 593076.9

LPGC Placer

Benchmark Est.Wire Area SCC GCW GCCW

plbench250 70.6(+6.1%) 16.1(+0.0%) 27.1(-20.0%) 48.1(-20.2%) 344.5(-31.1%) plbench576 1905.9(-1.8%) 544.5 (-4.2%) 178.9 (-20.4%) 513.5 (-6.20%) 4727.6 (-28.5%) plbench1000 616.0(+0.3%) 1019.1(+0.7%) 403.0(-29.2%) 721.5(-26.6%) 10128.8(-36.3%) plbench3136 4972.0(-3.2%) 3698.1(+1.4%) 2151.1(-28.5%) 3354.2(-22.6%) 67297.0(-29.0%) plbench10000 20204.0(+0.5%) 10212.1(-0.5%) 9819.0(-39.4%) 8312.4(-19.3%) 349180.4(-36.6%)

Avg. Reduction +0.387% -0.523% -27.4% -19.2% -33.2%

Table 5.1: Comparison between typical placer with LPGC Placer on benchmark circuits

ment results of benchmark plbench576 and plbench3136 respectively by LPGC Placer.

Fig. 5.3: Snapshot of the placement result on circuit plbench3136.

Chapter 6 Conclusions and Future Work

In the coming nano-scale era, the power consumption of interconnects would become a serious factor which is hard to be solved. More and more researchers are devoted them-selves to this field. We know that amount of electronic products increases the rate which obeys Moore’s law. The total power consumption of the products becomes a huge number.

Therefore, an integrated solution from every level is no doubt highly desired. We have developed a novel placer, LPGC Placer, to provide a low power solution at the placement level and prove it works through experimental results.

We truly hope our work could motivate other designer’s inspiration for the low power design. or minimize their product’s power consumption.

We will revisit the following subjects as our future work.

Thermal Balancing

We believe in the coming SOC era, the thermal problem would become worsen than before since we put too many devices on the small area. In recent decades, the chip area has no obviously increasing trend but the number of devices on it has been increased in a very sharp rate. By the continuously advancing technology and design method, the thermal balancing problem becomes more important. Especially, by the lack of EDA tool which could handle this problem, we would devote ourselves to be the pioneer.

Gate Insertion

We find that there is an over-optimistic assumption for every low power clock gating router. They assumed that after the sophisticated calculation of locations for additional gates, each gate always can be placed on the best location it deserved.

When the design goes complicated, the available space for gate insertion would be rare. We will try to investigate this problem in the higher level of the physical design.

Multilevel Placement

Multilevel methodology is a new trend that force every traditional algorithm to try to revisit the problem as to handle the extremely large design. We will try to find every possibility for adopting the multilevel technique into our LPGC Placer.

Reconstruction of Optimal Gated Clock Topology

When the feature size becomes more tiny, today’s optimal topology might become tomor-row’s non-optimum while considering many nano-scale problems such as signal integrity, crosstalk, or the manufacture yield. We will revisit the problem of construction the opti-mal gated clock topology in different directions.

Bibliography

[1] A. P. Chandrakasan, S. Sheng, R. W. Brodersen, “Low-Power CMOS Digital De-sign,” IEEE Journal of Solid-State Circuits, Vol. 27, No. 4, pp. 473–484, April 1992.

[2] Ralph H. J. M. Otten, Raul Camposano and Patrick R. Groeneveld, “Design Au-tomation for Deepsubmicron: present and future,” Proceedings of the 2002 Design, and Test in Europe Conference and Exhibition(Date’02), page(s):650-659

[3] D. Duarte, V. Narayanan, and M. J. Irwin, “Impact of Technology Scaling in the Clock System Power,” Proceedings of the IEEE Computer Society Annual Sympo-sium on VLSI, p.59, April 25-26, 2002

[4] David Garrett, Mircea Stan and Alvar Dean, “Challenges in clockgating for a low power ASIC methodology,” Proceedings of the 1999 international symposium on Low power electronics and design, p.176-181, August 16-17, 1999, San Diego, Cal-ifornia, United States

[5] Chao, K.-Y., and Wong, D.F., “Floorplanning for low power designs,” IEEE In-ternational Symposium on Circuits and Systems, Volume 1, 28 April-3 May 1995 Page(s):45 - 48 vol.1

[6] Hirendu Vaishnav, and Massoud Pedram, “PCUBE: A Performance Driven Place-ment Algorithm for Low Power Designs,” Proceedings of European Design Automa-tion Conference, pp. 72-77, September 1993.

[7] Sadiq M. Sait, Mahmood R. Minhas, and Junaid A. Khan, “Performance and low power driven VLSI standard cell placement using tabu search,” IEEE Congress on Evolutionary Computation, May 2002, Honolulu, Hawaii, USA, pp 372-377.

[8] Sadiq M. Sait, Habib Youssef, Junaid A. Khan, and Aiman El-Maleh “Fuzzified Iterative Algorithms for Performance Driven Low Power VLSI Placement,” 19th International Conference on Computer Design (ICCD 2001), 23-26 Sept. 2001 Pages:484 - 487

[9] M. Jim´enez , and M. Shanblatt, “Integrating a Low-Power Objective into the Place-ment of Macro Block-based Layouts,” Proceedings of the IEEE. Midwest Sympo-sium on Circuits and Systems, pp. 62-65, Ohio, Aug. 2001.

[10] Sadiq M. Sait, H. Youssef, and Junaid Khan, “Fuzzy simulated evolution for power and performance optimization of VLSI placement” International Joint INNS-IEEE Conference on Neural Networks, Volume: 1, pages: 738-743, Washington DC, July 14-19, 2001.

[11] T.-H. Chao , J.-M. Ho , Y.-C. Hsu, “Zero skew clock net routing,” Proceedings of the 29th ACM/IEEE conference on Design automation, p.518-523, June 08-12, 1992, Anaheim, California, United States

[12] Michael A. B. Jackson, Arvind Srinivasan, E. S. Kuh, “Clock routing for high-performance ICs,” Proceedings of the 27th ACM/IEEE conference on Design au-tomation, p.573-579, June 24-27, 1990, Orlando, Florida, United States

[13] Masato Edahiro, “A clustering-based optimization algorithm in zero-skew routings,”

Proceedings of the 30th international conference on Design automation, p.612-616, June 14-18, 1993, Dallas, Texas, United States

[14] J. Cong, A. B. Kahng, and G. Robins, “Matching-Based Methods for High-Performance Clock Routing,” IEEE Transactions on Computer-Aided Design, 12(8), August 1993, pp. 1157-1169.

[15] A. B. Kahng, J. Cong and G. Robins, “High-Performance Clock Routing Based on Recursive Geometric Matching”, Proc. ACM/IEEE Design Automation Confer-ence,” June 1991, pp. 322-327.

[16] K. D. Boese, A. B. Kahng, “Zero-Skew Clock Routing Trees with Minimum Wire Length,” IEEE International Conference on ASIC, pp. 1.1.1–1.1.5, Rochester, NY, September 1992.

[17] Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho, Kneeth D. Boese, and Andrew B.

Kahng, “Zero skew clock routing with minimum wirelength,” IEEE Transactions on Circuits and Systems, Volume 39, Issue 11, Nov. 1992 Page(s):799 - 814

[18] Ran-Song Tsay, “Exact Zero Skew,” IEEE Int. Conference on Computer-Aided De-sign (ICCAD-91), pp. 336- 339, 1991. Nov. 1991

[19] M. Donno, A. Ivaldi, L. Benini, and E. Macii, “Clock-tree power optimization based on RTL clock-gating,” ACM/IEEE 40th Design Automation Conference (DAC-03), pages 622-627, Anaheim, CA, June 2-6 2003.

[20] J. Oh, and M. Pedram, “Gated Clock Routing for Low-Power Microprocessor De-sign,” IEEE Transactions on CAD/ICAS, Vol. 20, No. 6, pp. 715-722, June 2001.

[21] Qi-Wang, and Roy.-S, “Power minimization by clock root gating,” Proceedings of the ASP-DAC 2003 , Pages:249 - 254, 21-24,Jan.2003

[22] Amir H. Farrahi, Chunhong Chen, Ankur Srivastava, Gustavo Tellez, and Majid Sar-rafzadeh, “Activity-Driven Clock Design,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 20, No. 6, June 2001, pp. 705-714.

[23] J. Oh, and M. Pedram, “Gated clock routing minimizing the switched capacitance,”

Proc. of Design Automation and Test in Europe, Feb. 1998, pp. 692-697.

[24] L. Benini, G. De Micheli, “Transformation and Synthesis of FSMs for Low-Power Gated-Clock Implementation,” IEEE Transactions on CAD/ICAS, Vol. 15, No. 6, pp. 630–643, June 1996

[25] Chunhong Chen, Changjun Kang and Majid Sarrafzadeh, “Activity-sensitive clock tree construction for low power,” Proceedings of the 2002 international symposium on Low power electronics and design, August 12-14, 2002, Monterey, California, USA

[26] Q. Wu, M. Pedram, and X. Wu. “Clock-gating and its application to low power design of sequential circuits,” IEEE Custom Integrated Circuits Conference, 1997, pp. 479-482

[27] Naveed Sherwani, Algorithms for VLSI Physical Design Automation, Kluwer Acad-emic Publishers, 3rd edition, 1999

[28] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, 2nd Ed, MIT Press and McGraw-Hill Book Company, 2001

[29] Sadiq M. Sait and H. Youssef, VLSI Physical Design Automation: Theory and

在文檔中 LPGC:一個基於最佳閘控制時脈拓樸結構之新穎的低功耗電路擺置演算法 (頁 47-0)