A low power scheduling method using dual V/sub dd/ and dual V/sub th/

(1)

A Low Power Scheduling Method using Dual V

dd

and Dual V

th

Kun-Lin Tsai

1

Department of EE1

National Taiwan University Taipei 106, Taiwan. Email: kunlin@orchid.ee.ntu.edu.tw

Szu-Wei Chang

2

_{and Feipei Lai}

1_,2

Department of CSIE2

National Taiwan University Taipei 106, Taiwan. Email: flai@ntu.edu.tw

Shanq-Jang Ruan

3

Department of ET3

National Taiwan University of Science and Technology

Taipei 106, Taiwan. Email: sjruan@et.ntust.edu.tw

Abstract— As the technology scales down to the nanometer dimensions, the static power consumption has become more and more important. To manage the power consumption, in this paper, we propose a low power method, which considers the dual supply voltage (Vdd) and the dual threshold voltage

(Vth) at the same time, to deal with the scheduling problem

in the behavioral synthesis stage. A flexible design space of power, and a better performance can be achieved when we use the proposed method. A combined algorithm of GA (Genetic Algorithm) and SA (Simulated Annealing) is used to solve the scheduling problem. Experimental results illustrate 41.6% power reduction on average.

I. INTRODUCTION

In recent years, the power consumption of a chip has become a very important issue, especially for SoC design. It is obvious that in the next decade low power design would be a big challenge for the IC design companies [1]. Among lots of design methods, the most effective way to reduce power consumption is to lower the supply voltage (Vdd) of a circuit.

Reducing the supply voltage, however, increases the circuit delay. A solution method is using the dual or the multiple supply voltages.

The dual Vddmethod was used on every level of low power

circuit design, such as behavioral level [2], [3] and gate level [4]. However, taking only the supply voltage into account is not enough. In deep sub-micron design, the leakage power consumption is also a very important issue. In [5], [6], they proposed the dual threshold voltage (Vth) to tackle with the

leakage power optimization problem. Although the leakage power is greatly reduced, the total power consumption still hang in the balance. Hence, some papers proposed the design method of using dual Vdd and dual Vth at the same time on

gate level to further reduce power dissipation [7] and circuit level [8].

The work presented in this paper focuses on high level power optimization. We address the problem of scheduling a data-flow graph, for the case when the resources operate at dual supply voltage and dual threshold voltage. An algorithm which combines genetic algorithm with simulated annealing was used to assign the voltage for each node of the CDFG. The contributions of this paper are 1) take both dynamic power and static power consumption into account; 2) a novel application of genetic algorithm based simulated annealing algorithm is used in high level synthesis.

II. SIMULATED ANNEALING AND GENETIC ALGORITHM

The goal of high level synthesis is to map the high level descriptions to hardware structures that meet the design con-straints such as area, latency, and power consumption. There are many algorithms available in the high level synthesis, and the simulated annealing (SA) and genetic algorithm (GA) are two of them.

Simulated annealing [9], SA in short, is an optimization technique which is naturally motivated by the process of annealing. Simulated annealing starts with a high temperature T. By applying a neighborhood operation, a current state i (with energy Ei) may change to the state j (with energy Ej),

when Ej< Ei. If Ej> Ei, the state i is replaced by the state

j with probability e(Ei−Ej)/T. The process is repeated with a

new state, and a lower temperature comes from the cooling function until the temperature is smaller than the termination temperature Tf.

The genetic algorithm (GA) is a method which explores the design space to find a local optimal solution. The genetic algorithm consists of four steps: crossover, mutation, natural selection, and survival of the fitness. The detailed descriptions can be found in [10].

Both of the above algorithms can be used to solve the optimization problem. Generally, the process of simulated annealing is hard to parallelize, but the genetic algorithm is a naturally parallel algorithm. However, the genetic algorithm is hard to converge to a good result. The proposed GASA (Genetic Algorithm based Simulated Annealing) algorithm inherits strengths from both GA and SA, and gets rid of the disadvantages of them. The GASA can be easily implemented in parallel. By parallelizing the algorithm, several machines can be gathered to speed up the computing time. Besides, the design space can be explored by performing neighborhood operation from SA.

III. LOWPOWERSCHEDULING WITHGASA In this section, we show our GASA (Genetic Algorithm based Simulated Annealing) scheduling method. First, we introduce the data flow of the GASA algorithm. Secondly, we talk about the dual Vdd and dual Vth library. And thirdly, the

chromosome representation and some GASA operations and parameters are introduced in detail. Finally, an example of the GASA scheduling is illustrated.

684 0-7803-8834-8/05/$20.00 ©2005 IEEE.

(2)

Scheduler CDFG & parameters Library Schedule Result Replace two parents with trial winners Select two individuals (Two parents) Crossover & Mutation Evaluation Boltzmann Trial Arrive Tf? Initiate first generation Trial Winner All individuals Power & Delay Yes No Output

Two parents & two children

All individuals include trial winners

Fig. 1. The GASA scheduling algorithm flowchart.

A. Data flow of GASA

The GASA algorithm runs with several simulated annealing processes in parallel. The mutation operation in GA is ana-logical to the neighborhood operation in SA, and crossover operation represents the role of recombining independent solutions. Before we come to the GASA flow, one term must be defined first.

Definition: A Boltzmann trial is defined as a compe-tition between states i and j, and the probability of state i wins the competition is 1/(1 + e(Ei−Ej)/T).

Here, e is the natural constant, and Eiand Ejdenote

as the energy of state i and state j respectively. T represents the temperature in the SA algorithm.

By the definition, the energy Ei and Ej represent the

power-delay products of the scheduling results. If the power-power-delay product of state i is smaller than that of state j, then we define Ei< Ej. What should be noted is that if temperature

T is large enough, then the next state j will be accepted even the energy of j is larger than the energy of i. By using the

Boltzmann trial, the SA uphill operation can be presented in

our scheduling algorithm.

The data flow of the GASA algorithm is shown in Fig. 1. In this flow, the inputs are the CDFG (Control/Data Flow Graph) and some parameters. The main scheduler assigns different Vdd and Vth to each node in the CDFG. Thus, a

library which consists of several dual Vdd/Vth components is

necessary to the scheduler. At the beginning of the scheduler, it brings out the first generation, and generates many individuals. Then, it randomly selects two individuals as the parents, and performs the crossover and mutation operations to generate two children. After that, the scheduler evaluates the power and delay of two parents and two children, and decides the

Boltzmann trial winner by Definition 1. If the temperature

cools down to the Tf (terminated temperature), it will output

the trial winner, else it will continue the GASA loop. In our scheduling algorithm, we try to recombine the results of each individual rather than just randomly generate a new

TABLE I

AN EXAMPLE OF TECHNOLOGY LIBRARY WITH DUALVddAND DUALVth.

Multiplier Vdd H/ Vth H Vdd H/ Vth L Power 561.6 754.7 Delay 18 17 Vdd L/ Vth H Vdd L/ Vth L Power 273.5 365.3 Delay 25 24

(Power: mW) (Delay: control cycle time) TABLE II CHROMOSOME REPRESENTATION One individual n1 n2 · · · nk Relative CS 0 1 · · · 1 instance (Vdd/Vth) L / H H / L _{· · ·} L / L

individual. By combining the results of each individual, we can improve the convergence speed. The recombining phase will select two parents from the selection pool, and produce two children. The two children may have some essential parts of genes that make the fitness of the children better or worse than that of their parents. Here, the “essential” parts of genes represent those nodes belonging to the critical paths, or reducing large amount of power if we choose another Vddor

Vth.

B. Dual Vdd / Vthlibrary

An essential component of the GASA algorithm is the cell library, in which each cell has four instance types, as shown in Table I. Table I shows the power and delay of a multiplier with different Vddand Vth. In Table I, Vdd H represents a cell

with high supply voltage. Similarly, Vth L means a cell with

low threshold voltage. In order to simplify the calculation, the delay of each instance type is set as the number of control cycles rather than the actual delay time.

C. Individual and Chromosome Representation

A suitable chromosome representation is needed to repre-sent the individual in the GASA scheduling algorithm, since it affects the running time of the algorithm. The chromo-some representation must include the information of supply voltage, threshold voltage and control cycles. Table II shows the chromosome representation in our GASA algorithm. In Table II, ni means the ith node in the data flow graph, the

relative cs of nishows the number of control steps between the

starting control step of ni and the maximum occupied control

step of all preceding nodes of ni. The instance records what

kind of resources allocated to this operation; H indicates the high voltage and L indicates the low voltage. By checking the instance field of one node, we can look up the power consumption, area cost, and delay information from the library. Each column in the table represents a chromosome. If there are k nodes in the CDFG, this individual needs k chromosomes to represent itself.

(3)

Parent A Parent B Child A Child B Node Relative CS Instance ( Vdd / Vth) L / L L / H H / H H / L N1 N2 N3 N4 N5 N6 0 1 1 3 0 1 H / H H / L Node Relative CS Instance ( Vdd / Vth) H / L H / H L / H L / L N1 N2 N3 N4 N5 N6 1 2 0 3 1 1 L / L L / H Node Relative CS Instance ( Vdd / Vth) H / L H / H N1 N2 N3 N4 N5 N6 1 2 0 3 0 1 L / H Node Relative CS Instance ( Vdd / Vth) L / H L / L N1 N2 N3 N4 N5 N6 3 1 1 L / L H / H H / H H / L L / L L / H 0 1 1 H / L

Fig. 2. An example of one point crossover operation.

D. GASA Operations and Parameters

1) Mutation operation: While performing the mutation

op-eration on one individual, we will choose some chromosomes to mutate their values by a specific mutation rate. For example, if there are k nodes in one individual, and the mutation rate is Pm. We will choose k × Pmchromosomes to mutate, while

randomly changing the genes (Relative CS and instance). Through mutation operation, some variants of one individual can be produced to explore the neighbors of the current position in design space. Later we shall give a discussion on the mutation rate Pm.

2) Crossover operation: Crossover is a kind of

recombi-nation operation. Fig. 2 shows an example of the crossover operation. In the GASA scheduling, we adopt one point

crossover operation which will exchange the right half part

of two individuals to each other.

3) Population size: The population size is one of the major

control parameters of the GASA. Generally, the larger of the population size, the better result we can obtain. However, the larger size of the population also requires the larger amount of memory and takes the longer operation time.

4) cooling procedure: The cooling procedure in our GASA

scheduling is to multiply the current temperature by a cooling constant CC (0 < CC < 1). If CC is set with a large value, the temperature would reduce slowly and it would produce large generations. In generally, the population size and the cooling constant both affect the optimization gain and the computing time. The designer should tune both of these parameters to meet the design constraints.

5) Mutation rate: The value of mutation rate will affect the difference between parents and children. If the mutation rate is

too large, some good chromosomes will be annihilated. If the mutation rate is too small, the resemblance between parents and children will be too close. Therefore, it will result in a local minimal solution. In our algorithm, we set the mutation rate as 20%. A larger mutation rate is set if the result seems to fall in a local minimal value.

0 1 2 3 4 5 6 7 8 9 10 11 14 15 12 13 Source Sink + + – – 1 2 3 4 5 6 7 8 9 10 ₊ + – – Source Sink (a) (b) (c) Node instance L_/ HL/HL/HL/HL/HL/HL/LL/HH/HH/H (Vdd / Vth) 1 2 3 4 5 6 7 8 9 10 Control Step

Fig. 3. An example of GASA scheduling. (a) Original DFG. (b) Vdd/Vth

of each node. (c) Scheduling result.

6) Temperature: The last two parameters are the starting

temperature Ts and terminating temperature Tf. We set the

terminating temperature as 0.1. The starting temperature Ts

is set by the mathematical method. Ts = Ej−Ei

−ln(1k−1), where

Ej and Ei is the initial energy of state i and j, and k is

the probability of Ei larger than Ej. The starting temperature

influences the convergence speed and the accepting rate of the parent individuals at the beginning of the optimization process.

E. Example of GASA Scheduling

Suppose we have a library which consists of several compo-nents. Each component was implemented with four different kinds of the Vdd / Vth combination, as shown in Table I.

The input file is the DFG, as shown in Fig. 3.(a). At the beginning of the scheduling, each node was assigned with different Vddand Vth. After many loops, the scheduling result

was shown in Fig. 3.(c), and the Vdd / Vth assignment was

shown in Fig. 3.(b). Note that the different instances used in the scheduling process influence the final delay and the power consumption of whole design.

The goal of GASA is to minimize the power and delay penalty of a system. Through our GA-based simulated an-nealing approach, the nearly optimal solution can be achieved in the tolerable processing time.

IV. EXPERIMENTALRESULT

To show the effectiveness of our method, we compare the power consumption and delay overhead among three scheduling algorithms. The first is the ASAP (As Soon As Possible) scheduling method. The second is the dual Vddonly

scheduling method, and the third is the proposed dual Vddand

dual Vth scheduling method. A dual Vdd/Vth library, which

contains several components such as multiplier and adder, is used for low power scheduling. In this library, each component was designed by TSMC 0.18µm process. The Vdd H was 1.8

V, Vdd Lwas 1.26 V, Vth H was 0.558 V, and Vth Lwas 0.458

V.

The experimental result is shown in Table III. Six CDFG benchmarks are used to examine the scheduling algorithms. It is clearly that the proposed dual Vdd/Vth can obtain the

(4)

TABLE III EXPERIMENTAL RESULT

As Soon As Possible (ASAP) Scheduling Dual VddScheduling Proposed Dual Vdd/ VthScheduling

Benchmark Power Delay Power * Delay Power Delay Power * Delay Power Delay Power * Delay

diffeq 5199.4 54 2807677.6 3079.0 62 190898 2300.2 64 147212.8 fir11 9979.5 109 1087765.5 5435.7 130 706641 5797.9 112 649364.8 dct 16436.8 72 1183449.6 10941.0 83 908103 9332 84 783888 iir7 13669.3 144 1968379.2 8519.8 158 1346128.4 7656.9 156 1194476.4 wdf7 17023.8 125 2127975 10576.0 150 1586400 9485.1 150 1422765 nc 28177.3 135 3803935.5 17834.9 151 2693069.9 14814.2 145 2148059

(Power: mW) (Delay: number of control cycles)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

diffeq fir11 dct iir7 wdf7 nc

benchmark

Power saving Delay overhead

Fig. 4. A comparison of power saving and delay overhead between ASAP scheduling and dual vdd/vthscheduling.

best power-delay product, and it also shows the proposed method can obtain about 46.1% power saving and 12.3% delay overhead when compare with the ASAP scheduling method. Fig. 4 shows the comparison of power saving and delay overhead between ASAP scheduling and dual vdd/vth

scheduling methods. In this figure, the solid lines represent the dual vdd/vthscheduling relative to the ASAP method.

We also show the convergence result of diffeq benchmark. Fig. 5.(a) and (b) are the power and delay convergence result by using dual Vdd only scheduling method. Fig. 5.(c) and (d)

indicate the power and delay convergence result by using dual Vdd and dual Vth scheduling method.

The experimental result shows that we can get further power reduction when using a dual Vdd and dual Vth library with

limited delay overhead. It means that the proposed method has a tradeoff between power consumption and delay. In this al-gorithm, most controlling parameters can be set automatically to reduce the designers’ load. More constraints can be added to this algorithm such as area constraints, and the additional penalty so that it can provide more flexible design space.

V. CONCLUSIONS

In this paper, a dual Vdd / Vth scheduling method is

proposed for low power high level synthesis. In the proposed method, the dynamic and static power consumption are con-sidered simultaneously. By using the GASA (genetic algorithm based simulated annealing) algorithm, each node in the CDFG (Control Data Flow Graph) is assigned with either a high or low Vdd / Vth to achieve the low power goal and to control

the computing time. The experimental result shows that our

2700 2900 3100 3300 3500 3700 3900 20 .5 13 .6 9 5. 96 _3.95 _2.61 _1.73 _1.15 _0.76 0.5 0.33 0.22 0.15 Tempture Po w er 60 61 62 63 64 65 66 20 .5 13 .9 9. 37 _6.33 _4.28 _2.89 _1.95 _1.32 _0.89 0.6 0.41 0.28 0.19 0.13 Tempture C on tr ol S te p 2000 2500 3000 3500 4000 4500 13 .3 2 8. 48 _5.39 _3.43 _2.18 _1.39 _0.88 _0.56 _0.36 _0.23 _0.14 Tempture Po w er 56 58 60 62 64 66 68 13 .5 9 8. 82 5. 73 3. 72 2. 41 1. 57 1. 02 0. 66 0. 43 0. 28 0. 18 0. 12 Tempture C on tro l S te p (a) (b) (c) (d)

Fig. 5. Convergence result of diffeq benchmark. (a) power result with dual Vdd. (b) delay result with dual Vdd. (c) power result with dual Vdd/Vth. (d)

delay result with dual Vdd/Vth.

method is feasible. The contribution of this paper is that the GASA method can be used on multiple Vdd/ Vth scheduling

and takes both power and performance into account at the same time.

REFERENCES

[1] http://public.itrs.net/

[2] M. A. Elgamel, and M. A. Bayoumi, “On low power high level synthesis using genetic algorithms,” in IEEE Proc. of ICECS 2002, vol. 2, pp. 725–728, Sept. 2002.

[3] S. P. Mohanty, and N. Ranganathan, “A framework for energy and transient power reduction during behavioral synthesis,” IEEE Trans. on

VLSI system, vol. 12, No. 6, pp. 562–572, June 2004.

[4] K. Usami, and M. Igarashi, “Low-power design methodology and application utilizing dual supply voltage,” in IEEE Proc. of ASP-DAC, pp. 123–128, Jan. 2000.

[5] D. Samanta, and A. Pal, “Synthesis of dual-V/sub T/ dynamic CMOS circuits,” in Proc. of VLSI Design 2003, pp. 303–308, Jan. 2003. [6] K. S. Khouri, and N.K.Jha, “Leakage Power Analysis and Reduction

During Behavioral Synthesis,” IEEE Trans. on VLSI Systems, vol. 10, No. 6, pp. 876–885, Dec. 2002.

[7] S. Augsburger, and B. Nikoli´c, “Combing dual-supply, dual threshold and transistor sizing for power reduction,” in IEEE Proc. of ICCD’02, pp. 316–321, Sept. 2002.

[8] A. Srivastava, and D. Sylvester, “Minimizing total power by simulaneous Vdd/Vthassignment,” IEEE Trans. on CAD , vol. 23, No. 5, pp. 665–

677, May 2004.

[9] P.J.M. van Laarhoven and E.H.L. Aarts, Simulated annealing : theory

and applications, Kluwer Academic Publishers, 1987.

[10] D. E. Goldberg, Genetic algorithm in search, optimization, and machine

learning, Addison-Wesley, 1989.