Intra-task DVS - Related Work - 真實處理器之高能源效率任務內動態電壓調整策略

2. Related Work

2.3 Intra-task DVS

Intra-task DVS algorithms adjust the CPU frequency during running a task by its deadline and remaining execution cycles. Shin et al. [10] first proposed an intra-task DVS for hard real time applications by using remaining WCEC-based speed

assignment, and voltage scaling decisions are made at compile time to reduce runtime overhead. But it is not energy efficient if we frequently change the CPU speed.

AbouGhazaleh et al. [6] applied a collaborative operating system and compiler algorithm to reduce voltage scaling overhead. The compiler inserts the information of WCEC to the source code, and the timing of changing CPU frequency is control by the operating system, called power management point. The above two DVS

algorithms have a common property that tasks execute at high frequency first and decelerate [5][14]; it wastes energy if tasks poorly finish in WCET. Shin and Kim [18]

used the average case execution path (ACEP) as a reference path to decrease CPU frequency in the beginning of the task running, but Seo et al. [14] claimed that an ACEP-based algorithm does not always achieve minimum average energy

consumption and proposed an optimal-case execution path based on probability. But this algorithm only supports the control flow graphs which each of branches only has two successors. Although the program analysis based on intra-task DVS algorithms does not require OS modification [10][18], the shortcoming is that if can not

collaborate with the other inter-task DVS algorithms. This is because during compile time, the frequency/ voltage scaling code is inserted and the CPU frequency is static.

Gruain [5] first proposed a stochastic model for intra-task DVS algorithms, which is better than reclaim-based intra-task ones [6][10][18]. A reclaim-based intra-task DVS is that the CPU frequency is calculated dynamically during running a task. Lorch and Smith [2] proved that the optimal speed is inverse proportion to the cube root of the tail probability in Gruain’s stochastic model, but this optimal speed scheduler does not work energy efficiently in real world if the CPU only support a limited set of

frequency and voltage levels.

Chapter 3 Optimal Speed Schedule for Ideal CPUs

We discuss an optimal speed schedule with minimal energy consumption in ideal CPUs. The clock frequency is assumed linearly related to the supply voltage. For any CPU frequency changes, the supported voltage can change as well. Although Lorch and Smith [2] have derived such an optimal schedule, called PACE, their proof is tedious and did not take the restriction of a limited set of supported frequency/voltage levels and the actual power consumption of a CPU into account. Therefore, we first formulate the optimal schedule problem as a constrained nonlinear programming problem, and derive an optimal speed schedule for ideal CPUs using the Lagrange multiplier procedure.

3.1 Stochastic Intra-task DVS

The switching power dissipation (Psw) in CMOS circuits is defined as [1]

f V C

P_sw = _eff ⋅ _dd²⋅ (1) where Ceff is the effective switching capacitance, Vdd is the supply voltage, and f represents the CPU frequency. Obviously, reduction of supply voltage results in lower power consumption. The relation between circuit delay (Td) and supply voltage (Vdd) is approximated by [1]

)α

(

dd L

d V V

V T C

∝ − (2)

where CL is the load capacitance, Vth is the threshold voltage, and α is the velocity saturation index which varies between 1 and 2. Because the clock frequency is proportional to the inverse of the circuit delay, the reduction of supply voltage results in reduction of the clock frequency [1]. In [2][4][5][13][14][20], the linear relation of V_dd and f is assumed. It would not be beneficial to reduce the CPU frequency without also reducing the voltage, because in this case the energy per operation would be constant [29]. The energy (E) consumption by executing a task is defined as [4]

a where t is the actual task executing time and Ca is the actual number of execution cycles of a task.

In this thesis we focus on the intra-task DVS, so we only concern one task.

Assume the WCEC of a task T is Cw. We denote random variable X associated with the actual number of cycles used by the task T over the interval (0, Cw). The cumulative distribution function of random variable X isF(x)=P(X ≤x), and the tail distribution function [5] isF^c(x)=1−F(x)=P(X >x) . During a task running, a stochastic DVS algorithm often assigns an appropriate CPU speed depending on how many cycles that the task has completed. Like the definition in [2][5], we denote the speed by an ascending function s(x). Because it is a linear relationship between voltage and clock rate [20], the expected energy consumption of executing a task is proportional to

3.2 Simplifying the Derivation in PACE by Lagrange Multiplier Procedure

Because it is infeasible to change CPU frequency/voltage continuously, we

assign a set of n possible execution cycles sorted from minimum to maximum, {C1, C2, …,Cn}, and only change CPU frequency/voltage after a task executes Ci cycles, wherei=0,...,n−1 andC₀ =0. In other words, the CPU frequency between Ci-1 and C_i is constant. We denote the constant frequency assigned in partition [C_i₋₁,C_i) asf_i and rewrite equation (4) in discrete form:

) Our goal is to find an optimal speed schedule with minimum energy consumption of objective function (5) under the time constraint

f d

It is a constrained nonlinear programming problem to find the minimum value of the objective function (5) with constrain (6). We use the Lagrange multiplier procedure to solve it [24]. First we relax the constrained model into an unconstrained form by weighting constraints in the objective function with Lagrange multiplier v.

The result is a Lagrangian function

Solving the stationary point of the Lagrangian function

We obtain the optimal solution f_i*

10 same as that in [2]. Now we try to solve the stationary point in the Lagrangian

function (7). First, the derivative of the Lagrangian function (7) with respect to v is set to 0:

To verify the computed stationary point is indeed a global minimum, the Hessian matrix of the Lagrangian function L is defined as

⎟⎟ function L is an unconstrained local minimum and is optimal for the objective

function (5) [24].

Chapter 4 Proposed Optimal Schedule in Realistic CPUs (OSRC)

4.1 The Problem of the PACE in Realistic CPUs

The main problem of the optimal speed schedule from PACE [2] is that for realistic CPUs, like Intel XScale [15][16], AMD Duron [17] and Transmeta Cursoe [18], only a limited set of voltage levels for corresponding frequency ranges is supported. This means the optimal speed schedule from PACE is not applicable to these realistic CPUs. Another problem is as follows. When calculating the optimal speed schedule, if we only consider the dynamic power based on equation (1) and neglect the static and leakage powers, it may actually result in less energy saving for the optimal scheme. To obtain the optimal speed schedule, it is more reasonable to use the CPU power consumption data from measurements.

Table 1 and Table 2 show the valid CPU frequency, corresponding voltage and power consumption in Intel PXA255 and PXA270 CPUs, respectively. The power consumption values are different from the theoretical values calculated from equation (1), because most of DVS researches did not consider the short circuit and leakage power consumptions [1]. So it is more reasonable to adopt realistic power consumption values instead of those from equation (1) in DVS algorithms. In this

thesis, we used the power consumption values directly from Table 1 and Table 2 for evaluation.

Table 1：Power consumption specification for Intel PXA255.

Frequency (MHz) Voltage (V) Power (mW)

33 (idle) 1.0 45

200 1.0 178

300 1.1 283

400 1.3 411

Table 2：Power consumption specification for Intel PXA270.

Frequency (MHz) Voltage (V) Power (mW)

13 (idle) 0.85 44.2

104 0.9 115

208 1.15 279

312 1.25 390

416 1.35 570

520 1.45 747

624 1.55 925

For illustration, we use PACE and our proposed OSRC, respectively, to find optimal schedules in realistic CPUs for two example tasks. Table 3 gives the specifications of these two tasks.

Table 3：Specifications of two example tasks

Task C_i (Mc) F^c(C_i) deadline (ms)

Task 1 {5, 15} {1, 0.2} 50

Task 2 {5, 10, 15} {1, 0.3, 0.1} 50

4.2 PACE Approach: Rounding the Frequencies Obtain from PACE to the Nearest Available Frequencies

First, we take task 1 running at Intel PAX255 as an example. From equation (8), if the start-up speed of task 1 is c and task 1 takes 5 million cycles (Mc) and keeps running, the CPU speed should be raised to 1.71c. By equation (8), we found the start-up speed c is 217 MHz, and the speed after 10 Mc is 370 MHz. Because Intel PXA255 does not support these speed settings, it is reasonable to choose the upper nearest available frequencies to avoid deadline missing. After rounding the ideal frequencies, the speed schedule is: 300 MHz at start-up and 400 MHz after 5 Mc executed. We use function p(f) which is C_eff ⋅ f ³to describe the power consumption under speed f and p(f) can be obtained from Table 1. Now we rewrite equation (5) as

∑

= From equation (9), the expected energy consumption is 7.15 mJ if the energy consumption during idea time is included.

Now we take task 2 running at Intel PXA255 as another example. By a similar way, based on PACE, we obtained an optimal schedule of {213 MHz, 319 MHz, 460 MHz}. Because the maximum supported frequency in Intel PXA255 is 400 MHz, we

could not find a rounding up frequency for 460 MHz.

4.3 OSRC Approach

Denoting the available frequencies by a linear combination was often used [22][25][26][27][28]. By rewriting the objective function (5), the stochastic DVS model is formulated to MKP. If a CPU has a limited set of m speeds, {s1, s2,…,sm}, it is better to formulate the original constrained nonlinear programming problem (equations (9) and (6)) as MKP. We denote f_i, the CPU frequency after a task executes Ci-1 cycles and is static until the task executes Ci cycles, as a linear combination ofs_i

Using the same concept of the linear combination of frequencies, the expected energy consumption (E(fi)) by executing (C_i−C_i₋₁) cycles under static CPU frequency

From equation (5), the expected energy consumption based on the intra task DVS stochastic model is the sum of the expected energy consumption in partition[C_i₋₁,C_i). Therefore, the expected energy consumption under a limited set of frequencies combinations is

16 of frequencies combinations is given by

d (“≦”) relation. This is because under a limited set of frequencies, the sum of

j i j

i s x

t ( )⋅ _, is hard to fit the deadline exactly. Now the problem is MKP. x_i_,_jare the only variables that we should solve

Minimizing ⁿ _i_j

In an intra-task DVS schedule, the maximum number of frequency changes during a task running is the number of CPU frequency levels minus one. Because the frequency levels of realistic CPUs [15][16][17][18] are only a few, we use the Dynamic Programming method to solve MKP [31]. From section 3.2, a task consists of n partition, {p₁, p₂, ..., p_n}={C₁,C₂−C₁,...,C_n−C_n₋₁}, and the best feasible solution of r partitions {pi … pn}, wherer=n−i+1, is denoted as Sr, which is a set of {xi,j … xn,j}, wherej∈{1,... ,m}. Using the recursion to solve Sr, we have

⎪⎪

Because the deadline of Sr-1 varies with a selected speed sj in partition pn-r+1, the energy consumption of Sr-1 is given by

∑

⁻⁺

In Sr , under the following two conditions, deadline miss will occur:

1) Sr-1 is null

Procedure Discrete-Optimal-Speed(Sr) if (r == 1) then /*S1means the last partition.*/ temp_S[j] = Discrete-Optimal-Speed(S_r-1);

end for

for (each temp_S[j]) do

Find j where Ej(Sr-1) + F^c(Cn-r)*e(sj) is minimum;

best_j = j;

end for

if (found best_j) then

return {xn-r+1, best_j}∪temp_S[best_j];

elsereturn null;

end if end if

Fig. 1. OSRC procedure.

The proposed OSRC procedure is shown in Fig. 1. Although a recursive procedure is used in the OSRC procedure, the computation will not take too much time due to a limited set of available CPU frequencies. Because voltage scaling is computationally expensive and hampers the possible energy saving [32], the size of n (number of a task’s possible execution cycles) should be small. In addition, the main idea of stochastic DVS is if a periodic task’s actual execution cycles (AEC) follows a distribution, the optimal speed schedule can save the most energy. Since the distribution can be obtained offline or online, the optimal speed schedule needs to be calculated only once and will be used for a long time. Based on the above, the computation time of stochastic intra-task DVS will not be an issue.

To demonstrate the merit of our OSRC over PACE, let’s return to example task 1, as shown in Table 3, which consists of two partitions{p₁, p₂}={C₁,C₂ −C₁}=

Mc}

10 Mc, 5

{ , and the corresponding tail distribution functions are 1 and 0.2. Fig. 2 shows a recursion graph for task 1. Note that each line is labeled with a selected frequency, the dotted lines present possible paths and the solid lines present the optimal paths. Each node is labeled with two values: the upper value is energy consumption and the lower value is time consumed in each partition. Our goal is to find a path which has a minimum sum of energy consumption, and the sum of time consumed must be less or equal to the deadline. Therefore, by using OSRC, the optimal speed schedule is {200 MHz, 400 MHz}, and the energy consumption is 6.51 mJ, which reduce 18% more power than that by using PACE (7.15 mJ from section 4.2). Similarly, by using OSRC, the optimal speed schedule of example task 2 is {200 MHz, 400 MHz, 400 MHz}, which couldn’t be solved by PACE, as shown in section 4.2.

Fig. 2. The recursion graph for Task 1.

Chapter 5 Evaluation and Discussion

5.1 Simulation Model

First, we examined our optimal speed procedure in CPUs with different frequency/voltage levels: Intel PXA255 and PXA270. The energy consumption with respect to each frequency/voltage was according to Table 1 and Table 2, and the energy consumption in idle state was also considered. A single task’s WCEC was set to the worst case execution time(WCET)×f_max; the WCET equals to 50 ms and

fmaxmeans the maximum CPU frequency. α∈{0.2,0.5,0.8}, which is the ratio of best case execution cycles (BCEC)/WCEC, means the variation of a task’s AEC which is between BCEC and WCEC according to normal distribution [5][32]. The mean and standard deviation were set to (WCEC+BCEC) 2 and (WCEC−BCEC) 6, meaning that 99.7 percent of the execution cycles falls in the interval [BCEC, WCEC]

[13]. Because the speed schedule varies with the CPU utilization, we simulated the task’s allowed execution time (AET) between[WCEC f_max,WCEC f_min]. All simulation parameters are list in Table 4.

Table 4：Simulation parameters

Figure CPU WCEC(Mc) BCEC(Mc) AET(ms) N(µ,σ²)

Fig. 3 PXA255 20 4 50~100 (12, 2.7)

Fig. 4 PXA270 31.2 6.24 50~300 (18.7, 4.2)

Fig. 5 PXA255 20 10 50~100 (15,1.7)

Fig. 6 PXA255 20 16 50~100 (18, 1.3)

Fig. 7 PXA270 31.2 15.6 50~300 (23.4, 2.6) Fig. 8 PXA270 31.2 24.96 50~300 (28.1, 1.0)

We have implemented four schemes, including OSRC for performance evaluation:

․ WCE-stretch [5]: The speed schedule assumes that the task will exhibit its worst case behavior, and choose the minimum static frequency.

․ PACE [2] : The optimal speed schedule is calculated by the theoretical value in ideal CPU as described in section 4, and the unavailable frequencies are rounding up to the nearest available ones.

․ OSRC : The speed schedule is calculated by the proposed OSRC schedule in Fig. 1.

․ LB (low bound) : An oracle algorithm knows the AEC in advance.

Because of a limited set of CPU frequency/voltage levels, the unavailable frequencies were replaced by linear combinations of their two immediate frequencies in low bound schemes for maximum energy saving. Unlike other stochastic-related papers [2][5], the performance comparison is based on the expected energy consumption.

And all schemes are normalized with respect to the WCE-stretch.

5.2 Impact of CPU Levels

Fig. 3 and Fig. 4 show the expected energy consumption comparison for all

schemes in CPUs with different voltage/frequency levels: Intel PXA255 and PXA270.

α is set to 0.2. In Fig. 3, the sudden transition between AET/WCET 1.2 and 1.4 in the low bound curve results in the WCE-stretch speed schedule drops the speed from 400 MHz to 300 MHz. When AET/WCET ≧ 2, the curve will up to 1 by the same reason.

In 3 levels CPU, Intel PXA255, OSRC reduces CPU energy consumption between 0%

and 10.2% with an average of 6.5%; PACE reduces CPU energy consumption between -1.2% and 9.9% with an average of 2.0%; the low bound scheme reduces CPU energy consumption between 0% and 18.5% with an average of 10.8%. In 6 levels CPU, Intel PXA270, OSRC reduces CPU energy consumption between 0% and 24.8% with an average of 15.9%; PACE reduces CPU energy consumption between -1.6% and 22.9% with an average of 5.6%; the low bound scheme reduces CPU energy consumption between 0% and 26% with an average of 19.2%. The results show that the more CPU levels, the more energy saving.

0.7 0.8 0.9 1 1.1

1 1.2 1.4 1.6 1.8 2

AET/WCET

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 3. The impact of CPU levels on expected energy consumption in Intel PXA255.

0.7 0.8 0.9 1 1.1

1 2 3 4 5 6

AET/WCET

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 4. The impact of CPU levels on expected energy consumption in Intel PXA270.

5.3 Impact of BCEC/WCEC(α) Ratio

We set α to 0.5 and 0.8, and repeated simulations for these two types of CPUs. In Intel PXA255, as show in Fig. 5 and Fig. 6, OSRC can reduce 5.7% and 2.9% energy consumption in average (upper bound: 11.5% and 8.9%), respectively; the values in PACE are 2.1% and 1.0%. In Intel PXA270, as shown in Fig. 7 and Fig. 8, OSRC can reduce 13.4% and 6.7% energy consumption in average (upper bound: 17.3% and 13.3%), respectively; the values in PACE are 4.4% and 2.0%. These results show that when we set α to 0.5, the impacts of energy reduction are small for both types of CPU.

But when we raised α to 0.8, the optimal schedules are close to the WCE-stretch scheme, especially in 3 levels CPU. Because low slack time limits the aggressive frequency/voltage reduction in the optimal schedule, it happens in all offline DVS schedules.

24 0.7

0.8 0.9 1 1.1

1 1.2 1.4 1.6 1.8 2

AET/WCE

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 5. The impact of

α

on expected energy consumption in PXA255 (

α

=0.5).

0.7 0.8 0.9 1 1.1

1 1.2 1.4 1.6 1.8 2

AET/WCET

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 6. The impact of

α

on expected energy consumption in PXA255 (

α

=0.8).

0.7 0.8 0.9 1 1.1

1 2 3 4 5 6

AET/WCE

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 7. The impact of

α

on expected energy consumption in PXA270 (

α

=0.5).

0.7 0.8 0.9 1 1.1

1 2 3 4 5 6

AET/WCE

Expected energy consumption w.r.t. WCE-stretch

OSRC PACE LB

Fig. 8. The impact of

α

on expected energy consumption in PXA270 (

α

=0.8).

The average energy saving percentage with respect to WCE-stretch for each scheme in Fig. 3 through Fig. 8 is denoted as (1－ average of expected energy with respect to WCE-stretch). The results are summarized in Table 5, and the proposed

OSRC is three times in average better than that of PACE for realistic CPUs.

Table 5：Average energy saving percentage with respect to WCE-stretch Figure OSRC PACE LB

Fig. 3 6.5% 2.0% 10.8%

Fig. 4 15.9% 5.6% 19.2%

Fig. 5 5.7% 2.1% 11.5%

Fig. 6 2.9% 1.0% 8.9%

Fig. 7 13.4% 4.4% 17.3%

Fig. 8 6.7% 2.0% 13.3%

Chapter 6 Conclusions and Future Work

6.1 Concluding Remarks

In this thesis, we have derived an optimal speed schedule for ideal CPUs for hard real-time systems by the Lagrange multiplier procedure, in a simple and elegant way, compared to PACE [2]. Because of limited available frequency/voltage levels in realistic CPUs, the optimal speed schedule for ideal CPUs can not be applied to realistic CPUs directly. To find an optimal speed schedule for realistic CPUs, we transform the original nonlinear programming problem into MKP based on the frequency/voltage levels and power consumption of a realistic CPU. With limited CPU frequency/voltage levels, the problem can be solved by the OSRC procedure feasibly. To evaluate the merits of the proposed OSRC, the actual data of Intel PXA255 and PXA270 CPU were used in the analysis. We have the following remarks.

First, the analysis results have shown that the poor energy saving by using PACE in realistic CPUs, which is almost the same as that in WCE-stretch. By using the OSRC for realistic CPUs, the results are very close to the low bound derived from an oracle algorithm. Secondly, we observed that the CPU frequency/voltage levels affect the energy efficiency of the optimal speed schedule in the stochastic DVS model: the more the levels, the more the energy saving. Thirdly, we found that compiler-assisted intra-task DVS algorithms are hard to collaborate with inter-task DVS algorithms if the frequency/voltage scaling code is inserted in the source code. Lastly, under the

stochastic DVS model, our scheme can provide the best solution for realistic CPUs using dynamic programming. Evaluation have shown that the energy saving of OSRC is three times in average better than that of PACE in Realistic CPUs.

6.2 Future Work

In stochastic DVS intra-task DVS algorithms, the speed schedule is only calculated once for different AET so that it is easy to work with most of inter-task DVS algorithms. But there still exists some unresolved issues in OSRC. First, all of the approaches addressed in this thesis neglect the time and energy consumption owing to frequency/voltage transitions. If the time of transitions takes too long, a task may miss the deadline. If the energy consumption due to transitions is too much, it

在文檔中真實處理器之高能源效率任務內動態電壓調整策略 (頁 18-0)